fix(llmobs): [MLOB-3863] openai agents support reasoning messages (#14522)

ncybul · web-flow · commit 8e6116144d47 · 2025-09-10T13:35:27.000-04:00
While QA-ing our agentic integrations, I noticed a bug in the current Open AI Agent's output message parsing. The content field on OpenAI's `ResponseReasoningItem` is optional ([ref](https://github.com/openai/openai-python/blob/2adf11112988e998fcf5adb805bae38501d22318/src/openai/types/responses/response_reasoning_item.py#L27-L51)) which we were not handling properly leading to potential `NoneType` errors like the one below: ``` File "/Users/nicole.cybul/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_integrations/utils.py", line 1161, in llmobs_output_messages for content in item.content: ^^^^^^^^^^^^ TypeError: 'NoneType' object is not iterable ``` While investigating this issue, I noticed that our OpenAI integration had a nearly identical implementation for parsing input and output messages, so I decided to extract the common logic into a helper function that could be reused across the OpenAI and OpenAI Agents integrations. This fixed the `NoneType` error since now we do not try to iterate over `item.content` for reasoning message types. ## Manual Testing Here is a [trace](https://app.datadoghq.com/llm/traces?query=%40ml_app%3Anicole-test%20%40event_type%3Aspan%20%40parent_id%3Aundefined&agg_m=count&agg_m_source=base&agg_t=count&fromUser=false&sp=%5B%7B%22sp%22%3A%7B%22width%22%3A%22min%28100%25%2C%20max%28calc%28100%25%20-%20var%28--ui-page-left-offset%29%20-%2016px%29%2C%20900px%29%29%22%7D%2C%22p%22%3A%7B%22eventId%22%3A%22AwAAAZkvInhdwYmtigAAABhBWmt2SW5oZEFBRExtZktfRkZzZEFBQUEAAAAkMDE5OTJmMjMtOTQxOC00OTliLTg1ZWYtYjc5ZjBmNGU5YjkzAABCrQ%22%7D%2C%22i%22%3A%22llm-obs-panel%22%7D%5D&spanId=9743349247239047563&start=1757431678372&end=1757432578372&paused=false) that I created with this feature branch which can be compared with this [trace](https://app.datadoghq.com/llm/traces?query=%40ml_app%3Anicole-test%20%40event_type%3Aspan%20%40parent_id%3Aundefined&agg_m=count&agg_m_source=base&agg_t=count&fromUser=true&sp=%5B%7B%22sp%22%3A%7B%22width%22%3A%22min%28100%25%2C%20max%28calc%28100%25%20-%20var%28--ui-page-left-offset%29%20-%2016px%29%2C%20900px%29%29%22%7D%2C%22p%22%3A%7B%22eventId%22%3A%22AwAAAZkvJpU3H4D3EQAAABhBWmt2SnBVM0FBQlh3NUdHTDY2SEFBQUEAAAAkZjE5OTJmMjYtY2MyMy00YTQxLThlZWQtZWJlNmZlNWYyOTE4AAAI6g%22%7D%2C%22i%22%3A%22llm-obs-panel%22%7D%5D&spanId=10719453409381675278&start=1757431833983&end=1757432733983&paused=false) from the main branch. | | Before | After | |---|---|---| | OpenAI Agents LLM spans were missing output messages due to the `NoneType` error). | <img width="1056" height="992" alt="image" src="https://github.com/user-attachments/assets/3e94b71f-6d39-4044-af4d-bba901e3f850" /> | <img width="1236" height="994" alt="image" src="https://github.com/user-attachments/assets/266b4085-14d4-487e-946d-a5ea0af990f6" /> | | Tool Results for OpenAI Agents LLM spans were not being captured | <img width="542" height="116" alt="image" src="https://github.com/user-attachments/assets/8829ae88-add1-415c-964e-c2253478a595" /> | <img width="1206" height="104" alt="image" src="https://github.com/user-attachments/assets/26e829ff-efc8-4228-a289-1c1f316411f9" /> | | Spans were not being linked properly due to missing output messages | <img width="1690" height="864" alt="image" src="https://github.com/user-attachments/assets/b7c290a1-7474-4b7a-8cab-fca1e75732f5" /> | <img width="548" height="1104" alt="image" src="https://github.com/user-attachments/assets/b5542829-784d-4780-b33c-3a9843753bcb" /> | ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
diff --git a/ddtrace/llmobs/_integrations/openai_agents.py b/ddtrace/llmobs/_integrations/openai_agents.py
@@ -37,6 +37,7 @@
 from ddtrace.llmobs._utils import _get_nearest_llmobs_ancestor
 from ddtrace.llmobs._utils import _get_span_name
 from ddtrace.llmobs._utils import load_data_value
+from ddtrace.llmobs._utils import safe_json
 from ddtrace.trace import Span
 
 
@@ -232,13 +233,13 @@ def _llmobs_set_response_attributes(self, span: Span, oai_span: OaiSpanAdapter)
         if oai_span.response and oai_span.response.output:
             messages, tool_call_outputs = oai_span.llmobs_output_messages()
 
-            for tool_id, tool_name, tool_args in tool_call_outputs:
+            for tool_call_output in tool_call_outputs:
                 core.dispatch(
                     DISPATCH_ON_LLM_TOOL_CHOICE,
                     (
-                        tool_id,
-                        tool_name,
-                        tool_args,
+                        tool_call_output["tool_id"],
+                        tool_call_output["name"],
+                        safe_json(tool_call_output["arguments"]),
                         {
                             "trace_id": format_trace_id(span.trace_id),
                             "span_id": str(span.span_id),
diff --git a/ddtrace/llmobs/_integrations/utils.py b/ddtrace/llmobs/_integrations/utils.py
@@ -550,13 +550,35 @@ def openai_get_input_messages_from_response_input(
     Returns:
         - A list of processed messages
     """
+    processed, _ = _openai_parse_input_response_messages(messages)
+    return processed
+
+
+def _openai_parse_input_response_messages(
+    messages: Optional[Union[str, List[Dict[str, Any]]]], system_instructions: Optional[str] = None
+) -> Tuple[List[Dict[str, Any]], List[str]]:
+    """
+    Parses input messages from the openai responses api into a list of processed messages
+    and a list of tool call IDs.
+
+    Args:
+        messages: A list of output messages
+
+    Returns:
+        - A list of processed messages
+        - A list of tool call IDs
+    """
     processed: List[Dict[str, Any]] = []
+    tool_call_ids: List[str] = []
+
+    if system_instructions:
+        processed.append({"role": "system", "content": system_instructions})
 
     if not messages:
-        return processed
+        return processed, tool_call_ids
 
     if isinstance(messages, str):
-        return [{"role": "user", "content": messages}]
+        return [{"role": "user", "content": messages}], tool_call_ids
 
     for item in messages:
         processed_item: Dict[str, Union[str, List[ToolCall], List[ToolResult]]] = {}
@@ -574,7 +596,7 @@ def openai_get_input_messages_from_response_input(
                 processed_item["role"] = item["role"]
         elif "call_id" in item and ("arguments" in item or "input" in item):
             # Process `ResponseFunctionToolCallParam` or ResponseCustomToolCallParam type from input messages
-            arguments_str = item.get("arguments", "{}") or item.get("input", "{}")
+            arguments_str = item.get("arguments", "") or item.get("input", OAI_HANDOFF_TOOL_ARG)
             arguments = safe_load_json(arguments_str)
 
             tool_call_info = ToolCall(
@@ -585,7 +607,7 @@ def openai_get_input_messages_from_response_input(
             )
             processed_item.update(
                 {
-                    "role": "user",
+                    "role": "assistant",
                     "tool_calls": [tool_call_info],
                 }
             )
@@ -607,10 +629,11 @@ def openai_get_input_messages_from_response_input(
                     "tool_results": [tool_result_info],
                 }
             )
+            tool_call_ids.append(item["call_id"])
         if processed_item:
             processed.append(processed_item)
 
-    return processed
+    return processed, tool_call_ids
 
 
 def openai_get_output_messages_from_response(response: Optional[Any]) -> List[Dict[str, Any]]:
@@ -630,15 +653,33 @@ def openai_get_output_messages_from_response(response: Optional[Any]) -> List[Di
     if not messages:
         return []
 
+    processed_messages, _ = _openai_parse_output_response_messages(messages)
+
+    return processed_messages
+
+
+def _openai_parse_output_response_messages(messages: List[Any]) -> Tuple[List[Dict[str, Any]], List[ToolCall]]:
+    """
+    Parses output messages from the openai responses api into a list of processed messages
+    and a list of tool call outputs.
+
+    Args:
+        messages: A list of output messages
+
+    Returns:
+        - A list of processed messages
+        - A list of tool call outputs
+    """
     processed: List[Dict[str, Any]] = []
+    tool_call_outputs: List[ToolCall] = []
 
     for item in messages:
         message = {}
         message_type = _get_attr(item, "type", "")
 
         if message_type == "message":
             text = ""
-            for content in _get_attr(item, "content", []):
+            for content in _get_attr(item, "content", []) or []:
                 text += str(_get_attr(content, "text", "") or "")
                 text += str(_get_attr(content, "refusal", "") or "")
             message.update({"role": _get_attr(item, "role", "assistant"), "content": text})
@@ -656,26 +697,29 @@ def openai_get_output_messages_from_response(response: Optional[Any]) -> List[Di
                 }
             )
         elif message_type == "function_call" or message_type == "custom_tool_call":
-            arguments = _get_attr(item, "input", "") or _get_attr(item, "arguments", "{}")
-            arguments = safe_load_json(arguments)
+            call_id = _get_attr(item, "call_id", "")
+            name = _get_attr(item, "name", "")
+            raw_arguments = _get_attr(item, "input", "") or _get_attr(item, "arguments", OAI_HANDOFF_TOOL_ARG)
+            arguments = safe_load_json(raw_arguments)
             tool_call_info = ToolCall(
-                tool_id=_get_attr(item, "call_id", ""),
+                tool_id=call_id,
                 arguments=arguments,
-                name=_get_attr(item, "name", ""),
+                name=name,
                 type=_get_attr(item, "type", "function"),
             )
+            tool_call_outputs.append(tool_call_info)
             message.update(
                 {
                     "tool_calls": [tool_call_info],
                     "role": "assistant",
                 }
             )
         else:
-            message.update({"role": "assistant", "content": "Unsupported content type: {}".format(message_type)})
+            message.update({"content": str(item), "role": "assistant"})
 
         processed.append(message)
 
-    return processed
+    return processed, tool_call_outputs
 
 
 def openai_get_metadata_from_response(
@@ -1071,126 +1115,26 @@ def llmobs_input_messages(self) -> Tuple[List[Dict[str, Any]], List[str]]:
             - A list of processed messages
             - A list of tool call IDs for span linking purposes
         """
-        messages = self.input
-        processed: List[Dict[str, Any]] = []
-        tool_call_ids: List[str] = []
-
-        if self.response_system_instructions:
-            processed.append({"role": "system", "content": self.response_system_instructions})
-
-        if not messages:
-            return processed, tool_call_ids
-
-        if isinstance(messages, str):
-            return [{"content": messages, "role": "user"}], tool_call_ids
-
-        for item in messages:
-            processed_item: Dict[str, Union[str, List[Dict[str, str]]]] = {}
-            # Handle regular message
-            if "content" in item and "role" in item:
-                processed_item_content = ""
-                if isinstance(item["content"], list):
-                    for content in item["content"]:
-                        processed_item_content += content.get("text", "")
-                        processed_item_content += content.get("refusal", "")
-                else:
-                    processed_item_content = item["content"]
-                if processed_item_content:
-                    processed_item["content"] = processed_item_content
-                    processed_item["role"] = item["role"]
-            elif "call_id" in item and "arguments" in item:
-                """
-                Process `ResponseFunctionToolCallParam` type from input messages
-                """
-                try:
-                    arguments = json.loads(item["arguments"])
-                except json.JSONDecodeError:
-                    arguments = item["arguments"]
-                processed_item["tool_calls"] = [
-                    {
-                        "tool_id": item["call_id"],
-                        "arguments": arguments,
-                        "name": item.get("name", ""),
-                        "type": item.get("type", "function_call"),
-                    }
-                ]
-            elif "call_id" in item and "output" in item:
-                """
-                Process `FunctionCallOutput` type from input messages
-                """
-                output = item["output"]
-
-                if isinstance(output, str):
-                    try:
-                        output = json.loads(output)
-                    except json.JSONDecodeError:
-                        output = {"output": output}
-                tool_call_ids.append(item["call_id"])
-                processed_item["role"] = "tool"
-                processed_item["content"] = item["output"]
-                processed_item["tool_id"] = item["call_id"]
-            if processed_item:
-                processed.append(processed_item)
+        return _openai_parse_input_response_messages(self.input, self.response_system_instructions)
 
-        return processed, tool_call_ids
-
-    def llmobs_output_messages(self) -> Tuple[List[Dict[str, Any]], List[Tuple[str, str, str]]]:
+    def llmobs_output_messages(self) -> Tuple[List[Dict[str, Any]], List[ToolCall]]:
         """Returns processed output messages for LLM Obs LLM spans.
 
         Returns:
             - A list of processed messages
-            - A list of tool call data (name, id, args) for span linking purposes
+            - A list of tool calls for span linking purposes
         """
         if not self.response or not self.response.output:
             return [], []
 
         messages: List[Any] = self.response.output
-        processed: List[Dict[str, Any]] = []
-        tool_call_outputs: List[Tuple[str, str, str]] = []
         if not messages:
-            return processed, tool_call_outputs
+            return [], []
 
         if not isinstance(messages, list):
             messages = [messages]
 
-        for item in messages:
-            message = {}
-            # Handle content-based messages
-            if hasattr(item, "content"):
-                text = ""
-                for content in item.content:
-                    if hasattr(content, "text") or hasattr(content, "refusal"):
-                        text += getattr(content, "text", "")
-                        text += getattr(content, "refusal", "")
-                message.update({"role": getattr(item, "role", "assistant"), "content": text})
-            # Handle tool calls
-            elif hasattr(item, "call_id") and hasattr(item, "arguments"):
-                tool_call_outputs.append(
-                    (
-                        item.call_id,
-                        getattr(item, "name", ""),
-                        item.arguments if item.arguments else OAI_HANDOFF_TOOL_ARG,
-                    )
-                )
-                message.update(
-                    {
-                        "tool_calls": [
-                            {
-                                "tool_id": item.call_id,
-                                "arguments": (
-                                    json.loads(item.arguments) if isinstance(item.arguments, str) else item.arguments
-                                ),
-                                "name": getattr(item, "name", ""),
-                                "type": getattr(item, "type", "function"),
-                            }
-                        ]
-                    }
-                )
-            else:
-                message.update({"content": str(item)})
-            processed.append(message)
-
-        return processed, tool_call_outputs
+        return _openai_parse_output_response_messages(messages)
 
     def llmobs_trace_input(self) -> Optional[str]:
         """Converts Response span data to an input value for top level trace.
diff --git a/releasenotes/notes/openai-agents-support-reasoning-a782615651107cc9.yaml b/releasenotes/notes/openai-agents-support-reasoning-a782615651107cc9.yaml
@@ -0,0 +1,5 @@
+---
+fixes:
+  - |
+    LLM Observability: Fixes an issue where reasoning message types were not being handled correctly in the OpenAI Agents integration, leading to 
+    output messages being dropped on LLM spans.
diff --git a/tests/contrib/openai_agents/test_openai_agents_llmobs.py b/tests/contrib/openai_agents/test_openai_agents_llmobs.py

-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +---
 +fixes:
 +  - |
 +    LLM Observability: Fixes an issue where reasoning message types were not being handled correctly in the OpenAI Agents integration, leading to
 +    output messages being dropped on LLM spans.