[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat #24549
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Current test on test_gpt_oss_multi_turn_chat is pretty flaky and fails >50% times.
Example of failed test asking for if to use Celsius or Fahrenheit :
root:test_serving_chat.py:187 lacora: response ChatCompletion(id='chatcmpl-9596f8584920486b8e0e22468e55e606', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure! Would you like the temperature in Celsius or Fahrenheit?', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content='We need to call the function get_current_weather with city "Dallas", state "TX", unit maybe default? The user didn't specify unit. We can ask for unit? But we can choose default. Probably ask for unit? The user didn't specify. We can ask: "Would you like Celsius or Fahrenheit?" But we can also default to Fahrenheit for US. Let's ask.'), stop_reason=None, token_ids=None)], created=1757467045, model='openai/gpt-oss-20b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=97, prompt_tokens=153, total_tokens=250, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, prompt_token_ids=None, kv_transfer_params=None)
Example of successful test outputing function call:
root:test_serving_chat.py:187 lacora: response ChatCompletion(id='chatcmpl-04de7729d5a1444fbc7072b029b4a945', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='{"city":"Dallas","state":"TX","unit":"celsius"}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-3190042ee74c466db8b019aa3431dfc0', function=Function(arguments='{"city":"Dallas","state":"TX","unit":"celsius"}', name='get_current_weather'), type='function')], reasoning_content='We need to call the function get_current_weather with city "Dallas", state "TX", unit "celsius".'), stop_reason=200012, token_ids=None)], created=1757466163, model='openai/gpt-oss-20b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=57, prompt_tokens=157, total_tokens=214, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, prompt_token_ids=None, kv_transfer_params=None)
Test Plan & Test Result
Fix and run multiple times all succeeded.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.