Skip to content

Conversation

grahama1970
Copy link

@grahama1970 grahama1970 commented Sep 11, 2025

feat(router): Parallel Acompletions

Summary

  • Adds parallel_acompletions in the router to fan out concurrent sub-requests and aggregate results.

Motivation

  • Many production calls need parallel fanout (tooling, ensembles, multi-provider redundancy). Providing a first-class helper in the router reduces duplicated user logic and improves observability.

Diagram

flowchart TD
  A[Client Request] --> B[Router parallel acompletions]
  B --> C[Build requests and batch id]
  C --> S[Semaphore]
  S --> T1[run one 0]
  S --> T2[run one 1]
  T1 --> G[Gather results]
  T2 --> G
  G --> R[Aggregated results]
  R --> O[index, request, response or exception]
Loading

Return shape

  • parallel_acompletions(...) -> List[RouterParallelResult] (optionally preserve_order=True).
  • iter_parallel_acompletions(...) -> AsyncIterator[RouterParallelResult] (yields in completion order).

What’s Included

Router

  • New litellm/router_utils/parallel_acompletion.py orchestration helper.
  • Integration in litellm/router.py gated behind an experimental flag.
  • Minor auth/route compat tweaks in proxy code paths.

Docs/Tests

  • Guide: docs/my-website/docs/guides/parallel_acompletions.md + sidebar entry.
  • Unit test: tests/router/test_parallel_acompletions.py.
  • Live test (Gemini): tests/router/test_parallel_acompletions_live_gemini.py.

Scope/Impact

  • Default behavior unchanged; feature is gated via experimental flag.
  • No broad formatting or unrelated refactors.

Validation

  • Unit tests pass locally with pytest -n auto.
  • Router behavior verified via unit + live tests.

Risks/Follow-ups

  • Parallel fanout introduces concurrency. Isolated via helper and flag-gated; feedback on API ergonomics and metrics surface is welcome.
  • If CI enforces repo-wide formatting, maintainers may prefer “Squash and merge” to keep history tidy.

Checklist

  • Feature behind experimental flag
  • Docs and sidebar updated
  • Unit + live tests added

Ancillary: Tokenizer Stability

  • Rationale: stabilize tokenizer loading behavior in CI/offline environments without affecting public API. These changes were made simply to make parallel acompletions pass reliably in tests/CI and are not part of the feature surface.
  • Changes
    • Use single-argument HF tokenizer loading to match mocks/tests.
    • Temporarily disable HF_HUB_ENABLE_HF_TRANSFER during HF loads (avoid timeouts/noise when hf_transfer isn’t installed).
    • Preserve robust fallback to tiktoken when HF is unavailable or download/signature fails.
  • Validation
    • Verified tests that exercise tokenizer path no longer flake; fallback path exercised when HF unavailable.
    • Non-breaking; same return types and behavior when HF is present.
    • Lives in its own commit: fix(tokenizer): stabilize create_pretrained_tokenizer.

Links

Examples

Gather in one shot (preserve order)

import os
import asyncio
from litellm import Router
from litellm.router_utils.parallel_acompletion import RouterParallelRequest

os.environ["LITELLM_ENABLE_PARALLEL_ACOMPLETIONS"] = "1"  # enable feature

async def main():
    router = Router(
        model_list=[{
            "model_name": "prod",
            "litellm_params": {"model": "gpt-3.5-turbo", "api_key": "sk-..."},
        }]
    )

    requests = [
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "A"}]),
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "B"}]),
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "C"}]),
    ]

    results = await router.parallel_acompletions(
        requests,
        concurrency=2,
        preserve_order=True,    # returned list matches input order
        return_exceptions=True, # keep errors in result.exception
    )

    for r in results:
        if r.exception:
            print("error", r.index, r.exception)
        else:
            print("ok", r.index, r.response)

asyncio.run(main())

Iterate as each finishes (completion order)

import os
import asyncio
from litellm import Router
from litellm.router_utils.parallel_acompletion import RouterParallelRequest

os.environ["LITELLM_ENABLE_PARALLEL_ACOMPLETIONS"] = "1"

async def main():
    router = Router(model_list=[{"model_name": "prod", "litellm_params": {"model": "gpt-3.5-turbo", "api_key": "sk-..."}}])

    requests = [
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "X"}]),
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "Y"}]),
        RouterParallelRequest(model="prod", messages=[{"role": "user", "content": "Z"}]),
    ]

    try:
        async for r in router.iter_parallel_acompletions(requests, concurrency=3, return_exceptions=False):
            # if any call fails, iteration raises immediately (fail-fast)
            print("ok", r.index, r.response)
    except Exception as e:
        print("aborted due to:", e)

asyncio.run(main())

Copy link

vercel bot commented Sep 11, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Ready Ready Preview Comment Sep 12, 2025 3:12pm

- Use single-arg HF tokenizer loading to match mocks/tests
- Temporarily disable hf_transfer via context manager during HF loads
- Preserve robust fallback to tiktoken when HF unavailable
- Avoid network chatter/timeouts in CI when hf_transfer not installed
- Router parallel fanout helper + iterator
- Docs and tests
- Resolve rebase conflicts and keep commit clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant