Skip to content

Conversation

tbtommyb
Copy link

I am not expecting you to merge this PR as it is quite messy with a lot of line noise (not sure why, I did use ruff), but I thought you might be interested in my approach.

I have been working to implement batch analysis using Anthropic on AWS Bedrock because I find the sequential analysis too slow on large codebases. I have hardcoded quite a lot of Anthropic/Bedrock-specific types and logic into the batch services, but if you want to support batch mode and settle on an interface I can try and refactor it into a BedrockAnthropic batch_client. Separate clients could implement support OpenAI and Anthropic's batch modes and keep the main codebase generic.

Analysis using this batch mode is much, much faster than sequential analysis. On a codebase of approx 12,500 files I estimate that sequential analysis will take ~10 days. In batch mode it takes less than a day and this is including a summarization call somewhere that I haven't moved to batch mode which took 10 hours.

Since the existing code is based on sequentially analysing documents I have had to make a lot of changes. The basic flow is:

  • generate all information extraction prompts in one file using stable node ids.
  • split into reasonably sized sub-files. Bedrock times out after 24 hours and it can run a few batch tasks in parallel so splitting into sub-files makes it much faster.
  • run the information extraction process again but provide the prompt output and actually do the extraction.
  • use extracted output to create glean output (I now see in a separate ticket you advise against gleaning so I will skip this).
  • create the graphs. This is sequential so I use pickle to save progress
  • batch the summarize_nodes_description (run once for prompts, run again with output to do the work)
  • batch the summarize_edges_description (as above)
  • upsert everything (also sequential)

However, in experimenting with the output I find the batched RAG generates less accurate answers than sequential RAG. It hallucinates entities a bit more and seems to have less breadth available. For example, I asked it a question about a subdirectory of a React/Redux app. The sequential RAG generated a pretty good answer but the batched one focused inappropriately on Redux reducers for some reason. I am wondering now if that is due to the gleaning step as I see you warn that it causes hallucinations. I will try without the gleaning but please let me know if you can see anything obviously wrong in the code. I track error rates at various steps and they are all well below 1% so I am not sure what else might be wrong.

This is much, much faster than sequential analysis. On a codebase
of approx 12,500 files I estimate that sequential analysis will
take ~10 days. In batch mode it takes less than a day.

However, the quality of the output is not as good so I need to
investigate the root cause.

I have hardcoded quite a lot of Anthropic/Bedrock-specific types and
logic into the batch services, but it should be easy enough to move
that out into a BedrockAnthropic batch_client and keep the main
codebase generic.
@tbtommyb tbtommyb changed the title Add support for batched analysis using Anthropic on AWS Bedrock Add support for batched analysis? Feb 13, 2025
@tbtommyb
Copy link
Author

tbtommyb commented Feb 13, 2025

I have re-run that subdirectory without gleaning and it seems much better. Trying on a larger codebase now.

Edit: Updating this in case anyone has similar issues.

Without gleaning I saw far fewer hallucinations so I tried it on the full codebase. I found the result quality dropped quite a lot. It didn't seem to know about key components and frequently referenced irrelevant things. When I looked at the entities and relationships the RAG system was supplying with the query I noticed that there was a lot of garbage being sent. It turned out that because I was gathering data from all [ts, tsx, js, json, md, yaml] files I was inadvertently including very large bundled output .js and network response .json files. These seemed to overwhelm the RAG system particularly because they didn't have clearly defined entities or relationships.

I cut the gathered data down to just [ts, tsx] files. The total processed file count went down by less than 10% but the number of prompts submitted dropped by nearly half, so clearly there were relatively few very large files full of junk. The quality of the output is now much better and seems pretty usable. I can probably improve it further by better specifying the example queries and entity types (I am struggling to get it to reliably report the file path of each file it knows).

Incidentally the batch run took less than 6 hours and that is including waiting a few hours for Bedrock capacity. I kicked off a sequential build at the same time and it is currently at 2.5%.

@kobrinartem
Copy link

Hi @tbtommyb Thank for sharing. I am researching to batch processing as it fits better for such type of the tasks and way cheaper. I did few experiment with Anthropic API. I tried Bedrock but I don't like limit in 50 messages minimum in 1 request and Bedrock limits are still are not the best one.
What is about embeddings? Do you still use openai or disabled them for your case?

@tbtommyb
Copy link
Author

tbtommyb commented Apr 8, 2025

You could write different BatchClients for anthropic, OpenAI etc and their batch APIs and reuse some of my logic. Bedrock is what I have access to so I use it.

for embeddings I just use the standard embedding client in the codebase. They are fast enough that it’s not really worth batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants