Add support for batched analysis? #80
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am not expecting you to merge this PR as it is quite messy with a lot of line noise (not sure why, I did use ruff), but I thought you might be interested in my approach.
I have been working to implement batch analysis using Anthropic on AWS Bedrock because I find the sequential analysis too slow on large codebases. I have hardcoded quite a lot of Anthropic/Bedrock-specific types and logic into the batch services, but if you want to support batch mode and settle on an interface I can try and refactor it into a BedrockAnthropic batch_client. Separate clients could implement support OpenAI and Anthropic's batch modes and keep the main codebase generic.
Analysis using this batch mode is much, much faster than sequential analysis. On a codebase of approx 12,500 files I estimate that sequential analysis will take ~10 days. In batch mode it takes less than a day and this is including a summarization call somewhere that I haven't moved to batch mode which took 10 hours.
Since the existing code is based on sequentially analysing documents I have had to make a lot of changes. The basic flow is:
However, in experimenting with the output I find the batched RAG generates less accurate answers than sequential RAG. It hallucinates entities a bit more and seems to have less breadth available. For example, I asked it a question about a subdirectory of a React/Redux app. The sequential RAG generated a pretty good answer but the batched one focused inappropriately on Redux reducers for some reason. I am wondering now if that is due to the gleaning step as I see you warn that it causes hallucinations. I will try without the gleaning but please let me know if you can see anything obviously wrong in the code. I track error rates at various steps and they are all well below 1% so I am not sure what else might be wrong.