-
Notifications
You must be signed in to change notification settings - Fork 116
Feature Enhancement: Batch Inference Support in candle-binding #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Enhancement: Batch Inference Support in candle-binding #71
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
17a1c5a
to
bfb87d9
Compare
|
||
/// Unified classifier with shared ModernBERT backbone and multiple task heads | ||
pub struct UnifiedClassifier { | ||
// Shared ModernBERT encoder (saves ~800MB memory vs 3 separate models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OneZero-Y we are thinking the same: using the same base bert model in all classification tasks. But in the current model training, the base bert are all different. So this is not working right now.
The future step is in the two directions:
- Multi task fine tuning: A single classification head for all classes. This attempt however, yields very poor accuracy on some classes.
- LoRA. We are still working on this. If you like to get started, that'll be very promising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs I'll definitely give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! This will need to do in both training and classification:
- The candle lora crate has some bert examples but no modernbert. If you can get this started, that'll be cool!
- The fine tuning scripts will use the PEFT library for the lora support.
- Then the candle-binding can be ready to support multi lora classification tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented a complete LoRA solution that addresses the low confidence issue you mentioned.
Problem Solved:
- High Confidence: 0.99+ (vs original 0.19-0.38)
- Unified Base Models: Same architecture across all tasks (BERT/RoBERTa/ModernBERT)
- Production Ready: Complete training pipeline + documentation
Key Features:
- Smart Model Selection: BERT > RoBERTa > ModernBERT priority
- Proven Results: Python-Go numerical consistency achieved
- Zero Config: Automatic model discovery and loading
The implementation includes complete training scripts (src/training/training_lora/), Rust integration, and comprehensive documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OneZero-Y great work!
@OneZero-Y Your test results show low confidence classification. I also observed this in my tests. The root cause is that, when the classifiers are trained, both the base bert model and classification head are distinct from others. You can double check the base bert models in HF. The future step is in the two directions:
|
bfb87d9
to
a2322bb
Compare
unit test failed:
|
Makefile
Outdated
@@ -256,6 +256,13 @@ download-models: | |||
hf download LLM-Semantic-Router/pii_classifier_modernbert-base_presidio_token_model --local-dir models/pii_classifier_modernbert-base_presidio_token_model; \ | |||
fi | |||
|
|||
# LoRA Enhanced Models (Note: These need to be trained locally) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OneZero-Y do you have a trained lora? can you upload to huggingface and allow me to try it first? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs
I don't have permission to upload models to https://huggingface.co/llm-semantic-router .
I uploaded it to my personal directory first( https://huggingface.co/OneZero-Y )If there are no issues, please download it from here and upload it to https://huggingface.co/llm-semantic-router.
I will first use the download link in the Makefile file to download from my personal repository
@OneZero-Y let's fast track this PR. Would you please fix the unit test and upload your trained models with lora to huggingface? We have a huggingface community at https://huggingface.co/llm-semantic-router. From there, let's migrate to lora based classifiers. Again, this is amazing work! |
@OneZero-Y can you also join the semantic router channel at vllm slack? want to follow up on the lora approach w/r/t #59 |
a2322bb
to
e02abb9
Compare
@OneZero-Y I just sent you a huggingface community invite, please see if you can upload the models there. Thanks |
|
Feature Enhancement: Batch Inference Support in candle-binding Signed-off-by: OneZero-Y <[email protected]> fix: unified_classifier_test Signed-off-by: OneZero-Y <[email protected]> fix: unified_classifier_test Signed-off-by: OneZero-Y <[email protected]> fix: unit_test Signed-off-by: OneZero-Y <[email protected]>
- Complete LoRA training scripts for 3 classification tasks - Smart model selection with architecture priority (BERT > RoBERTa > ModernBERT) - Official Candle BERT integration for Python-Go consistency - Enhanced unified classifier with high-confidence LoRA models Signed-off-by: OneZero-Y <[email protected]>
e02abb9
to
baa0822
Compare
|
499333d
to
8feb3b0
Compare
Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]> fix: unit test and model download from huggingface Signed-off-by: OneZero-Y <[email protected]>
8feb3b0
to
55103aa
Compare
@OneZero-Y great work! thanks! |
What type of PR is this?
Feature Enhancement
What this PR does / why we need it:
This PR implements a complete unified batch inference for the semantic router, provides true batch inference capabilities.
Key Features
Key Components:
candle-binding/src/unified_classifier.rs
): ModernBERT + multi-task headspkg/utils/classification/unified_classifier.go
): Memory-safe CGO interfacepkg/services/classification.go
): Unified batch processing servicepkg/api/server.go
): Enhanced batch classification endpointpkg/utils/classification/model_discovery.go
): Zero-config model loadingUnit Tests:
Integration Tests:
API Changes
Enhanced Batch Endpoint:
Response with Probabilities:
Which issue(s) this PR fixes:
Fixes #32
Release Notes: Yes/No