-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
Docker
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: 169178b
Docker label: sha-169178b
nvidia-smi
Args {
model_id: "/share/base_model/Mistral-Nemo-Instruct-2407-GPTQ",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Gptq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
8192,
),
max_input_length: None,
max_total_tokens: Some(
10240,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "545eaf4c39af",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model.
As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:
2024-10-11T11:27:58.527278Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/[mod.rs:105](http://mod.rs:105/): `inputs` tokens + `max_new_tokens` must be <= 10240. Given: 9266 `inputs` tokens and 1000 `max_new_tokens`
My question:
- Will TGI automatically do truncation for user_input according to max_input_tokens?
- Could I use some parameters to truncate input length to less than max_input_tokens?
Thanks a lot for help.
Expected behavior
Input tokens should be truncated.