input tokens exceeded `max_input_tokens`

### System Info

Docker

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: 169178b937d0c4173b0fdcd6bf10a858cfe4f428
Docker label: sha-169178b
nvidia-smi


Args {
    model_id: "/share/base_model/Mistral-Nemo-Instruct-2407-GPTQ",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        Gptq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(
        8192,
    ),
    max_input_length: None,
    max_total_tokens: Some(
        10240,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "545eaf4c39af",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
}

### Information

- [X] Docker
- [ ] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model. 
As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:

```
2024-10-11T11:27:58.527278Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/[mod.rs:105](http://mod.rs:105/): `inputs` tokens + `max_new_tokens` must be <= 10240. Given: 9266 `inputs` tokens and 1000 `max_new_tokens`
```

My question:
1. Will TGI automatically do truncation for user_input according to max_input_tokens?
2. Could I use some parameters to truncate input length to less than max_input_tokens?

Thanks a lot for help.

### Expected behavior

Input tokens should be truncated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

input tokens exceeded `max_input_tokens` #2638

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

input tokens exceeded max_input_tokens #2638

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

input tokens exceeded `max_input_tokens` #2638