Set the maximum number of tokens that can be processed in a single request.
Needed for some models. Use with caution, as it may allow execution of untrusted code.
Inflight quantization with bitsandbytes. Reduces the memory requirements to run large models on smaller GPUs, but may negatively impact performances.
vLLM settings
Set the maximum number of tokens that can be processed in a single request (including both input and output). Reducing the context length can help avoid out-of-memory errors.
Set the maximum number of images allowed in a single query. Reducing this parameter can help avoid out-of-memory errors with multimodal models.
Tools are not supported on all models. Please refer to vLLM documentation for more details.
Reasoning is not supported on all models. Please refer to vLLM documentation for more details.
Overriding the default chat template is not recommended in general. However, some models require a custom chat template in order to support tool calling.
Set the number of tensor parallel groups. Leave empty for auto-detection by DSS, set to 1 to disable, or enter a value greater than 1 to enforce a specific number.
Set the number of pipeline parallel groups. Leave empty for auto-detection by DSS, set to 1 to disable, or enter a value greater than 1 to enforce a specific number.
In Mixture-of-Experts (MoE) models, allows for experts to be distributed over separate GPUs.
This is an advanced setting of the vLLM engine (ignored when vLLM engine is disabled). Tune it with caution as it may result in performance degradation or engine failure.
Data type for model weights and activations.
Enabling CUDA graph improves inference speed but increases memory requirements. Enforcing eager mode can help avoid out-of-memory errors.
Set the maximum number of sequences that can be processed per iteration. Reducing this parameter value can help avoid out-of-memory errors, in particular with multimodal models.
The minimum number of instances that DSS will keep running. This setting only applies if "Reserved capacity" is enabled.
This will limit the number of model instances that can be running at once.
Minimum is greater than maximum!
Sets the target load per instance, counting both active and queued requests. The system automatically adds or removes instances to keep the average load near this target, balancing performance with cost-efficiency.
In seconds. The time period used to average the request load. A longer window makes scaling more stable by ignoring brief spikes, while a shorter window reacts faster. For the last instance, this also functions as a TTL, shutting down the instance if it remains idle for this duration.
Set the CUDA_VISIBLE_DEVICES environment variable with a comma-separated list of GPU IDs (e.g., 0,1) to select specific GPUs. Leave this setting empty to use all available GPUs or for containerized execution.