Set the maximum number of tokens that can be processed in a single request (including both input and output). Reducing the context length can help avoid out-of-memory errors.
Set the maximum number of tokens that can be processed in a single request.
Set the maximum number of images allowed in a single query. Reducing this parameter can help avoid out-of-memory errors with multimodal models.
Tools are not supported on all models. Please refer to vLLM documentation for more details.
Required
Overriding the default chat template is not recommended in general. However, some models require a custom chat template in order to support tool calling.
Needed for some models. Use with caution, as it may allow execution of untrusted code.
Inflight quantization with bitsandbytes. Reduces the memory requirements to run large models on smaller GPUs, but may negatively impact performances.
Set the number of tensor parallel groups. Leave empty for auto-detection by DSS, set to 1 to disable, or enter a value greater than 1 to enforce a specific number.
Set the number of pipeline parallel groups. Leave empty for auto-detection by DSS, set to 1 to disable, or enter a value greater than 1 to enforce a specific number.
In Mixture-of-Experts (MoE) models, allows for experts to be distributed over separate GPUs.
This is an advanced setting of the vLLM engine (ignored when vLLM engine is disabled). Tune it with caution as it may result in performance degradation or engine failure.
Data type for model weights and activations.
Enabling CUDA graph improves inference speed but increases memory requirements. Enforcing eager mode can help avoid out-of-memory errors.
Set the maximum number of sequences that can be processed per iteration. Reducing this parameter value can help avoid out-of-memory errors, in particular with multimodal models.
The minimum number of instances that DSS will keep running. This setting only applies if "Reserved capacity" is enabled.
This will limit the number of model instances that can be running at once.
Minimum is greater than maximum!
Set the CUDA_VISIBLE_DEVICES environment variable with a comma-separated list of GPU IDs (e.g., 0,1) to select specific GPUs. Leave this setting empty to use all available GPUs or for containerized execution.