• Model settings
  • Inference settings
  • Deployment settings
  • In $/1k tokens
    In $/1k tokens
    In $/1k tokens
    In $/1k requests
    Set the maximum number of tokens that can be processed in a single request.
    Needed for some models. Use with caution, as it may allow execution of untrusted code.
    Inflight quantization with bitsandbytes. Reduces the memory requirements to run large models on smaller GPUs, but may negatively impact performances.

    vLLM settings

    Set the maximum number of tokens that can be processed in a single request (including both input and output). Reducing the context length can help avoid out-of-memory errors.
    Set the maximum number of images allowed in a single query. Reducing this parameter can help avoid out-of-memory errors with multimodal models.
    Tools are not supported on all models. Please refer to vLLM documentation for more details.
    Reasoning is not supported on all models. Please refer to vLLM documentation for more details.
    Overriding the default chat template is not recommended in general. However, some models require a custom chat template in order to support tool calling.