Input dataset

*required fields
If empty, defaults to a random value. Limited to a-z, 0-9 and _. Specify same id as other Model Evaluation to overwrite.
If empty, defaults to the date and time of evaluation
Can contain variables in ${} notation, which are expanded when the recipe is run.
A Model Evaluation will include dynamically generated evaluation:date and evaluationDataset:dataset-name labels, that you may override there.

Metrics

LLM-as-a-judge metrics computation

Define LLM connection to use for metrics computation (see metrics help for when each is required)

  • {{group.label}}

      Select All
    Input Output Ground Truth Actual tool calls Reference tool calls
  • Custom Metrics

    Code Metrics

    Add custom evaluation metrics to score the agent

    Warning: You do not have permission to run arbitrary code. The recipe will fail if it includes custom metrics and is run by a user not having this permission.

    Traits

    Traits are a list of assertions submitted to an LLM to validate the agent actions or answers

    {{it.name}}
    will appear when hovering result

    Python environment

    Running this recipe with code-env {{codeEnvWarning.envName}} may fail. {{codeEnvWarning.reason}}

    Container configuration

    Execution configuration

    Stop recipe execution if any metric produce an error. If disabled, metrics in error only produce empty values.

    Completion LLM parameters

    Metric configuration

    If set, the HuggingFace model needed for BERT-Score computation will be downloaded through the connection. The HuggingFace Model Cache will be used if enabled in the connection.
    If empty, defaults to bert-base-uncased
    Sets the maximum number of requests in parallel for RAGAS metrics. If empty, defaults to {{defaultRagasMaxWorkers}}