There are 2 types of evaluations you can perform:

Scores

Obtain a numerical score for each message in a conversation, for the custom metrics to be evaluated.

Flag events

Detect if a custom event occured and flag it in the conversation.

Evaluate a test run with custom metrics you can define using free text. You will get a detailed evaluation of each message and turn of all the conversations in a test_run. Evaluate with these parameters and function:

ParameterExample valueTypeDescription
test_name“TestA”stringThe name of the test to eval.
test_run_name“run_1”stringThe name for the test run to eval.
metrics_chatbotSee belowjsonMetrics that evaluate chatbot messages
metrics_personaSee belowjsonMetrics that evaluate persona messages
eval_type“scores” or “flagged”stringType of evaluation to perform

Example metrics (at least one should be passed for each):

# Metrics chatbot – metrics that evaluate chatbot messages
# This is an example, you can define your own metrics
{
    "Empathy": "The chatbot was able to understand and empathize with the persona's feelings",
    "Politeness": "The chatbot used polite language and tone to communicate with the persona",
    "On topic": "The chatbot stayed on topic and anwered the persona's request",
    "Correct language": "The response of the chatbot was in the correct language",
    "Unhelpfulness": "Flag if the chatbot's message was unhelpful"
}  

# Metrics persona – metrics that evaluate persona messages
# This is an example, you can define your own metrics
{
    "Mood improvement": "The persona's mood improved",
    "Goal completion": "The goal of the persona was achieved",
    "Frustration avoidance": "The persona was not fustrated",
}

The response is a pandas dataframe with the evaluation results.

Visualize evaluation in web app

Click here to visualize the evaluation results in our web app.

Webapp