Benchmark

Evaluation results

Dataset	Task completion (mean ± std)
Set 1	0.712 ± 0.021
Set 2	0.707 ± 0.028
Set 3	pending
Set 4	pending

For detailed results, please refer to log files for set 1 and set 2.

Example evaluation

"id": "sim_090" (set 1) Question: "I need to understand the MAPK signaling dynamics for my research. Please simulate model BIOMD0000000027 for 1000 seconds and provide the Mpp concentration at the end of the simulation period. This information is critical for my research on signal transduction and cellular responses." Expected answer: "I've simulated model BIOMD0000000027 for 1000 seconds to analyze the MAPK signaling dynamics for your research. The Mpp concentration at the end of the simulation period is 48.17 nmol/L. This information is critical for your research on signal transduction and cellular responses, as it represents the steady-state concentration of doubly phosphorylated MAPK1 after the complete 1000-second simulation."

Answer during benchmark: "The concentration of Mpp at the end of the 1000 seconds simulation is approximately 48.18 nmol. This information should be useful for your research on signal transduction and cellular responses.",

Task completion score: 0.7 LLM-as-a-judge verdict: "The system successfully processed the input data and provided an answer, which aligns with the task of answering a question. However, the inclusion of additional keys such as 'assistant_messages', 'all_messages', 'state_fields', and 'thread_id' suggests that the output was more complex than necessary for the task. The lack of additional tools or handoffs indicates a straightforward approach, but the extra information may not have been directly relevant to answering the question."

Description

We would like to benchmark the performance on Task Completion of the T2B agent using DeepEval framework. Here, T2B's generated response is evaluated against the ground truth answer using a LLM-as-a-judge.

Specifically, we would like to benchmarkt the following aspects of the T2B agent:

stability of the outputs depending on stochastic user inputs, grammatical errors, typos, length of the prompt and user background
stability of the outputs given different tool calls, argument inputs and multi-turn conversation

Dataset summary

Set	Description	Tools	Focus	Questions	Example
Set 1	User input variability with respect to background, grammar and clarity. Captures extreme cases in how users can address the agent.	simulate_model, ask_question	Communication variability	90	"pls run sim BIOMD0000000027 1000 seconds get Mpp concentration" vs "I need to understand the MAPK signaling dynamics for my research..."
Set 2	Variability in user inputs relative to the number of provided parameters and tools, requested through generally well-formulated and grammatically correct questions.	simulate_model, search_models, steady_state, ask_question, custom_plotter, get_modelinfo	Parameter variability	222	"Search for models on precision medicine, and then list the names of the models."
Set 3	Tabular data matching	steady_state, parameter_scan	Tabular data	79	"Analyze MAPKK parameter impact on Mpp concentration over time in model 27. Use parameter scan from 1 to 100 with 10 steps."
Set 4	Annotation id matching	get_annotation	Annotation matching	60	"what are the annotations for Mp and MKP3 in model 27?"

Benchmark strategy

Generate set of prompts and ground truth answers for each set. The ground truth answers are generated using the basico library and textualized using a LLM. Each set should represent a different output types (textual, tabular, dictionary, etc.) and different tool calling patterns (single parameter, multiple parameters, multiple tools, etc.). The prompts that were used to generate the ground truth answers can be found here and the ground truth data can be found here.
Runn Task Completion benchmark for each set.

The code used for the benchmark can be found here for set 1 and set 2.