Skip to content

Benchmark - All Results

BioCypher query generation

In this set of tasks, we test LLM abilities to generate queries for a BioCypher Knowledge Graph using BioChatter. The schema_config.yaml of the BioCypher Knowledge Graph and a natural language query are passed to BioChatter.

Individual steps of the query generation process are tested separately, as well as the end-to-end performance of the process.

Full model name Passed test cases Total test cases Accuracy Iterations
gpt-3.5-turbo-0125 8 8 1 5
openhermes-2.5:7:ggufv2:Q6_K 8 8 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 9 9 1 5
openhermes-2.5:7:ggufv2:Q8_0 8 9 0.888889 5
openhermes-2.5:7:ggufv2:Q5_K_M 8 9 0.888889 5
openhermes-2.5:7:ggufv2:Q4_K_M 8 9 0.888889 5
gpt-4-0613 8 9 0.888889 5
gpt-3.5-turbo-0613 8 9 0.888889 5
gpt-4-0125-preview 7 9 0.777778 5
chatglm3:6:ggmlv3:q4_0 6 8 0.75 5
openhermes-2.5:7:ggufv2:Q2_K 5 9 0.555556 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 4 8 0.5 5
code-llama-instruct:7:ggufv2:Q3_K_M 4 8 0.5 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 3.8 8 0.475 5
code-llama-instruct:13:ggufv2:Q3_K_M 3.6 8 0.45 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q8_0 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q5_K_M 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q4_K_M 4 9 0.444444 5
llama-2-chat:70:ggufv2:Q5_K_M 4 9 0.444444 5
llama-2-chat:70:ggufv2:Q4_K_M 4 9 0.444444 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 3.8 9 0.422222 5
llama-2-chat:7:ggufv2:Q6_K 3 8 0.375 5
llama-2-chat:70:ggufv2:Q3_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 3 9 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 3 9 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 3 9 0.333333 5
llama-2-chat:7:ggufv2:Q3_K_M 3 9 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 3 9 0.333333 5
code-llama-instruct:7:ggufv2:Q4_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 2.8 9 0.311111 5
code-llama-instruct:7:ggufv2:Q2_K 2 8 0.25 5
code-llama-instruct:34:ggufv2:Q8_0 2 8 0.25 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 2 9 0.222222 5
code-llama-instruct:34:ggufv2:Q6_K 1 8 0.125 5
code-llama-instruct:34:ggufv2:Q5_K_M 1 8 0.125 5
code-llama-instruct:7:ggufv2:Q5_K_M 1 9 0.111111 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 9 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 8 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 8 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 8 0 5
llama-2-chat:7:ggufv2:Q2_K 0 9 0 5
llama-2-chat:70:ggufv2:Q2_K 0 9 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 8 0 5
llama-2-chat:13:ggufv2:Q6_K 0 8 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 9 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 9 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 9 0 5
llama-2-chat:13:ggufv2:Q2_K 0 9 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 8 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 8 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 8 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 8 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 9 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 8 0 5
llama-2-chat:13:ggufv2:Q8_0 0 9 0 5
Full model name Passed test cases Total test cases Accuracy Iterations
openhermes-2.5:7:ggufv2:Q8_0 12 12 1 5
gpt-3.5-turbo-0125 12 12 1 5
openhermes-2.5:7:ggufv2:Q6_K 12 12 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 12 12 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 12 12 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 12 12 1 5
gpt-4-0125-preview 9 12 0.75 5
gpt-4-0613 7.8 12 0.65 5
openhermes-2.5:7:ggufv2:Q2_K 6 12 0.5 5
code-llama-instruct:34:ggufv2:Q2_K 6 12 0.5 5
gpt-3.5-turbo-0613 6 12 0.5 5
chatglm3:6:ggmlv3:q4_0 4.8 12 0.4 5
code-llama-instruct:7:ggufv2:Q3_K_M 3 12 0.25 5
code-llama-instruct:7:ggufv2:Q2_K 3 12 0.25 5
llama-2-chat:70:ggufv2:Q5_K_M 3 12 0.25 5
llama-2-chat:70:ggufv2:Q4_K_M 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 3 12 0.25 5
code-llama-instruct:34:ggufv2:Q3_K_M 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 3 12 0.25 5
code-llama-instruct:34:ggufv2:Q8_0 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 12 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 12 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 12 0 5
llama-2-chat:7:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 12 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 12 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q8_0 0 12 0 5
llama-2-chat:7:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q2_K 0 12 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 12 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 12 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 12 0 5
llama-2-chat:70:ggufv2:Q2_K 0 12 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q2_K 0 12 0 5
llama-2-chat:13:ggufv2:Q8_0 0 12 0 5
Full model name Passed test cases Total test cases Accuracy Iterations
gpt-3.5-turbo-0613 23.2 64 0.3625 5
gpt-4-0613 23 64 0.359375 5
gpt-3.5-turbo-0125 22.8 64 0.35625 5
chatglm3:6:ggmlv3:q4_0 18.4 64 0.2875 5
llama-2-chat:70:ggufv2:Q3_K_M 11 64 0.171875 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 10.4 64 0.1625 5
openhermes-2.5:7:ggufv2:Q5_K_M 8 64 0.125 5
openhermes-2.5:7:ggufv2:Q3_K_M 8 64 0.125 5
openhermes-2.5:7:ggufv2:Q8_0 8 64 0.125 5
llama-2-chat:7:ggufv2:Q3_K_M 6.4 64 0.1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 4.2 64 0.065625 5
code-llama-instruct:7:ggufv2:Q2_K 4 64 0.0625 5
openhermes-2.5:7:ggufv2:Q6_K 3 64 0.046875 5
openhermes-2.5:7:ggufv2:Q4_K_M 3 64 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 3 64 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 3 64 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 2.4 64 0.0375 5
llama-2-chat:7:ggufv2:Q5_K_M 2.4 64 0.0375 5
llama-2-chat:7:ggufv2:Q6_K 0 64 0 5
llama-2-chat:7:ggufv2:Q8_0 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 64 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 64 0 5
openhermes-2.5:7:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 64 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 64 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:7:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 64 0 5
gpt-4-0125-preview 0 64 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 64 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 64 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 64 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 64 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q2_K 0 64 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 64 0 5
llama-2-chat:70:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 64 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q8_0 0 64 0 5
Full model name Passed test cases Total test cases Accuracy Iterations
code-llama-instruct:34:ggufv2:Q4_K_M 7.8 8 0.975 5
code-llama-instruct:34:ggufv2:Q5_K_M 7.6 8 0.95 5
code-llama-instruct:34:ggufv2:Q8_0 7.4 8 0.925 5
code-llama-instruct:34:ggufv2:Q6_K 7.2 8 0.9 5
gpt-4-0613 8 9 0.888889 5
code-llama-instruct:13:ggufv2:Q2_K 7 8 0.875 5
code-llama-instruct:34:ggufv2:Q3_K_M 7 8 0.875 5
gpt-3.5-turbo-0125 7.8 9 0.866667 5
code-llama-instruct:13:ggufv2:Q3_K_M 6.8 8 0.85 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 6.8 8 0.85 5
openhermes-2.5:7:ggufv2:Q2_K 7.6 9 0.844444 5
code-llama-instruct:13:ggufv2:Q6_K 6.6 8 0.825 5
code-llama-instruct:7:ggufv2:Q2_K 6.4 8 0.8 5
code-llama-instruct:7:ggufv2:Q3_K_M 7.2 9 0.8 5
llama-2-chat:70:ggufv2:Q5_K_M 7 9 0.777778 5
llama-2-chat:13:ggufv2:Q4_K_M 7 9 0.777778 5
openhermes-2.5:7:ggufv2:Q5_K_M 7 9 0.777778 5
llama-2-chat:70:ggufv2:Q3_K_M 7 9 0.777778 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 7 9 0.777778 5
code-llama-instruct:7:ggufv2:Q6_K 6.2 8 0.775 5
code-llama-instruct:13:ggufv2:Q5_K_M 6.2 8 0.775 5
code-llama-instruct:13:ggufv2:Q4_K_M 6.2 8 0.775 5
llama-2-chat:13:ggufv2:Q6_K 6.2 8 0.775 5
llama-2-chat:70:ggufv2:Q4_K_M 6.8 9 0.755556 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 6.8 9 0.755556 5
openhermes-2.5:7:ggufv2:Q8_0 6.8 9 0.755556 5
gpt-3.5-turbo-0613 6.8 9 0.755556 5
openhermes-2.5:7:ggufv2:Q4_K_M 6.8 9 0.755556 5
code-llama-instruct:34:ggufv2:Q2_K 6 8 0.75 5
code-llama-instruct:13:ggufv2:Q8_0 6 8 0.75 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 6.6 9 0.733333 5
llama-2-chat:13:ggufv2:Q3_K_M 6.6 9 0.733333 5
gpt-4-0125-preview 6.6 9 0.733333 5
openhermes-2.5:7:ggufv2:Q6_K 6.6 9 0.733333 5
openhermes-2.5:7:ggufv2:Q3_K_M 7.2 10 0.72 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 6.4 9 0.711111 5
llama-2-chat:13:ggufv2:Q8_0 6.4 9 0.711111 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 6.2 9 0.688889 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 6.2 9 0.688889 5
llama-2-chat:7:ggufv2:Q2_K 6.2 9 0.688889 5
code-llama-instruct:7:ggufv2:Q5_K_M 6.2 9 0.688889 5
code-llama-instruct:7:ggufv2:Q8_0 6 9 0.666667 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 6 9 0.666667 5
llama-2-chat:70:ggufv2:Q2_K 6 9 0.666667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 6 9 0.666667 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 5.2 8 0.65 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 5.8 9 0.644444 5
llama-2-chat:13:ggufv2:Q5_K_M 5.8 9 0.644444 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 5.4 9 0.6 5
code-llama-instruct:7:ggufv2:Q4_K_M 5.4 9 0.6 5
llama-2-chat:7:ggufv2:Q4_K_M 4.4 9 0.488889 5
llama-2-chat:7:ggufv2:Q3_K_M 4.2 9 0.466667 5
llama-2-chat:7:ggufv2:Q8_0 3.2 9 0.355556 5
llama-2-chat:7:ggufv2:Q6_K 3 9 0.333333 5
llama-2-chat:13:ggufv2:Q2_K 2.6 9 0.288889 5
llama-2-chat:7:ggufv2:Q5_K_M 2.6 9 0.288889 5
chatglm3:6:ggmlv3:q4_0 2.2 8 0.275 5
Full model name Passed test cases Total test cases Accuracy Iterations
gpt-4-0613 20.4 30 0.68 5
code-llama-instruct:7:ggufv2:Q4_K_M 19.6 30 0.653333 5
code-llama-instruct:34:ggufv2:Q3_K_M 18 30 0.6 5
openhermes-2.5:7:ggufv2:Q5_K_M 17.6 30 0.586667 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 17.2 30 0.573333 5
code-llama-instruct:13:ggufv2:Q2_K 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q5_K_M 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q8_0 17 30 0.566667 5
code-llama-instruct:34:ggufv2:Q2_K 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q6_K 16.2 30 0.54 5
openhermes-2.5:7:ggufv2:Q6_K 16 30 0.533333 5
code-llama-instruct:13:ggufv2:Q3_K_M 16 30 0.533333 5
code-llama-instruct:13:ggufv2:Q4_K_M 16 30 0.533333 5
code-llama-instruct:7:ggufv2:Q2_K 16 30 0.533333 5
gpt-3.5-turbo-0613 15 30 0.5 5
gpt-3.5-turbo-0125 14.6 30 0.486667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 14.4 30 0.48 5
llama-2-chat:13:ggufv2:Q3_K_M 14.4 30 0.48 5
chatglm3:6:ggmlv3:q4_0 14.4 30 0.48 5
llama-2-chat:13:ggufv2:Q8_0 14.4 30 0.48 5
llama-2-chat:70:ggufv2:Q2_K 14.2 30 0.473333 5
code-llama-instruct:34:ggufv2:Q6_K 14.2 30 0.473333 5
code-llama-instruct:34:ggufv2:Q8_0 14 30 0.466667 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 14 30 0.466667 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q8_0 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q3_K_M 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q4_K_M 14 30 0.466667 5
code-llama-instruct:34:ggufv2:Q5_K_M 14 30 0.466667 5
code-llama-instruct:34:ggufv2:Q4_K_M 14 30 0.466667 5
gpt-4-0125-preview 13.2 30 0.44 5
llama-2-chat:13:ggufv2:Q5_K_M 13 30 0.433333 5
openhermes-2.5:7:ggufv2:Q2_K 13 30 0.433333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 13 30 0.433333 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 13 30 0.433333 5
code-llama-instruct:7:ggufv2:Q3_K_M 12.8 30 0.426667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 12.8 30 0.426667 5
llama-2-chat:70:ggufv2:Q4_K_M 12.6 30 0.42 5
llama-2-chat:70:ggufv2:Q3_K_M 12.4 30 0.413333 5
code-llama-instruct:7:ggufv2:Q8_0 12 30 0.4 5
code-llama-instruct:7:ggufv2:Q5_K_M 12 30 0.4 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 11.6 30 0.386667 5
llama-2-chat:13:ggufv2:Q6_K 11.6 30 0.386667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 11.4 30 0.38 5
llama-2-chat:13:ggufv2:Q4_K_M 11 30 0.366667 5
llama-2-chat:13:ggufv2:Q2_K 11 30 0.366667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 11 30 0.366667 5
llama-2-chat:70:ggufv2:Q5_K_M 10.8 30 0.36 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 10 30 0.333333 5
code-llama-instruct:7:ggufv2:Q6_K 10 30 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 10 30 0.333333 5
llama-2-chat:7:ggufv2:Q5_K_M 8.8 30 0.293333 5
llama-2-chat:7:ggufv2:Q6_K 8 30 0.266667 5
llama-2-chat:7:ggufv2:Q8_0 8 30 0.266667 5
llama-2-chat:7:ggufv2:Q4_K_M 7.2 30 0.24 5
llama-2-chat:7:ggufv2:Q3_K_M 7 30 0.233333 5
llama-2-chat:7:ggufv2:Q2_K 3 30 0.1 5
Full model name Passed test cases Total test cases Accuracy Iterations
gpt-3.5-turbo-0125 29 30 0.966667 5
code-llama-instruct:7:ggufv2:Q4_K_M 29 30 0.966667 5
gpt-4-0613 29 30 0.966667 5
code-llama-instruct:7:ggufv2:Q8_0 28.8 30 0.96 5
code-llama-instruct:7:ggufv2:Q5_K_M 28.8 30 0.96 5
code-llama-instruct:7:ggufv2:Q6_K 28.8 30 0.96 5
gpt-3.5-turbo-0613 28.4 30 0.946667 5
openhermes-2.5:7:ggufv2:Q3_K_M 28.2 30 0.94 5
openhermes-2.5:7:ggufv2:Q2_K 28.2 30 0.94 5
llama-2-chat:70:ggufv2:Q4_K_M 27.6 30 0.92 5
code-llama-instruct:7:ggufv2:Q2_K 27.6 30 0.92 5
openhermes-2.5:7:ggufv2:Q5_K_M 27.4 30 0.913333 5
llama-2-chat:70:ggufv2:Q3_K_M 27.2 30 0.906667 5
llama-2-chat:70:ggufv2:Q5_K_M 27.2 30 0.906667 5
code-llama-instruct:34:ggufv2:Q4_K_M 27.2 30 0.906667 5
llama-2-chat:70:ggufv2:Q2_K 27 30 0.9 5
code-llama-instruct:34:ggufv2:Q5_K_M 27 30 0.9 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 26.8 30 0.893333 5
openhermes-2.5:7:ggufv2:Q8_0 26.4 30 0.88 5
code-llama-instruct:7:ggufv2:Q3_K_M 26.2 30 0.873333 5
openhermes-2.5:7:ggufv2:Q4_K_M 26.2 30 0.873333 5
code-llama-instruct:34:ggufv2:Q8_0 25.8 30 0.86 5
openhermes-2.5:7:ggufv2:Q6_K 25.8 30 0.86 5
code-llama-instruct:34:ggufv2:Q6_K 25.6 30 0.853333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 25.4 30 0.846667 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 25.4 30 0.846667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 25.2 30 0.84 5
code-llama-instruct:13:ggufv2:Q3_K_M 25 30 0.833333 5
gpt-4-0125-preview 25 30 0.833333 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 25 30 0.833333 5
code-llama-instruct:13:ggufv2:Q4_K_M 25 30 0.833333 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 24.8 30 0.826667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 24.8 30 0.826667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 24.8 30 0.826667 5
code-llama-instruct:13:ggufv2:Q2_K 24.6 30 0.82 5
llama-2-chat:13:ggufv2:Q6_K 24.4 30 0.813333 5
code-llama-instruct:13:ggufv2:Q6_K 23.8 30 0.793333 5
llama-2-chat:13:ggufv2:Q8_0 23.6 30 0.786667 5
code-llama-instruct:34:ggufv2:Q3_K_M 23.6 30 0.786667 5
code-llama-instruct:13:ggufv2:Q5_K_M 23.4 30 0.78 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 23.2 30 0.773333 5
code-llama-instruct:13:ggufv2:Q8_0 23 30 0.766667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 22.8 30 0.76 5
llama-2-chat:13:ggufv2:Q4_K_M 22.8 30 0.76 5
llama-2-chat:13:ggufv2:Q5_K_M 22.4 30 0.746667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 21.8 30 0.726667 5
llama-2-chat:7:ggufv2:Q3_K_M 20.8 30 0.693333 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 20.8 30 0.693333 5
code-llama-instruct:34:ggufv2:Q2_K 20.6 30 0.686667 5
llama-2-chat:7:ggufv2:Q2_K 20.6 30 0.686667 5
llama-2-chat:13:ggufv2:Q3_K_M 20.4 30 0.68 5
llama-2-chat:7:ggufv2:Q6_K 19.8 30 0.66 5
llama-2-chat:7:ggufv2:Q4_K_M 19.4 30 0.646667 5
llama-2-chat:7:ggufv2:Q8_0 19.2 30 0.64 5
llama-2-chat:7:ggufv2:Q5_K_M 19 30 0.633333 5
chatglm3:6:ggmlv3:q4_0 16.6 30 0.553333 5
llama-2-chat:13:ggufv2:Q2_K 13 30 0.433333 5
Full model name Passed test cases Total test cases Accuracy Iterations
gpt-3.5-turbo-0125 27.8 30 0.926667 5
gpt-4-0613 26.4 30 0.88 5
gpt-3.5-turbo-0613 25 30 0.833333 5
chatglm3:6:ggmlv3:q4_0 0 30 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q2_K 0 30 0 5
llama-2-chat:7:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q6_K 0 30 0 5
llama-2-chat:7:ggufv2:Q8_0 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 30 0 5
openhermes-2.5:7:ggufv2:Q2_K 0 30 0 5
openhermes-2.5:7:ggufv2:Q3_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q4_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q5_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q6_K 0 30 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q8_0 0 30 0 5
llama-2-chat:70:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 30 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 30 0 5
code-llama-instruct:7:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 30 0 5
gpt-4-0125-preview 0 30 0 5
llama-2-chat:13:ggufv2:Q2_K 0 30 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q6_K 0 30 0 5
openhermes-2.5:7:ggufv2:Q8_0 0 30 0 5

Retrieval-Augmented Generation (RAG)

In this set of tasks, we test LLM abilities to generate answers to a given question using a RAG agent, or to judge the relevance of a RAG fragment to a given question. Instructions can be explicit ("is this fragment relevant to the question?") or implicit (just asking the question without instructions and evaluating whether the model responds with 'not enough information given').

TODO: description of rag_interpretation_test_explicit_relevance_of_single_fragments

Full model name Passed test cases Total test cases Accuracy Iterations
llama-2-chat:13:ggufv2:Q8_0 6 6 1 5
llama-2-chat:13:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:13:ggufv2:Q6_K 6 6 1 5
llama-2-chat:70:ggufv2:Q2_K 6 6 1 5
llama-2-chat:70:ggufv2:Q3_K_M 6 6 1 5
llama-2-chat:70:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:70:ggufv2:Q5_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q3_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q5_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q6_K 6 6 1 5
llama-2-chat:7:ggufv2:Q8_0 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 6 6 1 5
openhermes-2.5:7:ggufv2:Q2_K 6 6 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q6_K 6 6 1 5
llama-2-chat:13:ggufv2:Q5_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q8_0 6 6 1 5
llama-2-chat:13:ggufv2:Q3_K_M 6 6 1 5
llama-2-chat:13:ggufv2:Q2_K 6 6 1 5
gpt-4-0613 6 6 1 5
gpt-4-0125-preview 6 6 1 5
gpt-3.5-turbo-0613 6 6 1 5
gpt-3.5-turbo-0125 6 6 1 5
code-llama-instruct:7:ggufv2:Q8_0 6 6 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 6 6 1 5
code-llama-instruct:13:ggufv2:Q6_K 5 6 0.833333 5
llama-2-chat:7:ggufv2:Q2_K 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q6_K 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q5_K_M 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q3_K_M 5 6 0.833333 5
code-llama-instruct:13:ggufv2:Q8_0 5 6 0.833333 5
chatglm3:6:ggmlv3:q4_0 4.4 6 0.733333 5
code-llama-instruct:13:ggufv2:Q5_K_M 4 6 0.666667 5
code-llama-instruct:34:ggufv2:Q4_K_M 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q3_K_M 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q2_K 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q6_K 2 6 0.333333 5
code-llama-instruct:34:ggufv2:Q8_0 2 6 0.333333 5
code-llama-instruct:13:ggufv2:Q4_K_M 2 6 0.333333 5
code-llama-instruct:7:ggufv2:Q2_K 2 6 0.333333 5
code-llama-instruct:34:ggufv2:Q5_K_M 2 6 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 2 6 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 1 6 0.166667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0.8 6 0.133333 5
code-llama-instruct:13:ggufv2:Q2_K 0.2 6 0.0333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 6 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 6 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 6 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 6 0 5

TODO: description of rag_interpretation_test_implicit_relevance_of_multiple_fragments

Full model name Passed test cases Total test cases Accuracy Iterations
chatglm3:6:ggmlv3:q4_0 2 2 1 5
gpt-3.5-turbo-0613 2 2 1 5
openhermes-2.5:7:ggufv2:Q6_K 2 2 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 2 2 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 2 2 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 2 2 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 2 2 1 5
llama-2-chat:7:ggufv2:Q3_K_M 2 2 1 5
llama-2-chat:7:ggufv2:Q2_K 2 2 1 5
llama-2-chat:70:ggufv2:Q4_K_M 2 2 1 5
gpt-4-0613 2 2 1 5
openhermes-2.5:7:ggufv2:Q8_0 2 2 1 5
code-llama-instruct:34:ggufv2:Q2_K 2 2 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 2 2 1 5
code-llama-instruct:34:ggufv2:Q5_K_M 2 2 1 5
gpt-3.5-turbo-0125 1.8 2 0.9 5
code-llama-instruct:7:ggufv2:Q6_K 1.8 2 0.9 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 1.8 2 0.9 5
code-llama-instruct:34:ggufv2:Q6_K 1.8 2 0.9 5
code-llama-instruct:34:ggufv2:Q8_0 1.8 2 0.9 5
llama-2-chat:70:ggufv2:Q5_K_M 1.8 2 0.9 5
code-llama-instruct:7:ggufv2:Q2_K 1.4 2 0.7 5
code-llama-instruct:7:ggufv2:Q3_K_M 1.4 2 0.7 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 1.4 2 0.7 5
llama-2-chat:7:ggufv2:Q5_K_M 1.2 2 0.6 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 1.2 2 0.6 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 1.2 2 0.6 5
code-llama-instruct:13:ggufv2:Q6_K 1 2 0.5 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q8_0 1 2 0.5 5
openhermes-2.5:7:ggufv2:Q2_K 1 2 0.5 5
openhermes-2.5:7:ggufv2:Q3_K_M 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q5_K_M 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q4_K_M 1 2 0.5 5
code-llama-instruct:34:ggufv2:Q3_K_M 1 2 0.5 5
code-llama-instruct:7:ggufv2:Q5_K_M 1 2 0.5 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 1 2 0.5 5
llama-2-chat:13:ggufv2:Q5_K_M 1 2 0.5 5
code-llama-instruct:7:ggufv2:Q8_0 1 2 0.5 5
gpt-4-0125-preview 1 2 0.5 5
llama-2-chat:13:ggufv2:Q2_K 1 2 0.5 5
llama-2-chat:13:ggufv2:Q3_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q4_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q6_K 1 2 0.5 5
llama-2-chat:70:ggufv2:Q2_K 1 2 0.5 5
llama-2-chat:70:ggufv2:Q3_K_M 1 2 0.5 5
llama-2-chat:7:ggufv2:Q4_K_M 1 2 0.5 5
llama-2-chat:7:ggufv2:Q6_K 1 2 0.5 5
llama-2-chat:7:ggufv2:Q8_0 1 2 0.5 5
llama-2-chat:13:ggufv2:Q8_0 1 2 0.5 5
code-llama-instruct:34:ggufv2:Q4_K_M 0.8 2 0.4 5
code-llama-instruct:13:ggufv2:Q2_K 0.8 2 0.4 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 2 0 5

Coming soon.