Skip to content

Benchmark - All Results

BioCypher query generation

In this set of tasks, we test LLM abilities to generate queries for a BioCypher Knowledge Graph using BioChatter. The schema_config.yaml of the BioCypher Knowledge Graph and a natural language query are passed to BioChatter.

Individual steps of the query generation process are tested separately, as well as the end-to-end performance of the process.

Full model name Score achieved Score possible Accuracy Iterations
gpt-3.5-turbo-0125 8 8 1 5
openhermes-2.5:7:ggufv2:Q6_K 8 8 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 9 9 1 5
gpt-4o-2024-05-13 8 8 1 5
openhermes-2.5:7:ggufv2:Q8_0 8 9 0.888889 5
openhermes-2.5:7:ggufv2:Q5_K_M 8 9 0.888889 5
openhermes-2.5:7:ggufv2:Q4_K_M 8 9 0.888889 5
gpt-4-0613 8 9 0.888889 5
gpt-3.5-turbo-0613 8 9 0.888889 5
llama-3-instruct:8:ggufv2:Q8_0 7 8 0.875 5
llama-3-instruct:8:ggufv2:Q6_K 7 8 0.875 5
llama-3-instruct:8:ggufv2:Q5_K_M 7 8 0.875 5
llama-3-instruct:8:ggufv2:Q4_K_M 7 8 0.875 5
gpt-4-0125-preview 7 9 0.777778 5
chatglm3:6:ggmlv3:q4_0 6 8 0.75 5
openhermes-2.5:7:ggufv2:Q2_K 5 9 0.555556 5
code-llama-instruct:7:ggufv2:Q3_K_M 4 8 0.5 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 4 8 0.5 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 3.8 8 0.475 5
code-llama-instruct:13:ggufv2:Q3_K_M 3.6 8 0.45 5
llama-2-chat:70:ggufv2:Q5_K_M 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q8_0 4 9 0.444444 5
llama-2-chat:70:ggufv2:Q4_K_M 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q4_K_M 4 9 0.444444 5
llama-2-chat:7:ggufv2:Q5_K_M 4 9 0.444444 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 4 9 0.444444 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 3.8 9 0.422222 5
llama-2-chat:7:ggufv2:Q6_K 3 8 0.375 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 3 9 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 3 9 0.333333 5
llama-2-chat:7:ggufv2:Q3_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 3 9 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 3 9 0.333333 5
code-llama-instruct:7:ggufv2:Q4_K_M 3 9 0.333333 5
llama-2-chat:70:ggufv2:Q3_K_M 3 9 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 2.8 9 0.311111 5
code-llama-instruct:7:ggufv2:Q2_K 2 8 0.25 5
code-llama-instruct:34:ggufv2:Q8_0 2 8 0.25 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 2 9 0.222222 5
code-llama-instruct:34:ggufv2:Q6_K 1 8 0.125 5
code-llama-instruct:34:ggufv2:Q5_K_M 1 8 0.125 5
code-llama-instruct:7:ggufv2:Q5_K_M 1 9 0.111111 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 9 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 8 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 8 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 8 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 8 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 9 0 5
llama-2-chat:7:ggufv2:Q2_K 0 9 0 5
llama-2-chat:13:ggufv2:Q2_K 0 9 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 8 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 9 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 9 0 5
llama-2-chat:13:ggufv2:Q6_K 0 8 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 8 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 8 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 8 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 8 0 5
llama-2-chat:13:ggufv2:Q8_0 0 9 0 5
llama-2-chat:70:ggufv2:Q2_K 0 9 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 9 0 5
Full model name Score achieved Score possible Accuracy Iterations
openhermes-2.5:7:ggufv2:Q8_0 12 12 1 5
gpt-3.5-turbo-0125 12 12 1 5
openhermes-2.5:7:ggufv2:Q6_K 12 12 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 12 12 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 12 12 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 12 12 1 5
gpt-4-0125-preview 9 12 0.75 5
gpt-4-0613 7.8 12 0.65 5
openhermes-2.5:7:ggufv2:Q2_K 6 12 0.5 5
code-llama-instruct:34:ggufv2:Q2_K 6 12 0.5 5
gpt-3.5-turbo-0613 6 12 0.5 5
chatglm3:6:ggmlv3:q4_0 4.8 12 0.4 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 3 12 0.25 5
code-llama-instruct:7:ggufv2:Q3_K_M 3 12 0.25 5
code-llama-instruct:7:ggufv2:Q2_K 3 12 0.25 5
llama-2-chat:70:ggufv2:Q5_K_M 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 3 12 0.25 5
code-llama-instruct:34:ggufv2:Q3_K_M 3 12 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 3 12 0.25 5
llama-2-chat:70:ggufv2:Q4_K_M 3 12 0.25 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 12 0 5
llama-3-instruct:8:ggufv2:Q6_K 0 12 0 5
llama-3-instruct:8:ggufv2:Q8_0 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 12 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 12 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 12 0 5
llama-3-instruct:8:ggufv2:Q4_K_M 0 12 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 12 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 12 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 12 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 12 0 5
llama-3-instruct:8:ggufv2:Q5_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q8_0 0 12 0 5
llama-2-chat:7:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 12 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 12 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 12 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 12 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 12 0 5
gpt-4o-2024-05-13 0 12 0 5
llama-2-chat:13:ggufv2:Q2_K 0 12 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 12 0 5
llama-2-chat:13:ggufv2:Q6_K 0 12 0 5
llama-2-chat:13:ggufv2:Q8_0 0 12 0 5
llama-2-chat:70:ggufv2:Q2_K 0 12 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 12 0 5
llama-2-chat:7:ggufv2:Q2_K 0 12 0 5
llama-2-chat:7:ggufv2:Q3_K_M 0 12 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 12 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 12 0 5
Full model name Score achieved Score possible Accuracy Iterations
gpt-3.5-turbo-0613 23.2 64 0.3625 5
gpt-4-0613 23 64 0.359375 5
gpt-3.5-turbo-0125 22.8 64 0.35625 5
chatglm3:6:ggmlv3:q4_0 18.4 64 0.2875 5
llama-3-instruct:8:ggufv2:Q8_0 18 64 0.28125 5
llama-3-instruct:8:ggufv2:Q6_K 18 64 0.28125 5
llama-3-instruct:8:ggufv2:Q5_K_M 12 64 0.1875 5
llama-2-chat:70:ggufv2:Q3_K_M 11 64 0.171875 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 10.4 64 0.1625 5
openhermes-2.5:7:ggufv2:Q8_0 8 64 0.125 5
openhermes-2.5:7:ggufv2:Q5_K_M 8 64 0.125 5
openhermes-2.5:7:ggufv2:Q3_K_M 8 64 0.125 5
llama-3-instruct:8:ggufv2:Q4_K_M 7 64 0.109375 5
llama-2-chat:7:ggufv2:Q3_K_M 6.4 64 0.1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 4.2 64 0.065625 5
code-llama-instruct:7:ggufv2:Q2_K 4 64 0.0625 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 3 64 0.046875 5
openhermes-2.5:7:ggufv2:Q6_K 3 64 0.046875 5
openhermes-2.5:7:ggufv2:Q4_K_M 3 64 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 3 64 0.046875 5
llama-2-chat:7:ggufv2:Q5_K_M 2.4 64 0.0375 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 2.4 64 0.0375 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 64 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 64 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 64 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 64 0 5
llama-2-chat:7:ggufv2:Q8_0 0 64 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 64 0 5
openhermes-2.5:7:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 64 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 64 0 5
llama-2-chat:7:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 64 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 64 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 64 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 64 0 5
gpt-4-0125-preview 0 64 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 64 0 5
gpt-4o-2024-05-13 0 64 0 5
llama-2-chat:13:ggufv2:Q2_K 0 64 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 64 0 5
llama-2-chat:13:ggufv2:Q6_K 0 64 0 5
llama-2-chat:13:ggufv2:Q8_0 0 64 0 5
llama-2-chat:70:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 64 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 64 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 64 0 5
llama-2-chat:7:ggufv2:Q2_K 0 64 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 64 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 64 0 5
Full model name Score achieved Score possible Accuracy Iterations
code-llama-instruct:34:ggufv2:Q4_K_M 7.8 8 0.975 5
code-llama-instruct:34:ggufv2:Q5_K_M 7.6 8 0.95 5
code-llama-instruct:34:ggufv2:Q8_0 7.4 8 0.925 5
code-llama-instruct:34:ggufv2:Q6_K 7.2 8 0.9 5
gpt-4-0613 8 9 0.888889 5
code-llama-instruct:13:ggufv2:Q2_K 7 8 0.875 5
code-llama-instruct:34:ggufv2:Q3_K_M 7 8 0.875 5
gpt-3.5-turbo-0125 7.8 9 0.866667 5
code-llama-instruct:13:ggufv2:Q3_K_M 6.8 8 0.85 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 6.8 8 0.85 5
gpt-4o-2024-05-13 6.8 8 0.85 5
openhermes-2.5:7:ggufv2:Q2_K 7.6 9 0.844444 5
code-llama-instruct:13:ggufv2:Q6_K 6.6 8 0.825 5
code-llama-instruct:7:ggufv2:Q3_K_M 7.2 9 0.8 5
code-llama-instruct:7:ggufv2:Q2_K 6.4 8 0.8 5
llama-2-chat:70:ggufv2:Q5_K_M 7 9 0.777778 5
llama-2-chat:13:ggufv2:Q4_K_M 7 9 0.777778 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 7 9 0.777778 5
llama-2-chat:70:ggufv2:Q3_K_M 7 9 0.777778 5
openhermes-2.5:7:ggufv2:Q5_K_M 7 9 0.777778 5
code-llama-instruct:7:ggufv2:Q6_K 6.2 8 0.775 5
llama-3-instruct:8:ggufv2:Q6_K 6.2 8 0.775 5
llama-3-instruct:8:ggufv2:Q4_K_M 6.2 8 0.775 5
code-llama-instruct:13:ggufv2:Q5_K_M 6.2 8 0.775 5
code-llama-instruct:13:ggufv2:Q4_K_M 6.2 8 0.775 5
llama-2-chat:13:ggufv2:Q6_K 6.2 8 0.775 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 6.8 9 0.755556 5
openhermes-2.5:7:ggufv2:Q4_K_M 6.8 9 0.755556 5
llama-2-chat:70:ggufv2:Q4_K_M 6.8 9 0.755556 5
openhermes-2.5:7:ggufv2:Q8_0 6.8 9 0.755556 5
gpt-3.5-turbo-0613 6.8 9 0.755556 5
code-llama-instruct:13:ggufv2:Q8_0 6 8 0.75 5
code-llama-instruct:34:ggufv2:Q2_K 6 8 0.75 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 6.6 9 0.733333 5
gpt-4-0125-preview 6.6 9 0.733333 5
openhermes-2.5:7:ggufv2:Q6_K 6.6 9 0.733333 5
llama-2-chat:13:ggufv2:Q3_K_M 6.6 9 0.733333 5
llama-3-instruct:8:ggufv2:Q8_0 5.8 8 0.725 5
openhermes-2.5:7:ggufv2:Q3_K_M 7.2 10 0.72 5
llama-2-chat:13:ggufv2:Q8_0 6.4 9 0.711111 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 6.4 9 0.711111 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 6.2 9 0.688889 5
code-llama-instruct:7:ggufv2:Q5_K_M 6.2 9 0.688889 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 6.2 9 0.688889 5
llama-2-chat:7:ggufv2:Q2_K 6.2 9 0.688889 5
code-llama-instruct:7:ggufv2:Q8_0 6 9 0.666667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 6 9 0.666667 5
llama-2-chat:70:ggufv2:Q2_K 6 9 0.666667 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 6 9 0.666667 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 5.2 8 0.65 5
llama-3-instruct:8:ggufv2:Q5_K_M 5.2 8 0.65 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 5.8 9 0.644444 5
llama-2-chat:13:ggufv2:Q5_K_M 5.8 9 0.644444 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 5.4 9 0.6 5
code-llama-instruct:7:ggufv2:Q4_K_M 5.4 9 0.6 5
llama-2-chat:7:ggufv2:Q4_K_M 4.4 9 0.488889 5
llama-2-chat:7:ggufv2:Q3_K_M 4.2 9 0.466667 5
llama-2-chat:7:ggufv2:Q8_0 3.2 9 0.355556 5
llama-2-chat:7:ggufv2:Q6_K 3 9 0.333333 5
llama-2-chat:13:ggufv2:Q2_K 2.6 9 0.288889 5
llama-2-chat:7:ggufv2:Q5_K_M 2.6 9 0.288889 5
chatglm3:6:ggmlv3:q4_0 2.2 8 0.275 5
Full model name Score achieved Score possible Accuracy Iterations
gpt-4-0613 20.4 30 0.68 5
llama-3-instruct:8:ggufv2:Q4_K_M 20 30 0.666667 5
llama-3-instruct:8:ggufv2:Q8_0 20 30 0.666667 5
llama-3-instruct:8:ggufv2:Q6_K 20 30 0.666667 5
code-llama-instruct:7:ggufv2:Q4_K_M 19.6 30 0.653333 5
llama-3-instruct:8:ggufv2:Q5_K_M 18 30 0.6 5
code-llama-instruct:34:ggufv2:Q3_K_M 18 30 0.6 5
openhermes-2.5:7:ggufv2:Q5_K_M 17.6 30 0.586667 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 17.2 30 0.573333 5
code-llama-instruct:13:ggufv2:Q8_0 17 30 0.566667 5
code-llama-instruct:34:ggufv2:Q2_K 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q2_K 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q5_K_M 17 30 0.566667 5
code-llama-instruct:13:ggufv2:Q6_K 16.2 30 0.54 5
code-llama-instruct:7:ggufv2:Q2_K 16 30 0.533333 5
gpt-4o-2024-05-13 16 30 0.533333 5
code-llama-instruct:13:ggufv2:Q4_K_M 16 30 0.533333 5
code-llama-instruct:13:ggufv2:Q3_K_M 16 30 0.533333 5
openhermes-2.5:7:ggufv2:Q6_K 16 30 0.533333 5
gpt-3.5-turbo-0613 15 30 0.5 5
gpt-3.5-turbo-0125 14.6 30 0.486667 5
llama-2-chat:13:ggufv2:Q3_K_M 14.4 30 0.48 5
llama-2-chat:13:ggufv2:Q8_0 14.4 30 0.48 5
chatglm3:6:ggmlv3:q4_0 14.4 30 0.48 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 14.4 30 0.48 5
code-llama-instruct:34:ggufv2:Q6_K 14.2 30 0.473333 5
llama-2-chat:70:ggufv2:Q2_K 14.2 30 0.473333 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 14 30 0.466667 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q3_K_M 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q4_K_M 14 30 0.466667 5
openhermes-2.5:7:ggufv2:Q8_0 14 30 0.466667 5
code-llama-instruct:34:ggufv2:Q4_K_M 14 30 0.466667 5
code-llama-instruct:34:ggufv2:Q5_K_M 14 30 0.466667 5
code-llama-instruct:34:ggufv2:Q8_0 14 30 0.466667 5
gpt-4-0125-preview 13.2 30 0.44 5
openhermes-2.5:7:ggufv2:Q2_K 13 30 0.433333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 13 30 0.433333 5
llama-2-chat:13:ggufv2:Q5_K_M 13 30 0.433333 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 13 30 0.433333 5
code-llama-instruct:7:ggufv2:Q3_K_M 12.8 30 0.426667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 12.8 30 0.426667 5
llama-2-chat:70:ggufv2:Q4_K_M 12.6 30 0.42 5
llama-2-chat:70:ggufv2:Q3_K_M 12.4 30 0.413333 5
code-llama-instruct:7:ggufv2:Q5_K_M 12 30 0.4 5
code-llama-instruct:7:ggufv2:Q8_0 12 30 0.4 5
llama-2-chat:13:ggufv2:Q6_K 11.6 30 0.386667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 11.6 30 0.386667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 11.4 30 0.38 5
llama-2-chat:13:ggufv2:Q2_K 11 30 0.366667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 11 30 0.366667 5
llama-2-chat:13:ggufv2:Q4_K_M 11 30 0.366667 5
llama-2-chat:70:ggufv2:Q5_K_M 10.8 30 0.36 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 10 30 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 10 30 0.333333 5
code-llama-instruct:7:ggufv2:Q6_K 10 30 0.333333 5
llama-2-chat:7:ggufv2:Q5_K_M 8.8 30 0.293333 5
llama-2-chat:7:ggufv2:Q8_0 8 30 0.266667 5
llama-2-chat:7:ggufv2:Q6_K 8 30 0.266667 5
llama-2-chat:7:ggufv2:Q4_K_M 7.2 30 0.24 5
llama-2-chat:7:ggufv2:Q3_K_M 7 30 0.233333 5
llama-2-chat:7:ggufv2:Q2_K 3 30 0.1 5
Full model name Score achieved Score possible Accuracy Iterations
gpt-3.5-turbo-0125 29 30 0.966667 5
gpt-4-0613 29 30 0.966667 5
code-llama-instruct:7:ggufv2:Q4_K_M 29 30 0.966667 5
code-llama-instruct:7:ggufv2:Q5_K_M 28.8 30 0.96 5
code-llama-instruct:7:ggufv2:Q8_0 28.8 30 0.96 5
code-llama-instruct:7:ggufv2:Q6_K 28.8 30 0.96 5
gpt-3.5-turbo-0613 28.4 30 0.946667 5
openhermes-2.5:7:ggufv2:Q3_K_M 28.2 30 0.94 5
openhermes-2.5:7:ggufv2:Q2_K 28.2 30 0.94 5
llama-3-instruct:8:ggufv2:Q6_K 27.8 30 0.926667 5
llama-3-instruct:8:ggufv2:Q5_K_M 27.8 30 0.926667 5
llama-3-instruct:8:ggufv2:Q4_K_M 27.6 30 0.92 5
llama-3-instruct:8:ggufv2:Q8_0 27.6 30 0.92 5
code-llama-instruct:7:ggufv2:Q2_K 27.6 30 0.92 5
llama-2-chat:70:ggufv2:Q4_K_M 27.6 30 0.92 5
openhermes-2.5:7:ggufv2:Q5_K_M 27.4 30 0.913333 5
llama-2-chat:70:ggufv2:Q3_K_M 27.2 30 0.906667 5
llama-2-chat:70:ggufv2:Q5_K_M 27.2 30 0.906667 5
code-llama-instruct:34:ggufv2:Q4_K_M 27.2 30 0.906667 5
code-llama-instruct:34:ggufv2:Q5_K_M 27 30 0.9 5
llama-2-chat:70:ggufv2:Q2_K 27 30 0.9 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 26.8 30 0.893333 5
openhermes-2.5:7:ggufv2:Q8_0 26.4 30 0.88 5
openhermes-2.5:7:ggufv2:Q4_K_M 26.2 30 0.873333 5
code-llama-instruct:7:ggufv2:Q3_K_M 26.2 30 0.873333 5
code-llama-instruct:34:ggufv2:Q8_0 25.8 30 0.86 5
openhermes-2.5:7:ggufv2:Q6_K 25.8 30 0.86 5
code-llama-instruct:34:ggufv2:Q6_K 25.6 30 0.853333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 25.4 30 0.846667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 25.4 30 0.846667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 25.2 30 0.84 5
gpt-4-0125-preview 25 30 0.833333 5
code-llama-instruct:13:ggufv2:Q4_K_M 25 30 0.833333 5
code-llama-instruct:13:ggufv2:Q3_K_M 25 30 0.833333 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 25 30 0.833333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 24.8 30 0.826667 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 24.8 30 0.826667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 24.8 30 0.826667 5
code-llama-instruct:13:ggufv2:Q2_K 24.6 30 0.82 5
llama-2-chat:13:ggufv2:Q6_K 24.4 30 0.813333 5
gpt-4o-2024-05-13 24 30 0.8 5
code-llama-instruct:13:ggufv2:Q6_K 23.8 30 0.793333 5
llama-2-chat:13:ggufv2:Q8_0 23.6 30 0.786667 5
code-llama-instruct:34:ggufv2:Q3_K_M 23.6 30 0.786667 5
code-llama-instruct:13:ggufv2:Q5_K_M 23.4 30 0.78 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 23.2 30 0.773333 5
code-llama-instruct:13:ggufv2:Q8_0 23 30 0.766667 5
llama-2-chat:13:ggufv2:Q4_K_M 22.8 30 0.76 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 22.8 30 0.76 5
llama-2-chat:13:ggufv2:Q5_K_M 22.4 30 0.746667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 21.8 30 0.726667 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 20.8 30 0.693333 5
llama-2-chat:7:ggufv2:Q3_K_M 20.8 30 0.693333 5
code-llama-instruct:34:ggufv2:Q2_K 20.6 30 0.686667 5
llama-2-chat:7:ggufv2:Q2_K 20.6 30 0.686667 5
llama-2-chat:13:ggufv2:Q3_K_M 20.4 30 0.68 5
llama-2-chat:7:ggufv2:Q6_K 19.8 30 0.66 5
llama-2-chat:7:ggufv2:Q4_K_M 19.4 30 0.646667 5
llama-2-chat:7:ggufv2:Q8_0 19.2 30 0.64 5
llama-2-chat:7:ggufv2:Q5_K_M 19 30 0.633333 5
chatglm3:6:ggmlv3:q4_0 16.6 30 0.553333 5
llama-2-chat:13:ggufv2:Q2_K 13 30 0.433333 5
Full model name Score achieved Score possible Accuracy Iterations
gpt-3.5-turbo-0125 27.8 30 0.926667 5
gpt-4-0613 26.4 30 0.88 5
gpt-3.5-turbo-0613 25 30 0.833333 5
chatglm3:6:ggmlv3:q4_0 0 30 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:7:ggufv2:Q6_K 0 30 0 5
llama-2-chat:7:ggufv2:Q8_0 0 30 0 5
llama-3-instruct:8:ggufv2:Q4_K_M 0 30 0 5
llama-3-instruct:8:ggufv2:Q5_K_M 0 30 0 5
llama-3-instruct:8:ggufv2:Q6_K 0 30 0 5
llama-3-instruct:8:ggufv2:Q8_0 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 30 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 30 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 30 0 5
openhermes-2.5:7:ggufv2:Q2_K 0 30 0 5
openhermes-2.5:7:ggufv2:Q3_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q4_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q5_K_M 0 30 0 5
openhermes-2.5:7:ggufv2:Q6_K 0 30 0 5
llama-2-chat:7:ggufv2:Q2_K 0 30 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 30 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 30 0 5
code-llama-instruct:7:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 30 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 30 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 30 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 30 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 30 0 5
gpt-4-0125-preview 0 30 0 5
gpt-4o-2024-05-13 0 30 0 5
llama-2-chat:13:ggufv2:Q2_K 0 30 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 30 0 5
llama-2-chat:13:ggufv2:Q6_K 0 30 0 5
llama-2-chat:13:ggufv2:Q8_0 0 30 0 5
llama-2-chat:70:ggufv2:Q2_K 0 30 0 5
openhermes-2.5:7:ggufv2:Q8_0 0 30 0 5

Retrieval-Augmented Generation (RAG)

In this set of tasks, we test LLM abilities to generate answers to a given question using a RAG agent, or to judge the relevance of a RAG fragment to a given question. Instructions can be explicit ("is this fragment relevant to the question?") or implicit (just asking the question without instructions and evaluating whether the model responds with 'not enough information given').

Full model name Score achieved Score possible Accuracy Iterations
llama-2-chat:70:ggufv2:Q3_K_M 6 6 1 5
llama-3-instruct:8:ggufv2:Q6_K 6 6 1 5
llama-2-chat:13:ggufv2:Q8_0 6 6 1 5
llama-2-chat:70:ggufv2:Q2_K 6 6 1 5
llama-2-chat:70:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:70:ggufv2:Q5_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q3_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q5_K_M 6 6 1 5
llama-2-chat:7:ggufv2:Q6_K 6 6 1 5
llama-2-chat:7:ggufv2:Q8_0 6 6 1 5
llama-3-instruct:8:ggufv2:Q4_K_M 6 6 1 5
llama-3-instruct:8:ggufv2:Q5_K_M 6 6 1 5
llama-3-instruct:8:ggufv2:Q8_0 6 6 1 5
llama-2-chat:13:ggufv2:Q5_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 6 6 1 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 6 6 1 5
openhermes-2.5:7:ggufv2:Q2_K 6 6 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 6 6 1 5
openhermes-2.5:7:ggufv2:Q6_K 6 6 1 5
llama-2-chat:13:ggufv2:Q6_K 6 6 1 5
openhermes-2.5:7:ggufv2:Q8_0 6 6 1 5
llama-2-chat:13:ggufv2:Q4_K_M 6 6 1 5
llama-2-chat:13:ggufv2:Q3_K_M 6 6 1 5
llama-2-chat:13:ggufv2:Q2_K 6 6 1 5
gpt-4o-2024-05-13 6 6 1 5
gpt-4-0613 6 6 1 5
gpt-4-0125-preview 6 6 1 5
gpt-3.5-turbo-0613 6 6 1 5
gpt-3.5-turbo-0125 6 6 1 5
code-llama-instruct:7:ggufv2:Q8_0 6 6 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 6 6 1 5
code-llama-instruct:13:ggufv2:Q6_K 5 6 0.833333 5
llama-2-chat:7:ggufv2:Q2_K 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q6_K 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q5_K_M 5 6 0.833333 5
code-llama-instruct:7:ggufv2:Q3_K_M 5 6 0.833333 5
code-llama-instruct:13:ggufv2:Q8_0 5 6 0.833333 5
chatglm3:6:ggmlv3:q4_0 4.4 6 0.733333 5
code-llama-instruct:13:ggufv2:Q5_K_M 4 6 0.666667 5
code-llama-instruct:34:ggufv2:Q4_K_M 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q3_K_M 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q2_K 3 6 0.5 5
code-llama-instruct:34:ggufv2:Q8_0 2 6 0.333333 5
code-llama-instruct:34:ggufv2:Q5_K_M 2 6 0.333333 5
code-llama-instruct:13:ggufv2:Q4_K_M 2 6 0.333333 5
code-llama-instruct:7:ggufv2:Q2_K 2 6 0.333333 5
code-llama-instruct:34:ggufv2:Q6_K 2 6 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 2 6 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 1 6 0.166667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0.8 6 0.133333 5
code-llama-instruct:13:ggufv2:Q2_K 0.2 6 0.0333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 6 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 6 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 6 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 6 0 5
Full model name Score achieved Score possible Accuracy Iterations
chatglm3:6:ggmlv3:q4_0 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 2 2 1 5
llama-2-chat:70:ggufv2:Q4_K_M 2 2 1 5
llama-2-chat:7:ggufv2:Q2_K 2 2 1 5
llama-2-chat:7:ggufv2:Q3_K_M 2 2 1 5
llama-3-instruct:8:ggufv2:Q4_K_M 2 2 1 5
llama-3-instruct:8:ggufv2:Q5_K_M 2 2 1 5
llama-3-instruct:8:ggufv2:Q6_K 2 2 1 5
llama-3-instruct:8:ggufv2:Q8_0 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 2 2 1 5
gpt-3.5-turbo-0613 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 2 2 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 2 2 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 2 2 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 2 2 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 2 2 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 2 2 1 5
openhermes-2.5:7:ggufv2:Q6_K 2 2 1 5
gpt-4-0613 2 2 1 5
openhermes-2.5:7:ggufv2:Q8_0 2 2 1 5
code-llama-instruct:34:ggufv2:Q2_K 2 2 1 5
code-llama-instruct:34:ggufv2:Q5_K_M 2 2 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 2 2 1 5
gpt-3.5-turbo-0125 1.8 2 0.9 5
code-llama-instruct:7:ggufv2:Q6_K 1.8 2 0.9 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 1.8 2 0.9 5
code-llama-instruct:34:ggufv2:Q6_K 1.8 2 0.9 5
code-llama-instruct:34:ggufv2:Q8_0 1.8 2 0.9 5
llama-2-chat:70:ggufv2:Q5_K_M 1.8 2 0.9 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 1.4 2 0.7 5
code-llama-instruct:7:ggufv2:Q3_K_M 1.4 2 0.7 5
code-llama-instruct:7:ggufv2:Q2_K 1.4 2 0.7 5
gpt-4o-2024-05-13 1.4 2 0.7 5
llama-2-chat:7:ggufv2:Q5_K_M 1.2 2 0.6 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 1.2 2 0.6 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 1.2 2 0.6 5
code-llama-instruct:34:ggufv2:Q3_K_M 1 2 0.5 5
llama-2-chat:70:ggufv2:Q2_K 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q4_K_M 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q5_K_M 1 2 0.5 5
openhermes-2.5:7:ggufv2:Q3_K_M 1 2 0.5 5
openhermes-2.5:7:ggufv2:Q2_K 1 2 0.5 5
code-llama-instruct:7:ggufv2:Q8_0 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q6_K 1 2 0.5 5
code-llama-instruct:13:ggufv2:Q8_0 1 2 0.5 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q2_K 1 2 0.5 5
llama-2-chat:70:ggufv2:Q3_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q3_K_M 1 2 0.5 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 1 2 0.5 5
llama-2-chat:13:ggufv2:Q4_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q5_K_M 1 2 0.5 5
gpt-4-0125-preview 1 2 0.5 5
llama-2-chat:13:ggufv2:Q6_K 1 2 0.5 5
llama-2-chat:7:ggufv2:Q8_0 1 2 0.5 5
llama-2-chat:7:ggufv2:Q6_K 1 2 0.5 5
llama-2-chat:7:ggufv2:Q4_K_M 1 2 0.5 5
llama-2-chat:13:ggufv2:Q8_0 1 2 0.5 5
code-llama-instruct:7:ggufv2:Q5_K_M 1 2 0.5 5
code-llama-instruct:34:ggufv2:Q4_K_M 0.8 2 0.4 5
code-llama-instruct:13:ggufv2:Q2_K 0.8 2 0.4 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 2 0 5

Text Extraction

In this set of tasks, we test LLM abilities to extract text from a given document.

Full model name Score achieved Score possible Accuracy Iterations
gpt-4-0125-preview 68.2808 99 0.689705 5
gpt-4-0613 66.2214 99 0.668903 5
gpt-4o-2024-05-13 64.7407 99 0.653946 5
openhermes-2.5:7:ggufv2:Q6_K 61.2976 99 0.619167 5
openhermes-2.5:7:ggufv2:Q8_0 59.482 99 0.600829 5
openhermes-2.5:7:ggufv2:Q4_K_M 59.1309 99 0.597281 5
openhermes-2.5:7:ggufv2:Q5_K_M 57.4117 99 0.579916 5
gpt-3.5-turbo-0613 56.9628 99 0.575381 5
openhermes-2.5:7:ggufv2:Q3_K_M 54.8943 99 0.554488 5
gpt-3.5-turbo-0125 50.4932 99 0.510032 5
openhermes-2.5:7:ggufv2:Q2_K 43.9613 99 0.444054 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 38.1897 99 0.385754 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 36.5284 99 0.368974 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 36.3738 99 0.367412 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 34.8167 99 0.351684 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 34.3555 99 0.347025 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 32.7949 99 0.331261 5
llama-2-chat:70:ggufv2:Q4_K_M 23.8526 99 0.240936 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 23.3302 99 0.235659 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 22.7326 99 0.229622 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 22.3269 99 0.225524 5
llama-2-chat:70:ggufv2:Q2_K 21.2897 99 0.215047 5
llama-2-chat:70:ggufv2:Q5_K_M 20.8064 99 0.210166 5
llama-2-chat:70:ggufv2:Q3_K_M 19.5919 99 0.197898 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 19.1849 99 0.193786 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 18.7286 99 0.189177 5
llama-3-instruct:8:ggufv2:Q8_0 18.6669 99 0.188555 5
chatglm3:6:ggmlv3:q4_0 18.6402 99 0.188284 5
llama-3-instruct:8:ggufv2:Q5_K_M 16.4769 99 0.166434 5
llama-3-instruct:8:ggufv2:Q6_K 16.103 99 0.162657 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 15.5939 99 0.157514 5
code-llama-instruct:7:ggufv2:Q4_K_M 13.7345 99 0.138732 5
llama-3-instruct:8:ggufv2:Q4_K_M 11.5703 99 0.116871 5
llama-2-chat:13:ggufv2:Q3_K_M 11.1504 99 0.112631 5
llama-2-chat:13:ggufv2:Q4_K_M 8.79788 99 0.0888675 5
llama-2-chat:7:ggufv2:Q4_K_M 8.43969 99 0.0852494 5
llama-2-chat:13:ggufv2:Q5_K_M 7.58505 99 0.0766167 5
llama-2-chat:13:ggufv2:Q8_0 7.54833 99 0.0762457 5
llama-2-chat:7:ggufv2:Q5_K_M 6.90615 99 0.0697591 5
llama-2-chat:7:ggufv2:Q3_K_M 6.4421 99 0.0650717 5
llama-2-chat:13:ggufv2:Q2_K 6.42895 99 0.0649389 5
llama-2-chat:7:ggufv2:Q2_K 3.58247 99 0.0361865 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4o-2024-05-13 assay 6.67307 9 0.741453 5
gpt-4-0125-preview assay 6.60264 9 0.733627 5
openhermes-2.5:7:ggufv2:Q6_K assay 6.45354 9 0.717059 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M assay 6.42156 9 0.713507 5
openhermes-2.5:7:ggufv2:Q8_0 assay 6.24141 9 0.69349 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 assay 5.8662 9 0.6518 5
mistral-instruct-v0.2:7:ggufv2:Q2_K assay 5.84165 9 0.649072 5
mistral-instruct-v0.2:7:ggufv2:Q6_K assay 5.83272 9 0.64808 5
openhermes-2.5:7:ggufv2:Q5_K_M assay 5.77475 9 0.641639 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M assay 5.72421 9 0.636024 5
gpt-3.5-turbo-0613 assay 5.71717 9 0.635241 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M assay 5.66084 9 0.628983 5
gpt-3.5-turbo-0125 assay 5.48324 9 0.609249 5
gpt-4-0613 assay 5.47238 9 0.608042 5
openhermes-2.5:7:ggufv2:Q4_K_M assay 5.40473 9 0.600526 5
openhermes-2.5:7:ggufv2:Q3_K_M assay 4.99329 9 0.55481 5
openhermes-2.5:7:ggufv2:Q2_K assay 4.35689 9 0.484099 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M assay 3.17543 9 0.352825 5
llama-2-chat:70:ggufv2:Q4_K_M assay 1.8509 9 0.205655 5
llama-2-chat:70:ggufv2:Q5_K_M assay 1.81844 9 0.202049 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K assay 1.68419 9 0.187133 5
chatglm3:6:ggmlv3:q4_0 assay 1.61672 9 0.179636 5
code-llama-instruct:7:ggufv2:Q4_K_M assay 1.53778 9 0.170864 5
llama-3-instruct:8:ggufv2:Q6_K assay 1.48103 9 0.164559 5
llama-3-instruct:8:ggufv2:Q8_0 assay 1.37088 9 0.15232 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 assay 1.16327 9 0.129252 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M assay 1.15926 9 0.128806 5
llama-2-chat:70:ggufv2:Q2_K assay 1.15095 9 0.127884 5
llama-2-chat:70:ggufv2:Q3_K_M assay 1.07788 9 0.119765 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M assay 1.05347 9 0.117052 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K assay 1.02909 9 0.114343 5
llama-2-chat:13:ggufv2:Q2_K assay 0.974441 9 0.108271 5
llama-3-instruct:8:ggufv2:Q5_K_M assay 0.922706 9 0.102523 5
llama-2-chat:7:ggufv2:Q5_K_M assay 0.919259 9 0.10214 5
llama-2-chat:13:ggufv2:Q5_K_M assay 0.836349 9 0.0929276 5
llama-2-chat:13:ggufv2:Q8_0 assay 0.756302 9 0.0840336 5
llama-2-chat:13:ggufv2:Q3_K_M assay 0.750557 9 0.0833952 5
llama-2-chat:13:ggufv2:Q4_K_M assay 0.647223 9 0.0719137 5
llama-2-chat:7:ggufv2:Q4_K_M assay 0.604799 9 0.0671999 5
llama-3-instruct:8:ggufv2:Q4_K_M assay 0.522273 9 0.0580303 5
llama-2-chat:7:ggufv2:Q3_K_M assay 0.455699 9 0.0506332 5
llama-2-chat:7:ggufv2:Q2_K assay 0.233824 9 0.0259804 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4-0613 chemical 6.38889 9 0.709877 5
gpt-4-0125-preview chemical 6.22222 9 0.691358 5
openhermes-2.5:7:ggufv2:Q6_K chemical 6.16667 9 0.685185 5
gpt-4o-2024-05-13 chemical 5.55556 9 0.617284 5
gpt-3.5-turbo-0613 chemical 5.44444 9 0.604938 5
openhermes-2.5:7:ggufv2:Q3_K_M chemical 5.23309 9 0.581455 5
openhermes-2.5:7:ggufv2:Q8_0 chemical 5.16667 9 0.574074 5
openhermes-2.5:7:ggufv2:Q5_K_M chemical 5.06667 9 0.562963 5
gpt-3.5-turbo-0125 chemical 5.06444 9 0.562716 5
openhermes-2.5:7:ggufv2:Q4_K_M chemical 4.95556 9 0.550617 5
openhermes-2.5:7:ggufv2:Q2_K chemical 4.66667 9 0.518519 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M chemical 4.02332 9 0.447036 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M chemical 3.69824 9 0.410916 5
mistral-instruct-v0.2:7:ggufv2:Q6_K chemical 3.5588 9 0.395422 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M chemical 3.23175 9 0.359083 5
mistral-instruct-v0.2:7:ggufv2:Q2_K chemical 2.9648 9 0.329422 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M chemical 2.85926 9 0.317696 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 chemical 2.80214 9 0.311349 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K chemical 2.28839 9 0.254265 5
llama-3-instruct:8:ggufv2:Q6_K chemical 1.99259 9 0.221399 5
llama-3-instruct:8:ggufv2:Q5_K_M chemical 1.98451 9 0.220501 5
llama-3-instruct:8:ggufv2:Q8_0 chemical 1.98451 9 0.220501 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M chemical 1.92687 9 0.214097 5
llama-2-chat:70:ggufv2:Q2_K chemical 1.92403 9 0.213781 5
llama-2-chat:70:ggufv2:Q4_K_M chemical 1.86594 9 0.207326 5
llama-2-chat:70:ggufv2:Q5_K_M chemical 1.7972 9 0.199689 5
llama-2-chat:70:ggufv2:Q3_K_M chemical 1.65417 9 0.183796 5
llama-2-chat:13:ggufv2:Q4_K_M chemical 1.60885 9 0.178761 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K chemical 1.37178 9 0.15242 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 chemical 1.02473 9 0.113859 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M chemical 0.993896 9 0.110433 5
llama-3-instruct:8:ggufv2:Q4_K_M chemical 0.920791 9 0.10231 5
chatglm3:6:ggmlv3:q4_0 chemical 0.839293 9 0.0932548 5
llama-2-chat:7:ggufv2:Q5_K_M chemical 0.580952 9 0.0645503 5
llama-2-chat:13:ggufv2:Q5_K_M chemical 0.473978 9 0.0526642 5
llama-2-chat:13:ggufv2:Q8_0 chemical 0.473978 9 0.0526642 5
llama-2-chat:13:ggufv2:Q3_K_M chemical 0.447004 9 0.0496671 5
code-llama-instruct:7:ggufv2:Q4_K_M chemical 0.44189 9 0.0490989 5
llama-2-chat:13:ggufv2:Q2_K chemical 0.429118 9 0.0476798 5
llama-2-chat:7:ggufv2:Q4_K_M chemical 0.416702 9 0.0463002 5
llama-2-chat:7:ggufv2:Q3_K_M chemical 0.270151 9 0.0300168 5
llama-2-chat:7:ggufv2:Q2_K chemical 0.264943 9 0.0294381 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4-0613 context 7.90663 9 0.878514 5
gpt-4-0125-preview context 7.85253 9 0.872503 5
gpt-4o-2024-05-13 context 7.82965 9 0.869961 5
gpt-3.5-turbo-0125 context 6.89247 9 0.76583 5
openhermes-2.5:7:ggufv2:Q4_K_M context 6.89055 9 0.765616 5
openhermes-2.5:7:ggufv2:Q6_K context 6.79989 9 0.755543 5
openhermes-2.5:7:ggufv2:Q3_K_M context 6.77271 9 0.752524 5
openhermes-2.5:7:ggufv2:Q8_0 context 6.67749 9 0.741944 5
gpt-3.5-turbo-0613 context 6.50472 9 0.722747 5
openhermes-2.5:7:ggufv2:Q5_K_M context 6.44769 9 0.71641 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 context 5.16754 9 0.574171 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M context 5.12599 9 0.569555 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M context 5.02844 9 0.558716 5
mistral-instruct-v0.2:7:ggufv2:Q6_K context 5.0158 9 0.557311 5
mistral-instruct-v0.2:7:ggufv2:Q2_K context 4.99362 9 0.554846 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K context 4.51314 9 0.50146 5
llama-2-chat:70:ggufv2:Q3_K_M context 4.22332 9 0.469258 5
llama-2-chat:70:ggufv2:Q4_K_M context 4.10284 9 0.455871 5
llama-2-chat:70:ggufv2:Q2_K context 4.08979 9 0.454421 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M context 4.06318 9 0.451465 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M context 4.01117 9 0.445686 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M context 3.90982 9 0.434425 5
openhermes-2.5:7:ggufv2:Q2_K context 3.86897 9 0.429886 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M context 3.79416 9 0.421573 5
llama-2-chat:70:ggufv2:Q5_K_M context 3.74591 9 0.416212 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 context 3.70126 9 0.411251 5
code-llama-instruct:7:ggufv2:Q4_K_M context 3.32657 9 0.369619 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K context 3.1452 9 0.349466 5
chatglm3:6:ggmlv3:q4_0 context 2.85636 9 0.317373 5
llama-2-chat:7:ggufv2:Q3_K_M context 2.10857 9 0.234285 5
llama-2-chat:7:ggufv2:Q4_K_M context 1.89605 9 0.210672 5
llama-2-chat:13:ggufv2:Q3_K_M context 1.78868 9 0.198742 5
llama-2-chat:13:ggufv2:Q5_K_M context 1.78618 9 0.198464 5
llama-2-chat:13:ggufv2:Q4_K_M context 1.77351 9 0.197056 5
llama-3-instruct:8:ggufv2:Q8_0 context 1.67334 9 0.185926 5
llama-3-instruct:8:ggufv2:Q5_K_M context 1.64821 9 0.183134 5
llama-2-chat:13:ggufv2:Q8_0 context 1.58821 9 0.176468 5
llama-3-instruct:8:ggufv2:Q4_K_M context 1.57169 9 0.174632 5
llama-2-chat:13:ggufv2:Q2_K context 1.34289 9 0.14921 5
llama-2-chat:7:ggufv2:Q5_K_M context 1.23881 9 0.137645 5
llama-2-chat:7:ggufv2:Q2_K context 1.12335 9 0.124816 5
llama-3-instruct:8:ggufv2:Q6_K context 1.10292 9 0.122547 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
openhermes-2.5:7:ggufv2:Q8_0 disease 6.46667 9 0.718519 5
openhermes-2.5:7:ggufv2:Q6_K disease 6.46667 9 0.718519 5
openhermes-2.5:7:ggufv2:Q5_K_M disease 6.46667 9 0.718519 5
openhermes-2.5:7:ggufv2:Q3_K_M disease 6.46667 9 0.718519 5
openhermes-2.5:7:ggufv2:Q4_K_M disease 6.46667 9 0.718519 5
gpt-4-0125-preview disease 6.21333 9 0.69037 5
gpt-4o-2024-05-13 disease 6.2 9 0.688889 5
gpt-4-0613 disease 6.13333 9 0.681481 5
gpt-3.5-turbo-0613 disease 6.06667 9 0.674074 5
gpt-3.5-turbo-0125 disease 4.75238 9 0.528042 5
openhermes-2.5:7:ggufv2:Q2_K disease 4.32493 9 0.480548 5
mistral-instruct-v0.2:7:ggufv2:Q2_K disease 4.20708 9 0.467453 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K disease 4.14674 9 0.460748 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M disease 4.02927 9 0.447696 5
mistral-instruct-v0.2:7:ggufv2:Q6_K disease 4.01581 9 0.446201 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 disease 3.47244 9 0.385827 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M disease 3.04532 9 0.338369 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M disease 2.92854 9 0.325393 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M disease 2.65437 9 0.294929 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 disease 2.57657 9 0.286285 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M disease 2.44785 9 0.271983 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M disease 2.29171 9 0.254634 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K disease 2.29094 9 0.254549 5
llama-3-instruct:8:ggufv2:Q5_K_M disease 1.73452 9 0.192725 5
llama-3-instruct:8:ggufv2:Q6_K disease 1.73452 9 0.192725 5
llama-3-instruct:8:ggufv2:Q8_0 disease 1.73452 9 0.192725 5
code-llama-instruct:7:ggufv2:Q4_K_M disease 1.33093 9 0.147881 5
chatglm3:6:ggmlv3:q4_0 disease 1.21669 9 0.135188 5
llama-3-instruct:8:ggufv2:Q4_K_M disease 0.995894 9 0.110655 5
llama-2-chat:13:ggufv2:Q5_K_M disease 0.306386 9 0.0340429 5
llama-2-chat:13:ggufv2:Q8_0 disease 0.26663 9 0.0296256 5
llama-2-chat:13:ggufv2:Q4_K_M disease 0.250053 9 0.0277836 5
llama-2-chat:70:ggufv2:Q5_K_M disease 0.235648 9 0.0261831 5
llama-2-chat:7:ggufv2:Q3_K_M disease 0.185035 9 0.0205595 5
llama-2-chat:70:ggufv2:Q2_K disease 0.182046 9 0.0202274 5
llama-2-chat:70:ggufv2:Q4_K_M disease 0.179398 9 0.0199331 5
llama-2-chat:7:ggufv2:Q5_K_M disease 0.150208 9 0.0166897 5
llama-2-chat:70:ggufv2:Q3_K_M disease 0.142957 9 0.0158841 5
llama-2-chat:13:ggufv2:Q3_K_M disease 0.103277 9 0.0114752 5
llama-2-chat:7:ggufv2:Q4_K_M disease 0.0898052 9 0.00997835 5
llama-2-chat:13:ggufv2:Q2_K disease 0.0874203 9 0.00971337 5
llama-2-chat:7:ggufv2:Q2_K disease 0.0587138 9 0.00652375 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4o-2024-05-13 entity 5.9909 9 0.665655 5
gpt-4-0125-preview entity 4.59502 9 0.510558 5
gpt-3.5-turbo-0613 entity 4.57972 9 0.508858 5
openhermes-2.5:7:ggufv2:Q4_K_M entity 4.22461 9 0.469401 5
openhermes-2.5:7:ggufv2:Q8_0 entity 4.1344 9 0.459377 5
gpt-4-0613 entity 4.12852 9 0.458724 5
openhermes-2.5:7:ggufv2:Q6_K entity 4.09333 9 0.454814 5
openhermes-2.5:7:ggufv2:Q5_K_M entity 4.02016 9 0.446685 5
gpt-3.5-turbo-0125 entity 3.71195 9 0.412439 5
openhermes-2.5:7:ggufv2:Q3_K_M entity 3.65819 9 0.406466 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M entity 2.42313 9 0.269236 5
openhermes-2.5:7:ggufv2:Q2_K entity 2.33413 9 0.259348 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M entity 2.30597 9 0.256219 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M entity 2.20283 9 0.244759 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K entity 2.10077 9 0.233419 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M entity 2.0607 9 0.228967 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 entity 2.00802 9 0.223113 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M entity 1.99809 9 0.22201 5
mistral-instruct-v0.2:7:ggufv2:Q6_K entity 1.99214 9 0.221349 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 entity 1.79999 9 0.199999 5
mistral-instruct-v0.2:7:ggufv2:Q2_K entity 1.77563 9 0.197292 5
chatglm3:6:ggmlv3:q4_0 entity 1.22227 9 0.135808 5
llama-2-chat:70:ggufv2:Q3_K_M entity 1.20851 9 0.134279 5
llama-2-chat:70:ggufv2:Q2_K entity 1.16189 9 0.129099 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M entity 1.10007 9 0.12223 5
llama-2-chat:70:ggufv2:Q4_K_M entity 1.01555 9 0.112839 5
code-llama-instruct:7:ggufv2:Q4_K_M entity 0.948961 9 0.10544 5
llama-2-chat:70:ggufv2:Q5_K_M entity 0.903324 9 0.100369 5
llama-2-chat:13:ggufv2:Q2_K entity 0.807379 9 0.0897088 5
llama-2-chat:13:ggufv2:Q4_K_M entity 0.785233 9 0.0872482 5
llama-3-instruct:8:ggufv2:Q5_K_M entity 0.75253 9 0.0836144 5
llama-3-instruct:8:ggufv2:Q6_K entity 0.749495 9 0.0832772 5
llama-2-chat:7:ggufv2:Q3_K_M entity 0.699988 9 0.0777764 5
llama-3-instruct:8:ggufv2:Q8_0 entity 0.695524 9 0.0772805 5
llama-3-instruct:8:ggufv2:Q4_K_M entity 0.694377 9 0.077153 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K entity 0.685368 9 0.076152 5
llama-2-chat:7:ggufv2:Q4_K_M entity 0.685027 9 0.0761141 5
llama-2-chat:13:ggufv2:Q8_0 entity 0.629764 9 0.0699737 5
llama-2-chat:7:ggufv2:Q5_K_M entity 0.623851 9 0.0693168 5
llama-2-chat:13:ggufv2:Q5_K_M entity 0.623813 9 0.0693126 5
llama-2-chat:13:ggufv2:Q3_K_M entity 0.56502 9 0.06278 5
llama-2-chat:7:ggufv2:Q2_K entity 0.318196 9 0.0353551 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
openhermes-2.5:7:ggufv2:Q2_K experiment_yes_or_no 9 9 1 5
gpt-4-0125-preview experiment_yes_or_no 9 9 1 5
llama-2-chat:70:ggufv2:Q4_K_M experiment_yes_or_no 9 9 1 5
chatglm3:6:ggmlv3:q4_0 experiment_yes_or_no 8.6 9 0.955556 5
openhermes-2.5:7:ggufv2:Q6_K experiment_yes_or_no 8.33333 9 0.925926 5
openhermes-2.5:7:ggufv2:Q5_K_M experiment_yes_or_no 8.33333 9 0.925926 5
openhermes-2.5:7:ggufv2:Q4_K_M experiment_yes_or_no 8.33333 9 0.925926 5
llama-2-chat:70:ggufv2:Q5_K_M experiment_yes_or_no 8.025 9 0.891667 5
openhermes-2.5:7:ggufv2:Q3_K_M experiment_yes_or_no 8 9 0.888889 5
gpt-4-0613 experiment_yes_or_no 8 9 0.888889 5
gpt-4o-2024-05-13 experiment_yes_or_no 8 9 0.888889 5
openhermes-2.5:7:ggufv2:Q8_0 experiment_yes_or_no 8 9 0.888889 5
gpt-3.5-turbo-0613 experiment_yes_or_no 8 9 0.888889 5
llama-2-chat:70:ggufv2:Q2_K experiment_yes_or_no 7.05061 9 0.783401 5
llama-2-chat:70:ggufv2:Q3_K_M experiment_yes_or_no 6.07336 9 0.674818 5
gpt-3.5-turbo-0125 experiment_yes_or_no 6.03333 9 0.67037 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M experiment_yes_or_no 5.23564 9 0.581738 5
llama-2-chat:13:ggufv2:Q3_K_M experiment_yes_or_no 5.16593 9 0.573993 5
llama-3-instruct:8:ggufv2:Q8_0 experiment_yes_or_no 3.7 9 0.411111 5
llama-3-instruct:8:ggufv2:Q5_K_M experiment_yes_or_no 3.68182 9 0.409091 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 experiment_yes_or_no 3.32028 9 0.36892 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M experiment_yes_or_no 3.26963 9 0.363292 5
code-llama-instruct:7:ggufv2:Q4_K_M experiment_yes_or_no 3.0913 9 0.343478 5
llama-3-instruct:8:ggufv2:Q6_K experiment_yes_or_no 2.36364 9 0.262626 5
mistral-instruct-v0.2:7:ggufv2:Q6_K experiment_yes_or_no 2.36015 9 0.262239 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M experiment_yes_or_no 2.2851 9 0.2539 5
mistral-instruct-v0.2:7:ggufv2:Q2_K experiment_yes_or_no 2.2802 9 0.253355 5
llama-2-chat:7:ggufv2:Q4_K_M experiment_yes_or_no 2.06817 9 0.229797 5
llama-3-instruct:8:ggufv2:Q4_K_M experiment_yes_or_no 1.89935 9 0.211039 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M experiment_yes_or_no 1.45686 9 0.161873 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M experiment_yes_or_no 1.29991 9 0.144434 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M experiment_yes_or_no 1.1661 9 0.129567 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K experiment_yes_or_no 1.15184 9 0.127982 5
llama-2-chat:13:ggufv2:Q8_0 experiment_yes_or_no 1.06643 9 0.118492 5
llama-2-chat:13:ggufv2:Q5_K_M experiment_yes_or_no 1.03147 9 0.114607 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K experiment_yes_or_no 0.785587 9 0.0872874 5
llama-2-chat:7:ggufv2:Q3_K_M experiment_yes_or_no 0.726745 9 0.0807495 5
llama-2-chat:7:ggufv2:Q5_K_M experiment_yes_or_no 0.618798 9 0.0687554 5
llama-2-chat:13:ggufv2:Q4_K_M experiment_yes_or_no 0.468722 9 0.0520802 5
llama-2-chat:13:ggufv2:Q2_K experiment_yes_or_no 0.267272 9 0.0296969 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 experiment_yes_or_no 0.201489 9 0.0223876 5
llama-2-chat:7:ggufv2:Q2_K experiment_yes_or_no 0.130285 9 0.0144761 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
mistral-instruct-v0.2:7:ggufv2:Q4_K_M hypothesis 3.67339 9 0.408154 5
mistral-instruct-v0.2:7:ggufv2:Q6_K hypothesis 3.33681 9 0.370756 5
gpt-4-0613 hypothesis 3.29696 9 0.366328 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 hypothesis 2.9272 9 0.325244 5
gpt-4o-2024-05-13 hypothesis 2.89512 9 0.32168 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M hypothesis 2.75585 9 0.306206 5
gpt-3.5-turbo-0125 hypothesis 2.72775 9 0.303084 5
gpt-3.5-turbo-0613 hypothesis 2.64497 9 0.293885 5
openhermes-2.5:7:ggufv2:Q4_K_M hypothesis 2.57382 9 0.28598 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M hypothesis 2.47292 9 0.274769 5
openhermes-2.5:7:ggufv2:Q8_0 hypothesis 2.37196 9 0.263551 5
gpt-4-0125-preview hypothesis 2.33518 9 0.259465 5
openhermes-2.5:7:ggufv2:Q6_K hypothesis 2.29085 9 0.254539 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M hypothesis 2.23255 9 0.248061 5
openhermes-2.5:7:ggufv2:Q3_K_M hypothesis 2.09626 9 0.232917 5
mistral-instruct-v0.2:7:ggufv2:Q2_K hypothesis 2.05375 9 0.228195 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M hypothesis 1.87442 9 0.208269 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 hypothesis 1.83735 9 0.20415 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M hypothesis 1.71557 9 0.190619 5
openhermes-2.5:7:ggufv2:Q5_K_M hypothesis 1.52181 9 0.16909 5
openhermes-2.5:7:ggufv2:Q2_K hypothesis 1.4915 9 0.165722 5
llama-2-chat:70:ggufv2:Q3_K_M hypothesis 1.44143 9 0.160158 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K hypothesis 1.44009 9 0.16001 5
llama-2-chat:70:ggufv2:Q2_K hypothesis 1.4389 9 0.159878 5
llama-2-chat:70:ggufv2:Q4_K_M hypothesis 1.41421 9 0.157134 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K hypothesis 1.39565 9 0.155072 5
llama-3-instruct:8:ggufv2:Q4_K_M hypothesis 1.13596 9 0.126218 5
chatglm3:6:ggmlv3:q4_0 hypothesis 0.98676 9 0.10964 5
llama-3-instruct:8:ggufv2:Q8_0 hypothesis 0.878406 9 0.0976006 5
llama-3-instruct:8:ggufv2:Q6_K hypothesis 0.876219 9 0.0973577 5
llama-2-chat:7:ggufv2:Q5_K_M hypothesis 0.68638 9 0.0762645 5
llama-2-chat:70:ggufv2:Q5_K_M hypothesis 0.623758 9 0.0693064 5
llama-2-chat:7:ggufv2:Q4_K_M hypothesis 0.62053 9 0.0689478 5
llama-3-instruct:8:ggufv2:Q5_K_M hypothesis 0.604423 9 0.0671582 5
code-llama-instruct:7:ggufv2:Q4_K_M hypothesis 0.572369 9 0.0635965 5
llama-2-chat:13:ggufv2:Q8_0 hypothesis 0.55524 9 0.0616934 5
llama-2-chat:7:ggufv2:Q2_K hypothesis 0.520453 9 0.0578281 5
llama-2-chat:13:ggufv2:Q2_K hypothesis 0.49279 9 0.0547544 5
llama-2-chat:13:ggufv2:Q3_K_M hypothesis 0.424638 9 0.0471819 5
llama-2-chat:13:ggufv2:Q5_K_M hypothesis 0.408017 9 0.0453352 5
llama-2-chat:7:ggufv2:Q3_K_M hypothesis 0.402337 9 0.0447041 5
llama-2-chat:13:ggufv2:Q4_K_M hypothesis 0.366299 9 0.0406999 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4o-2024-05-13 intervention 5.34631 9 0.594034 5
openhermes-2.5:7:ggufv2:Q4_K_M intervention 4.9841 9 0.553789 5
gpt-4-0125-preview intervention 4.92171 9 0.546857 5
gpt-4-0613 intervention 4.72253 9 0.524725 5
openhermes-2.5:7:ggufv2:Q6_K intervention 4.71449 9 0.523833 5
openhermes-2.5:7:ggufv2:Q8_0 intervention 4.44465 9 0.49385 5
gpt-3.5-turbo-0613 intervention 4.27143 9 0.474603 5
openhermes-2.5:7:ggufv2:Q5_K_M intervention 4.00021 9 0.444467 5
gpt-3.5-turbo-0125 intervention 3.75141 9 0.416823 5
openhermes-2.5:7:ggufv2:Q3_K_M intervention 3.55238 9 0.394709 5
openhermes-2.5:7:ggufv2:Q2_K intervention 2.92766 9 0.325296 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K intervention 2.23683 9 0.248536 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M intervention 2.23319 9 0.248132 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M intervention 1.66677 9 0.185196 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M intervention 1.23412 9 0.137124 5
code-llama-instruct:7:ggufv2:Q4_K_M intervention 1.17173 9 0.130192 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M intervention 1.15754 9 0.128615 5
llama-2-chat:13:ggufv2:Q4_K_M intervention 1.02157 9 0.113507 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 intervention 0.987919 9 0.109769 5
chatglm3:6:ggmlv3:q4_0 intervention 0.881806 9 0.0979784 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 intervention 0.879646 9 0.0977385 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M intervention 0.723791 9 0.0804212 5
mistral-instruct-v0.2:7:ggufv2:Q2_K intervention 0.680182 9 0.0755758 5
llama-2-chat:70:ggufv2:Q2_K intervention 0.668995 9 0.0743328 5
mistral-instruct-v0.2:7:ggufv2:Q6_K intervention 0.640258 9 0.0711397 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M intervention 0.550643 9 0.0611826 5
llama-2-chat:70:ggufv2:Q5_K_M intervention 0.542302 9 0.0602558 5
llama-2-chat:13:ggufv2:Q2_K intervention 0.502722 9 0.055858 5
llama-2-chat:70:ggufv2:Q4_K_M intervention 0.417501 9 0.046389 5
llama-2-chat:7:ggufv2:Q3_K_M intervention 0.416756 9 0.0463062 5
llama-3-instruct:8:ggufv2:Q5_K_M intervention 0.410888 9 0.0456543 5
llama-2-chat:70:ggufv2:Q3_K_M intervention 0.402319 9 0.0447021 5
llama-3-instruct:8:ggufv2:Q4_K_M intervention 0.37923 9 0.0421366 5
llama-2-chat:13:ggufv2:Q5_K_M intervention 0.339683 9 0.0377425 5
llama-3-instruct:8:ggufv2:Q6_K intervention 0.327257 9 0.0363619 5
llama-3-instruct:8:ggufv2:Q8_0 intervention 0.319187 9 0.0354652 5
llama-2-chat:13:ggufv2:Q3_K_M intervention 0.265476 9 0.0294974 5
llama-2-chat:7:ggufv2:Q5_K_M intervention 0.24986 9 0.0277622 5
llama-2-chat:13:ggufv2:Q8_0 intervention 0.244444 9 0.0271605 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K intervention 0.2273 9 0.0252555 5
llama-2-chat:7:ggufv2:Q2_K intervention 0.118691 9 0.0131879 5
llama-2-chat:7:ggufv2:Q4_K_M intervention 0.0769231 9 0.00854701 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4-0125-preview ncbi_link 6.48768 9 0.720853 5
gpt-4-0613 ncbi_link 6.05933 9 0.673259 5
openhermes-2.5:7:ggufv2:Q8_0 ncbi_link 3.5303 9 0.392256 5
openhermes-2.5:7:ggufv2:Q6_K ncbi_link 3.5303 9 0.392256 5
gpt-4o-2024-05-13 ncbi_link 3.51302 9 0.390335 5
openhermes-2.5:7:ggufv2:Q5_K_M ncbi_link 3.47436 9 0.38604 5
openhermes-2.5:7:ggufv2:Q4_K_M ncbi_link 3.11111 9 0.345679 5
openhermes-2.5:7:ggufv2:Q3_K_M ncbi_link 2.37436 9 0.263818 5
gpt-3.5-turbo-0613 ncbi_link 2.16667 9 0.240741 5
gpt-3.5-turbo-0125 ncbi_link 1.42925 9 0.158805 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M ncbi_link 1.03429 9 0.114921 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M ncbi_link 0.884957 9 0.0983286 5
mistral-instruct-v0.2:7:ggufv2:Q2_K ncbi_link 0.881705 9 0.0979672 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M ncbi_link 0.710989 9 0.0789988 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K ncbi_link 0.656812 9 0.0729791 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M ncbi_link 0.615714 9 0.0684127 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 ncbi_link 0.596131 9 0.0662368 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M ncbi_link 0.574422 9 0.0638246 5
mistral-instruct-v0.2:7:ggufv2:Q6_K ncbi_link 0.558824 9 0.0620915 5
openhermes-2.5:7:ggufv2:Q2_K ncbi_link 0.505458 9 0.0561621 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 ncbi_link 0.429927 9 0.0477697 5
code-llama-instruct:7:ggufv2:Q4_K_M ncbi_link 0.328564 9 0.0365071 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K ncbi_link 0.271548 9 0.030172 5
llama-2-chat:13:ggufv2:Q8_0 ncbi_link 0.255217 9 0.0283574 5
llama-2-chat:70:ggufv2:Q2_K ncbi_link 0.253735 9 0.0281928 5
llama-2-chat:13:ggufv2:Q4_K_M ncbi_link 0.246231 9 0.027359 5
llama-2-chat:70:ggufv2:Q4_K_M ncbi_link 0.241357 9 0.0268174 5
llama-2-chat:13:ggufv2:Q5_K_M ncbi_link 0.236802 9 0.0263113 5
llama-3-instruct:8:ggufv2:Q4_K_M ncbi_link 0.233815 9 0.0259795 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M ncbi_link 0.230909 9 0.0256566 5
llama-2-chat:7:ggufv2:Q4_K_M ncbi_link 0.216341 9 0.0240378 5
llama-2-chat:70:ggufv2:Q5_K_M ncbi_link 0.196981 9 0.0218868 5
llama-2-chat:13:ggufv2:Q2_K ncbi_link 0.192574 9 0.0213971 5
llama-3-instruct:8:ggufv2:Q8_0 ncbi_link 0.179211 9 0.0199123 5
llama-2-chat:7:ggufv2:Q3_K_M ncbi_link 0.177339 9 0.0197044 5
llama-3-instruct:8:ggufv2:Q6_K ncbi_link 0.173014 9 0.0192238 5
llama-2-chat:7:ggufv2:Q5_K_M ncbi_link 0.170952 9 0.0189947 5
llama-2-chat:70:ggufv2:Q3_K_M ncbi_link 0.166777 9 0.0185308 5
llama-3-instruct:8:ggufv2:Q5_K_M ncbi_link 0.166614 9 0.0185127 5
llama-2-chat:7:ggufv2:Q2_K ncbi_link 0.15271 9 0.0169677 5
llama-2-chat:13:ggufv2:Q3_K_M ncbi_link 0.150011 9 0.0166679 5
chatglm3:6:ggmlv3:q4_0 ncbi_link 0.122857 9 0.0136508 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4-0613 significance 5.6 9 0.622222 5
gpt-4-0125-preview significance 5.18384 9 0.575982 5
gpt-4o-2024-05-13 significance 4.22424 9 0.46936 5
openhermes-2.5:7:ggufv2:Q4_K_M significance 3.92996 9 0.436662 5
openhermes-2.5:7:ggufv2:Q8_0 significance 3.78182 9 0.420202 5
openhermes-2.5:7:ggufv2:Q6_K significance 3.78182 9 0.420202 5
openhermes-2.5:7:ggufv2:Q5_K_M significance 3.77787 9 0.419763 5
openhermes-2.5:7:ggufv2:Q3_K_M significance 3.69091 9 0.410101 5
gpt-3.5-turbo-0613 significance 3.58562 9 0.398402 5
gpt-3.5-turbo-0125 significance 3.51717 9 0.390797 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M significance 2.93833 9 0.326481 5
mistral-instruct-v0.2:7:ggufv2:Q6_K significance 2.87928 9 0.31992 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M significance 2.79423 9 0.31047 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 significance 2.62296 9 0.29144 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M significance 2.56724 9 0.285249 5
openhermes-2.5:7:ggufv2:Q2_K significance 2.48514 9 0.276127 5
mistral-instruct-v0.2:7:ggufv2:Q2_K significance 2.4813 9 0.2757 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 significance 1.50696 9 0.16744 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K significance 1.34869 9 0.149854 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M significance 1.31454 9 0.14606 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M significance 1.2312 9 0.1368 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M significance 1.01129 9 0.112366 5
llama-3-instruct:8:ggufv2:Q6_K significance 0.994971 9 0.110552 5
llama-3-instruct:8:ggufv2:Q8_0 significance 0.957259 9 0.106362 5
llama-2-chat:70:ggufv2:Q3_K_M significance 0.758379 9 0.0842644 5
llama-2-chat:70:ggufv2:Q2_K significance 0.716547 9 0.0796163 5
llama-2-chat:70:ggufv2:Q4_K_M significance 0.68386 9 0.0759845 5
llama-3-instruct:8:ggufv2:Q5_K_M significance 0.636128 9 0.0706809 5
llama-2-chat:70:ggufv2:Q5_K_M significance 0.518572 9 0.0576191 5
llama-2-chat:7:ggufv2:Q4_K_M significance 0.329457 9 0.0366064 5
llama-2-chat:13:ggufv2:Q8_0 significance 0.326026 9 0.0362251 5
llama-2-chat:7:ggufv2:Q5_K_M significance 0.281188 9 0.0312431 5
llama-3-instruct:8:ggufv2:Q4_K_M significance 0.228461 9 0.0253845 5
llama-2-chat:13:ggufv2:Q4_K_M significance 0.213246 9 0.023694 5
llama-2-chat:13:ggufv2:Q2_K significance 0.207957 9 0.0231063 5
llama-2-chat:13:ggufv2:Q5_K_M significance 0.205271 9 0.0228079 5
llama-2-chat:7:ggufv2:Q3_K_M significance 0.194946 9 0.0216607 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K significance 0.178078 9 0.0197864 5
llama-2-chat:13:ggufv2:Q3_K_M significance 0.131484 9 0.0146093 5
code-llama-instruct:7:ggufv2:Q4_K_M significance 0.123914 9 0.0137682 5
chatglm3:6:ggmlv3:q4_0 significance 0.118153 9 0.0131281 5
llama-2-chat:7:ggufv2:Q2_K significance 0.103278 9 0.0114753 5
Full model name Subtask Score achieved Score possible Accuracy Iterations
gpt-4-0125-preview stats 8.86667 9 0.985185 5
openhermes-2.5:7:ggufv2:Q8_0 stats 8.66667 9 0.962963 5
openhermes-2.5:7:ggufv2:Q6_K stats 8.66667 9 0.962963 5
openhermes-2.5:7:ggufv2:Q5_K_M stats 8.52821 9 0.947578 5
gpt-4o-2024-05-13 stats 8.51282 9 0.945869 5
gpt-4-0613 stats 8.51282 9 0.945869 5
openhermes-2.5:7:ggufv2:Q4_K_M stats 8.25641 9 0.917379 5
openhermes-2.5:7:ggufv2:Q3_K_M stats 8.05641 9 0.895157 5
openhermes-2.5:7:ggufv2:Q2_K stats 8 9 0.888889 5
gpt-3.5-turbo-0613 stats 7.98135 9 0.886817 5
gpt-3.5-turbo-0125 stats 7.12976 9 0.792195 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M stats 6.89091 9 0.765657 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M stats 6.29908 9 0.699898 5
mistral-instruct-v0.2:7:ggufv2:Q6_K stats 6.18322 9 0.687024 5
llama-3-instruct:8:ggufv2:Q8_0 stats 5.17406 9 0.574895 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M stats 5.1041 9 0.567122 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 stats 5.04591 9 0.560657 5
mistral-instruct-v0.2:7:ggufv2:Q2_K stats 4.63496 9 0.514995 5
llama-3-instruct:8:ggufv2:Q6_K stats 4.30739 9 0.478599 5
llama-3-instruct:8:ggufv2:Q5_K_M stats 3.9346 9 0.437178 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 stats 3.60737 9 0.400819 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M stats 3.58841 9 0.398712 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K stats 3.21213 9 0.356904 5
llama-2-chat:70:ggufv2:Q4_K_M stats 3.08109 9 0.342344 5
llama-3-instruct:8:ggufv2:Q4_K_M stats 2.98843 9 0.332048 5
llama-2-chat:70:ggufv2:Q2_K stats 2.65216 9 0.294684 5
llama-2-chat:70:ggufv2:Q3_K_M stats 2.44276 9 0.271418 5
llama-2-chat:70:ggufv2:Q5_K_M stats 2.3993 9 0.266589 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M stats 2.21549 9 0.246165 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M stats 1.96241 9 0.218045 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K stats 1.76057 9 0.195619 5
llama-2-chat:7:ggufv2:Q4_K_M stats 1.43589 9 0.159543 5
llama-2-chat:13:ggufv2:Q4_K_M stats 1.41695 9 0.157439 5
llama-2-chat:13:ggufv2:Q8_0 stats 1.38608 9 0.154009 5
llama-2-chat:7:ggufv2:Q5_K_M stats 1.3859 9 0.153989 5
llama-2-chat:13:ggufv2:Q3_K_M stats 1.35834 9 0.150927 5
llama-2-chat:13:ggufv2:Q5_K_M stats 1.3371 9 0.148567 5
llama-2-chat:13:ggufv2:Q2_K stats 1.12439 9 0.124932 5
code-llama-instruct:7:ggufv2:Q4_K_M stats 0.860471 9 0.0956079 5
llama-2-chat:7:ggufv2:Q3_K_M stats 0.804538 9 0.0893931 5
llama-2-chat:7:ggufv2:Q2_K stats 0.558031 9 0.0620034 5
chatglm3:6:ggmlv3:q4_0 stats 0.17925 9 0.0199166 5

Stripplot Extraction Subtask