Skip to content

Benchmark - All Results

BioCypher query generation

In this set of tasks, we test LLM abilities to generate queries for a BioCypher Knowledge Graph using BioChatter. The schema_config.yaml of the BioCypher Knowledge Graph and a natural language query are passed to BioChatter.

Individual steps of the query generation process are tested separately, as well as the end-to-end performance of the process.

Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 22 22 0 1 3
claude-3-opus-20240229 24 24 0 1 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 24 24 0 1 3
openhermes-2.5:7:ggufv2:Q3_K_M 279 279 0 1 5
llama-3.1-instruct:70:ggufv2:IQ4_XS 24 24 0 1 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 24 24 0 1 3
llama-3.1-instruct:70:ggufv2:IQ2_M 24 24 0 1 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 132 132 0 1 3
llama-3.1-instruct:8:ggufv2:Q6_K 24 24 0 1 3
gpt-4o-2024-05-13 70 70 0 1 5
gpt-4-turbo-2024-04-09 64 64 0 1 5
gpt-4o-2024-08-06 42 42 0 1 3
llama-3.1-instruct:8:ggufv2:Q8_0 186 186 0 1 3
gpt-3.5-turbo-0125 45 46 0 0.978261 5
gpt-4o-mini-2024-07-18 70 76 0 0.921053 5
gpt-4-0613 58 63 0 0.920635 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 138 150 0 0.92 3
llama-3.1-instruct:8:ggufv2:IQ4_XS 59 66 0 0.893939 3
gpt-3.5-turbo-0613 40 45 0 0.888889 5
llama-3-instruct:8:ggufv2:Q8_0 35 40 0 0.875 5
llama-3-instruct:8:ggufv2:Q6_K 35 40 0 0.875 5
llama-3-instruct:8:ggufv2:Q5_K_M 35 40 0 0.875 5
llama-3-instruct:8:ggufv2:Q4_K_M 31 36 0 0.861111 5
gpt-4-0125-preview 47 57 0 0.824561 5
openhermes-2.5:7:ggufv2:Q6_K 307 400 0 0.7675 5
openhermes-2.5:7:ggufv2:Q5_K_M 280 369 0 0.758808 5
chatglm3:6:ggmlv3:q4_0 30 40 0 0.75 5
openhermes-2.5:7:ggufv2:Q4_K_M 223 333 0 0.66967 5
openhermes-2.5:7:ggufv2:Q8_0 277 441 0 0.628118 5
openhermes-2.5:7:ggufv2:Q2_K 136 225 0 0.604444 5
llama-2-chat:7:ggufv2:Q4_K_M 77 135 0 0.57037 5
llama-2-chat:7:ggufv2:Q5_K_M 77 135 0 0.57037 5
llama-2-chat:7:ggufv2:Q6_K 72 130 0 0.553846 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 20 40 0 0.5 5
code-llama-instruct:7:ggufv2:Q3_K_M 20 40 0 0.5 5
llama-2-chat:7:ggufv2:Q8_0 65 135 0 0.481481 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 19 40 0 0.475 5
code-llama-instruct:13:ggufv2:Q3_K_M 18 40 0 0.45 5
llama-2-chat:70:ggufv2:Q4_K_M 20 45 0 0.444444 5
llama-2-chat:70:ggufv2:Q5_K_M 20 45 0 0.444444 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 20 45 0 0.444444 5
llama-2-chat:7:ggufv2:Q3_K_M 51 117 0 0.435897 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 19 45 0 0.422222 5
llama-2-chat:7:ggufv2:Q2_K 48 117 0 0.410256 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 15 45 0 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 15 45 0 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 15 45 0 0.333333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 15 45 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 15 45 0 0.333333 5
llama-2-chat:70:ggufv2:Q3_K_M 15 45 0 0.333333 5
code-llama-instruct:7:ggufv2:Q4_K_M 15 45 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 14 45 0 0.311111 5
code-llama-instruct:34:ggufv2:Q8_0 10 40 0 0.25 5
code-llama-instruct:7:ggufv2:Q2_K 10 40 0 0.25 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 10 45 0 0.222222 5
code-llama-instruct:34:ggufv2:Q5_K_M 5 40 0 0.125 5
code-llama-instruct:34:ggufv2:Q6_K 5 40 0 0.125 5
code-llama-instruct:7:ggufv2:Q5_K_M 5 45 0 0.111111 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 40 0 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 40 0 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 40 0 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 40 0 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 40 0 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 40 0 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 40 0 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 40 0 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 45 0 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 45 0 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 40 0 0 5
llama-2-chat:13:ggufv2:Q2_K 0 45 0 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 45 0 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 45 0 0 5
llama-2-chat:13:ggufv2:Q8_0 0 45 0 0 5
llama-2-chat:13:ggufv2:Q6_K 0 40 0 0 5
llama-2-chat:70:ggufv2:Q2_K 0 45 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 45 0 0 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
gpt-3.5-turbo-0125 69 69 0 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 87 87 0 1 5
gpt-4o-2024-08-06 63 63 0 1 3
openhermes-2.5:7:ggufv2:Q5_K_M 78 87 0 0.896552 5
openhermes-2.5:7:ggufv2:Q4_K_M 78 87 0 0.896552 5
openhermes-2.5:7:ggufv2:Q8_0 78 87 0 0.896552 5
openhermes-2.5:7:ggufv2:Q6_K 78 87 0 0.896552 5
gpt-4-0125-preview 54 69 0 0.782609 5
gpt-4-0613 48 69 0 0.695652 5
openhermes-2.5:7:ggufv2:Q2_K 57 87 0 0.655172 5
code-llama-instruct:34:ggufv2:Q2_K 30 60 0 0.5 5
gpt-3.5-turbo-0613 30 60 0 0.5 5
chatglm3:6:ggmlv3:q4_0 24 60 0 0.4 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 18 63 0 0.285714 3
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 15 60 0 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 15 60 0 0.25 5
llama-2-chat:70:ggufv2:Q4_K_M 15 60 0 0.25 5
code-llama-instruct:7:ggufv2:Q2_K 15 60 0 0.25 5
code-llama-instruct:7:ggufv2:Q3_K_M 15 60 0 0.25 5
llama-2-chat:70:ggufv2:Q5_K_M 15 60 0 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 15 60 0 0.25 5
code-llama-instruct:34:ggufv2:Q3_K_M 15 60 0 0.25 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 15 60 0 0.25 5
llama-3.1-instruct:8:ggufv2:Q3_K_L 9 63 0 0.142857 3
llama-3.1-instruct:8:ggufv2:Q8_0 9 63 0 0.142857 3
gpt-4-turbo-2024-04-09 9 69 0 0.130435 5
gpt-4o-2024-05-13 9 69 0 0.130435 5
gpt-4o-mini-2024-07-18 9 69 0 0.130435 5
llama-2-chat:7:ggufv2:Q8_0 9 87 0 0.103448 5
llama-2-chat:7:ggufv2:Q3_K_M 9 87 0 0.103448 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 60 0 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 60 0 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 60 0 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 60 0 0 5
claude-3-opus-20240229 0 36 0 0 3
code-llama-instruct:13:ggufv2:Q2_K 0 60 0 0 5
claude-3-5-sonnet-20240620 0 36 0 0 3
code-llama-instruct:13:ggufv2:Q8_0 0 60 0 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 60 0 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 60 0 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 60 0 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 60 0 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 60 0 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 60 0 0 5
llama-2-chat:7:ggufv2:Q6_K 0 87 0 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 87 0 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 87 0 0 5
llama-2-chat:7:ggufv2:Q2_K 0 87 0 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 60 0 0 5
llama-2-chat:13:ggufv2:Q6_K 0 60 0 0 5
llama-2-chat:13:ggufv2:Q8_0 0 60 0 0 5
llama-2-chat:70:ggufv2:Q2_K 0 60 0 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 60 0 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 60 0 0 5
llama-2-chat:13:ggufv2:Q2_K 0 60 0 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 60 0 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 60 0 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 60 0 0 5
llama-3-instruct:8:ggufv2:Q8_0 0 60 0 0 5
llama-3-instruct:8:ggufv2:Q4_K_M 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 60 0 0 5
llama-3.1-instruct:8:ggufv2:Q6_K 0 36 0 0 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 0 36 0 0 3
llama-3.1-instruct:8:ggufv2:IQ4_XS 0 45 0 0 3
llama-3.1-instruct:70:ggufv2:IQ2_M 0 36 0 0 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 0 36 0 0 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 0 36 0 0 3
llama-3-instruct:8:ggufv2:Q6_K 0 60 0 0 5
llama-3-instruct:8:ggufv2:Q5_K_M 0 60 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 60 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 60 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 60 0 0 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
llama-3.1-instruct:8:ggufv2:Q8_0 129 228 0 0.565789 3
llama-3.1-instruct:8:ggufv2:Q4_K_M 117 228 0 0.513158 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 111 228 0 0.486842 3
llama-3.1-instruct:8:ggufv2:Q6_K 90 192 0 0.46875 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 84 192 0 0.4375 3
gpt-4o-2024-08-06 97 228 1.1547 0.425439 3
claude-3-opus-20240229 81 192 0 0.421875 3
gpt-4o-mini-2024-07-18 129 332 0.547723 0.388554 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 79 204 0 0.387255 3
gpt-4-0613 127 332 0 0.38253 5
llama-3.1-instruct:70:ggufv2:IQ4_XS 72 192 0 0.375 3
claude-3-5-sonnet-20240620 72 192 0 0.375 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 72 192 0 0.375 3
gpt-3.5-turbo-0125 122 332 0 0.36747 5
gpt-3.5-turbo-0613 116 320 0 0.3625 5
llama-3.1-instruct:70:ggufv2:IQ2_M 63 192 0 0.328125 3
gpt-4-turbo-2024-04-09 108 332 0.894427 0.325301 5
chatglm3:6:ggmlv3:q4_0 92 320 0 0.2875 5
llama-3-instruct:8:ggufv2:Q8_0 90 320 0 0.28125 5
llama-3-instruct:8:ggufv2:Q6_K 90 320 0 0.28125 5
openhermes-2.5:7:ggufv2:Q8_0 70 356 0 0.196629 5
openhermes-2.5:7:ggufv2:Q5_K_M 70 356 0 0.196629 5
llama-3-instruct:8:ggufv2:Q5_K_M 60 320 0 0.1875 5
llama-2-chat:70:ggufv2:Q3_K_M 55 320 0 0.171875 5
openhermes-2.5:7:ggufv2:Q3_K_M 61 356 0 0.171348 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 52 320 0 0.1625 5
openhermes-2.5:7:ggufv2:Q6_K 45 356 0 0.126404 5
openhermes-2.5:7:ggufv2:Q4_K_M 45 356 0 0.126404 5
llama-3-instruct:8:ggufv2:Q4_K_M 35 320 0 0.109375 5
llama-2-chat:7:ggufv2:Q3_K_M 32 356 0 0.0898876 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 21 320 0 0.065625 5
code-llama-instruct:7:ggufv2:Q2_K 20 320 0 0.0625 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 15 320 0 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 15 320 0 0.046875 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 12 320 0 0.0375 5
llama-2-chat:7:ggufv2:Q5_K_M 12 356 0 0.0337079 5
gpt-4o-2024-05-13 10 332 0 0.0301205 5
gpt-4-0125-preview 10 332 0 0.0301205 5
openhermes-2.5:7:ggufv2:Q2_K 6 356 0 0.0168539 5
code-llama-instruct:13:ggufv2:Q2_K 0 320 0 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 320 0 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 320 0 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 320 0 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 320 0 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 320 0 0 5
llama-2-chat:7:ggufv2:Q6_K 0 356 0 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 356 0 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 320 0 0 5
llama-2-chat:7:ggufv2:Q2_K 0 356 0 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 320 0 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 320 0 0 5
llama-2-chat:13:ggufv2:Q2_K 0 320 0 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 320 0 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 320 0 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 320 0 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 320 0 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 320 0 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 320 0 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 320 0 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 320 0 0 5
llama-2-chat:13:ggufv2:Q6_K 0 320 0 0 5
llama-2-chat:13:ggufv2:Q8_0 0 320 0 0 5
llama-2-chat:70:ggufv2:Q2_K 0 320 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 320 0 0 5
llama-2-chat:7:ggufv2:Q8_0 0 356 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 320 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 320 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 320 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 320 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 320 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 320 0 0 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-opus-20240229 24 24 0 1 3
code-llama-instruct:34:ggufv2:Q4_K_M 39 40 0 0.975 5
code-llama-instruct:34:ggufv2:Q5_K_M 38 40 0 0.95 5
code-llama-instruct:34:ggufv2:Q8_0 37 40 0 0.925 5
llama-3.1-instruct:70:ggufv2:IQ2_M 22 24 0.57735 0.916667 3
code-llama-instruct:34:ggufv2:Q6_K 36 40 0 0.9 5
code-llama-instruct:34:ggufv2:Q3_K_M 35 40 0 0.875 5
code-llama-instruct:13:ggufv2:Q2_K 35 40 0 0.875 5
claude-3-5-sonnet-20240620 26 30 0.57735 0.866667 3
code-llama-instruct:13:ggufv2:Q3_K_M 34 40 0 0.85 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 34 40 0 0.85 5
llama-3.1-instruct:8:ggufv2:Q6_K 20 24 1.1547 0.833333 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 20 24 1.1547 0.833333 3
code-llama-instruct:13:ggufv2:Q6_K 33 40 0 0.825 5
code-llama-instruct:7:ggufv2:Q3_K_M 36 45 0 0.8 5
code-llama-instruct:7:ggufv2:Q2_K 32 40 0 0.8 5
gpt-3.5-turbo-0125 45 57 0 0.789474 5
llama-2-chat:70:ggufv2:Q5_K_M 35 45 0 0.777778 5
llama-2-chat:70:ggufv2:Q3_K_M 35 45 0 0.777778 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 35 45 0 0.777778 5
llama-2-chat:13:ggufv2:Q4_K_M 35 45 0 0.777778 5
llama-3-instruct:8:ggufv2:Q4_K_M 31 40 0 0.775 5
code-llama-instruct:7:ggufv2:Q6_K 31 40 0 0.775 5
code-llama-instruct:13:ggufv2:Q4_K_M 31 40 0 0.775 5
code-llama-instruct:13:ggufv2:Q5_K_M 31 40 0 0.775 5
llama-3-instruct:8:ggufv2:Q6_K 31 40 0 0.775 5
llama-2-chat:13:ggufv2:Q6_K 31 40 0 0.775 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 34 45 0 0.755556 5
gpt-3.5-turbo-0613 34 45 0 0.755556 5
llama-2-chat:70:ggufv2:Q4_K_M 34 45 0 0.755556 5
code-llama-instruct:34:ggufv2:Q2_K 30 40 0 0.75 5
code-llama-instruct:13:ggufv2:Q8_0 30 40 0 0.75 5
llama-3.1-instruct:70:ggufv2:IQ4_XS 18 24 0 0.75 3
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 33 45 0 0.733333 5
llama-2-chat:13:ggufv2:Q3_K_M 33 45 0 0.733333 5
llama-3-instruct:8:ggufv2:Q8_0 29 40 0 0.725 5
llama-2-chat:13:ggufv2:Q8_0 32 45 0 0.711111 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 32 45 0 0.711111 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 31 45 0 0.688889 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 31 45 0 0.688889 5
code-llama-instruct:7:ggufv2:Q5_K_M 31 45 0 0.688889 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 30 45 0 0.666667 5
code-llama-instruct:7:ggufv2:Q8_0 30 45 0 0.666667 5
gpt-4-0613 46 69 0 0.666667 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 30 45 0 0.666667 5
llama-2-chat:70:ggufv2:Q2_K 30 45 0 0.666667 5
gpt-4-turbo-2024-04-09 46 70 0 0.657143 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 26 40 0 0.65 5
llama-3-instruct:8:ggufv2:Q5_K_M 26 40 0 0.65 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 29 45 0 0.644444 5
llama-2-chat:13:ggufv2:Q5_K_M 29 45 0 0.644444 5
llama-3.1-instruct:70:ggufv2:Q3_K_S 15 24 0 0.625 3
gpt-4-0125-preview 39 63 0 0.619048 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 27 45 0 0.6 5
code-llama-instruct:7:ggufv2:Q4_K_M 27 45 0 0.6 5
gpt-4o-2024-05-13 40 76 0 0.526316 5
gpt-4o-mini-2024-07-18 43 82 0.547723 0.52439 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 29 66 0 0.439394 3
llama-2-chat:7:ggufv2:Q2_K 38 117 0.57735 0.324786 5
llama-2-chat:13:ggufv2:Q2_K 13 45 0 0.288889 5
chatglm3:6:ggmlv3:q4_0 11 40 0 0.275 5
llama-2-chat:7:ggufv2:Q4_K_M 34 135 1.1547 0.251852 5
llama-3.1-instruct:8:ggufv2:Q3_K_L 36 150 2.88675 0.24 3
llama-2-chat:7:ggufv2:Q3_K_M 28 135 1.1547 0.207407 5
openhermes-2.5:7:ggufv2:Q2_K 52 279 0.57735 0.18638 5
gpt-4o-2024-08-06 34 189 0.57735 0.179894 3
llama-2-chat:7:ggufv2:Q6_K 24 135 0 0.177778 5
llama-2-chat:7:ggufv2:Q8_0 25 153 1 0.163399 5
llama-3.1-instruct:8:ggufv2:Q8_0 33 204 1.73205 0.161765 3
openhermes-2.5:7:ggufv2:Q3_K_M 53 338 0.57735 0.156805 5
llama-2-chat:7:ggufv2:Q5_K_M 21 135 0.57735 0.155556 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 28 186 0.57735 0.150538 3
openhermes-2.5:7:ggufv2:Q4_K_M 52 369 0 0.140921 5
openhermes-2.5:7:ggufv2:Q5_K_M 49 405 0.57735 0.120988 5
openhermes-2.5:7:ggufv2:Q6_K 50 441 0.57735 0.113379 5
openhermes-2.5:7:ggufv2:Q8_0 48 477 1.1547 0.100629 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-opus-20240229 66 90 0 0.733333 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 63 90 0 0.7 3
gpt-4-0613 118 173 0 0.682081 5
llama-3-instruct:8:ggufv2:Q4_K_M 100 150 0 0.666667 5
llama-3-instruct:8:ggufv2:Q8_0 100 150 0 0.666667 5
llama-3-instruct:8:ggufv2:Q6_K 100 150 0 0.666667 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 105 159 0 0.660377 3
llama-3.1-instruct:8:ggufv2:Q8_0 105 159 0 0.660377 3
code-llama-instruct:7:ggufv2:Q4_K_M 98 150 0 0.653333 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 73 113 0 0.646018 3
llama-3.1-instruct:70:ggufv2:IQ2_M 57 90 0 0.633333 3
claude-3-5-sonnet-20240620 57 90 0 0.633333 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 57 90 0 0.633333 3
llama-3.1-instruct:8:ggufv2:Q6_K 57 90 0 0.633333 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 99 159 0 0.622642 3
code-llama-instruct:34:ggufv2:Q3_K_M 90 150 0 0.6 5
llama-3-instruct:8:ggufv2:Q5_K_M 90 150 0 0.6 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 86 150 0 0.573333 5
code-llama-instruct:13:ggufv2:Q2_K 85 150 0 0.566667 5
code-llama-instruct:13:ggufv2:Q8_0 85 150 0 0.566667 5
code-llama-instruct:13:ggufv2:Q5_K_M 85 150 0 0.566667 5
code-llama-instruct:34:ggufv2:Q2_K 85 150 0 0.566667 5
llama-3.1-instruct:70:ggufv2:IQ4_XS 51 90 0 0.566667 3
openhermes-2.5:7:ggufv2:Q5_K_M 124 219 0 0.56621 5
openhermes-2.5:7:ggufv2:Q6_K 122 219 0 0.557078 5
code-llama-instruct:13:ggufv2:Q6_K 81 150 0 0.54 5
gpt-4o-2024-05-13 93 173 0 0.537572 5
gpt-4o-mini-2024-07-18 93 173 0 0.537572 5
code-llama-instruct:7:ggufv2:Q2_K 80 150 0 0.533333 5
code-llama-instruct:13:ggufv2:Q3_K_M 80 150 0 0.533333 5
code-llama-instruct:13:ggufv2:Q4_K_M 80 150 0 0.533333 5
gpt-4o-2024-08-06 84 159 0 0.528302 3
gpt-3.5-turbo-0125 89 173 0 0.514451 5
gpt-4-turbo-2024-04-09 88 173 0 0.508671 5
gpt-3.5-turbo-0613 75 150 0 0.5 5
openhermes-2.5:7:ggufv2:Q8_0 109 219 0 0.497717 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 72 150 0 0.48 5
chatglm3:6:ggmlv3:q4_0 72 150 0 0.48 5
llama-2-chat:13:ggufv2:Q8_0 72 150 0 0.48 5
llama-2-chat:13:ggufv2:Q3_K_M 72 150 0 0.48 5
openhermes-2.5:7:ggufv2:Q4_K_M 105 219 0.57735 0.479452 5
llama-2-chat:70:ggufv2:Q2_K 71 150 0 0.473333 5
code-llama-instruct:34:ggufv2:Q6_K 71 150 0 0.473333 5
openhermes-2.5:7:ggufv2:Q3_K_M 103 219 0 0.47032 5
code-llama-instruct:34:ggufv2:Q4_K_M 70 150 0 0.466667 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 70 150 0 0.466667 5
code-llama-instruct:34:ggufv2:Q5_K_M 70 150 0 0.466667 5
code-llama-instruct:34:ggufv2:Q8_0 70 150 0 0.466667 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 70 150 0 0.466667 5
gpt-4-0125-preview 79 173 0 0.456647 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 65 150 0 0.433333 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 65 150 0 0.433333 5
llama-2-chat:13:ggufv2:Q5_K_M 65 150 0 0.433333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 64 150 0 0.426667 5
code-llama-instruct:7:ggufv2:Q3_K_M 64 150 0 0.426667 5
openhermes-2.5:7:ggufv2:Q2_K 92 219 0 0.420091 5
llama-2-chat:70:ggufv2:Q4_K_M 63 150 0 0.42 5
llama-2-chat:70:ggufv2:Q3_K_M 62 150 0 0.413333 5
code-llama-instruct:7:ggufv2:Q8_0 60 150 0 0.4 5
code-llama-instruct:7:ggufv2:Q5_K_M 60 150 0 0.4 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 58 150 0 0.386667 5
llama-2-chat:13:ggufv2:Q6_K 58 150 0 0.386667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 57 150 0 0.38 5
llama-2-chat:13:ggufv2:Q2_K 55 150 0 0.366667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 55 150 0 0.366667 5
llama-2-chat:13:ggufv2:Q4_K_M 55 150 0 0.366667 5
llama-2-chat:70:ggufv2:Q5_K_M 54 150 0 0.36 5
llama-2-chat:7:ggufv2:Q5_K_M 74 219 0 0.3379 5
code-llama-instruct:7:ggufv2:Q6_K 50 150 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 50 150 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 50 150 0 0.333333 5
llama-2-chat:7:ggufv2:Q6_K 64 219 0 0.292237 5
llama-2-chat:7:ggufv2:Q8_0 64 219 0 0.292237 5
llama-2-chat:7:ggufv2:Q4_K_M 60 219 0 0.273973 5
llama-2-chat:7:ggufv2:Q3_K_M 50 219 0 0.228311 5
llama-2-chat:7:ggufv2:Q2_K 36 219 0 0.164384 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 87 90 0 0.966667 3
code-llama-instruct:7:ggufv2:Q4_K_M 145 150 0 0.966667 5
llama-3.1-instruct:70:ggufv2:Q3_K_S 87 90 0 0.966667 3
code-llama-instruct:7:ggufv2:Q6_K 144 150 0 0.96 5
code-llama-instruct:7:ggufv2:Q5_K_M 144 150 0 0.96 5
code-llama-instruct:7:ggufv2:Q8_0 144 150 0 0.96 5
gpt-4-0613 166 173 0 0.959538 5
llama-3.1-instruct:70:ggufv2:IQ2_M 86 90 1.1547 0.955556 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 86 90 1.1547 0.955556 3
llama-3.1-instruct:8:ggufv2:Q6_K 86 90 0.57735 0.955556 3
gpt-3.5-turbo-0125 165 173 0 0.953757 5
gpt-4o-mini-2024-07-18 165 173 0 0.953757 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 107 113 0.57735 0.946903 3
gpt-3.5-turbo-0613 142 150 0 0.946667 5
claude-3-opus-20240229 85 90 0.57735 0.944444 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 150 159 0 0.943396 3
llama-3.1-instruct:8:ggufv2:Q8_0 149 159 0.57735 0.937107 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 84 90 0 0.933333 3
llama-3-instruct:8:ggufv2:Q5_K_M 139 150 0 0.926667 5
llama-3-instruct:8:ggufv2:Q6_K 139 150 0 0.926667 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 147 159 0 0.924528 3
code-llama-instruct:7:ggufv2:Q2_K 138 150 0 0.92 5
llama-3-instruct:8:ggufv2:Q4_K_M 138 150 0 0.92 5
llama-2-chat:70:ggufv2:Q4_K_M 138 150 0 0.92 5
llama-3-instruct:8:ggufv2:Q8_0 138 150 0 0.92 5
openhermes-2.5:7:ggufv2:Q5_K_M 201 219 0.57735 0.917808 5
openhermes-2.5:7:ggufv2:Q2_K 201 219 0 0.917808 5
openhermes-2.5:7:ggufv2:Q3_K_M 201 219 3 0.917808 5
llama-2-chat:70:ggufv2:Q5_K_M 136 150 0 0.906667 5
code-llama-instruct:34:ggufv2:Q4_K_M 136 150 0 0.906667 5
llama-2-chat:70:ggufv2:Q3_K_M 136 150 0 0.906667 5
openhermes-2.5:7:ggufv2:Q8_0 198 219 0 0.90411 5
llama-2-chat:70:ggufv2:Q2_K 135 150 0 0.9 5
code-llama-instruct:34:ggufv2:Q5_K_M 135 150 0 0.9 5
openhermes-2.5:7:ggufv2:Q4_K_M 196 219 0.57735 0.894977 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 134 150 0 0.893333 5
openhermes-2.5:7:ggufv2:Q6_K 195 219 0 0.890411 5
gpt-4o-2024-08-06 139 159 2.3094 0.874214 3
code-llama-instruct:7:ggufv2:Q3_K_M 131 150 0 0.873333 5
code-llama-instruct:34:ggufv2:Q8_0 129 150 0 0.86 5
code-llama-instruct:34:ggufv2:Q6_K 128 150 0 0.853333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 127 150 0 0.846667 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 127 150 0 0.846667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 126 150 0 0.84 5
gpt-4-0125-preview 145 173 0 0.83815 5
code-llama-instruct:13:ggufv2:Q3_K_M 125 150 0 0.833333 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 125 150 0 0.833333 5
code-llama-instruct:13:ggufv2:Q4_K_M 125 150 0 0.833333 5
gpt-4-turbo-2024-04-09 144 173 0.447214 0.83237 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 124 150 0 0.826667 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 124 150 0 0.826667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 124 150 0 0.826667 5
code-llama-instruct:13:ggufv2:Q2_K 123 150 0 0.82 5
llama-2-chat:13:ggufv2:Q6_K 122 150 0 0.813333 5
gpt-4o-2024-05-13 140 173 0 0.809249 5
code-llama-instruct:13:ggufv2:Q6_K 119 150 0 0.793333 5
code-llama-instruct:34:ggufv2:Q3_K_M 118 150 0 0.786667 5
llama-2-chat:13:ggufv2:Q8_0 118 150 0 0.786667 5
code-llama-instruct:13:ggufv2:Q5_K_M 117 150 0 0.78 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 116 150 0 0.773333 5
code-llama-instruct:13:ggufv2:Q8_0 115 150 0 0.766667 5
llama-2-chat:13:ggufv2:Q4_K_M 114 150 0 0.76 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 114 150 0 0.76 5
llama-2-chat:13:ggufv2:Q5_K_M 112 150 0 0.746667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 109 150 0 0.726667 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 104 150 0 0.693333 5
code-llama-instruct:34:ggufv2:Q2_K 103 150 0 0.686667 5
llama-2-chat:13:ggufv2:Q3_K_M 102 150 0 0.68 5
llama-2-chat:7:ggufv2:Q2_K 134 219 1.1547 0.611872 5
llama-2-chat:7:ggufv2:Q4_K_M 134 219 2.68223 0.611872 5
llama-2-chat:7:ggufv2:Q8_0 129 219 0 0.589041 5
llama-2-chat:7:ggufv2:Q3_K_M 129 219 1.1547 0.589041 5
llama-2-chat:7:ggufv2:Q6_K 123 219 0 0.561644 5
chatglm3:6:ggmlv3:q4_0 83 150 0 0.553333 5
llama-2-chat:7:ggufv2:Q5_K_M 120 219 0.57735 0.547945 5
llama-2-chat:13:ggufv2:Q2_K 65 150 0 0.433333 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
gpt-3.5-turbo-0125 159 173 0 0.919075 5
gpt-4-0613 152 173 0 0.878613 5
gpt-3.5-turbo-0613 125 150 0 0.833333 5
gpt-4o-2024-08-06 132 159 1.1547 0.830189 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 129 159 0 0.811321 3
llama-3.1-instruct:8:ggufv2:Q8_0 123 159 0 0.773585 3
llama-3.1-instruct:8:ggufv2:IQ4_XS 84 113 0 0.743363 3
llama-3.1-instruct:8:ggufv2:Q4_K_M 117 159 0 0.735849 3
claude-3-5-sonnet-20240620 66 90 0 0.733333 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 66 90 0 0.733333 3
llama-3.1-instruct:8:ggufv2:Q6_K 66 90 0 0.733333 3
gpt-4o-mini-2024-07-18 119 173 1.54266 0.687861 5
claude-3-opus-20240229 59 90 3.21455 0.655556 3
gpt-4-turbo-2024-04-09 110 173 0 0.635838 5
llama-3.1-instruct:70:ggufv2:Q3_K_S 54 90 3.4641 0.6 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 54 90 0 0.6 3
llama-3.1-instruct:70:ggufv2:IQ2_M 54 90 0 0.6 3
openhermes-2.5:7:ggufv2:Q3_K_M 63 219 1 0.287671 5
openhermes-2.5:7:ggufv2:Q6_K 60 219 1.73205 0.273973 5
openhermes-2.5:7:ggufv2:Q5_K_M 58 219 0.57735 0.26484 5
openhermes-2.5:7:ggufv2:Q4_K_M 54 219 1.1547 0.246575 5
openhermes-2.5:7:ggufv2:Q8_0 52 219 1.1547 0.237443 5
openhermes-2.5:7:ggufv2:Q2_K 35 219 2.3094 0.159817 5
gpt-4o-2024-05-13 20 173 0 0.115607 5
gpt-4-0125-preview 19 173 0 0.109827 5
chatglm3:6:ggmlv3:q4_0 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q2_K 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q4_K_M 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q3_K_M 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q2_K 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q8_0 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q6_K 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q5_K_M 0 150 0 0 5
code-llama-instruct:13:ggufv2:Q4_K_M 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q8_0 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q6_K 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q5_K_M 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q4_K_M 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q3_K_M 0 150 0 0 5
code-llama-instruct:7:ggufv2:Q2_K 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q8_0 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q6_K 0 150 0 0 5
code-llama-instruct:34:ggufv2:Q5_K_M 0 150 0 0 5
llama-2-chat:7:ggufv2:Q6_K 0 219 0 0 5
llama-2-chat:7:ggufv2:Q5_K_M 0 219 0 0 5
llama-2-chat:7:ggufv2:Q4_K_M 0 219 0 0 5
llama-2-chat:7:ggufv2:Q3_K_M 0 219 0 0 5
llama-2-chat:7:ggufv2:Q2_K 0 219 0 0 5
llama-2-chat:70:ggufv2:Q5_K_M 0 150 0 0 5
llama-2-chat:70:ggufv2:Q4_K_M 0 150 0 0 5
llama-2-chat:70:ggufv2:Q3_K_M 0 150 0 0 5
llama-2-chat:70:ggufv2:Q2_K 0 150 0 0 5
llama-2-chat:13:ggufv2:Q8_0 0 150 0 0 5
llama-2-chat:13:ggufv2:Q6_K 0 150 0 0 5
llama-2-chat:13:ggufv2:Q5_K_M 0 150 0 0 5
llama-2-chat:13:ggufv2:Q4_K_M 0 150 0 0 5
llama-2-chat:13:ggufv2:Q3_K_M 0 150 0 0 5
llama-2-chat:13:ggufv2:Q2_K 0 150 0 0 5
llama-3-instruct:8:ggufv2:Q8_0 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 0 150 0 0 5
llama-3-instruct:8:ggufv2:Q4_K_M 0 150 0 0 5
llama-2-chat:7:ggufv2:Q8_0 0 219 0 0 5
llama-3-instruct:8:ggufv2:Q6_K 0 150 0 0 5
llama-3-instruct:8:ggufv2:Q5_K_M 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 0 150 0 0 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 150 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0 150 0 0 5

Retrieval-Augmented Generation (RAG)

In this set of tasks, we test LLM abilities to generate answers to a given question using a RAG agent, or to judge the relevance of a RAG fragment to a given question. Instructions can be explicit ("is this fragment relevant to the question?") or implicit (just asking the question without instructions and evaluating whether the model responds with 'not enough information given').

Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 18 18 0 1 3
llama-2-chat:13:ggufv2:Q6_K 30 30 0 1 5
llama-2-chat:70:ggufv2:Q2_K 30 30 0 1 5
gpt-4o-2024-08-06 18 18 0 1 3
code-llama-instruct:7:ggufv2:Q8_0 30 30 0 1 5
gpt-3.5-turbo-0125 30 30 0 1 5
gpt-3.5-turbo-0613 30 30 0 1 5
gpt-4-0125-preview 30 30 0 1 5
gpt-4-0613 30 30 0 1 5
gpt-4-turbo-2024-04-09 30 30 0 1 5
gpt-4o-2024-05-13 30 30 0 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 30 30 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 30 30 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 30 30 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 30 30 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q2_K 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q3_K_M 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q6_K 30 30 0 1 5
llama-3.1-instruct:8:ggufv2:Q3_K_L 18 18 0 1 3
llama-3.1-instruct:8:ggufv2:Q4_K_M 18 18 0 1 3
llama-2-chat:13:ggufv2:Q8_0 30 30 0 1 5
llama-2-chat:13:ggufv2:Q5_K_M 30 30 0 1 5
llama-2-chat:13:ggufv2:Q4_K_M 30 30 0 1 5
llama-2-chat:13:ggufv2:Q3_K_M 30 30 0 1 5
llama-2-chat:13:ggufv2:Q2_K 30 30 0 1 5
llama-2-chat:7:ggufv2:Q5_K_M 30 30 0 1 5
llama-2-chat:70:ggufv2:Q4_K_M 30 30 0 1 5
llama-2-chat:70:ggufv2:Q5_K_M 30 30 0 1 5
llama-2-chat:7:ggufv2:Q3_K_M 30 30 0 1 5
llama-2-chat:7:ggufv2:Q6_K 30 30 0 1 5
llama-2-chat:7:ggufv2:Q4_K_M 30 30 0 1 5
llama-2-chat:70:ggufv2:Q3_K_M 30 30 0 1 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 18 18 0 1 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 18 18 0 1 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 18 18 0 1 3
llama-3.1-instruct:70:ggufv2:IQ2_M 18 18 0 1 3
llama-3-instruct:8:ggufv2:Q8_0 30 30 0 1 5
llama-3-instruct:8:ggufv2:Q6_K 30 30 0 1 5
llama-3-instruct:8:ggufv2:Q5_K_M 30 30 0 1 5
llama-3-instruct:8:ggufv2:Q4_K_M 30 30 0 1 5
llama-2-chat:7:ggufv2:Q8_0 30 30 0 1 5
openhermes-2.5:7:ggufv2:Q8_0 30 30 0 1 5
llama-3.1-instruct:8:ggufv2:Q8_0 18 18 0 1 3
mistral-instruct-v0.2:7:ggufv2:Q2_K 30 30 0 1 5
llama-3.1-instruct:8:ggufv2:Q5_K_M 18 18 0 1 3
llama-3.1-instruct:8:ggufv2:Q6_K 18 18 0 1 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 30 30 0 1 5
code-llama-instruct:7:ggufv2:Q6_K 25 30 0 0.833333 5
code-llama-instruct:7:ggufv2:Q5_K_M 25 30 0 0.833333 5
code-llama-instruct:13:ggufv2:Q6_K 25 30 0 0.833333 5
code-llama-instruct:13:ggufv2:Q8_0 25 30 0 0.833333 5
claude-3-opus-20240229 15 18 0 0.833333 3
code-llama-instruct:7:ggufv2:Q3_K_M 25 30 0 0.833333 5
llama-2-chat:7:ggufv2:Q2_K 25 30 0 0.833333 5
gpt-4o-mini-2024-07-18 25 30 0 0.833333 5
chatglm3:6:ggmlv3:q4_0 22 30 0 0.733333 5
code-llama-instruct:13:ggufv2:Q5_K_M 20 30 0 0.666667 5
code-llama-instruct:34:ggufv2:Q2_K 15 30 0 0.5 5
code-llama-instruct:34:ggufv2:Q3_K_M 15 30 0 0.5 5
code-llama-instruct:34:ggufv2:Q4_K_M 15 30 0 0.5 5
code-llama-instruct:13:ggufv2:Q4_K_M 10 30 0 0.333333 5
code-llama-instruct:34:ggufv2:Q5_K_M 10 30 0 0.333333 5
code-llama-instruct:34:ggufv2:Q6_K 10 30 0 0.333333 5
code-llama-instruct:34:ggufv2:Q8_0 10 30 0 0.333333 5
code-llama-instruct:7:ggufv2:Q2_K 10 30 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 10 30 0 0.333333 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 5 30 0 0.166667 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 4 30 0 0.133333 5
code-llama-instruct:13:ggufv2:Q2_K 1 30 0 0.0333333 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 30 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 30 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 30 0 0 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 30 0 0 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
chatglm3:6:ggmlv3:q4_0 10 10 0 1 5
claude-3-5-sonnet-20240620 6 6 0 1 3
claude-3-opus-20240229 6 6 0 1 3
code-llama-instruct:34:ggufv2:Q2_K 10 10 0 1 5
llama-2-chat:7:ggufv2:Q2_K 10 10 0 1 5
llama-2-chat:7:ggufv2:Q3_K_M 10 10 0 1 5
llama-2-chat:70:ggufv2:Q4_K_M 10 10 0 1 5
gpt-4-turbo-2024-04-09 10 10 0 1 5
gpt-3.5-turbo-0613 10 10 0 1 5
gpt-4-0613 10 10 0 1 5
code-llama-instruct:7:ggufv2:Q4_K_M 10 10 0 1 5
code-llama-instruct:34:ggufv2:Q5_K_M 10 10 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 10 10 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 10 10 0 1 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 10 10 0 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 10 10 0 1 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 10 10 0 1 5
openhermes-2.5:7:ggufv2:Q5_K_M 10 10 0 1 5
openhermes-2.5:7:ggufv2:Q6_K 10 10 0 1 5
llama-3.1-instruct:8:ggufv2:Q8_0 6 6 0 1 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 10 10 0 1 5
llama-3.1-instruct:70:ggufv2:Q3_K_S 6 6 0 1 3
llama-3.1-instruct:70:ggufv2:IQ2_M 6 6 0 1 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 6 6 0 1 3
llama-3-instruct:8:ggufv2:Q8_0 10 10 0 1 5
llama-3-instruct:8:ggufv2:Q6_K 10 10 0 1 5
openhermes-2.5:7:ggufv2:Q8_0 10 10 0 1 5
openhermes-2.5:7:ggufv2:Q4_K_M 10 10 0 1 5
llama-3-instruct:8:ggufv2:Q5_K_M 10 10 0 1 5
llama-3-instruct:8:ggufv2:Q4_K_M 10 10 0 1 5
gpt-3.5-turbo-0125 9 10 0 0.9 5
code-llama-instruct:7:ggufv2:Q6_K 9 10 0 0.9 5
llama-2-chat:70:ggufv2:Q5_K_M 9 10 0 0.9 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 9 10 0 0.9 5
code-llama-instruct:34:ggufv2:Q8_0 9 10 0 0.9 5
code-llama-instruct:34:ggufv2:Q6_K 9 10 0 0.9 5
llama-3.1-instruct:8:ggufv2:Q3_K_L 5 6 0.57735 0.833333 3
llama-3.1-instruct:8:ggufv2:IQ4_XS 5 6 0.57735 0.833333 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 5 6 0.57735 0.833333 3
llama-3.1-instruct:8:ggufv2:Q4_K_M 5 6 0.57735 0.833333 3
llama-3.1-instruct:8:ggufv2:Q6_K 5 6 0.57735 0.833333 3
code-llama-instruct:7:ggufv2:Q3_K_M 7 10 0 0.7 5
gpt-4o-2024-05-13 7 10 0 0.7 5
code-llama-instruct:7:ggufv2:Q2_K 7 10 0 0.7 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 7 10 0 0.7 5
gpt-4o-2024-08-06 4 6 0.57735 0.666667 3
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 6 10 0 0.6 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 6 10 0 0.6 5
llama-2-chat:7:ggufv2:Q5_K_M 6 10 0 0.6 5
code-llama-instruct:13:ggufv2:Q6_K 5 10 0 0.5 5
code-llama-instruct:13:ggufv2:Q4_K_M 5 10 0 0.5 5
code-llama-instruct:34:ggufv2:Q3_K_M 5 10 0 0.5 5
llama-2-chat:7:ggufv2:Q6_K 5 10 0 0.5 5
llama-2-chat:7:ggufv2:Q4_K_M 5 10 0 0.5 5
llama-2-chat:70:ggufv2:Q3_K_M 5 10 0 0.5 5
code-llama-instruct:13:ggufv2:Q8_0 5 10 0 0.5 5
code-llama-instruct:13:ggufv2:Q5_K_M 5 10 0 0.5 5
code-llama-instruct:7:ggufv2:Q5_K_M 5 10 0 0.5 5
gpt-4-0125-preview 5 10 0 0.5 5
code-llama-instruct:7:ggufv2:Q8_0 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q2_K 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q3_K_M 5 10 0 0.5 5
gpt-4o-mini-2024-07-18 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q4_K_M 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q5_K_M 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q6_K 5 10 0 0.5 5
llama-2-chat:13:ggufv2:Q8_0 5 10 0 0.5 5
llama-2-chat:70:ggufv2:Q2_K 5 10 0 0.5 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 5 10 0 0.5 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 5 10 0 0.5 5
openhermes-2.5:7:ggufv2:Q3_K_M 5 10 0 0.5 5
llama-2-chat:7:ggufv2:Q8_0 5 10 0 0.5 5
openhermes-2.5:7:ggufv2:Q2_K 5 10 0 0.5 5
code-llama-instruct:13:ggufv2:Q2_K 4 10 0 0.4 5
code-llama-instruct:34:ggufv2:Q4_K_M 4 10 0 0.4 5
code-llama-instruct:13:ggufv2:Q3_K_M 0 10 0 0 5

Text Extraction

In this set of tasks, we test LLM abilities to extract text from a given document.

Full model name Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 224.558 297 0.0766981 0.756088 3
gpt-4o-2024-08-06 211.222 297 1.31903 0.711185 3
llama-3.1-instruct:70:ggufv2:IQ4_XS 207.674 297 1.29175e-15 0.699238 3
claude-3-opus-20240229 205.297 297 0.441227 0.691235 3
gpt-4-0125-preview 341.404 495 0 0.689705 5
gpt-4o-mini-2024-07-18 338.854 495 1.74807 0.684553 5
gpt-4-0613 331.107 495 0 0.668903 5
gpt-4o-2024-05-13 323.703 495 0 0.653946 5
gpt-4-turbo-2024-04-09 321.933 495 4.39682 0.650369 5
llama-3.1-instruct:70:ggufv2:Q3_K_S 190.774 297 9.26323e-16 0.642336 3
llama-3.1-instruct:70:ggufv2:IQ2_M 186.07 297 8.83831e-16 0.626498 3
openhermes-2.5:7:ggufv2:Q6_K 306.488 495 0 0.619167 5
openhermes-2.5:7:ggufv2:Q8_0 297.41 495 0 0.600829 5
openhermes-2.5:7:ggufv2:Q4_K_M 295.654 495 0 0.597281 5
openhermes-2.5:7:ggufv2:Q5_K_M 287.059 495 0 0.579916 5
gpt-3.5-turbo-0613 284.814 495 0 0.575381 5
openhermes-2.5:7:ggufv2:Q3_K_M 274.471 495 0 0.554488 5
gpt-3.5-turbo-0125 252.466 495 0 0.510032 5
openhermes-2.5:7:ggufv2:Q2_K 219.807 495 0 0.444054 5
llama-3.1-instruct:8:ggufv2:IQ4_XS 123.142 297 5.69391e-16 0.414621 3
llama-3.1-instruct:8:ggufv2:Q6_K 117.157 297 3.73928e-16 0.394469 3
llama-3.1-instruct:8:ggufv2:Q8_0 115.554 297 4.18545e-16 0.38907 3
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 190.948 495 0 0.385754 5
llama-3.1-instruct:8:ggufv2:Q4_K_M 113.462 297 0.300229 0.382027 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 113.002 297 0.00031952 0.380477 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 182.642 495 0 0.368974 5
mistral-instruct-v0.2:7:ggufv2:Q6_K 181.869 495 0 0.367412 5
llama-3.1-instruct:8:ggufv2:Q3_K_L 107.033 297 5.50801e-16 0.360379 3
mistral-instruct-v0.2:7:ggufv2:Q8_0 174.084 495 0 0.351684 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 171.777 495 0 0.347025 5
mistral-instruct-v0.2:7:ggufv2:Q2_K 163.974 495 0 0.331261 5
llama-2-chat:70:ggufv2:Q4_K_M 119.263 495 0 0.240936 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 116.651 495 0 0.235659 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 113.663 495 0 0.229622 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 111.634 495 0 0.225524 5
llama-2-chat:70:ggufv2:Q2_K 106.448 495 0 0.215047 5
llama-2-chat:70:ggufv2:Q5_K_M 104.032 495 0 0.210166 5
llama-2-chat:70:ggufv2:Q3_K_M 97.9593 495 0 0.197898 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 95.9243 495 0 0.193786 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 93.6428 495 0 0.189177 5
llama-3-instruct:8:ggufv2:Q8_0 93.3345 495 0 0.188555 5
chatglm3:6:ggmlv3:q4_0 93.2008 495 0 0.188284 5
llama-3-instruct:8:ggufv2:Q5_K_M 82.3847 495 0 0.166434 5
llama-3-instruct:8:ggufv2:Q6_K 80.5152 495 0 0.162657 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 77.9693 495 0 0.157514 5
code-llama-instruct:7:ggufv2:Q4_K_M 68.6724 495 0 0.138732 5
llama-3-instruct:8:ggufv2:Q4_K_M 57.8514 495 0 0.116871 5
llama-2-chat:13:ggufv2:Q3_K_M 55.7521 495 0 0.112631 5
llama-2-chat:13:ggufv2:Q4_K_M 43.9894 495 0 0.0888675 5
llama-2-chat:7:ggufv2:Q4_K_M 42.1985 495 0 0.0852494 5
llama-2-chat:7:ggufv2:Q8_0 25.1647 297 1.46597e-16 0.0847297 3
llama-2-chat:13:ggufv2:Q6_K 23.2057 297 0.00246731 0.0781337 3
llama-2-chat:13:ggufv2:Q5_K_M 37.9252 495 0 0.0766167 5
llama-2-chat:13:ggufv2:Q8_0 37.7416 495 0 0.0762457 5
llama-2-chat:7:ggufv2:Q5_K_M 34.5308 495 0 0.0697591 5
llama-2-chat:7:ggufv2:Q3_K_M 32.2105 495 0 0.0650717 5
llama-2-chat:13:ggufv2:Q2_K 32.1447 495 0 0.0649389 5
llama-2-chat:7:ggufv2:Q6_K 18.2539 297 2.57076e-16 0.0614608 3
llama-2-chat:7:ggufv2:Q2_K 17.9123 495 0 0.0361865 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 assay 23.4242 27 0 0.867565 3
llama-3.1-instruct:70:ggufv2:IQ2_M assay 22.1667 27 0 0.820988 3
llama-3.1-instruct:70:ggufv2:Q3_K_S assay 22.1667 27 0 0.820988 3
claude-3-opus-20240229 assay 21.4909 27 1.51082e-17 0.79596 3
gpt-4o-2024-08-06 assay 20.3897 27 3.02164e-17 0.755176 3
llama-3.1-instruct:70:ggufv2:IQ4_XS assay 20.3238 27 3.02164e-17 0.752734 3
gpt-4o-mini-2024-07-18 assay 33.0217 45 0.00559644 0.733816 5
gpt-4-turbo-2024-04-09 assay 26.4233 45 0.0114637 0.587184 5
llama-3.1-instruct:8:ggufv2:Q3_K_L assay 12.9241 27 1.60525e-17 0.478671 3
llama-3.1-instruct:8:ggufv2:IQ4_XS assay 12.4373 27 8.49837e-18 0.460639 3
llama-3.1-instruct:8:ggufv2:Q4_K_M assay 12.4264 27 7.55411e-18 0.460235 3
llama-3.1-instruct:8:ggufv2:Q6_K assay 12.1618 27 8.49837e-18 0.450438 3
llama-3.1-instruct:8:ggufv2:Q5_K_M assay 11.2618 27 0 0.417103 3
llama-3.1-instruct:8:ggufv2:Q8_0 assay 10.4603 27 1.51082e-17 0.387417 3
gpt-4o-2024-05-13 assay 6.67307 45 0 0.148291 5
gpt-4-0125-preview assay 6.60264 45 0 0.146725 5
openhermes-2.5:7:ggufv2:Q6_K assay 6.45354 45 0 0.143412 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M assay 6.42156 45 0 0.142701 5
openhermes-2.5:7:ggufv2:Q8_0 assay 6.24141 45 0 0.138698 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 assay 5.8662 45 0 0.13036 5
mistral-instruct-v0.2:7:ggufv2:Q2_K assay 5.84165 45 0 0.129814 5
mistral-instruct-v0.2:7:ggufv2:Q6_K assay 5.83272 45 0 0.129616 5
openhermes-2.5:7:ggufv2:Q5_K_M assay 5.77475 45 0 0.128328 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M assay 5.72421 45 0 0.127205 5
gpt-3.5-turbo-0613 assay 5.71717 45 0 0.127048 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M assay 5.66084 45 0 0.125797 5
gpt-3.5-turbo-0125 assay 5.48324 45 0 0.12185 5
gpt-4-0613 assay 5.47238 45 0 0.121608 5
openhermes-2.5:7:ggufv2:Q4_K_M assay 5.40473 45 0 0.120105 5
openhermes-2.5:7:ggufv2:Q3_K_M assay 4.99329 45 0 0.110962 5
openhermes-2.5:7:ggufv2:Q2_K assay 4.35689 45 0 0.0968198 5
llama-2-chat:7:ggufv2:Q6_K assay 2.34166 27 7.55411e-18 0.0867281 3
llama-2-chat:13:ggufv2:Q6_K assay 2.19772 27 3.77706e-18 0.081397 3
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M assay 3.17543 45 0 0.070565 5
llama-2-chat:7:ggufv2:Q8_0 assay 1.62311 27 0 0.0601152 3
llama-2-chat:70:ggufv2:Q4_K_M assay 1.8509 45 0 0.041131 5
llama-2-chat:70:ggufv2:Q5_K_M assay 1.81844 45 0 0.0404097 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K assay 1.68419 45 0 0.0374265 5
chatglm3:6:ggmlv3:q4_0 assay 1.61672 45 0 0.0359271 5
code-llama-instruct:7:ggufv2:Q4_K_M assay 1.53778 45 0 0.0341728 5
llama-3-instruct:8:ggufv2:Q6_K assay 1.48103 45 0 0.0329118 5
llama-3-instruct:8:ggufv2:Q8_0 assay 1.37088 45 0 0.0304641 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 assay 1.16327 45 0 0.0258505 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M assay 1.15926 45 0 0.0257612 5
llama-2-chat:70:ggufv2:Q2_K assay 1.15095 45 0 0.0255768 5
llama-2-chat:70:ggufv2:Q3_K_M assay 1.07788 45 0 0.023953 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M assay 1.05347 45 0 0.0234104 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K assay 1.02909 45 0 0.0228686 5
llama-2-chat:13:ggufv2:Q2_K assay 0.974441 45 0 0.0216542 5
llama-3-instruct:8:ggufv2:Q5_K_M assay 0.922706 45 0 0.0205046 5
llama-2-chat:7:ggufv2:Q5_K_M assay 0.919259 45 0 0.020428 5
llama-2-chat:13:ggufv2:Q5_K_M assay 0.836349 45 0 0.0185855 5
llama-2-chat:13:ggufv2:Q8_0 assay 0.756302 45 0 0.0168067 5
llama-2-chat:13:ggufv2:Q3_K_M assay 0.750557 45 0 0.016679 5
llama-2-chat:13:ggufv2:Q4_K_M assay 0.647223 45 0 0.0143827 5
llama-2-chat:7:ggufv2:Q4_K_M assay 0.604799 45 0 0.01344 5
llama-3-instruct:8:ggufv2:Q4_K_M assay 0.522273 45 0 0.0116061 5
llama-2-chat:7:ggufv2:Q3_K_M assay 0.455699 45 0 0.0101266 5
llama-2-chat:7:ggufv2:Q2_K assay 0.233824 45 0 0.00519608 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-opus-20240229 chemical 24 27 0 0.888889 3
claude-3-5-sonnet-20240620 chemical 21.6667 27 0 0.802469 3
llama-3.1-instruct:70:ggufv2:IQ4_XS chemical 20 27 0 0.740741 3
gpt-4o-2024-08-06 chemical 18.6667 27 0 0.691358 3
llama-3.1-instruct:70:ggufv2:Q3_K_S chemical 18 27 0 0.666667 3
llama-3.1-instruct:70:ggufv2:IQ2_M chemical 18 27 0 0.666667 3
gpt-4-turbo-2024-04-09 chemical 29.188 45 0 0.648623 5
gpt-4o-mini-2024-07-18 chemical 27.7778 45 0 0.617284 5
llama-3.1-instruct:8:ggufv2:IQ4_XS chemical 12.3451 27 0 0.457227 3
llama-3.1-instruct:8:ggufv2:Q4_K_M chemical 12.0531 27 0 0.446411 3
llama-3.1-instruct:8:ggufv2:Q6_K chemical 11.0168 27 0 0.408029 3
llama-3.1-instruct:8:ggufv2:Q3_K_L chemical 10.8547 27 2.36066e-19 0.402026 3
llama-3.1-instruct:8:ggufv2:Q8_0 chemical 9.13698 27 0 0.338407 3
llama-3.1-instruct:8:ggufv2:Q5_K_M chemical 8.35802 27 0 0.309556 3
gpt-4-0613 chemical 6.38889 45 0 0.141975 5
gpt-4-0125-preview chemical 6.22222 45 0 0.138272 5
openhermes-2.5:7:ggufv2:Q6_K chemical 6.16667 45 0 0.137037 5
gpt-4o-2024-05-13 chemical 5.55556 45 0 0.123457 5
gpt-3.5-turbo-0613 chemical 5.44444 45 0 0.120988 5
openhermes-2.5:7:ggufv2:Q3_K_M chemical 5.23309 45 0 0.116291 5
openhermes-2.5:7:ggufv2:Q8_0 chemical 5.16667 45 0 0.114815 5
openhermes-2.5:7:ggufv2:Q5_K_M chemical 5.06667 45 0 0.112593 5
gpt-3.5-turbo-0125 chemical 5.06444 45 0 0.112543 5
openhermes-2.5:7:ggufv2:Q4_K_M chemical 4.95556 45 0 0.110123 5
openhermes-2.5:7:ggufv2:Q2_K chemical 4.66667 45 0 0.103704 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M chemical 4.02332 45 0 0.0894072 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M chemical 3.69824 45 0 0.0821832 5
mistral-instruct-v0.2:7:ggufv2:Q6_K chemical 3.5588 45 0 0.0790845 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M chemical 3.23175 45 0 0.0718166 5
mistral-instruct-v0.2:7:ggufv2:Q2_K chemical 2.9648 45 0 0.0658845 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M chemical 2.85926 45 0 0.0635392 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 chemical 2.80214 45 0 0.0622698 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K chemical 2.28839 45 0 0.050853 5
llama-2-chat:13:ggufv2:Q6_K chemical 1.33748 27 0 0.0495362 3
llama-3-instruct:8:ggufv2:Q6_K chemical 1.99259 45 0 0.0442798 5
llama-3-instruct:8:ggufv2:Q5_K_M chemical 1.98451 45 0 0.0441003 5
llama-3-instruct:8:ggufv2:Q8_0 chemical 1.98451 45 0 0.0441003 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M chemical 1.92687 45 0 0.0428194 5
llama-2-chat:70:ggufv2:Q2_K chemical 1.92403 45 0 0.0427562 5
llama-2-chat:70:ggufv2:Q4_K_M chemical 1.86594 45 0 0.0414653 5
llama-2-chat:7:ggufv2:Q8_0 chemical 1.11429 27 3.77706e-18 0.0412698 3
llama-2-chat:70:ggufv2:Q5_K_M chemical 1.7972 45 0 0.0399378 5
llama-2-chat:70:ggufv2:Q3_K_M chemical 1.65417 45 0 0.0367593 5
llama-2-chat:13:ggufv2:Q4_K_M chemical 1.60885 45 0 0.0357522 5
llama-2-chat:7:ggufv2:Q6_K chemical 0.85 27 1.88853e-18 0.0314815 3
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K chemical 1.37178 45 0 0.030484 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 chemical 1.02473 45 0 0.0227718 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M chemical 0.993896 45 0 0.0220866 5
llama-3-instruct:8:ggufv2:Q4_K_M chemical 0.920791 45 0 0.020462 5
chatglm3:6:ggmlv3:q4_0 chemical 0.839293 45 0 0.018651 5
llama-2-chat:7:ggufv2:Q5_K_M chemical 0.580952 45 0 0.0129101 5
llama-2-chat:13:ggufv2:Q5_K_M chemical 0.473978 45 0 0.0105328 5
llama-2-chat:13:ggufv2:Q8_0 chemical 0.473978 45 0 0.0105328 5
llama-2-chat:13:ggufv2:Q3_K_M chemical 0.447004 45 0 0.00993343 5
code-llama-instruct:7:ggufv2:Q4_K_M chemical 0.44189 45 0 0.00981978 5
llama-2-chat:13:ggufv2:Q2_K chemical 0.429118 45 0 0.00953595 5
llama-2-chat:7:ggufv2:Q4_K_M chemical 0.416702 45 0 0.00926004 5
llama-2-chat:7:ggufv2:Q3_K_M chemical 0.270151 45 0 0.00600336 5
llama-2-chat:7:ggufv2:Q2_K chemical 0.264943 45 0 0.00588762 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
llama-3.1-instruct:70:ggufv2:IQ2_M context 25.2426 27 1.51082e-17 0.93491 3
llama-3.1-instruct:70:ggufv2:IQ4_XS context 25.2195 27 3.02164e-17 0.934057 3
claude-3-5-sonnet-20240620 context 24.5401 27 6.04329e-17 0.908892 3
gpt-4o-2024-08-06 context 23.6991 27 0.00396564 0.877746 3
gpt-4-turbo-2024-04-09 context 38.9656 45 0.0255706 0.865903 5
claude-3-opus-20240229 context 23.2287 27 3.02164e-17 0.860323 3
gpt-4o-mini-2024-07-18 context 38.476 45 0.00134967 0.855023 5
llama-3.1-instruct:70:ggufv2:Q3_K_S context 22.2637 27 1.51082e-17 0.82458 3
llama-3.1-instruct:8:ggufv2:Q8_0 context 16.8902 27 9.44264e-19 0.625563 3
llama-3.1-instruct:8:ggufv2:IQ4_XS context 16.8016 27 1.51082e-17 0.622281 3
llama-3.1-instruct:8:ggufv2:Q4_K_M context 16.4994 27 2.83279e-18 0.611089 3
llama-3.1-instruct:8:ggufv2:Q6_K context 16.4093 27 1.88853e-18 0.607753 3
llama-3.1-instruct:8:ggufv2:Q5_K_M context 16.1817 27 1.88853e-18 0.599324 3
llama-3.1-instruct:8:ggufv2:Q3_K_L context 15.6028 27 1.88853e-18 0.57788 3
llama-2-chat:7:ggufv2:Q8_0 context 5.70797 27 0 0.211406 3
llama-2-chat:13:ggufv2:Q6_K context 4.88293 27 1.69967e-17 0.180849 3
gpt-4-0613 context 7.90663 45 0 0.175703 5
gpt-4-0125-preview context 7.85253 45 0 0.174501 5
gpt-4o-2024-05-13 context 7.82965 45 0 0.173992 5
gpt-3.5-turbo-0125 context 6.89247 45 0 0.153166 5
openhermes-2.5:7:ggufv2:Q4_K_M context 6.89055 45 0 0.153123 5
openhermes-2.5:7:ggufv2:Q6_K context 6.79989 45 0 0.151109 5
openhermes-2.5:7:ggufv2:Q3_K_M context 6.77271 45 0 0.150505 5
openhermes-2.5:7:ggufv2:Q8_0 context 6.67749 45 0 0.148389 5
gpt-3.5-turbo-0613 context 6.50472 45 0 0.144549 5
openhermes-2.5:7:ggufv2:Q5_K_M context 6.44769 45 0 0.143282 5
llama-2-chat:7:ggufv2:Q6_K context 3.73057 27 0 0.138169 3
mistral-instruct-v0.2:7:ggufv2:Q8_0 context 5.16754 45 0 0.114834 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M context 5.12599 45 0 0.113911 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M context 5.02844 45 0 0.111743 5
mistral-instruct-v0.2:7:ggufv2:Q6_K context 5.0158 45 0 0.111462 5
mistral-instruct-v0.2:7:ggufv2:Q2_K context 4.99362 45 0 0.110969 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K context 4.51314 45 0 0.100292 5
llama-2-chat:70:ggufv2:Q3_K_M context 4.22332 45 0 0.0938516 5
llama-2-chat:70:ggufv2:Q4_K_M context 4.10284 45 0 0.0911743 5
llama-2-chat:70:ggufv2:Q2_K context 4.08979 45 0 0.0908843 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M context 4.06318 45 0 0.090293 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M context 4.01117 45 0 0.0891372 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M context 3.90982 45 0 0.0868849 5
openhermes-2.5:7:ggufv2:Q2_K context 3.86897 45 0 0.0859772 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M context 3.79416 45 0 0.0843146 5
llama-2-chat:70:ggufv2:Q5_K_M context 3.74591 45 0 0.0832424 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 context 3.70126 45 0 0.0822502 5
code-llama-instruct:7:ggufv2:Q4_K_M context 3.32657 45 0 0.0739237 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K context 3.1452 45 0 0.0698933 5
chatglm3:6:ggmlv3:q4_0 context 2.85636 45 0 0.0634747 5
llama-2-chat:7:ggufv2:Q3_K_M context 2.10857 45 0 0.046857 5
llama-2-chat:7:ggufv2:Q4_K_M context 1.89605 45 0 0.0421345 5
llama-2-chat:13:ggufv2:Q3_K_M context 1.78868 45 0 0.0397484 5
llama-2-chat:13:ggufv2:Q5_K_M context 1.78618 45 0 0.0396929 5
llama-2-chat:13:ggufv2:Q4_K_M context 1.77351 45 0 0.0394113 5
llama-3-instruct:8:ggufv2:Q8_0 context 1.67334 45 0 0.0371853 5
llama-3-instruct:8:ggufv2:Q5_K_M context 1.64821 45 0 0.0366268 5
llama-2-chat:13:ggufv2:Q8_0 context 1.58821 45 0 0.0352936 5
llama-3-instruct:8:ggufv2:Q4_K_M context 1.57169 45 0 0.0349264 5
llama-2-chat:13:ggufv2:Q2_K context 1.34289 45 0 0.0298419 5
llama-2-chat:7:ggufv2:Q5_K_M context 1.23881 45 0 0.0275291 5
llama-2-chat:7:ggufv2:Q2_K context 1.12335 45 0 0.0249632 5
llama-3-instruct:8:ggufv2:Q6_K context 1.10292 45 0 0.0245094 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 disease 20.4 27 1.51082e-17 0.755556 3
gpt-4o-mini-2024-07-18 disease 32.3333 45 0 0.718519 5
gpt-4o-2024-08-06 disease 19.4 27 1.51082e-17 0.718519 3
gpt-4-turbo-2024-04-09 disease 32.2667 45 0.00331269 0.717037 5
llama-3.1-instruct:70:ggufv2:IQ4_XS disease 17.2 27 7.55411e-18 0.637037 3
llama-3.1-instruct:70:ggufv2:IQ2_M disease 13.2 27 7.55411e-18 0.488889 3
llama-3.1-instruct:70:ggufv2:Q3_K_S disease 13.2 27 7.55411e-18 0.488889 3
llama-3.1-instruct:8:ggufv2:Q5_K_M disease 11.7851 27 7.55411e-18 0.436484 3
llama-3.1-instruct:8:ggufv2:Q3_K_L disease 11.5883 27 2.36066e-19 0.429196 3
llama-3.1-instruct:8:ggufv2:Q6_K disease 11.5414 27 1.51082e-17 0.42746 3
llama-3.1-instruct:8:ggufv2:IQ4_XS disease 11.5414 27 1.51082e-17 0.42746 3
llama-3.1-instruct:8:ggufv2:Q4_K_M disease 10.2531 27 2.36066e-19 0.379746 3
claude-3-opus-20240229 disease 10 27 0 0.37037 3
llama-3.1-instruct:8:ggufv2:Q8_0 disease 9.31558 27 2.36066e-19 0.345022 3
openhermes-2.5:7:ggufv2:Q3_K_M disease 6.46667 45 0 0.143704 5
openhermes-2.5:7:ggufv2:Q4_K_M disease 6.46667 45 0 0.143704 5
openhermes-2.5:7:ggufv2:Q5_K_M disease 6.46667 45 0 0.143704 5
openhermes-2.5:7:ggufv2:Q6_K disease 6.46667 45 0 0.143704 5
openhermes-2.5:7:ggufv2:Q8_0 disease 6.46667 45 0 0.143704 5
gpt-4-0125-preview disease 6.21333 45 0 0.138074 5
gpt-4o-2024-05-13 disease 6.2 45 0 0.137778 5
gpt-4-0613 disease 6.13333 45 0 0.136296 5
gpt-3.5-turbo-0613 disease 6.06667 45 0 0.134815 5
gpt-3.5-turbo-0125 disease 4.75238 45 0 0.105608 5
openhermes-2.5:7:ggufv2:Q2_K disease 4.32493 45 0 0.0961096 5
mistral-instruct-v0.2:7:ggufv2:Q2_K disease 4.20708 45 0 0.0934906 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K disease 4.14674 45 0 0.0921497 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M disease 4.02927 45 0 0.0895392 5
mistral-instruct-v0.2:7:ggufv2:Q6_K disease 4.01581 45 0 0.0892402 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 disease 3.47244 45 0 0.0771654 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M disease 3.04532 45 0 0.0676737 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M disease 2.92854 45 0 0.0650787 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M disease 2.65437 45 0 0.0589859 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 disease 2.57657 45 0 0.057257 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M disease 2.44785 45 0 0.0543966 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M disease 2.29171 45 0 0.0509269 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K disease 2.29094 45 0 0.0509099 5
llama-3-instruct:8:ggufv2:Q8_0 disease 1.73452 45 0 0.0385449 5
llama-3-instruct:8:ggufv2:Q6_K disease 1.73452 45 0 0.0385449 5
llama-3-instruct:8:ggufv2:Q5_K_M disease 1.73452 45 0 0.0385449 5
llama-2-chat:13:ggufv2:Q6_K disease 0.827524 27 0 0.030649 3
code-llama-instruct:7:ggufv2:Q4_K_M disease 1.33093 45 0 0.0295762 5
chatglm3:6:ggmlv3:q4_0 disease 1.21669 45 0 0.0270376 5
llama-3-instruct:8:ggufv2:Q4_K_M disease 0.995894 45 0 0.022131 5
llama-2-chat:7:ggufv2:Q8_0 disease 0.444887 27 2.36066e-19 0.0164773 3
llama-2-chat:7:ggufv2:Q6_K disease 0.439254 27 0 0.0162687 3
llama-2-chat:13:ggufv2:Q5_K_M disease 0.306386 45 0 0.00680858 5
llama-2-chat:13:ggufv2:Q8_0 disease 0.26663 45 0 0.00592511 5
llama-2-chat:13:ggufv2:Q4_K_M disease 0.250053 45 0 0.00555673 5
llama-2-chat:70:ggufv2:Q5_K_M disease 0.235648 45 0 0.00523663 5
llama-2-chat:7:ggufv2:Q3_K_M disease 0.185035 45 0 0.0041119 5
llama-2-chat:70:ggufv2:Q2_K disease 0.182046 45 0 0.00404548 5
llama-2-chat:70:ggufv2:Q4_K_M disease 0.179398 45 0 0.00398663 5
llama-2-chat:7:ggufv2:Q5_K_M disease 0.150208 45 0 0.00333795 5
llama-2-chat:70:ggufv2:Q3_K_M disease 0.142957 45 0 0.00317683 5
llama-2-chat:13:ggufv2:Q3_K_M disease 0.103277 45 0 0.00229505 5
llama-2-chat:7:ggufv2:Q4_K_M disease 0.0898052 45 0 0.00199567 5
llama-2-chat:13:ggufv2:Q2_K disease 0.0874203 45 0 0.00194267 5
llama-2-chat:7:ggufv2:Q2_K disease 0.0587138 45 0 0.00130475 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
gpt-4o-2024-08-06 entity 17.9286 27 0 0.664021 3
llama-3.1-instruct:70:ggufv2:Q3_K_S entity 17.8096 27 1.51082e-17 0.659615 3
llama-3.1-instruct:70:ggufv2:IQ4_XS entity 17.6096 27 7.55411e-18 0.652208 3
claude-3-opus-20240229 entity 16.325 27 4.53247e-17 0.60463 3
llama-3.1-instruct:70:ggufv2:IQ2_M entity 16.225 27 1.51082e-17 0.600926 3
claude-3-5-sonnet-20240620 entity 15.625 27 1.51082e-17 0.578704 3
gpt-4o-mini-2024-07-18 entity 24.5545 45 0.0661682 0.545656 5
gpt-4-turbo-2024-04-09 entity 19.2455 45 0.0562119 0.427677 5
llama-3.1-instruct:8:ggufv2:Q8_0 entity 8.02772 27 0 0.297323 3
llama-3.1-instruct:8:ggufv2:Q6_K entity 7.84022 27 0 0.290379 3
llama-3.1-instruct:8:ggufv2:IQ4_XS entity 6.80681 27 1.51082e-17 0.252104 3
llama-3.1-instruct:8:ggufv2:Q3_K_L entity 6.16922 27 1.69967e-17 0.22849 3
llama-3.1-instruct:8:ggufv2:Q4_K_M entity 5.96677 27 7.55411e-18 0.220991 3
llama-3.1-instruct:8:ggufv2:Q5_K_M entity 5.58394 27 0 0.206813 3
gpt-4o-2024-05-13 entity 5.9909 45 0 0.133131 5
gpt-4-0125-preview entity 4.59502 45 0 0.102112 5
gpt-3.5-turbo-0613 entity 4.57972 45 0 0.101772 5
openhermes-2.5:7:ggufv2:Q4_K_M entity 4.22461 45 0 0.0938803 5
openhermes-2.5:7:ggufv2:Q8_0 entity 4.1344 45 0 0.0918755 5
gpt-4-0613 entity 4.12852 45 0 0.0917448 5
openhermes-2.5:7:ggufv2:Q6_K entity 4.09333 45 0 0.0909629 5
openhermes-2.5:7:ggufv2:Q5_K_M entity 4.02016 45 0 0.0893369 5
gpt-3.5-turbo-0125 entity 3.71195 45 0 0.0824877 5
openhermes-2.5:7:ggufv2:Q3_K_M entity 3.65819 45 0 0.0812932 5
llama-2-chat:13:ggufv2:Q6_K entity 2.14189 27 0 0.0793293 3
llama-2-chat:7:ggufv2:Q6_K entity 2.07106 27 9.44264e-19 0.0767059 3
llama-2-chat:7:ggufv2:Q8_0 entity 1.79733 27 0 0.0665678 3
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M entity 2.42313 45 0 0.0538473 5
openhermes-2.5:7:ggufv2:Q2_K entity 2.33413 45 0 0.0518696 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M entity 2.30597 45 0 0.0512437 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M entity 2.20283 45 0 0.0489518 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K entity 2.10077 45 0 0.0466838 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M entity 2.0607 45 0 0.0457934 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 entity 2.00802 45 0 0.0446226 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M entity 1.99809 45 0 0.044402 5
mistral-instruct-v0.2:7:ggufv2:Q6_K entity 1.99214 45 0 0.0442699 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 entity 1.79999 45 0 0.0399998 5
mistral-instruct-v0.2:7:ggufv2:Q2_K entity 1.77563 45 0 0.0394584 5
chatglm3:6:ggmlv3:q4_0 entity 1.22227 45 0 0.0271617 5
llama-2-chat:70:ggufv2:Q3_K_M entity 1.20851 45 0 0.0268558 5
llama-2-chat:70:ggufv2:Q2_K entity 1.16189 45 0 0.0258197 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M entity 1.10007 45 0 0.0244461 5
llama-2-chat:70:ggufv2:Q4_K_M entity 1.01555 45 0 0.0225677 5
code-llama-instruct:7:ggufv2:Q4_K_M entity 0.948961 45 0 0.021088 5
llama-2-chat:70:ggufv2:Q5_K_M entity 0.903324 45 0 0.0200739 5
llama-2-chat:13:ggufv2:Q2_K entity 0.807379 45 0 0.0179418 5
llama-2-chat:13:ggufv2:Q4_K_M entity 0.785233 45 0 0.0174496 5
llama-3-instruct:8:ggufv2:Q5_K_M entity 0.75253 45 0 0.0167229 5
llama-3-instruct:8:ggufv2:Q6_K entity 0.749495 45 0 0.0166554 5
llama-2-chat:7:ggufv2:Q3_K_M entity 0.699988 45 0 0.0155553 5
llama-3-instruct:8:ggufv2:Q8_0 entity 0.695524 45 0 0.0154561 5
llama-3-instruct:8:ggufv2:Q4_K_M entity 0.694377 45 0 0.0154306 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K entity 0.685368 45 0 0.0152304 5
llama-2-chat:7:ggufv2:Q4_K_M entity 0.685027 45 0 0.0152228 5
llama-2-chat:13:ggufv2:Q8_0 entity 0.629764 45 0 0.0139947 5
llama-2-chat:7:ggufv2:Q5_K_M entity 0.623851 45 0 0.0138634 5
llama-2-chat:13:ggufv2:Q5_K_M entity 0.623813 45 0 0.0138625 5
llama-2-chat:13:ggufv2:Q3_K_M entity 0.56502 45 0 0.012556 5
llama-2-chat:7:ggufv2:Q2_K entity 0.318196 45 0 0.00707101 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-opus-20240229 experiment_yes_or_no 27 27 0 1 3
gpt-4-turbo-2024-04-09 experiment_yes_or_no 45 45 0 1 5
llama-3.1-instruct:70:ggufv2:IQ4_XS experiment_yes_or_no 27 27 0 1 3
llama-3.1-instruct:70:ggufv2:Q3_K_S experiment_yes_or_no 27 27 0 1 3
claude-3-5-sonnet-20240620 experiment_yes_or_no 27 27 0 1 3
gpt-4o-2024-08-06 experiment_yes_or_no 25 27 0 0.925926 3
gpt-4o-mini-2024-07-18 experiment_yes_or_no 40 45 0 0.888889 5
llama-3.1-instruct:70:ggufv2:IQ2_M experiment_yes_or_no 22 27 0 0.814815 3
llama-3.1-instruct:8:ggufv2:Q6_K experiment_yes_or_no 18.0146 27 0 0.667206 3
llama-3.1-instruct:8:ggufv2:Q4_K_M experiment_yes_or_no 18.006 27 0 0.666888 3
llama-3.1-instruct:8:ggufv2:IQ4_XS experiment_yes_or_no 18.0059 27 0 0.666887 3
llama-3.1-instruct:8:ggufv2:Q8_0 experiment_yes_or_no 18.0059 27 0 0.666886 3
llama-3.1-instruct:8:ggufv2:Q5_K_M experiment_yes_or_no 18 27 0 0.666667 3
llama-3.1-instruct:8:ggufv2:Q3_K_L experiment_yes_or_no 18 27 0 0.666667 3
openhermes-2.5:7:ggufv2:Q2_K experiment_yes_or_no 9 45 0 0.2 5
gpt-4-0125-preview experiment_yes_or_no 9 45 0 0.2 5
llama-2-chat:70:ggufv2:Q4_K_M experiment_yes_or_no 9 45 0 0.2 5
chatglm3:6:ggmlv3:q4_0 experiment_yes_or_no 8.6 45 0 0.191111 5
openhermes-2.5:7:ggufv2:Q5_K_M experiment_yes_or_no 8.33333 45 0 0.185185 5
openhermes-2.5:7:ggufv2:Q6_K experiment_yes_or_no 8.33333 45 0 0.185185 5
openhermes-2.5:7:ggufv2:Q4_K_M experiment_yes_or_no 8.33333 45 0 0.185185 5
llama-2-chat:70:ggufv2:Q5_K_M experiment_yes_or_no 8.025 45 0 0.178333 5
openhermes-2.5:7:ggufv2:Q8_0 experiment_yes_or_no 8 45 0 0.177778 5
gpt-3.5-turbo-0613 experiment_yes_or_no 8 45 0 0.177778 5
gpt-4-0613 experiment_yes_or_no 8 45 0 0.177778 5
openhermes-2.5:7:ggufv2:Q3_K_M experiment_yes_or_no 8 45 0 0.177778 5
gpt-4o-2024-05-13 experiment_yes_or_no 8 45 0 0.177778 5
llama-2-chat:7:ggufv2:Q8_0 experiment_yes_or_no 4.67535 27 0 0.173161 3
llama-2-chat:70:ggufv2:Q2_K experiment_yes_or_no 7.05061 45 0 0.15668 5
llama-2-chat:70:ggufv2:Q3_K_M experiment_yes_or_no 6.07336 45 0 0.134964 5
gpt-3.5-turbo-0125 experiment_yes_or_no 6.03333 45 0 0.134074 5
llama-2-chat:13:ggufv2:Q6_K experiment_yes_or_no 3.25916 27 9.44264e-19 0.12071 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M experiment_yes_or_no 5.23564 45 0 0.116348 5
llama-2-chat:13:ggufv2:Q3_K_M experiment_yes_or_no 5.16593 45 0 0.114799 5
llama-3-instruct:8:ggufv2:Q8_0 experiment_yes_or_no 3.7 45 0 0.0822222 5
llama-3-instruct:8:ggufv2:Q5_K_M experiment_yes_or_no 3.68182 45 0 0.0818182 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 experiment_yes_or_no 3.32028 45 0 0.073784 5
llama-2-chat:7:ggufv2:Q6_K experiment_yes_or_no 1.97565 27 7.08198e-19 0.0731722 3
mistral-instruct-v0.2:7:ggufv2:Q5_K_M experiment_yes_or_no 3.26963 45 0 0.0726584 5
code-llama-instruct:7:ggufv2:Q4_K_M experiment_yes_or_no 3.0913 45 0 0.0686956 5
llama-3-instruct:8:ggufv2:Q6_K experiment_yes_or_no 2.36364 45 0 0.0525253 5
mistral-instruct-v0.2:7:ggufv2:Q6_K experiment_yes_or_no 2.36015 45 0 0.0524479 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M experiment_yes_or_no 2.2851 45 0 0.05078 5
mistral-instruct-v0.2:7:ggufv2:Q2_K experiment_yes_or_no 2.2802 45 0 0.0506711 5
llama-2-chat:7:ggufv2:Q4_K_M experiment_yes_or_no 2.06817 45 0 0.0459593 5
llama-3-instruct:8:ggufv2:Q4_K_M experiment_yes_or_no 1.89935 45 0 0.0422078 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M experiment_yes_or_no 1.45686 45 0 0.0323746 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M experiment_yes_or_no 1.29991 45 0 0.0288868 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M experiment_yes_or_no 1.1661 45 0 0.0259134 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K experiment_yes_or_no 1.15184 45 0 0.0255965 5
llama-2-chat:13:ggufv2:Q8_0 experiment_yes_or_no 1.06643 45 0 0.0236984 5
llama-2-chat:13:ggufv2:Q5_K_M experiment_yes_or_no 1.03147 45 0 0.0229215 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K experiment_yes_or_no 0.785587 45 0 0.0174575 5
llama-2-chat:7:ggufv2:Q3_K_M experiment_yes_or_no 0.726745 45 0 0.0161499 5
llama-2-chat:7:ggufv2:Q5_K_M experiment_yes_or_no 0.618798 45 0 0.0137511 5
llama-2-chat:13:ggufv2:Q4_K_M experiment_yes_or_no 0.468722 45 0 0.010416 5
llama-2-chat:13:ggufv2:Q2_K experiment_yes_or_no 0.267272 45 0 0.00593938 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 experiment_yes_or_no 0.201489 45 0 0.00447753 5
llama-2-chat:7:ggufv2:Q2_K experiment_yes_or_no 0.130285 45 0 0.00289522 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
gpt-4o-mini-2024-07-18 hypothesis 17.2242 45 0.0552888 0.382759 5
gpt-4-turbo-2024-04-09 hypothesis 16.3215 45 0.154249 0.3627 5
gpt-4o-2024-08-06 hypothesis 8.73977 27 0.0621005 0.323695 3
llama-3.1-instruct:70:ggufv2:IQ4_XS hypothesis 6.99218 27 0 0.25897 3
claude-3-opus-20240229 hypothesis 6.74202 27 0.0355843 0.249704 3
llama-3.1-instruct:70:ggufv2:Q3_K_S hypothesis 5.51964 27 4.72132e-18 0.204431 3
claude-3-5-sonnet-20240620 hypothesis 4.76919 27 0.000375971 0.176637 3
llama-3.1-instruct:8:ggufv2:IQ4_XS hypothesis 4.4134 27 0 0.163459 3
llama-3.1-instruct:70:ggufv2:IQ2_M hypothesis 4.27846 27 0 0.158462 3
llama-3.1-instruct:8:ggufv2:Q3_K_L hypothesis 4.14868 27 0 0.153655 3
llama-3.1-instruct:8:ggufv2:Q5_K_M hypothesis 3.48866 27 3.55023e-05 0.129209 3
llama-2-chat:7:ggufv2:Q8_0 hypothesis 2.854 27 3.77706e-18 0.105704 3
llama-3.1-instruct:8:ggufv2:Q4_K_M hypothesis 2.74519 27 1.88853e-18 0.101674 3
llama-3.1-instruct:8:ggufv2:Q8_0 hypothesis 2.70116 27 0 0.100043 3
llama-3.1-instruct:8:ggufv2:Q6_K hypothesis 2.64133 27 9.44264e-19 0.097827 3
mistral-instruct-v0.2:7:ggufv2:Q4_K_M hypothesis 3.67339 45 0 0.0816309 5
llama-2-chat:7:ggufv2:Q6_K hypothesis 2.01944 27 5.19345e-18 0.074794 3
mistral-instruct-v0.2:7:ggufv2:Q6_K hypothesis 3.33681 45 0 0.0741512 5
gpt-4-0613 hypothesis 3.29696 45 0 0.0732657 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 hypothesis 2.9272 45 0 0.0650489 5
gpt-4o-2024-05-13 hypothesis 2.89512 45 0 0.064336 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M hypothesis 2.75585 45 0 0.0612411 5
gpt-3.5-turbo-0125 hypothesis 2.72775 45 0 0.0606168 5
llama-2-chat:13:ggufv2:Q6_K hypothesis 1.61253 27 1.88853e-18 0.0597233 3
gpt-3.5-turbo-0613 hypothesis 2.64497 45 0 0.0587771 5
openhermes-2.5:7:ggufv2:Q4_K_M hypothesis 2.57382 45 0 0.0571961 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M hypothesis 2.47292 45 0 0.0549539 5
openhermes-2.5:7:ggufv2:Q8_0 hypothesis 2.37196 45 0 0.0527103 5
gpt-4-0125-preview hypothesis 2.33518 45 0 0.051893 5
openhermes-2.5:7:ggufv2:Q6_K hypothesis 2.29085 45 0 0.0509077 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M hypothesis 2.23255 45 0 0.0496122 5
openhermes-2.5:7:ggufv2:Q3_K_M hypothesis 2.09626 45 0 0.0465835 5
mistral-instruct-v0.2:7:ggufv2:Q2_K hypothesis 2.05375 45 0 0.045639 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M hypothesis 1.87442 45 0 0.0416537 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 hypothesis 1.83735 45 0 0.04083 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M hypothesis 1.71557 45 0 0.0381237 5
openhermes-2.5:7:ggufv2:Q5_K_M hypothesis 1.52181 45 0 0.033818 5
openhermes-2.5:7:ggufv2:Q2_K hypothesis 1.4915 45 0 0.0331444 5
llama-2-chat:70:ggufv2:Q3_K_M hypothesis 1.44143 45 0 0.0320317 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K hypothesis 1.44009 45 0 0.032002 5
llama-2-chat:70:ggufv2:Q2_K hypothesis 1.4389 45 0 0.0319755 5
llama-2-chat:70:ggufv2:Q4_K_M hypothesis 1.41421 45 0 0.0314268 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K hypothesis 1.39565 45 0 0.0310144 5
llama-3-instruct:8:ggufv2:Q4_K_M hypothesis 1.13596 45 0 0.0252436 5
chatglm3:6:ggmlv3:q4_0 hypothesis 0.98676 45 0 0.021928 5
llama-3-instruct:8:ggufv2:Q8_0 hypothesis 0.878406 45 0 0.0195201 5
llama-3-instruct:8:ggufv2:Q6_K hypothesis 0.876219 45 0 0.0194715 5
llama-2-chat:7:ggufv2:Q5_K_M hypothesis 0.68638 45 0 0.0152529 5
llama-2-chat:70:ggufv2:Q5_K_M hypothesis 0.623758 45 0 0.0138613 5
llama-2-chat:7:ggufv2:Q4_K_M hypothesis 0.62053 45 0 0.0137896 5
llama-3-instruct:8:ggufv2:Q5_K_M hypothesis 0.604423 45 0 0.0134316 5
code-llama-instruct:7:ggufv2:Q4_K_M hypothesis 0.572369 45 0 0.0127193 5
llama-2-chat:13:ggufv2:Q8_0 hypothesis 0.55524 45 0 0.0123387 5
llama-2-chat:7:ggufv2:Q2_K hypothesis 0.520453 45 0 0.0115656 5
llama-2-chat:13:ggufv2:Q2_K hypothesis 0.49279 45 0 0.0109509 5
llama-2-chat:13:ggufv2:Q3_K_M hypothesis 0.424638 45 0 0.00943639 5
llama-2-chat:13:ggufv2:Q5_K_M hypothesis 0.408017 45 0 0.00906704 5
llama-2-chat:7:ggufv2:Q3_K_M hypothesis 0.402337 45 0 0.00894082 5
llama-2-chat:13:ggufv2:Q4_K_M hypothesis 0.366299 45 0 0.00813997 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
llama-3.1-instruct:70:ggufv2:Q3_K_S intervention 21.7143 27 0 0.804233 3
llama-3.1-instruct:70:ggufv2:IQ4_XS intervention 20.4286 27 0 0.756614 3
llama-3.1-instruct:70:ggufv2:IQ2_M intervention 20.4286 27 0 0.756614 3
gpt-4o-mini-2024-07-18 intervention 30.0404 45 0.0608581 0.667565 5
gpt-4o-2024-08-06 intervention 17.3762 27 0.0687322 0.643563 3
llama-3.1-instruct:8:ggufv2:Q8_0 intervention 14.9465 27 0 0.553575 3
claude-3-opus-20240229 intervention 14.8366 27 0 0.549502 3
claude-3-5-sonnet-20240620 intervention 14.533 27 0.00814604 0.538261 3
gpt-4-turbo-2024-04-09 intervention 23.5746 45 0.0579382 0.52388 5
llama-3.1-instruct:8:ggufv2:Q5_K_M intervention 13.2322 27 0 0.490083 3
llama-3.1-instruct:8:ggufv2:IQ4_XS intervention 12.6108 27 0 0.467065 3
llama-3.1-instruct:8:ggufv2:Q6_K intervention 11.4946 27 0 0.425726 3
llama-3.1-instruct:8:ggufv2:Q4_K_M intervention 11.4467 27 0 0.423953 3
llama-3.1-instruct:8:ggufv2:Q3_K_L intervention 7.23948 27 1.88853e-18 0.268129 3
gpt-4o-2024-05-13 intervention 5.34631 45 0 0.118807 5
openhermes-2.5:7:ggufv2:Q4_K_M intervention 4.9841 45 0 0.110758 5
gpt-4-0125-preview intervention 4.92171 45 0 0.109371 5
gpt-4-0613 intervention 4.72253 45 0 0.104945 5
openhermes-2.5:7:ggufv2:Q6_K intervention 4.71449 45 0 0.104767 5
openhermes-2.5:7:ggufv2:Q8_0 intervention 4.44465 45 0 0.09877 5
gpt-3.5-turbo-0613 intervention 4.27143 45 0 0.0949206 5
openhermes-2.5:7:ggufv2:Q5_K_M intervention 4.00021 45 0 0.0888935 5
gpt-3.5-turbo-0125 intervention 3.75141 45 0 0.0833647 5
openhermes-2.5:7:ggufv2:Q3_K_M intervention 3.55238 45 0 0.0789418 5
openhermes-2.5:7:ggufv2:Q2_K intervention 2.92766 45 0 0.0650591 5
llama-2-chat:7:ggufv2:Q8_0 intervention 1.56417 27 0 0.0579323 3
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K intervention 2.23683 45 0 0.0497073 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M intervention 2.23319 45 0 0.0496264 5
llama-2-chat:13:ggufv2:Q6_K intervention 1.13241 27 0.000274145 0.0419412 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M intervention 1.66677 45 0 0.0370393 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M intervention 1.23412 45 0 0.0274249 5
code-llama-instruct:7:ggufv2:Q4_K_M intervention 1.17173 45 0 0.0260384 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M intervention 1.15754 45 0 0.025723 5
llama-2-chat:13:ggufv2:Q4_K_M intervention 1.02157 45 0 0.0227015 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 intervention 0.987919 45 0 0.0219538 5
chatglm3:6:ggmlv3:q4_0 intervention 0.881806 45 0 0.0195957 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 intervention 0.879646 45 0 0.0195477 5
llama-2-chat:7:ggufv2:Q6_K intervention 0.514286 27 3.77706e-18 0.0190476 3
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M intervention 0.723791 45 0 0.0160842 5
mistral-instruct-v0.2:7:ggufv2:Q2_K intervention 0.680182 45 0 0.0151152 5
llama-2-chat:70:ggufv2:Q2_K intervention 0.668995 45 0 0.0148666 5
mistral-instruct-v0.2:7:ggufv2:Q6_K intervention 0.640258 45 0 0.0142279 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M intervention 0.550643 45 0 0.0122365 5
llama-2-chat:70:ggufv2:Q5_K_M intervention 0.542302 45 0 0.0120512 5
llama-2-chat:13:ggufv2:Q2_K intervention 0.502722 45 0 0.0111716 5
llama-2-chat:70:ggufv2:Q4_K_M intervention 0.417501 45 0 0.00927779 5
llama-2-chat:7:ggufv2:Q3_K_M intervention 0.416756 45 0 0.00926124 5
llama-3-instruct:8:ggufv2:Q5_K_M intervention 0.410888 45 0 0.00913085 5
llama-2-chat:70:ggufv2:Q3_K_M intervention 0.402319 45 0 0.00894042 5
llama-3-instruct:8:ggufv2:Q4_K_M intervention 0.37923 45 0 0.00842733 5
llama-2-chat:13:ggufv2:Q5_K_M intervention 0.339683 45 0 0.0075485 5
llama-3-instruct:8:ggufv2:Q6_K intervention 0.327257 45 0 0.00727237 5
llama-3-instruct:8:ggufv2:Q8_0 intervention 0.319187 45 0 0.00709304 5
llama-2-chat:13:ggufv2:Q3_K_M intervention 0.265476 45 0 0.00589947 5
llama-2-chat:7:ggufv2:Q5_K_M intervention 0.24986 45 0 0.00555244 5
llama-2-chat:13:ggufv2:Q8_0 intervention 0.244444 45 0 0.0054321 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K intervention 0.2273 45 0 0.0050511 5
llama-2-chat:7:ggufv2:Q2_K intervention 0.118691 45 0 0.00263758 5
llama-2-chat:7:ggufv2:Q4_K_M intervention 0.0769231 45 0 0.0017094 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 ncbi_link 24 27 0 0.888889 3
claude-3-opus-20240229 ncbi_link 20.88 27 1.51082e-17 0.773333 3
gpt-4o-2024-08-06 ncbi_link 15.6389 27 0.00890973 0.579218 3
llama-3.1-instruct:70:ggufv2:IQ4_XS ncbi_link 15.3 27 0 0.566667 3
gpt-4-turbo-2024-04-09 ncbi_link 23.3333 45 0.132508 0.518519 5
llama-3.1-instruct:70:ggufv2:IQ2_M ncbi_link 13.5 27 0 0.5 3
gpt-4o-mini-2024-07-18 ncbi_link 16.8619 45 0.00496904 0.374709 5
llama-3.1-instruct:70:ggufv2:Q3_K_S ncbi_link 9.5 27 0 0.351852 3
gpt-4-0125-preview ncbi_link 6.48768 45 0 0.144171 5
llama-3.1-instruct:8:ggufv2:IQ4_XS ncbi_link 3.86536 27 0 0.143161 3
gpt-4-0613 ncbi_link 6.05933 45 0 0.134652 5
llama-3.1-instruct:8:ggufv2:Q8_0 ncbi_link 2.32755 27 7.55411e-18 0.0862055 3
openhermes-2.5:7:ggufv2:Q6_K ncbi_link 3.5303 45 0 0.0784512 5
openhermes-2.5:7:ggufv2:Q8_0 ncbi_link 3.5303 45 0 0.0784512 5
gpt-4o-2024-05-13 ncbi_link 3.51302 45 0 0.078067 5
openhermes-2.5:7:ggufv2:Q5_K_M ncbi_link 3.47436 45 0 0.077208 5
llama-3.1-instruct:8:ggufv2:Q4_K_M ncbi_link 1.90181 27 7.91155e-07 0.0704374 3
openhermes-2.5:7:ggufv2:Q4_K_M ncbi_link 3.11111 45 0 0.0691358 5
llama-3.1-instruct:8:ggufv2:Q5_K_M ncbi_link 1.59127 27 0 0.0589359 3
openhermes-2.5:7:ggufv2:Q3_K_M ncbi_link 2.37436 45 0 0.0527635 5
llama-3.1-instruct:8:ggufv2:Q6_K ncbi_link 1.31 27 0 0.0485184 3
gpt-3.5-turbo-0613 ncbi_link 2.16667 45 0 0.0481481 5
gpt-3.5-turbo-0125 ncbi_link 1.42925 45 0 0.031761 5
llama-2-chat:13:ggufv2:Q6_K ncbi_link 0.690904 27 0 0.0255891 3
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M ncbi_link 1.03429 45 0 0.0229841 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M ncbi_link 0.884957 45 0 0.0196657 5
mistral-instruct-v0.2:7:ggufv2:Q2_K ncbi_link 0.881705 45 0 0.0195934 5
llama-2-chat:7:ggufv2:Q6_K ncbi_link 0.507313 27 9.44264e-19 0.0187894 3
llama-3.1-instruct:8:ggufv2:Q3_K_L ncbi_link 0.486291 27 2.95082e-19 0.0180108 3
mistral-instruct-v0.2:7:ggufv2:Q5_K_M ncbi_link 0.710989 45 0 0.0157998 5
llama-2-chat:7:ggufv2:Q8_0 ncbi_link 0.410766 27 9.44264e-19 0.0152135 3
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K ncbi_link 0.656812 45 0 0.0145958 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M ncbi_link 0.615714 45 0 0.0136825 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 ncbi_link 0.596131 45 0 0.0132474 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M ncbi_link 0.574422 45 0 0.0127649 5
mistral-instruct-v0.2:7:ggufv2:Q6_K ncbi_link 0.558824 45 0 0.0124183 5
openhermes-2.5:7:ggufv2:Q2_K ncbi_link 0.505458 45 0 0.0112324 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 ncbi_link 0.429927 45 0 0.00955394 5
code-llama-instruct:7:ggufv2:Q4_K_M ncbi_link 0.328564 45 0 0.00730142 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K ncbi_link 0.271548 45 0 0.0060344 5
llama-2-chat:13:ggufv2:Q8_0 ncbi_link 0.255217 45 0 0.00567148 5
llama-2-chat:70:ggufv2:Q2_K ncbi_link 0.253735 45 0 0.00563856 5
llama-2-chat:13:ggufv2:Q4_K_M ncbi_link 0.246231 45 0 0.00547179 5
llama-2-chat:70:ggufv2:Q4_K_M ncbi_link 0.241357 45 0 0.00536348 5
llama-2-chat:13:ggufv2:Q5_K_M ncbi_link 0.236802 45 0 0.00526226 5
llama-3-instruct:8:ggufv2:Q4_K_M ncbi_link 0.233815 45 0 0.00519589 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M ncbi_link 0.230909 45 0 0.00513131 5
llama-2-chat:7:ggufv2:Q4_K_M ncbi_link 0.216341 45 0 0.00480757 5
llama-2-chat:70:ggufv2:Q5_K_M ncbi_link 0.196981 45 0 0.00437735 5
llama-2-chat:13:ggufv2:Q2_K ncbi_link 0.192574 45 0 0.00427942 5
llama-3-instruct:8:ggufv2:Q8_0 ncbi_link 0.179211 45 0 0.00398247 5
llama-2-chat:7:ggufv2:Q3_K_M ncbi_link 0.177339 45 0 0.00394087 5
llama-3-instruct:8:ggufv2:Q6_K ncbi_link 0.173014 45 0 0.00384476 5
llama-2-chat:7:ggufv2:Q5_K_M ncbi_link 0.170952 45 0 0.00379894 5
llama-2-chat:70:ggufv2:Q3_K_M ncbi_link 0.166777 45 0 0.00370615 5
llama-3-instruct:8:ggufv2:Q5_K_M ncbi_link 0.166614 45 0 0.00370254 5
llama-2-chat:7:ggufv2:Q2_K ncbi_link 0.15271 45 0 0.00339354 5
llama-2-chat:13:ggufv2:Q3_K_M ncbi_link 0.150011 45 0 0.00333359 5
chatglm3:6:ggmlv3:q4_0 ncbi_link 0.122857 45 0 0.00273017 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 significance 21.6 27 6.04329e-17 0.8 3
gpt-4o-mini-2024-07-18 significance 36 45 0 0.8 5
gpt-4o-2024-08-06 significance 17.8444 27 0.00285111 0.660905 3
llama-3.1-instruct:70:ggufv2:Q3_K_S significance 15.6 27 6.04329e-17 0.577778 3
claude-3-opus-20240229 significance 13.7935 27 0.013441 0.51087 3
llama-3.1-instruct:70:ggufv2:IQ4_XS significance 13.6 27 6.7987e-17 0.503704 3
gpt-4-turbo-2024-04-09 significance 22.6141 45 0.0472812 0.502536 5
llama-3.1-instruct:70:ggufv2:IQ2_M significance 13.0286 27 6.04329e-17 0.48254 3
llama-3.1-instruct:8:ggufv2:Q6_K significance 6.06508 27 1.51082e-17 0.224633 3
llama-3.1-instruct:8:ggufv2:IQ4_XS significance 6.00361 27 7.55411e-18 0.222356 3
llama-3.1-instruct:8:ggufv2:Q5_K_M significance 5.21313 27 1.51082e-17 0.193079 3
llama-3.1-instruct:8:ggufv2:Q8_0 significance 5.0798 27 2.26623e-17 0.188141 3
llama-3.1-instruct:8:ggufv2:Q4_K_M significance 4.94646 27 3.02164e-17 0.183202 3
llama-3.1-instruct:8:ggufv2:Q3_K_L significance 4.31237 27 2.36066e-17 0.159717 3
gpt-4-0613 significance 5.6 45 0 0.124444 5
gpt-4-0125-preview significance 5.18384 45 0 0.115196 5
gpt-4o-2024-05-13 significance 4.22424 45 0 0.0938721 5
openhermes-2.5:7:ggufv2:Q4_K_M significance 3.92996 45 0 0.0873325 5
openhermes-2.5:7:ggufv2:Q8_0 significance 3.78182 45 0 0.0840404 5
openhermes-2.5:7:ggufv2:Q6_K significance 3.78182 45 0 0.0840404 5
openhermes-2.5:7:ggufv2:Q5_K_M significance 3.77787 45 0 0.0839526 5
openhermes-2.5:7:ggufv2:Q3_K_M significance 3.69091 45 0 0.0820202 5
gpt-3.5-turbo-0613 significance 3.58562 45 0 0.0796804 5
gpt-3.5-turbo-0125 significance 3.51717 45 0 0.0781594 5
mistral-instruct-v0.2:7:ggufv2:Q4_K_M significance 2.93833 45 0 0.0652963 5
mistral-instruct-v0.2:7:ggufv2:Q6_K significance 2.87928 45 0 0.063984 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M significance 2.79423 45 0 0.062094 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 significance 2.62296 45 0 0.0582881 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M significance 2.56724 45 0 0.0570498 5
openhermes-2.5:7:ggufv2:Q2_K significance 2.48514 45 0 0.0552254 5
mistral-instruct-v0.2:7:ggufv2:Q2_K significance 2.4813 45 0 0.05514 5
llama-2-chat:7:ggufv2:Q8_0 significance 1.10159 27 0 0.0407996 3
llama-2-chat:13:ggufv2:Q6_K significance 1.07015 27 1.0623e-18 0.0396352 3
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 significance 1.50696 45 0 0.033488 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K significance 1.34869 45 0 0.0299709 5
llama-2-chat:7:ggufv2:Q6_K significance 0.806474 27 0 0.0298694 3
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M significance 1.31454 45 0 0.0292119 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M significance 1.2312 45 0 0.0273599 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M significance 1.01129 45 0 0.0224731 5
llama-3-instruct:8:ggufv2:Q6_K significance 0.994971 45 0 0.0221105 5
llama-3-instruct:8:ggufv2:Q8_0 significance 0.957259 45 0 0.0212724 5
llama-2-chat:70:ggufv2:Q3_K_M significance 0.758379 45 0 0.0168529 5
llama-2-chat:70:ggufv2:Q2_K significance 0.716547 45 0 0.0159233 5
llama-2-chat:70:ggufv2:Q4_K_M significance 0.68386 45 0 0.0151969 5
llama-3-instruct:8:ggufv2:Q5_K_M significance 0.636128 45 0 0.0141362 5
llama-2-chat:70:ggufv2:Q5_K_M significance 0.518572 45 0 0.0115238 5
llama-2-chat:7:ggufv2:Q4_K_M significance 0.329457 45 0 0.00732127 5
llama-2-chat:13:ggufv2:Q8_0 significance 0.326026 45 0 0.00724502 5
llama-2-chat:7:ggufv2:Q5_K_M significance 0.281188 45 0 0.00624862 5
llama-3-instruct:8:ggufv2:Q4_K_M significance 0.228461 45 0 0.00507691 5
llama-2-chat:13:ggufv2:Q4_K_M significance 0.213246 45 0 0.0047388 5
llama-2-chat:13:ggufv2:Q2_K significance 0.207957 45 0 0.00462127 5
llama-2-chat:13:ggufv2:Q5_K_M significance 0.205271 45 0 0.00456158 5
llama-2-chat:7:ggufv2:Q3_K_M significance 0.194946 45 0 0.00433214 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K significance 0.178078 45 0 0.00395728 5
llama-2-chat:13:ggufv2:Q3_K_M significance 0.131484 45 0 0.00292186 5
code-llama-instruct:7:ggufv2:Q4_K_M significance 0.123914 45 0 0.00275365 5
chatglm3:6:ggmlv3:q4_0 significance 0.118153 45 0 0.00262562 5
llama-2-chat:7:ggufv2:Q2_K significance 0.103278 45 0 0.00229507 5
Full model name Subtask Score achieved Score possible Score SD Accuracy Iterations
claude-3-5-sonnet-20240620 stats 27 27 0 1 3
claude-3-opus-20240229 stats 27 27 0 1 3
gpt-4-turbo-2024-04-09 stats 45 45 0 1 5
gpt-4o-2024-08-06 stats 26.5385 27 0 0.982906 3
gpt-4o-mini-2024-07-18 stats 42.5641 45 0 0.945869 5
llama-3.1-instruct:70:ggufv2:IQ4_XS stats 24 27 0 0.888889 3
llama-3.1-instruct:8:ggufv2:Q8_0 stats 18.6622 27 0 0.691194 3
llama-3.1-instruct:8:ggufv2:Q6_K stats 18.6622 27 0 0.691194 3
llama-3.1-instruct:8:ggufv2:IQ4_XS stats 18.3112 27 1.88853e-18 0.678192 3
llama-3.1-instruct:8:ggufv2:Q5_K_M stats 18.3058 27 0 0.677993 3
llama-3.1-instruct:70:ggufv2:Q3_K_S stats 18 27 0 0.666667 3
llama-3.1-instruct:70:ggufv2:IQ2_M stats 18 27 0 0.666667 3
llama-3.1-instruct:8:ggufv2:Q4_K_M stats 17.2171 27 0.033358 0.637672 3
llama-3.1-instruct:8:ggufv2:Q3_K_L stats 15.7067 27 0 0.581731 3
gpt-4-0125-preview stats 8.86667 45 0 0.197037 5
openhermes-2.5:7:ggufv2:Q8_0 stats 8.66667 45 0 0.192593 5
openhermes-2.5:7:ggufv2:Q6_K stats 8.66667 45 0 0.192593 5
openhermes-2.5:7:ggufv2:Q5_K_M stats 8.52821 45 0 0.189516 5
gpt-4-0613 stats 8.51282 45 0 0.189174 5
gpt-4o-2024-05-13 stats 8.51282 45 0 0.189174 5
openhermes-2.5:7:ggufv2:Q4_K_M stats 8.25641 45 0 0.183476 5
openhermes-2.5:7:ggufv2:Q3_K_M stats 8.05641 45 0 0.179031 5
openhermes-2.5:7:ggufv2:Q2_K stats 8 45 0 0.177778 5
gpt-3.5-turbo-0613 stats 7.98135 45 0 0.177363 5
gpt-3.5-turbo-0125 stats 7.12976 45 0 0.158439 5
mistral-instruct-v0.2:7:ggufv2:Q5_K_M stats 6.89091 45 0 0.153131 5
llama-2-chat:13:ggufv2:Q6_K stats 4.05299 27 2.26623e-17 0.150111 3
llama-2-chat:7:ggufv2:Q8_0 stats 3.87128 27 7.55411e-18 0.143381 3
mistral-instruct-v0.2:7:ggufv2:Q4_K_M stats 6.29908 45 0 0.13998 5
mistral-instruct-v0.2:7:ggufv2:Q6_K stats 6.18322 45 0 0.137405 5
llama-3-instruct:8:ggufv2:Q8_0 stats 5.17406 45 0 0.114979 5
mistral-instruct-v0.2:7:ggufv2:Q3_K_M stats 5.1041 45 0 0.113424 5
mistral-instruct-v0.2:7:ggufv2:Q8_0 stats 5.04591 45 0 0.112131 5
llama-2-chat:7:ggufv2:Q6_K stats 2.99816 27 7.55411e-18 0.111043 3
mistral-instruct-v0.2:7:ggufv2:Q2_K stats 4.63496 45 0 0.102999 5
llama-3-instruct:8:ggufv2:Q6_K stats 4.30739 45 0 0.0957198 5
llama-3-instruct:8:ggufv2:Q5_K_M stats 3.9346 45 0 0.0874356 5
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 stats 3.60737 45 0 0.0801638 5
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M stats 3.58841 45 0 0.0797425 5
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K stats 3.21213 45 0 0.0713807 5
llama-2-chat:70:ggufv2:Q4_K_M stats 3.08109 45 0 0.0684688 5
llama-3-instruct:8:ggufv2:Q4_K_M stats 2.98843 45 0 0.0664096 5
llama-2-chat:70:ggufv2:Q2_K stats 2.65216 45 0 0.0589368 5
llama-2-chat:70:ggufv2:Q3_K_M stats 2.44276 45 0 0.0542835 5
llama-2-chat:70:ggufv2:Q5_K_M stats 2.3993 45 0 0.0533177 5
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M stats 2.21549 45 0 0.049233 5
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M stats 1.96241 45 0 0.0436091 5
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K stats 1.76057 45 0 0.0391237 5
llama-2-chat:7:ggufv2:Q4_K_M stats 1.43589 45 0 0.0319086 5
llama-2-chat:13:ggufv2:Q4_K_M stats 1.41695 45 0 0.0314878 5
llama-2-chat:13:ggufv2:Q8_0 stats 1.38608 45 0 0.0308019 5
llama-2-chat:7:ggufv2:Q5_K_M stats 1.3859 45 0 0.0307977 5
llama-2-chat:13:ggufv2:Q3_K_M stats 1.35834 45 0 0.0301854 5
llama-2-chat:13:ggufv2:Q5_K_M stats 1.3371 45 0 0.0297134 5
llama-2-chat:13:ggufv2:Q2_K stats 1.12439 45 0 0.0249865 5
code-llama-instruct:7:ggufv2:Q4_K_M stats 0.860471 45 0 0.0191216 5
llama-2-chat:7:ggufv2:Q3_K_M stats 0.804538 45 0 0.0178786 5
llama-2-chat:7:ggufv2:Q2_K stats 0.558031 45 0 0.0124007 5
chatglm3:6:ggmlv3:q4_0 stats 0.17925 45 0 0.00398332 5
Full model name Score achieved Score possible Score SD Accuracy Iterations
gpt-4-turbo-2024-04-09 99 100 0.447214 0.99 5
gpt-4o-mini-2024-07-18 98 100 0.547723 0.98 5
gpt-4o-2024-05-13 96 100 0.83666 0.96 5

Stripplot Extraction Subtask

Medical Exam Question Answering

In this set of tasks, we test LLM abilities to answer medical exam questions.

Full model name Score achieved Score possible Score SD Accuracy Iterations
gpt-4o-2024-08-06 806 948 2.73205 0.850211 3
gpt-4o-mini-2024-07-18 1352 1608 1.44215 0.840796 5
gpt-4-turbo-2024-04-09 1366 1632 3.231 0.83701 5
llama-3.1-instruct:70:ggufv2:IQ4_XS 765 930 0 0.822581 3
claude-3-opus-20240229 754 936 2.88675 0.805556 3
llama-3.1-instruct:70:ggufv2:Q3_K_S 732 915 0 0.8 3
gpt-4-0125-preview 837 1077 4.04145 0.777159 3
claude-3-5-sonnet-20240620 759 981 0 0.7737 3
llama-3.1-instruct:70:ggufv2:IQ2_M 684 885 0 0.772881 3
llama-3.1-instruct:8:ggufv2:Q3_K_L 657 855 0 0.768421 3
llama-3.1-instruct:8:ggufv2:Q8_0 657 858 0 0.765734 3
gpt-4o-2024-05-13 820 1074 4.04145 0.763501 3
llama-3.1-instruct:8:ggufv2:IQ4_XS 654 864 0 0.756944 3
llama-3.1-instruct:8:ggufv2:Q6_K 645 858 0 0.751748 3
llama-3.1-instruct:8:ggufv2:Q5_K_M 636 849 0 0.749117 3
llama-3.1-instruct:8:ggufv2:Q4_K_M 621 837 0 0.741935 3
gpt-4-0613 785 1074 3.4641 0.730912 3
gpt-3.5-turbo-0125 721 1074 7.50555 0.671322 3
llama-3-instruct:8:ggufv2:Q8_0 690 1077 0 0.640669 3
llama-3-instruct:8:ggufv2:Q5_K_M 684 1077 0 0.635097 3
llama-3-instruct:8:ggufv2:Q4_K_M 673 1077 0.57735 0.624884 3
llama-3-instruct:8:ggufv2:Q6_K 672 1077 0 0.623955 3
openhermes-2.5:7:ggufv2:Q4_K_M 628 1071 1.73205 0.586368 3
openhermes-2.5:7:ggufv2:Q8_0 618 1071 0 0.577031 3
openhermes-2.5:7:ggufv2:Q6_K 615 1071 0 0.57423 3
openhermes-2.5:7:ggufv2:Q5_K_M 612 1071 0 0.571429 3
openhermes-2.5:7:ggufv2:Q3_K_M 604 1071 1.73205 0.563959 3
openhermes-2.5:7:ggufv2:Q2_K 579 1074 0 0.539106 3
llama-2-chat:13:ggufv2:Q8_0 462 1071 0 0.431373 3
llama-2-chat:13:ggufv2:Q5_K_M 462 1071 0 0.431373 3
llama-2-chat:13:ggufv2:Q4_K_M 459 1071 0 0.428571 3
llama-2-chat:13:ggufv2:Q6_K 459 1071 0 0.428571 3
llama-2-chat:13:ggufv2:Q3_K_M 459 1071 0 0.428571 3
chatglm3:6:ggmlv3:q4_0 457 1071 21.6162 0.426704 3
llama-2-chat:13:ggufv2:Q2_K 444 1071 0 0.414566 3
llama-2-chat:7:ggufv2:Q6_K 435 1071 0 0.406162 3
llama-2-chat:7:ggufv2:Q8_0 429 1071 0 0.40056 3
llama-2-chat:7:ggufv2:Q5_K_M 429 1071 0 0.40056 3
llama-2-chat:7:ggufv2:Q4_K_M 429 1071 0 0.40056 3
llama-2-chat:7:ggufv2:Q3_K_M 423 1071 0 0.394958 3
llama-2-chat:7:ggufv2:Q2_K 396 1071 0 0.369748 3
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 395 1071 0.57735 0.368814 3
mistral-instruct-v0.2:7:ggufv2:Q8_0 393 1071 0 0.366947 3
mistral-instruct-v0.2:7:ggufv2:Q6_K 393 1071 0 0.366947 3
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 391 1071 0.57735 0.365079 3
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 390 1071 0 0.364146 3
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 386 1071 0.57735 0.360411 3
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 384 1071 0 0.358543 3
mistral-instruct-v0.2:7:ggufv2:Q2_K 378 1071 0 0.352941 3
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 378 1071 0 0.352941 3
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 367 1071 0.57735 0.34267 3
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 353 1071 0.57735 0.329599 3