Skip to content

Benchmark Results - Overview

Here we collect the results of the living BioChatter benchmark. For an explanation, see the benchmarking documentation and the developer docs for further reading.

Scores per model

Table sorted by mean score in descending order. Click the column names to reorder.

Model name Size Median Accuracy SD
gpt-3.5-turbo-0125 175 0.87 0.21
gpt-4-0613 Unknown 0.78 0.18
gpt-4-0125-preview Unknown 0.73 0.3
gpt-3.5-turbo-0613 175 0.73 0.24
openhermes-2.5 7 0.7 0.32
gpt-4o-2024-05-13 Unknown 0.7 0.35
llama-3-instruct 8 0.64 0.36
chatglm3 6 0.44 0.26
llama-2-chat 70 0.42 0.34
mistral-instruct-v0.2 7 0.4 0.33
code-llama-instruct 7 0.4 0.35
code-llama-instruct 34 0.38 0.35
code-llama-instruct 13 0.38 0.33
llama-2-chat 13 0.38 0.33
llama-2-chat 7 0.34 0.31
mixtral-instruct-v0.1 46,7 0.34 0.28

Scatter Quantisation Name Boxplot Model

Scores per quantisation

Table sorted by mean score in descending order. Click the column names to reorder.

Model name Size Version Quantisation Median Accuracy SD
gpt-3.5-turbo-0125 175 nan nan 0.87 0.21
gpt-4-0613 Unknown nan nan 0.78 0.18
gpt-4-0125-preview Unknown nan nan 0.73 0.3
openhermes-2.5 7 ggufv2 Q5_K_M 0.73 0.32
gpt-3.5-turbo-0613 175 nan nan 0.73 0.24
openhermes-2.5 7 ggufv2 Q8_0 0.71 0.32
openhermes-2.5 7 ggufv2 Q4_K_M 0.71 0.33
openhermes-2.5 7 ggufv2 Q6_K 0.7 0.33
gpt-4o-2024-05-13 Unknown nan nan 0.7 0.35
llama-3-instruct 8 ggufv2 Q8_0 0.65 0.35
llama-3-instruct 8 ggufv2 Q4_K_M 0.64 0.38
llama-3-instruct 8 ggufv2 Q6_K 0.64 0.36
llama-3-instruct 8 ggufv2 Q5_K_M 0.62 0.36
openhermes-2.5 7 ggufv2 Q3_K_M 0.59 0.32
openhermes-2.5 7 ggufv2 Q2_K 0.51 0.3
code-llama-instruct 34 ggufv2 Q2_K 0.5 0.33
code-llama-instruct 7 ggufv2 Q3_K_M 0.49 0.31
code-llama-instruct 7 ggufv2 Q4_K_M 0.47 0.39
mistral-instruct-v0.2 7 ggufv2 Q5_K_M 0.46 0.34
mistral-instruct-v0.2 7 ggufv2 Q6_K 0.45 0.34
code-llama-instruct 34 ggufv2 Q3_K_M 0.45 0.31
chatglm3 6 ggmlv3 q4_0 0.44 0.26
llama-2-chat 70 ggufv2 Q4_K_M 0.44 0.35
llama-2-chat 70 ggufv2 Q5_K_M 0.44 0.35
code-llama-instruct 13 ggufv2 Q6_K 0.44 0.35
code-llama-instruct 13 ggufv2 Q8_0 0.44 0.33
code-llama-instruct 13 ggufv2 Q5_K_M 0.43 0.32
llama-2-chat 70 ggufv2 Q3_K_M 0.41 0.33
mistral-instruct-v0.2 7 ggufv2 Q3_K_M 0.41 0.34
mistral-instruct-v0.2 7 ggufv2 Q8_0 0.4 0.33
llama-2-chat 13 ggufv2 Q8_0 0.4 0.34
code-llama-instruct 7 ggufv2 Q8_0 0.4 0.37
code-llama-instruct 7 ggufv2 Q5_K_M 0.39 0.34
llama-2-chat 13 ggufv2 Q3_K_M 0.39 0.33
llama-2-chat 13 ggufv2 Q5_K_M 0.39 0.33
code-llama-instruct 7 ggufv2 Q2_K 0.38 0.29
code-llama-instruct 34 ggufv2 Q4_K_M 0.38 0.35
code-llama-instruct 7 ggufv2 Q6_K 0.38 0.39
code-llama-instruct 34 ggufv2 Q5_K_M 0.38 0.38
llama-2-chat 70 ggufv2 Q2_K 0.38 0.35
llama-2-chat 13 ggufv2 Q6_K 0.37 0.34
code-llama-instruct 34 ggufv2 Q8_0 0.37 0.35
llama-2-chat 7 ggufv2 Q4_K_M 0.37 0.29
mistral-instruct-v0.2 7 ggufv2 Q2_K 0.37 0.29
mistral-instruct-v0.2 7 ggufv2 Q4_K_M 0.37 0.35
code-llama-instruct 34 ggufv2 Q6_K 0.37 0.36
llama-2-chat 13 ggufv2 Q4_K_M 0.36 0.34
llama-2-chat 7 ggufv2 Q3_K_M 0.36 0.34
mixtral-instruct-v0.1 46,7 ggufv2 Q4_K_M 0.35 0.3
llama-2-chat 7 ggufv2 Q8_0 0.35 0.29
mixtral-instruct-v0.1 46,7 ggufv2 Q5_K_M 0.34 0.31
mixtral-instruct-v0.1 46,7 ggufv2 Q6_K 0.34 0.29
mixtral-instruct-v0.1 46,7 ggufv2 Q3_K_M 0.33 0.28
code-llama-instruct 13 ggufv2 Q4_K_M 0.33 0.31
llama-2-chat 7 ggufv2 Q6_K 0.33 0.29
mixtral-instruct-v0.1 46,7 ggufv2 Q8_0 0.33 0.25
llama-2-chat 7 ggufv2 Q5_K_M 0.32 0.29
mixtral-instruct-v0.1 46,7 ggufv2 Q2_K 0.32 0.27
llama-2-chat 13 ggufv2 Q2_K 0.28 0.29
llama-2-chat 7 ggufv2 Q2_K 0.22 0.36
code-llama-instruct 13 ggufv2 Q2_K 0.17 0.34
code-llama-instruct 13 ggufv2 Q3_K_M 0.15 0.34

Boxplot Quantisation

Scores of all tasks

Wide table; you may need to scroll horizontally to see all columns. Table sorted by mean score in descending order. Click the column names to reorder.

Full model name explicit_relevance_of_single_fragments sourcedata_info_extraction implicit_relevance_of_multiple_fragments api_calling query_generation naive_query_generation_using_schema entity_selection property_exists medical_exam end_to_end_query_generation property_selection relationship_selection Mean Accuracy Median Accuracy SD
gpt-3.5-turbo-0125 1 0.510032 0.9 0.647059 0.966667 0.486667 1 0.866667 0.670401 0.926667 0.35625 1 0.777534 0.866667 0.214646
gpt-4-0613 1 0.668903 1 0.619048 0.966667 0.68 0.888889 0.888889 0.730159 0.88 0.359375 0.65 0.777661 0.777661 0.177558
gpt-4-0125-preview 1 0.689705 0.5 0.793651 0.833333 0.44 0.777778 0.733333 0.77591 0 0 0.75 0.607809 0.733333 0.295129
openhermes-2.5:7:ggufv2:Q5_K_M 1 0.579916 1 nan 0.913333 0.586667 0.888889 0.777778 0.571429 0 0.125 1 0.676637 0.727208 0.318593
gpt-3.5-turbo-0613 1 0.575381 1 nan 0.946667 0.5 0.888889 0.755556 0.291317 0.833333 0.3625 0.5 0.695786 0.725671 0.237079
openhermes-2.5:7:ggufv2:Q8_0 1 0.600829 1 nan 0.88 0.466667 0.888889 0.755556 0.577031 0 0.125 1 0.663088 0.709322 0.319919
openhermes-2.5:7:ggufv2:Q4_K_M 1 0.597281 1 nan 0.873333 0.466667 0.888889 0.755556 0.586368 0 0.046875 1 0.655906 0.705731 0.330932
openhermes-2.5:7:ggufv2:Q6_K 1 0.619167 1 nan 0.86 0.533333 1 0.733333 0.57423 0 0.046875 1 0.669722 0.701528 0.334697
gpt-4o-2024-05-13 1 0.653946 0.7 0.809524 0.8 0.533333 1 0.85 0.762838 0 0 0 0.59247 0.7 0.350859
llama-3-instruct:8:ggufv2:Q8_0 1 0.188555 1 nan 0.92 0.666667 0.875 0.725 0.638655 0 0.28125 0 0.572284 0.652661 0.354494
llama-3-instruct:8:ggufv2:Q4_K_M 1 0.116871 1 nan 0.92 0.666667 0.861111 0.775 0.622782 0 0.109375 0 0.551982 0.644725 0.376704
llama-3-instruct:8:ggufv2:Q6_K 1 0.162657 1 nan 0.926667 0.666667 0.875 0.775 0.621849 0 0.28125 0 0.573554 0.644258 0.359128
llama-3-instruct:8:ggufv2:Q5_K_M 1 0.166434 1 nan 0.926667 0.6 0.875 0.65 0.633053 0 0.1875 0 0.548969 0.616527 0.360513
openhermes-2.5:7:ggufv2:Q3_K_M 1 0.554488 0.5 nan 0.94 0.466667 1 0.72 0.563959 0 0.125 1 0.624556 0.594257 0.318981
openhermes-2.5:7:ggufv2:Q2_K 1 0.444054 0.5 nan 0.94 0.433333 0.555556 0.844444 0.537815 0 0 0.5 0.5232 0.5116 0.298404
code-llama-instruct:34:ggufv2:Q2_K 0.5 nan 1 nan 0.686667 0.566667 0 0.75 nan 0 0 0.5 0.444815 0.5 0.328199
code-llama-instruct:7:ggufv2:Q3_K_M 0.833333 nan 0.7 nan 0.873333 0.426667 0.5 0.8 nan 0 0 0.25 0.487037 0.493519 0.307716
code-llama-instruct:7:ggufv2:Q4_K_M 1 0.138732 1 nan 0.966667 0.653333 0.333333 0.6 nan 0 0 0 0.469207 0.469207 0.38731
mistral-instruct-v0.2:7:ggufv2:Q5_K_M 1 0.385754 1 nan 0.826667 0.466667 0.444444 0.688889 0.364146 0 0 0 0.470597 0.455556 0.34385
mistral-instruct-v0.2:7:ggufv2:Q6_K 1 0.367412 1 nan 0.833333 0.433333 0.5 0.65 0.366947 0 0.046875 0 0.472536 0.452935 0.337974
code-llama-instruct:34:ggufv2:Q3_K_M 0.5 nan 0.5 nan 0.786667 0.6 0 0.875 nan 0 0 0.25 0.390185 0.445093 0.306514
chatglm3:6:ggmlv3:q4_0 0.733333 0.188284 1 nan 0.553333 0.48 0.75 0.275 0.426704 0 0.2875 0.4 0.463105 0.444905 0.260423
llama-2-chat:70:ggufv2:Q4_K_M 1 0.240936 1 nan 0.92 0.42 0.444444 0.755556 nan 0 0 0.25 0.503094 0.444444 0.354692
llama-2-chat:70:ggufv2:Q5_K_M 1 0.210166 0.9 nan 0.906667 0.36 0.444444 0.777778 nan 0 0 0.25 0.484905 0.444444 0.346535
code-llama-instruct:13:ggufv2:Q6_K 0.833333 nan 0.5 nan 0.793333 0.54 0 0.825 nan 0 0 0 0.387963 0.443981 0.345581
code-llama-instruct:13:ggufv2:Q8_0 0.833333 nan 0.5 nan 0.766667 0.566667 0 0.75 nan 0 0 0 0.37963 0.439815 0.334971
code-llama-instruct:13:ggufv2:Q5_K_M 0.666667 nan 0.5 nan 0.78 0.566667 0 0.775 nan 0 0 0 0.36537 0.432685 0.320506
llama-2-chat:70:ggufv2:Q3_K_M 1 0.197898 0.5 nan 0.906667 0.413333 0.333333 0.777778 nan 0 0.171875 0 0.430088 0.413333 0.327267
mistral-instruct-v0.2:7:ggufv2:Q3_K_M 1 0.368974 1 nan 0.773333 0.466667 0.333333 0.666667 0.360411 0 0.046875 0 0.456024 0.412499 0.335885
mistral-instruct-v0.2:7:ggufv2:Q8_0 1 0.351684 0.9 nan 0.846667 0.433333 0.333333 0.644444 0.366947 0 0.0375 0 0.446719 0.40014 0.330107
llama-2-chat:13:ggufv2:Q8_0 1 0.0762457 0.5 nan 0.786667 0.48 0 0.711111 0.431373 0 0 0 0.362309 0.396841 0.335904
code-llama-instruct:7:ggufv2:Q8_0 1 nan 0.5 nan 0.96 0.4 0 0.666667 nan 0 0 0 0.391852 0.395926 0.37338
code-llama-instruct:7:ggufv2:Q5_K_M 0.833333 nan 0.5 nan 0.96 0.4 0.111111 0.688889 nan 0 0 0 0.388148 0.394074 0.340156
llama-2-chat:13:ggufv2:Q3_K_M 1 0.112631 0.5 nan 0.68 0.48 0 0.733333 0.428571 0 0 0 0.357685 0.393128 0.325419
llama-2-chat:13:ggufv2:Q5_K_M 1 0.0766167 0.5 nan 0.746667 0.433333 0 0.644444 0.431373 0 0 0 0.348403 0.389888 0.32518
code-llama-instruct:7:ggufv2:Q2_K 0.333333 nan 0.7 nan 0.92 0.533333 0.25 0.8 nan 0 0.0625 0.25 0.427685 0.380509 0.292686
code-llama-instruct:34:ggufv2:Q4_K_M 0.5 nan 0.4 nan 0.906667 0.466667 0 0.975 nan 0 0 0 0.360926 0.380463 0.350483
code-llama-instruct:7:ggufv2:Q6_K 0.833333 nan 0.9 nan 0.96 0.333333 0 0.775 nan 0 0 0 0.422407 0.37787 0.391629
code-llama-instruct:34:ggufv2:Q5_K_M 0.333333 nan 1 nan 0.9 0.466667 0.125 0.95 nan 0 0 0 0.419444 0.376389 0.384096
llama-2-chat:70:ggufv2:Q2_K 1 0.215047 0.5 nan 0.9 0.473333 0 0.666667 nan 0 0 0 0.375505 0.375505 0.352226
llama-2-chat:13:ggufv2:Q6_K 1 0.0781337 0.5 nan 0.813333 0.386667 0 0.775 0.428571 0 0 0 0.361973 0.37432 0.342819
code-llama-instruct:34:ggufv2:Q8_0 0.333333 nan 0.9 nan 0.86 0.466667 0.25 0.925 nan 0 0 0 0.415 0.374167 0.353285
llama-2-chat:7:ggufv2:Q4_K_M 1 0.0852494 0.5 nan 0.646667 0.24 0.444444 0.488889 0.40056 0 0 0 0.345983 0.373271 0.290686
mistral-instruct-v0.2:7:ggufv2:Q2_K 1 0.331261 0.5 nan 0.693333 0.573333 0.222222 0.6 0.352941 0 0 0 0.388463 0.370702 0.294881
mistral-instruct-v0.2:7:ggufv2:Q4_K_M 1 0.347025 1 nan 0.826667 0.366667 0.333333 0.688889 0.365079 0 0 0 0.447969 0.365873 0.348328
code-llama-instruct:34:ggufv2:Q6_K 0.333333 nan 0.9 nan 0.853333 0.473333 0.125 0.9 nan 0 0 0 0.398333 0.365833 0.356636
llama-2-chat:13:ggufv2:Q4_K_M 1 0.0888675 0.5 nan 0.76 0.366667 0 0.777778 0.428571 0 0 0 0.356535 0.361601 0.336686
llama-2-chat:7:ggufv2:Q3_K_M 1 0.0650717 1 nan 0.693333 0.233333 0.333333 0.466667 0.394958 0 0.1 0 0.3897 0.361517 0.337204
mixtral-instruct-v0.1:46_7:ggufv2:Q4_K_M 0.166667 0.193786 1 nan 0.76 0.426667 0.333333 0.755556 0.368814 0 0.1625 0 0.378848 0.351074 0.301567
llama-2-chat:7:ggufv2:Q8_0 1 0.0847297 0.5 nan 0.64 0.266667 0.444444 0.355556 0.40056 0 0 0 0.335632 0.345594 0.286246
mixtral-instruct-v0.1:46_7:ggufv2:Q5_K_M 0 0.235659 1 nan 0.84 0.333333 0.422222 0.711111 0.352941 0 0 0.25 0.376842 0.343137 0.313874
mixtral-instruct-v0.1:46_7:ggufv2:Q6_K 0 0.225524 0.7 nan 0.826667 0.333333 0.475 0.85 0.34267 0 0 0.25 0.363927 0.338002 0.289705
mixtral-instruct-v0.1:46_7:ggufv2:Q3_K_M 0 0.229622 0.5 nan 0.893333 0.38 0.333333 0.777778 nan 0 0.065625 0.25 0.342969 0.333333 0.278279
code-llama-instruct:13:ggufv2:Q4_K_M 0.333333 nan 0.5 nan 0.833333 0.533333 0 0.775 nan 0 0 0 0.330556 0.331944 0.30939
llama-2-chat:7:ggufv2:Q6_K 1 0.0614608 0.5 nan 0.66 0.266667 0.375 0.333333 0.406162 0 0 0 0.327511 0.330422 0.288285
mixtral-instruct-v0.1:46_7:ggufv2:Q8_0 0.133333 0.189177 0.6 nan 0.846667 0.386667 0.311111 0.666667 0.358543 0 0 0.25 0.340197 0.325654 0.248216
llama-2-chat:7:ggufv2:Q5_K_M 1 0.0697591 0.6 nan 0.633333 0.293333 0.444444 0.288889 0.40056 0 0.0375 0 0.342529 0.317931 0.289372
mixtral-instruct-v0.1:46_7:ggufv2:Q2_K 0.333333 0.157514 0.6 nan 0.726667 0.48 0 0.733333 0.329599 0 0 0 0.305495 0.317547 0.269925
llama-2-chat:13:ggufv2:Q2_K 1 0.0649389 0.5 nan 0.433333 0.366667 0 0.288889 0.414566 0 0 0 0.278945 0.283917 0.285171
llama-2-chat:7:ggufv2:Q2_K 0.833333 0.0361865 1 nan 0.686667 0.1 0 0.688889 0.369748 0 0 0 0.337711 0.218856 0.359055
code-llama-instruct:13:ggufv2:Q2_K 0.0333333 nan 0.4 nan 0.82 0.566667 0 0.875 nan 0 0 0 0.299444 0.166389 0.336056
code-llama-instruct:13:ggufv2:Q3_K_M 0 nan 0 nan 0.833333 0.533333 0.45 0.85 nan 0 0 0 0.296296 0.148148 0.336707