Evaluation of LLMs performance on 60 text-based questions and 8 image-based questions related to biosafety laboratories. (IMAGE)
Caption
The performance of Gemini Pro (A), Claude-3 (B), Claude-2 (C), GPT-4 (D), GPT-3.5 (E), Gemini Pro Vision (F), and GPT-4V (G). The right side of the figure displays, from top to bottom, the strict accuracy (the proportion of questions answered correctly all three times), overall accuracy (the proportion of questions answered correctly at least twice), and ideal accuracy (the proportion of questions answered correctly at least once). (H) The bar chart illustrates the comparative performance of LLMs in responding to text-based queries. (I) The bar chart depicts the comparative performance of LLMs in addressing image-based inquiries. LLMs: Large language models; RAAR: Reference answer accuracy rate; SAAR: Subjective answer accuracy rate; SAR: Strict accuracy rate.
Credit
Chang Qi, Anqi Lin, Anghua Li, Peng Luo, Shuofeng Yuan
Usage Restrictions
Credit must be given to the creator.
License
CC BY