Breaking Down the Metrics: A Comparative Analysis of LLM Benchmarks

Valentina Porcu * and Aneta Havlínová

Independent researcher.
 
Research Article
International Journal of Science and Research Archive, 2024, 13(02), 777–788.
Article DOI: 10.30574/ijsra.2024.13.2.2209
 
Publication history: 
Received on 02 October 2024; revised on 11 November 2024; accepted on 14 November 2024
 
Abstract: 
Due to the fast advancement in large language models (LLMs), the natural language processing (NLP) technology domain has witnessed a massive change. It has brought ground breaking developments in how machines understand and generate human language. However, even with these advancements, it remains hard to objectively compare and evaluate LLMs' performance because various benchmarks and metrics are used in the assessment. Such a method requires this study's writers to conduct a comparative analysis of various benchmark LLMs to explain the viability of different evaluation metrics. Analyzing the recognized standards, including GLUE, SuperGLUE, and SQuAD, reveals weaknesses and potential in the present evaluating systems. It embraces quantitative assessments of model performances and benchmarking and critical evaluations of benchmark designs and their scope. The paper discusses methodological issues in benchmarking research and explains that current measures are insufficient and lack comprehensive evaluation patterns. The conclusions of this research underline the significance of creating fully integrated and efficient reference sets that match the rate of innovation in LLM ability and inform other studies and the overall subsequent advancement of more sophisticated and flexible language models.
 
Keywords: 
LLM Benchmarks; BIG-bench; NLP Tasks; SQuAD; Model Comparison; NLP evaluation
 
Full text article in PDF: