Part 4 :Measuring model for Massive Multitask Language Understanding(MMLU)
Most of us have encountered large language models (LLMs) described as versatile tools, much like a Swiss Army knife — adept in many areas but not necessarily expert in all. This raises questions about how to effectively evaluate their strengths and limitations across different tasks. It’s crucial to identify standardized methods for assessing their multi-task language understanding and how well they perform in various domains.
What are MMLU standards and recommendation?
When it comes to evaluating LLMs for multitask language understanding (MMLU), one of the most referenced papers is the one by Hendrycks et al., which outlines a comprehensive framework for these evaluations. This paper is often cited when discussing standards for assessing the capabilities of LLMs in multiple domains. You can find the paper here:
https://arxiv.org/abs/2009.03300
MMLU evaluations involve various datasets and often take into account different “shots” or examples provided to the model during testing. Some models are particularly sensitive to the length of the context provided, affecting their accuracy in specific areas.
A concern often raised is the potential for models to memorize parts of the training data. This can lead to artificially high accuracy if the evaluation questions overlap with the training set. To mitigate this, evaluators sometimes source questions from different documents or ensure that questions and answers are located on different pages. There are multiple MMLUs available in market, here I have used cais/mmlu.
List of Subsets in typical MMLU Dataset:
The MMLU dataset is divided into several subsets, each covering a distinct field of knowledge. Here’s a breakdown of the areas included in cais/mmlu which is available on hugging face:
- Abstract Algebra (116 rows)
- Anatomy (154 rows)
- Astronomy (173 rows)
- Auxiliary Train (99.8k rows)
- Business Ethics (116 rows)
- Clinical Knowledge (299 rows)
- College Biology (165 rows)
- College Chemistry (113 rows)
- College Computer Science (116 rows)
- College Mathematics (116 rows)
- College Medicine (200 rows)
- College Physics (118 rows)
- Computer Security (116 rows)
- Conceptual Physics (266 rows)
- Econometrics (131 rows)
- Electrical Engineering (166 rows)
- Elementary Mathematics (424 rows)
- Formal Logic (145 rows)
- Global Facts (115 rows)
- High School Biology (347 rows)
- High School Chemistry (230 rows)
- High School Computer Science (114 rows)
- High School European History (188 rows)
- High School Geography (225 rows)
- High School Government and Politics (219 rows)
- High School Macroeconomics (438 rows)
- High School Mathematics (304 rows)
- High School Microeconomics (269 rows)
- High School Physics (173 rows)
- High School Psychology (610 rows)
- High School Statistics (244 rows)
- High School U.S. History (231 rows)
- High School World History (268 rows)
- Human Aging (251 rows)
- Human Sexuality (148 rows)
- International Law (139 rows)
- Jurisprudence (124 rows)
- Logical Fallacies (186 rows)
- Machine Learning (128 rows)
- Management (119 rows)
- Marketing (264 rows)
- Medical Genetics (116 rows)
- Miscellaneous (874 rows)
- Moral Disputes (389 rows)
- Moral Scenarios (1k rows)
- Nutrition (344 rows)
- Philosophy (350 rows)
- Prehistory (364 rows)
- Professional Accounting (318 rows)
- Professional Law (1.71k rows)
- Professional Medicine (308 rows)
- Professional Psychology (686 rows)
- Public Relations (127 rows)
- Security Studies (277 rows)
- Sociology (228 rows)
- U.S. Foreign Policy (116 rows)
- Virology (189 rows)
- World Religions (195 rows)
Evaluations using MMLU often cover these areas at a high level. Other MMLU datasets can also be used for more targeted evaluations, especially if you’re looking to apply LLMs in specific fields. It’s crucial to ensure the model’s evaluation in your area of interest meets the necessary standards.
Increasing the model size alone doesn’t guarantee better performance. It must be paired with rich, diverse training data. Current research suggests that a 10% increase in model size requires an approximate 5% increase in training data for effective improvement.
I performed couple of tests on “bert-base-uncased” model, which is trained on Pre-training Data
English Wikipedia (2.5B words), BooksCorpus (800M words).
It was pretrained with Number of Layers: 12 Transformer blocks (layers)Hidden Size: 768 Total Parameters: 110 million Maximum Input Length: 512 tokens
I could get on elementary mathematics data an accuracy of around 21.95% again confidence level was low. This model, developed by Google AI, uses a transformer architecture that leverages bidirectional training to understand the context of words in a sentence. Primary use cases for this were Masked Language Modeling (MLM): Predicting randomly masked tokens in sentences.Next Sentence Prediction (NSP): Understanding the relationship between pairs of sentences.
This evaluation specifically focuses on elementary mathematics. However, you can choose any subset from the dataset to assess a model’s performance, providing insights into its average accuracy across various domains.
Conclusion :
As we continue to develop and use LLMs, it’s vital to assess whether existing evaluation standards are sufficient for our specific use cases. Creating custom evaluation datasets for your applications might be necessary. Over time, models may memorize evaluation data, requiring us to develop new datasets to ensure robust performance on unseen data. Ultimately, it’s up to us to decide how to evaluate pre-trained models effectively, and I hope these insights help you in evaluating any model from the MMLU perspective.