While artificial intelligence excels at tasks like coding and podcast generation, it struggles to accurately answer high-level history questions, according to a study.
Researchers tested OpenAI’s GPT-4, Meta’s Llama and Google’s Gemini using a newly developed benchmark called Hist-LLM.
The benchmark relies on the Seshat Global History Databank, a comprehensive database of historical knowledge.
The study, which was presented at the NeurIPS AI conference last month, found disappointing results, according to TechCrunch.
GPT-4 Turbo performed best but only achieved about 46% accuracy — barely above random guessing.
“LLMs, while impressive, still lack the depth required for advanced history,” said Maria del Rio-Chanona, a co-author of the paper and associate professor at University College London.
“They’re great for basic facts, but they fail at nuanced, PhD-level historical inquiries.”
Researchers found that LLMs often extrapolate from prominent historical data but struggle with more obscure details.
For instance, GPT-4 incorrectly stated that scale armor was present in ancient Egypt during a specific time period, when in reality, the technology only appeared 1,500 years later.
Similarly, the model falsely claimed ancient Egypt had a professional standing army during a particular period, likely due to the prevalence of information on standing armies in other ancient empires, such as Persia.
“If you get told A and B 100 times, and C only once, you’re more likely to recall A and B,” del Rio-Chanona explained.
Another concern was potential bias.
OpenAI’s GPT-4 and Meta’s Llama models performed worse when answering questions about regions such as sub-Saharan Africa, indicating training data limitations.
“These biases suggest LLMs reflect gaps in historical documentation rather than an unbiased representation of history,” said Peter Turchin, the study’s lead researcher.
Despite these limitations, researchers remain hopeful that AI can assist historians in the future.
They plan to refine the Hist-LLM benchmark by incorporating more diverse data sources and increasing the complexity of the questions.
“Our findings highlight areas where LLMs need improvement, but they also showcase their potential to support historical research,” the paper concluded.
As AI continues to evolve, experts say it is clear that human historians remain irreplaceable in interpreting complex historical narratives and ensuring accuracy in academic inquiry.
[Notigroup Newsroom in collaboration with other media outlets, with information from the following sources]