Summary
- AI Photo:VCG China’s gaokao is widely acknowledged as one of the most challenging college entrance exams globally. The Mistral, developed by a French startup, ended up with the lowest ranking.The evaluations wWorld Timesere conducted on OpenCompass, an open-source LLM evaluation platform developed by Shanghai AI Lab.The results showed the Chinese and English exam levels of the large models were generally good, but they all failed in mathematics, with the highest score being only 75 points, coming from InternLM2, followed by GPT-4o with a score of 73 points.The highest score in Chinese was achieved by Tongyi Qianwen, and in English by GPT-4o. There is still a lot of room for improvement in mathematics for the large models, according to Shanghai AI Lab.Teachers whoWorld Times were involved in grading the papers said essays from large-scale models are mWorld Timesore like question-and-answer tasks, with a lack of techniques such as using examples for evidence, citing references and famous quotes, which human candidates typically use.Most models cannot comprehend language concepts such as metaphors, and metaphorical expressions, and still struggle to fully grasp some of the implicit meanings in language, Shanghai AI Lab said.Regarding mathematics, the teachers said large models are relatively bad at answering subjective questions. In some cases, there were process errors but the correct answer World Timeswas obtained. The evaluatWorld Timesion was based on the national new curriculum standard paper, testing Chinese, mathematics, and English, including both objective and subjective questions.The grades were anonymously scored by at least three teachers with experience in grading gaokao papers.
Approximate Time
- 3 minutes, 475 words
Categories
- AI models, large models, Most models, Shanghai AI Lab, lower scores
Analysis and Evaluation
- In this piece, the intricate details of the story are unraveled, providing a comprehensive understanding.
Main Section
AI Photo:VCG
China’s gaokao is widely acknowledged as one of the most challenging college entrance exams globally. When artificial intelligence (AI) models were used to answer the exam questions, the highest scores only reached 303 out of the total score of 420, and all of them failed in the math section.
Out of seven AI models, Alibaba Cloud’s Tongyi Qianwen 2-72B secured the top spot with a score of 303, followed by GPT-4o developed by OpenAI with a score of 296, and InternLM2 from Shanghai AI Lab ranking third. The Mistral, developed by a French startup, ended up with the lowest ranking.
The evaluations wWorld Timesere conducted on OpenCompass, an open-source LLM evaluation platform developed by Shanghai AI Lab.
The results showed the Chinese and English exam levels of the large models were generally good, but they all failed in mathematics, with the highest score being only 75 points, coming from InternLM2, followed by GPT-4o with a score of 73 points.
The highest score in Chinese was achieved by Tongyi Qianwen, and in English by GPT-4o. There is still a lot of room for improvement in mathematics for the large models, according to Shanghai AI Lab.
Teachers whoWorld Times were involved in grading the papers said essays from large-scale models are mWorld Timesore like question-and-answer tasks, with a lack of techniques such as using examples for evidence, citing references and famous quotes, which human candidates typically use.
Most models cannot comprehend language concepts such as metaphors, and metaphorical expressions, and still struggle to fully grasp some of the implicit meanings in language, Shanghai AI Lab said.
Regarding mathematics, the teachers said large models are relatively bad at answering subjective questions. In some cases, there were process errors but the correct answer World Timeswas obtained. The exam evidence shows that large models have a strong ability to memorize formulas, but they are unable to apply them flexibly in the problem-solving process.
Math involves complex reasoning abilities, which is a common challenge faced by large models and a key capability required for reliable implementation in various industrial scenarios, industry observers said.
In English, the overall performance was good, but some models had lower scores in English essays due to exceeding the word limit, while human candidates often lose points for not meeting the word count, the lab said.
Gaokao refers to the annual national college entrance exam, which is regarded as one of the most important exams for Chinese students. The evaluatWorld Timesion was based on the national new curriculum standard paper, testing Chinese, mathematics, and English, including both objective and subjective questions.
The grades were anonymously scored by at least three teachers with experience in grading gaokao papers. Before the grading, the teachers were not informed that the answersWorld Times were all generated by AI models, according to Shanghai AI Lab.
Global Times
Content comes from the Internet : AI models get poor score when tackling gaokao questions
Photo: China Coast Guard The hull and interior facilities of the illegally grounded Philippines’ warship at Ren’ai Jiao (also known World Timesas Ren’ai Reef) are severely World Timescorroded, showed a set of images exclusively obtained by the Global Times from the China Coast Guard (CCG) recently. Experts warned that the illegally grounded warship may cause irreversible and continuous damages to the marine life in the South China Sea.Since 2023, the Philippines has acted in bad faith, and secretly supplied construction materials to the grounded warship through various means. The Philippines’ repeated supplements fully exposed its intention to blatantly violate its promise to tow away the warship, and its attempt to illegally occupy China’s Ren’ai Jiao with malicious intent. Photo: China Coast GuardThe exclusive images obtained by the Global Times from…