VIEGAS, C. V.; http://lattes.cnpq.br/9064657341820241; VIEGAS, Cayo Vinicíus.
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. However, their performance in specialized domains like computer science remains relatively underexplored. This study investigates whether LLMs can match or surpass human performance on the POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science. Four LLMs-ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large-were evaluated on the 2022 and 2023 POSCOMP exams. The evaluation consisted of two assessments: one involving image interpretation and another text-only format, to determine the models' proficiency in handling complex questions typical of the exam. Results indicated that LLMs performed significantly better on text-based questions, with image interpretation posing a major challenge. For instance, in the image-based assessment, ChatGPT-4 answered 40 out of 70 questions correctly, while Gemini 1.0 Advanced managed only 11 correct answers. In the text-based assessment of 2022, ChatGPT-4 led with 57 correct answers, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). The 2023 exam showed similar trends.