A direct public ARC-AGI evaluation data prompting test by OpenAI GPT-4 Omni and Anthropic Claude 3.5 Sonnet LLM inference.
This report documents the evaluation of two prominent large language models (LLMs)—OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet—on a subset of the publicly available ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) dataset. This benchmark is designed to assess the ability of AI systems to efficiently learn new skills, a crucial aspect of general intelligence.
The ARC-AGI dataset consists of 800 public tasks (400 training and 400 evaluation tasks) designed to test reasoning abilities in a minimal-prior, skill-based setting. Each task is presented as a JSON list of integers, representing a sequence of input-output pairs followed by a final test input. The objective is to predict the corresponding output based on the learned patterns from the provided examples. From the user's perspective, these tasks occur as geometric generalizations and reasoning puzzles in a 2D plane grid that is filled with colored squares. Colors in the following examples are identified by an integer number between 0 and 9.
The public evaluation data set can be retrieved from https://github.com/fchollet/ARC-AGI/tree/master/data/evaluation
<aside> 💡 Note: The ARC-AGI 1M$ Prize competition is restricted to the local implementation of the puzzle-solver. The test suite presented in this report utilizes LLMs over the internet connection, so it is not eligible for the competition. The final tests should be run in a restricted environment (Kaggle) to prevent leaking the actual evaluation data to the internet. This keeps tests fresh and novel for all comparative models.
</aside>
This report aims to utilize the most straightforward vanilla prompting strategy and examine the current efficiency of the two most prominent LLMs.
https://github.com/markomanninen/ARC-AGI/tree/master/test
The test runner script, run_test.py, uses Python to interact with the LLMs via their respective APIs. It iterates through the evaluation set, performing the following steps for each task: