A direct public ARC-AGI evaluation data prompting test by OpenAI GPT-4 Omni and Anthropic Claude 3.5 Sonnet LLM inference.

Evaluating Large Language Models on ARC-AGI Tasks

This report documents the evaluation of two prominent large language models (LLMs)—OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet—on a subset of the publicly available ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) dataset. This benchmark is designed to assess the ability of AI systems to efficiently learn new skills, a crucial aspect of general intelligence.

1. Objectives and Motives

Benchmark LLM performance: Assess the capability of GPT-4o and Claude 3.5 Sonnet in solving abstract reasoning tasks, gauging their potential in tasks requiring generalization and skill acquisition.
Understand LLM limitations: Identify specific patterns in task failures to understand areas where these models struggle.

2. The ARC-AGI Dataset

The ARC-AGI dataset consists of 800 public tasks (400 training and 400 evaluation tasks) designed to test reasoning abilities in a minimal-prior, skill-based setting. Each task is presented as a JSON list of integers, representing a sequence of input-output pairs followed by a final test input. The objective is to predict the corresponding output based on the learned patterns from the provided examples. From the user's perspective, these tasks occur as geometric generalizations and reasoning puzzles in a 2D plane grid that is filled with colored squares. Colors in the following examples are identified by an integer number between 0 and 9.

The public evaluation data set can be retrieved from https://github.com/fchollet/ARC-AGI/tree/master/data/evaluation

<aside> 💡 Note: The ARC-AGI 1M$ Prize competition is restricted to the local implementation of the puzzle-solver. The test suite presented in this report utilizes LLMs over the internet connection, so it is not eligible for the competition. The final tests should be run in a restricted environment (Kaggle) to prevent leaking the actual evaluation data to the internet. This keeps tests fresh and novel for all comparative models.

</aside>

This report aims to utilize the most straightforward vanilla prompting strategy and examine the current efficiency of the two most prominent LLMs.

3. Test Runner Script

https://github.com/markomanninen/ARC-AGI/tree/master/test

The test runner script, run_test.py, uses Python to interact with the LLMs via their respective APIs. It iterates through the evaluation set, performing the following steps for each task:

Format User Message: This procedure constructs a prompt for the LLM, including the training examples and the test input. The prompt is formatted to request a JSON array output.
Send Request to LLM: This procedure sends the prompt to either OpenAI or Anthropic APIs, depending on the selected model.
Extract JSON Response: Parses the LLM's response, extracting the predicted output as a JSON array. Claude models do not have JSON output mode so this method ensures the common functionality over different LLMs.
Compare Outputs: Compares the predicted output with the ground truth output, marking it as correct or incorrect.
Log Results: Records the prompt, ground truth output, predicted output, and the result of the comparison for future analysis.
Visualize Task: This function generates plots visualizing the input, true output, and predicted output for each task, aiding in understanding the model's reasoning process.