Do Transformers Learn Rules—or Just Patterns?

This project investigates whether transformer models truly learn the underlying rules of structured data—or rely on statistical patterns within the training distribution. :contentReference[oaicite:0]{index=0}

Overview

Framework

Results

Implications

Modern transformer models achieve strong performance across sequence tasks, but high accuracy does not necessarily imply structural understanding.

This research isolates that distinction by studying formal languages generated by deterministic finite automata (DFAs), where the true generative rules are known and unambiguous.

The central question:
Do models learn rules—or just approximate patterns?

Experimental Pipeline

DFA → Synthetic Data → Transformer Training → OOD Evaluation → Behavioral Analysis

Dataset

Sequences are generated from deterministic finite automata, ensuring:

Fully known generative structure
Clear definition of correctness
Controlled evaluation conditions

Rule Learning Definition

A model is said to learn rules if its predictions depend only on the DFA state-not the specific sequence that led there.

1. Strong In-Distribution Performance

Models achieve low loss and stable perplexity on sequences similar to training data.

2. Out-of-Distribution Failure

Performance degrades sharply on longer sequences, with systematic-not random-errors.

3. Structural Collapse

Failures are most evident in predicting sequence termination, which depends directly on underlying rules.

4. Violation of State Behavior

Models produce different predictions for sequences that correspond to the same DFA state, indicating lack of structural understanding.

5. Scaling Limitations

Increasing model size does not resolve generalization failures.

These findings suggest that transformer models trained via next-token prediction do not reliably learn underlying rules, even in simple, controlled settings.

Instead, they rely on surface-level features such as sequence length, token position, and local correlations.

This has broader implications for:

Reliability under distribution shift
Interpretability of AI systems
Limitations of current architectures

Ultimately, this work highlights a fundamental gap between performance and understanding, raising important questions for the future of machine learning.

← Back