This project investigates whether transformer models truly learn the underlying rules of structured data—or rely on statistical patterns within the training distribution. :contentReference[oaicite:0]{index=0}
Overview
Framework
Results
Implications
Modern transformer models achieve strong performance across sequence tasks, but high accuracy does not necessarily imply structural understanding.
This research isolates that distinction by studying formal languages generated by deterministic finite automata (DFAs), where the true generative rules are known and unambiguous.
The central question:
Do models learn rules—or just approximate patterns?
DFA → Synthetic Data → Transformer Training → OOD Evaluation → Behavioral Analysis
Sequences are generated from deterministic finite automata, ensuring:
A model is said to learn rules if its predictions depend only on the DFA state-not the specific sequence that led there.
Models achieve low loss and stable perplexity on sequences similar to training data.
Performance degrades sharply on longer sequences, with systematic-not random-errors.
Failures are most evident in predicting sequence termination, which depends directly on underlying rules.
Models produce different predictions for sequences that correspond to the same DFA state, indicating lack of structural understanding.
Increasing model size does not resolve generalization failures.
These findings suggest that transformer models trained via next-token prediction do not reliably learn underlying rules, even in simple, controlled settings.
Instead, they rely on surface-level features such as sequence length, token position, and local correlations.
This has broader implications for:
Ultimately, this work highlights a fundamental gap between performance and understanding, raising important questions for the future of machine learning.