Skip to content

ashioyajotham/greater-than-circuit

Repository files navigation

Greater-Than Circuit in GPT-2 Small

An independent TransformerLens replication of Hanna, Liu, and Variengien's analysis of how GPT-2 Small performs a narrow "greater-than" behavior in year-completion prompts.

The canonical experiment in this repository is not general arithmetic. It asks whether GPT-2 assigns more probability to valid end years in prompts like:

The war lasted from the year 1732 to the year 17

For this prompt, completions 33 through 99 are valid under the task. The main result is that patching identifies MLP layers 9 and 10 as the strongest contributors, with attention layers around 7-9 routing year information into that computation.

Quick Start

git clone https://github.com/ashioyajotham/greater-than-circuit
cd greater-than-circuit

python -m venv venv
.\venv\Scripts\activate  # Windows PowerShell
pip install -r requirements.txt

python run_hanna_analysis.py --n_examples 50 --device cpu

On first run, TransformerLens/Hugging Face may download GPT-2 Small. A CPU run is usable for smoke tests, but larger runs are much faster on CUDA.

For a quick smoke test:

python run_hanna_analysis.py --n_examples 1 --device cpu

Expected one-example smoke behavior:

Baseline Probability Difference: about 0.93
Top MLP layers: MLP9, MLP10, MLP8, MLP11
Top attention layers: L9, L7, L8

What This Reproduces

Primary reference:

Hanna, Liu, and Variengien (NeurIPS 2023), "How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model"

The original paper found that GPT-2 Small can perform a specific year-span completion task and that the behavior is concentrated in late MLPs, especially MLPs 9 and 10. This repo replicates that result with TransformerLens rather than the original rust-circuit stack.

This repository is best read as:

  • a runnable replication of the Hanna-style year-completion experiment
  • a small activation-patching codebase for exploring this circuit
  • a starting point for further mechanistic interpretability experiments

It is not evidence that GPT-2 has general numerical reasoning. GPT-2 Small fails many arithmetic tasks, and this circuit does not imply robust less-than, subtraction, or symbolic comparison ability.

Method

Task

The main task uses year-span prompts:

Prompt:   "The war lasted from the year 1732 to the year 17"
Target:   More probability on tokens 33-99 than on tokens 01-32
Metric:   PD = sum p(y > YY) - sum p(y <= YY)

The metric is Probability Difference, where YY is the two-digit start year. GPT-2's tokenizer makes most two-digit years single tokens, which makes the intervention setup relatively clean.

Corruption

The corrupted baseline follows the paper's 01 dataset. Clean prompts vary the start year, while corrupted prompts use 01 as the start year:

Clean:     "The war lasted from the year 1732 to the year 17"
Corrupted: "The war lasted from the year 1701 to the year 17"

Because almost all years are greater than 01, this changes the model's completion distribution in a controlled way.

Activation Patching

For each candidate component, the script:

  1. Runs the clean prompt and caches activations.
  2. Runs the corrupted prompt.
  3. Patches one clean activation into the corrupted run.
  4. Measures how much the Probability Difference recovers.

High recovery means the patched component is causally involved in the behavior under this intervention.

Current Results

The Hanna-style entry point reports the same qualitative structure as the paper:

Component family Strongest layers in this implementation
MLPs MLP 9, MLP 10, MLP 8, MLP 11
Attention Layers 7-9, especially layer 9
Baseline task behavior High positive Probability Difference

The exact percentages vary with prompt sample, template choice, device, and implementation details. The important replication claim is qualitative: late MLPs, especially 9 and 10, dominate the recovered greater-than behavior.

Project Structure

greater-than-circuit/
|-- src/
|   |-- __init__.py
|   |-- model_setup.py          # TransformerLens model loading
|   |-- prompt_design.py        # Exploratory True/False comparison prompts
|   |-- prompt_design_hanna.py  # Hanna-style year-completion prompts and PD metric
|   |-- activation_patching.py  # Core patching utilities
|   |-- circuit_analysis.py     # Component ranking and summaries
|   |-- circuit_validation.py   # Exploratory validation utilities
|   `-- visualization.py        # Plotting helpers
|-- tests/
|   |-- test_model_setup.py
|   |-- test_activation_patching.py
|   `-- test_circuit_analysis.py
|-- notebooks/
|   `-- quick_start_analysis.ipynb
|-- results/                    # Generated outputs from exploratory runs
|-- main.py                     # Exploratory True/False comparison pipeline
|-- run_hanna_analysis.py       # Canonical Hanna-style replication entry point
|-- requirements.txt
`-- pyproject.toml

Use run_hanna_analysis.py for the replication result. main.py is an older exploratory pipeline for direct True/False number-comparison prompts; GPT-2 Small is weak on that task, so its results should not be used as the primary replication claim.

Programmatic Use

import torch

from src.model_setup import ModelSetup
from src.prompt_design_hanna import (
    YearPromptGenerator,
    compute_probability_difference,
    get_year_token_ids,
)

setup = ModelSetup(device="cpu")
model = setup.load_model()
year_token_ids = get_year_token_ids(model)

generator = YearPromptGenerator(seed=42)
examples = generator.generate_balanced_year_dataset(n_examples=5, template_idx=0)

for example in examples:
    tokens = model.to_tokens(example.prompt_text)
    with torch.no_grad():
        logits = model(tokens)

    final_logits = logits[0, -1, :]
    pd = compute_probability_difference(
        final_logits,
        example.start_year,
        year_token_ids,
    )
    print(f"{example.prompt_text!r} -> PD={pd:.3f}")

Testing

The test suite uses mocks for most model-facing behavior, so it is much faster than the full activation-patching run:

python -m pytest

If the Windows pytest.exe launcher is broken, prefer:

venv\Scripts\python.exe -m pytest

Limitations

  • The code replicates the year-completion circuit structure, not general arithmetic.
  • The generic True/False comparison pipeline in main.py is exploratory and can show weak baseline performance.
  • The activation-patching script patches whole attention/MLP layer outputs; it is not a full path-patching reimplementation of every analysis in the paper.
  • The mechanism inside MLPs 9 and 10 is not fully characterized here. The code identifies important components, but it does not explain exactly how those MLPs encode year order.
  • Larger runs can be slow on CPU.

References

@inproceedings{hanna2023greater,
  title={How does {GPT-2} compute greater-than?: Interpreting mathematical abilities in a pre-trained language model},
  author={Hanna, Michael and Liu, Ollie and Variengien, Alexandre},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2023},
  url={https://arxiv.org/abs/2305.00586}
}

Related work:

Acknowledgments

This project builds on the original greater-than circuit work by Michael Hanna, Ollie Liu, and Alexandre Variengien, and on TransformerLens and the broader mechanistic interpretability tooling ecosystem.

License

MIT License. See LICENSE.

Citation

@software{ashioya2025greaterthan,
  title={Greater-Than Circuit in GPT-2 Small: A TransformerLens Replication},
  author={Ashioya, Jotham Victor},
  year={2025},
  url={https://github.com/ashioyajotham/greater-than-circuit},
  note={Independent replication of Hanna et al. (2023)}
}

About

Reverse engineering the circuit responsible for the "greater than" capability in a language model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors