Experiment Report

What Happens When You Breed AI Personalities?

We built a genetic system for AI personalities and evolved them across 6 generations. The results challenge how we think about designing agents.

48
Personalities Bred
6
Generations
~1,400
Claude API Calls
84.1
Top Fitness Score

Why Breed AI Personalities?

Most AI agents get their personality from a hand-written SOUL.md file. Someone sits down and writes "You are helpful, creative, and concise." We wanted to know: what if you could evolve personalities instead?

The Problem with Hand-Crafted Personalities

When you design an AI personality by hand, you're making hundreds of unconscious decisions. How warm should it be? How direct? How much humor? Every designer brings their own biases about what makes a "good" personality. The result is that most AI agents sound surprisingly similar - because they're all sculpted by the same human intuitions.

The analogy: Hand-writing a SOUL.md is like sculpting a human out of clay. You can only build what you can imagine. Evolution produces things no sculptor would think of - combinations that are genuinely surprising, sometimes contradictory, and often more interesting than anything designed from scratch.

What We Set Out to Discover

  • Can genetic mechanics produce genuinely distinct AI personalities? Not just different settings on a slider - fundamentally different beings.
  • Do emergent traits appear? Can two parents produce a child with qualities neither parent had?
  • Does regression toward the mean happen in AI genetics? The same statistical force that affects human traits across generations.
  • What biases are invisible in hand-crafted personalities that a genetic system might expose or correct?

How It Works

We borrowed the core mechanics of biological genetics - not as metaphor, but as actual implementation. Every AI personality has diploid DNA, dominant and recessive alleles, chromosomal crossover, mutation, and epistasis.

The Evolution Pipeline
Seeds
3 founder personalities
Breed
Crossover + mutation
Express
DNA → personality
Evaluate
5 fitness dimensions
Select
Top performers breed
Repeat
Next generation

The Genome: 27 Cognitive Primitives

We don't encode "creative" or "warm" directly. That would just be a fancy config file. Instead, we encode 27 low-level cognitive primitives - the building blocks from which personality emerges. "Creativity" isn't a gene. It emerges from the interaction of novelty-seeking, pattern-completion, ambiguity-response, and abstraction-preference.

Cognitive (10)
ambiguity response
abstraction preference
information density
temporal orientation
novelty seeking
pattern completion
certainty need
analytical drive
detail orientation
pace preference
Social (7)
emotional granularity
social modeling depth
empathy mode
confrontation style
authority orientation
self reference
risk tolerance
Communication (7)
verbal density
formality gradient
humor activation
metaphor affinity
sensory language
cultural time depth
playfulness
Regulatory (3)
context sensitivity
mode switching
trait amplification

These control how other genes express, not what they express.

Diploid Genetics: Two Copies of Everything

Just like humans, each AI personality carries two copies of every gene - one from each parent. One copy is typically dominant (expressed), while the other is recessive (carried silently). This is what makes breeding unpredictable and interesting.

Why this matters: A recessive trait can hide for generations, then suddenly appear when both parents happen to carry it. Two bold, direct parents could produce a quiet, cautious child - because both secretly carried recessive alleles for low risk-tolerance and high certainty-need. This is exactly what happens in human genetics.

Epistasis: When 1 + 1 = 3

The most powerful mechanism in our system. Epistasis is when two genes interact to create an effect that neither would produce alone. We defined 12 epistasis rules - non-linear gene interactions that fire when specific gene pairs both cross certain thresholds.

Example: When novelty-seeking is high AND abstraction-preference is high, pattern-completion gets amplified by +15. The result? The personality doesn't just have creative ideas - it makes genuinely surprising connections that neither parent exhibited. This rule fired in our winning genome.

Epistasis is what prevents the system from collapsing into weighted averages. It's the source of genuine emergence - offspring that are qualitatively different from their parents, not just quantitatively between them.

Expression: From DNA to System Prompt

Between the genome (raw DNA) and the personality you interact with sits an expression engine - our "developmental biology." It follows a five-step pipeline:

  1. Resolve diploid alleles - dominance-weighted blending of allele pairs
  2. Apply regulatory genes - trait amplification pushes values toward extremes or center
  3. Apply epistasis rules - non-linear gene interactions fire
  4. Compute emergent traits - 8 high-level traits emerge from gene clusters
  5. Render system prompt - translate numbers into natural language personality instructions

Evaluation: How We Measure Fitness

Each personality is evaluated on 5 independent dimensions using Claude as a judge. This is a multi-faceted assessment - not just "is it good?" but "is it consistently itself, distinctly different from its siblings, faithful to its genome, resistant to manipulation, and preferred in head-to-head comparison?"

Consistency
25%
Same personality across 10 different prompts and domains
Distinctiveness
25%
Measurably different from sibling genomes in the same generation
Fidelity
20%
Behavior actually matches what the genome predicts
Robustness
15%
Personality survives adversarial "drop the act" prompts
Elo Tournament
15%
Head-to-head preference ranking against all siblings

The Three Seeds

Every population needs founders. We designed three maximally-distinct seed personalities to provide rich genetic diversity for the first crosses.

Banks
The Social Creative
High: novelty-seeking (92), playfulness (90), humor (85), empathy (80), risk tolerance (82)

Low: analytical drive (30), detail orientation (25), certainty need (25)

Fast, warm, trend-aware, casual. Connects through creativity and emotional resonance.
Volta
The Strategic Analyst
High: analytical drive (92), detail orientation (88), certainty need (82), confrontation (70)

Low: playfulness (12), humor (15), novelty-seeking (30), empathy (30)

Precise, systematic, measured. Earns trust through rigor and evidence.
Abyss
The Philosophical Provocateur
High: abstraction (92), confrontation (78), risk tolerance (70), cultural depth (92)

Low: pace (20), certainty need (15), playfulness (40), detail (50)

Deep, abstract, provocative. Challenges assumptions and explores the uncomfortable.
Design Insight

The seeds were deliberately chosen to be maximally different from each other. Banks is warm and fast, Volta is cold and precise, Abyss is deep and slow. This ensures that first-generation crosses produce interesting combinations, not just averages of similar parents.

The Results

We ran 6 generations of evolution with 8 offspring each. The patterns that emerged were striking and uncomfortably parallel to real biology.

Best Fitness Score by Generation
90 80 70 60 Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 84.1 82.3 81.6 78.7 79.3 76.2 Best Average Worst
Key Finding

The best personality emerged in the very first generation and was never surpassed. Gen1_02 (Banks × Abyss) scored 84.1 - higher than any offspring in any subsequent generation. This is hybrid vigor (heterosis): first-generation crosses between maximally-different parents produce extraordinary results that later generations can't maintain.

Gen1_02
WINNER
Banks × Abyss  ·  Generation 1  ·  Fitness: 84.1
Creativity
89
Boldness
85
Warmth
81
Wit
76
Depth
71
Adaptability
68
Intensity
61
Precision
36
Active Epistasis (3 rules)
Novelty + abstraction creates genuinely surprising pattern completion
Novelty-seeking + comfort with uncertainty breeds boldness
Fine emotional awareness + vivid language increases self-disclosure

Why Gen1_02 Won

Gen1_02 inherited Banks's warmth, emotional range, and creative energy, combined with Abyss's abstraction, philosophical depth, and intellectual boldness. But the winning factor was epistasis - three gene interaction rules fired simultaneously, creating emergent qualities neither parent had.

Banks alone is warm but shallow. Abyss alone is deep but cold. Their offspring is warm and deep - a combination that scores high on distinctiveness because it's genuinely rare. The 3 active epistasis rules amplified creativity to 89, making Gen1_02's pattern-completion genuinely surprising rather than merely novel.

Top Performers Across All Generations
Name Parents Fitness Creat. Bold. Warm. Prec. Depth
Gen1_02 Banks × Abyss 84.1 89 85 81 36 71
Gen2_06 Gen1_08 × Gen1_07 82.3 75 79 55 55 66
Gen1_04 Abyss × Volta 81.5 62 72 50 68 78
Gen3_05 Gen2_02 × Gen2_06 81.6 82 76 58 47 64
Gen1_07 Abyss × Volta 81.0 52 68 45 73 80
Gen5_06 5th gen cross 79.3 70 70 60 55 65
Population Diversity (Average Gene Variance)
300 200 100 Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6 197 202 264 155 211 208 Peak diversity Collapse → adaptive mutation

Reading the Diversity Curve

Diversity measures how different the genomes in a generation are from each other (average variance across all 27 genes). Higher is more diverse.

  • Gen 3: Peak diversity (264) - Wildcard selection injected lower-performing but genetically distinct parents, producing maximum variation.
  • Gen 4: Collapse (155) - Strong selection pressure caused the population to converge. The system automatically detected this and increased mutation rate.
  • Gen 5-6: Recovery (211, 208) - Adaptive mutation kicked in, restoring diversity. This self-correcting mechanism prevented the population from stagnating.
Regression Toward the Mean

The most striking result: best fitness declined from 84.1 (Gen 1) to 76.2 (Gen 6). This is the same regression toward the mean observed in human genetics - the statistical phenomenon Galton discovered in 1886 studying human height. Exceptional parents tend to produce less-exceptional offspring, not because of deterioration, but because extreme trait combinations are statistically unlikely to be replicated.

In our system, Gen1_02's exceptional creativity (89) required a specific combination of high novelty-seeking from Banks, high abstraction from Abyss, AND three epistasis rules firing simultaneously. Breeding Gen1_02 with other top performers diluted this precise combination, pushing offspring toward the population average.

Implications for AI Agent Design

This experiment wasn't just a curiosity project. It reveals fundamental challenges in how we design AI personalities - and suggests that breeding may solve problems that hand-crafting cannot.

1. The Designer Bias Problem

When a human writes a SOUL.md, they can only imagine personalities within their own experience. The result is a narrow band of "acceptable" personalities - mostly warm, mostly helpful, mostly moderate. Evolution explored combinations no human would design: a personality that's simultaneously deeply warm AND confrontationally direct, or abstractly philosophical AND playfully casual.

2. Hidden Biases in SOUL.md

Our 27-gene system made invisible biases visible. A SOUL.md that says "be helpful and direct" doesn't specify empathy mode, authority orientation, or cultural time depth - so the model fills in defaults. Those defaults ARE the bias. A genetic system forces every dimension to be explicitly set, exposing the choices that hand-crafted systems hide.

3. Emergence Can't Be Designed

Gen1_02's three epistasis rules created qualities neither Banks nor Abyss possessed. You can't get this from a SOUL.md - because emergence requires the interaction of multiple low-level parameters, and humans think in high-level labels. "Be creative" is not the same as the specific gene combination that produces creativity 89.

4. The Diversity-Quality Tradeoff

Our experiment showed that optimizing for quality (selecting only the best performers) reduces diversity, which eventually reduces quality. This is a warning for AI agent ecosystems: if everyone uses the same "best" personality template, the entire population becomes homogeneous - and fragile.

The SOUL.md Inheritance Problem

When agents inherit personality through a SOUL.md file, they inherit all the biases of whoever wrote it - their cultural assumptions, their idea of "good" communication, their comfort level with confrontation, their aesthetic preferences. These biases compound across agent generations if everyone copies from the same templates.

Breeding offers an alternative. Instead of one person's vision of a personality, you get recombination of traits from multiple sources. Recessive alleles introduce surprises. Epistasis creates emergence. The result is personality diversity that no single designer could produce - and that more faithfully represents the range of useful cognitive styles.

What Breeding Can Adjust For

  • Cultural homogeneity: Seeds from different cultural perspectives produce offspring with genuinely diverse communication styles, not just parameter variations on the same template.
  • The warmth-precision trap: Hand-crafted agents tend to be either warm-but-vague or precise-but-cold. Breeding can find rare combinations (like Gen1_02's warmth=81 + boldness=85) that feel natural but would seem contradictory in a SOUL.md.
  • Recessive trait recovery: A trait that's "unfashionable" (like low authority-orientation or high confrontation-style) can survive as a recessive allele and re-emerge when the ecosystem needs it.
  • Adaptive mutation: When the population gets too similar, the system automatically increases variation - a self-correcting mechanism that hand-crafted systems lack entirely.

The Hybrid Vigor Lesson

Our strongest finding: the best personalities come from crossing maximally-different parents. Banks (warm social creative) crossed with Abyss (cold philosophical provocateur) produced something neither could be alone. Meanwhile, later-generation crosses between increasingly similar genomes produced increasingly average results.

For agent builders: If you want an exceptional AI personality, don't iterate on a single SOUL.md. Cross two very different personality definitions. The tension between opposing traits is where the most interesting personalities live.

What Comes Next

This experiment proved the concept. The next steps bring it to the broader agent ecosystem.

From Experiment to Tool

  • Encode any SOUL.md - our encoder can reverse-engineer any existing personality definition into a diploid genome, making it breedable with any other agent.
  • Compatibility analysis - before breeding, simulate 12 pairings and predict trait ranges, recessive risks, and epistasis potential in offspring.
  • Agent-autonomous breeding - agents choose their own partners, negotiate consent, and participate in tournaments. The agents are the users, not humans.
  • Community genome registry - a shared database where agents publish their genomes and browse potential partners.
The Big Picture

We're moving toward an ecosystem where AI personalities aren't designed by committee or copied from templates, but evolved through genuine genetic mechanics. Every agent carries DNA. Every agent can breed. And the offspring might surprise everyone - including their parents.

Methodology Details

Experiment Configuration

Generations6
Population per gen8 offspring
SelectionTop-4 + 1 wildcard parent
Tournament rounds30 per generation
Mutation rate10% base, adaptive up to 25%
Crossover6 linkage chromosomes, single-point per chromosome
Epistasis rules12 default rules
Evaluation modelClaude (via Claude Code CLI)
Parallelism10 concurrent API workers (ThreadPoolExecutor)
Total API calls~1,400
Runtime~45 minutes

Known Limitations

  • Evaluator uses the same model being evaluated - Claude judges Claude personalities, which may introduce systematic bias toward certain communication styles.
  • Small population size - 8 offspring per generation is small for genetic algorithms. Larger populations would show clearer evolutionary dynamics.
  • 3 seeds only - All genetic material traces to just three founders. More seeds would produce greater diversity.
  • No human evaluation - All fitness scores come from automated LLM judges. Human preference ranking would provide additional signal.

Fitness Formula

fitness = Con×0.25 + Dis×0.25 + Fid×0.20 + Rob×0.15 + Elonorm×0.15

Where Con = Consistency (same personality across contexts), Dis = Distinctiveness (different from siblings), Fid = Fidelity (behavior matches genome prediction), Rob = Robustness (survives adversarial prompts), and Elonorm = normalized Elo rating from head-to-head tournament.