ElephantFormer - AI-Powered Elephant Chess Engine

Open Table of contents

🎯 Project Overview
🏗️ System Architecture
📊 Dataset & Data Pipeline
⚡ Implementation Highlights
📊 Results & Performance
🔬 Research Discovery: Training Sequence Length Effects
🧩 Technical Challenges & Solutions
🎮 Interactive Demo & Usage
🔗 Links & Resources
What I Learned
🔮 Future Work & To-Do
- 🎯 Model Development
- 📱 Cross-Platform Deployment

🎯 Project Overview

ElephantFormer is a sophisticated AI system that learns to play Elephant Chess using modern Transformer architecture. Unlike traditional chess engines that rely on hand-crafted evaluation functions and minimax algorithms, ElephantFormer learns strategic patterns directly from game data through deep learning.

Motivation & Prior Work

While powerful traditional engines like Pikafish — a state-of-the-art xiangqi engine adapted from Stockfish that combines alpha-beta search with neural network evaluation — dominate competitive play, this project explores an alternative approach: end-to-end learning of chess strategy using pure Transformer architecture.

By treating chess as a sequence modeling problem, ElephantFormer aims to capture the nuanced patterns of strategic play without explicit game tree search, offering insights into how modern language model architectures can be adapted for complex strategic games.

4 Output Heads

41 Token Vocabulary

3+ Evaluation Metrics

100% Legal Move Compliance

Key Innovation

Transforms complex board game moves into a sequence prediction problem, representing each move as a 4-tuple (from_x, from_y, to_x, to_y) and training the model to predict the next logical move given game history.

Technical Approach

Implements a GPT-style transformer with four separate classification heads, each predicting one component of the move coordinates, ensuring legal move generation through game engine integration.

🏗️ System Architecture

Data Pipeline Flow

PGN Files → ICCS Parsing → Token Sequences → Transformer → Move Prediction → Legal Filtering

Core Components

🔤 Tokenization Layer

Input Processing: Converts ICCS move notation into unified token vocabulary
Special Tokens: <start>, <pad>, <unk> for sequence management
Move Encoding: Each move becomes 4 consecutive tokens representing coordinates

🧠 Transformer Core

Architecture: Multi-layer transformer encoder with causal masking
Embeddings: Token + positional embeddings for sequence understanding
Attention: Self-attention mechanism learns move patterns and strategies

🎯 Output Heads

Four Classifiers: Separate heads for from_x, from_y, to_x, to_y
Legal Filtering: Scores only valid moves from game engine
Move Selection: Highest-scoring legal move becomes the prediction

Example Tokenization Process

move = "H2-E2"  # ICCS notation
coords = (7, 2, 4, 2)  # Parse to coordinates
tokens = [fx_7, fy_2, tx_4, ty_2]  # Convert to token IDs
sequence = [<start>, fx_7, fy_2, tx_4, ty_2, ...]  # Build sequence

📊 Dataset & Data Pipeline

Dataset Composition

The model is trained on a comprehensive dataset of 41,738 professional Elephant Chess games sourced from:

World Xiangqi Federation games (41,743 games, ICCS format)
High-quality tournament and professional play data
Games parsed from standardized ICCS coordinate notation

Sequence Length Distribution

The dataset exhibits interesting characteristics when tokenized using the 1 + (num_moves-1)*4 scheme:

Percentile	Sequence Length	Description
50th (Median)	305 tokens	Typical game length
90th	545 tokens	Longer strategic games
99th	869 tokens	Very long endgames
Max	1,593 tokens	Exceptional marathon games

Key Statistics:

Mean length: 339.25 tokens
Range: 1-1,593 tokens (accommodating everything from quick wins to complex endgames)
Training considerations: Variable-length sequences handled via dynamic batching

Data Processing Pipeline

PGN Parsing: ICCS coordinate moves converted to (from_x, from_y, to_x, to_y) format
Tokenization: Each move represented as 4 tokens + START_TOKEN
Quality Filtering: Games with parsing errors or invalid moves removed
Train/Val Split: Subset selection for fast prototyping and experimentation

Why This Dataset Distribution Matters

The long tail of sequence lengths (99.9th percentile at 1,213 tokens) demonstrates the model’s ability to handle:

Short tactical games: Quick decisive victories
Standard games: Typical tournament-length matches
Complex endgames: Extended positional play requiring long-term planning

This distribution directly influenced architectural decisions like context window size and memory-efficient attention mechanisms.

⚡ Implementation Highlights

Modular Design

Data Layer: PGN parsing, tokenization utilities
Model Layer: Transformer architecture, training modules
Engine Layer: Game logic, move validation
Evaluation Layer: Performance metrics, win rate testing

PyTorch Lightning Integration

Training: Automated training loops with checkpointing
Validation: Early stopping and metric tracking
Scalability: Easy GPU/CPU switching and distributed training
Callbacks: Model checkpointing and learning rate scheduling

Game Engine Integration

Custom Elephant Chess engine with complete rule implementation:

# Legal move validation ensures AI always plays valid moves
legal_moves = game_engine.get_legal_moves()
for move in legal_moves:
    score = calculate_move_score(model_output, move)
best_move = max(legal_moves, key=lambda m: calculate_move_score(model_output, m))

Features include:

Perpetual check and chase detection
Traditional draw claim system
Move highlighting and game replay
ICCS notation parsing and validation

Training Strategy

Loss Function: Sum of CrossEntropyLoss from four output heads
Teacher Forcing: Use true previous moves during training
Data Splits: Configurable train/validation/test ratios
Optimization: AdamW optimizer with learning rate scheduling

📊 Results & Performance

Evaluation Metrics

Metric	Score	Description
Prediction Accuracy	12.49%	Exact move prediction (all 4 components)
Perplexity	683.05	Model confidence and pattern understanding
Win Rate vs Random	46.2%	Slightly below random baseline (1,241 games comprehensive analysis)

Comprehensive Evaluation Suite

# Accuracy testing
python -m elephant_former.evaluation.evaluator \
    --model_path checkpoints/best_model.ckpt \
    --pgn_file_path data/test_split.pgn \
    --metric accuracy \
    --device cuda

# Win rate against random opponent
python -m elephant_former.evaluation.evaluator \
    --metric win_rate \
    --num_win_rate_games 100 \
    --max_turns_win_rate 150

Model Performance Analysis

The model at training epoch 22 (validation loss: 6.36) achieved a 12.49% prediction accuracy on the test set, correctly predicting 7,027 out of 56,277 moves across 642 games. While this accuracy reflects the challenging nature of exact move prediction in chess, the model demonstrates several key capabilities:

Win Rate Performance: In comprehensive gameplay evaluation against random opponents across 1,241 games, the model achieved:

Overall Win Rate: 46.2% (162 wins out of 351 decisive games)
Strategic Consistency: Maintained performance across variable game lengths
Decisive Gameplay: 28.3% decisive game rate with strong win percentage when games reach resolution

Key Performance Insights:

46.2% win rate is slightly below random baseline (~50%), indicating room for improvement in overall strategy
Significant performance variation across game length ranges with statistical significance - the key research finding
Strategic adaptation: Performance varies dramatically by game phase, with peak effectiveness in mid-length games (129-192 moves reaching 63.4%)
Large-scale validation: Results based on comprehensive 1,241-game analysis ensuring statistical reliability

While overall performance is slightly below random baseline, the significant variation across game length ranges suggests the model has learned some strategic patterns, particularly excelling in specific game phases.

Pattern Recognition: The perplexity score of 683.05 indicates the model has learned meaningful chess patterns, though there’s room for optimization in future iterations.

🔬 Research Discovery: Training Sequence Length Effects

Counterintuitive Performance Boundary

Through comprehensive analysis of 1,241 games across different game lengths, I discovered a surprising relationship between the model’s training sequence length (128 moves = 512 tokens) and its strategic performance:

Game Length Range	Win Rate	Sample Size	Statistical Significance
0-64 moves	50.6%	85 games	Strong early game
65-128 moves	25.3%	99 games	Training boundary cliff
129-192 moves	63.4%	134 games	Peak performance
193-256 moves	38.5%	13 games	Gradual decline
257+ moves	20.0%	20 games	Severe degradation

Performance by Game Length Range

Key Research Insights

🎯 The Training Boundary Paradox (p < 0.0001):

Performance cliff at training limit: Win rate drops to 25.3% when approaching the 512-token training boundary
Peak performance beyond training: 63.4% win rate at 129-192 moves - counterintuitively the model performs best slightly beyond its training sequence length
Complete collapse: Performance degrades to 20% beyond 257 moves (2× training length)

📊 Statistical Rigor:

65-128 vs 129-192 moves: p < 0.0001 (***), Odds Ratio = 0.19
Early game vs training boundary: p = 0.0004 (***), Odds Ratio = 3.03
Peak vs far beyond training: p = 0.0004 (***), Odds Ratio = 6.94

🧠 Strategic Implications:

Context Window Limitation: Model struggles when approaching its 512-token training limit
Sweet Spot Discovery: Games ending at 129-192 moves show optimal strategic resolution
Architectural Insight: Transformer sequence length limits create unexpected performance boundaries in strategic domains
Training Methodology: Traditional fixed-length training may not be optimal for strategic games

Research Impact

This discovery demonstrates how transformer architecture constraints manifest in strategic gameplay, with implications for:

Chess AI Training: Models may benefit from varied sequence length training
Game AI Deployment: Consider time controls that favor the model’s “sweet spot”
Transformer Research: Evidence of performance boundaries tied to training sequence limits

Publication Potential: This counterintuitive finding challenges assumptions about transformer sequence length effects and provides novel insights for the game AI research community.

Key Achievements

✅ Successfully trains on complex game sequences
✅ Generates 100% legal moves through engine integration
✅ Discovered counterintuitive performance boundaries tied to transformer sequence length limits
✅ Handles variable-length game sequences effectively
✅ Achieved statistical significance in comprehensive 1,241-game analysis
✅ Peak performance of 63.4% win rate in optimal game length range (129-192 moves)
✅ Scalable architecture for different model sizes
✅ Research-quality evaluation methodology with proper statistical rigor
✅ Novel insights into transformer architecture constraints in strategic domains

🧩 Technical Challenges & Solutions

Challenge: Move Representation

Problem: Converting 2D board moves into transformer-compatible sequences
Solution: Designed unified token vocabulary representing each move as 4 consecutive tokens (fx, fy, tx, ty), enabling the model to learn coordinate relationships while maintaining sequence structure.

Challenge: Legal Move Enforcement

Problem: Ensuring AI never makes illegal moves despite free-form generation
Solution: Integrated game engine to filter predictions - model generates logits for all possible moves, but only legal moves are scored and selected, guaranteeing valid gameplay.

Challenge: Variable Sequence Lengths

Problem: Games have different lengths, complicating batch training
Solution: Implemented custom collate function with padding tokens and attention masks, allowing efficient batching while preserving sequence information.

Challenge: Multi-Output Architecture

Problem: Predicting 4 coordinate components simultaneously
Solution: Designed 4 separate classification heads sharing the same transformer backbone, with combined loss function ensuring all components are learned jointly.

🎮 Interactive Demo & Usage

Quick Setup

git clone https://github.com/SumYg/ElephantFormer.git
cd ElephantFormer
uv install

# Run interactive game demo
uv run python -m demos.quick_replay_demo.py

# Train your own model
uv run python train.py --pgn_file_path data/sample_games.pgn --max_epochs 10

# Test against the AI
uv run python -m elephant_former.inference.generator \
    --model_checkpoint_path checkpoints/best_model.ckpt

Available Demos

Game Replay: Visualize games with move highlighting
Perpetual Rules: See complex rule enforcement in action
Interactive Play: Play against the trained model
Training Visualization: Monitor model learning progress

Key Features

Real-time inference: Fast move prediction
Move validation: 100% legal move guarantee
Game analysis: Replay with strategic insights
Configurable difficulty: Adjust model parameters

🔗 Links & Resources

GitHub Repository: ElephantFormer
Documentation: Complete setup and usage guides included
Demo Video: ElephantFormer’s self-play capabilities - the trained AI model playing a complete game of Elephant Chess against itself, demonstrating learned strategic patterns and legal move generation in real-time.

What I Learned

This project pushed me to solve several complex problems at the intersection of game AI and modern NLP techniques, ultimately leading to an unexpected research discovery:

Sequence Modeling for Games: Learning how to represent spatial board game moves as sequences suitable for transformer architecture
Multi-Output Neural Networks: Designing and training models with multiple classification heads while maintaining consistency
Game Engine Integration: Ensuring AI-generated moves are always legal through real-time validation
Production ML Pipeline: Building complete train/evaluate/inference pipeline with proper checkpointing and evaluation
Statistical Analysis & Research: Conducting rigorous performance analysis that revealed counterintuitive findings about transformer sequence length boundaries

The most challenging aspect was balancing the model’s creative freedom with the strict constraints of legal gameplay - a problem that taught me valuable lessons about constrained generation in AI systems.

Key Research Insight: Through comprehensive evaluation of 1,241 games, I discovered that the model’s performance varies dramatically with game length in unexpected ways - performing worst near its training sequence boundary (65-128 moves: 25.3% win rate) but best just beyond it (129-192 moves: 63.4% win rate). This finding challenged my assumptions about transformer capabilities and taught me that sometimes the most valuable discoveries come from thorough analysis of apparent failures.

Scientific Methodology: This project taught me the importance of comprehensive evaluation and statistical rigor in AI research. What began as performance optimization became a research contribution demonstrating how architectural constraints manifest in strategic domains - a reminder that understanding failure modes can be as valuable as achieving high performance.

🔮 Future Work & To-Do

🎯 Model Development

Benchmark against Pikafish: Evaluate ElephantFormer’s performance against the state-of-the-art Pikafish engine to establish competitive baseline metrics
PPO Integration: Explore the effectiveness and potential of using Proximal Policy Optimization (PPO) in offline reinforcement learning settings for strategic improvement

📱 Cross-Platform Deployment

Mobile Application: Deploy the trained model on iOS/Android platforms for portable xiangqi gameplay
Web Interface: Create a browser-based implementation for accessible online play
Model Optimization: Optimize model size and inference speed for resource-constrained environments
Real-time Performance: Ensure smooth gameplay experience across different devices and platforms

This project represents my exploration into applying modern NLP techniques to traditional game AI problems, demonstrating both technical depth and practical engineering skills.