System Architecture

Public Binance trade data is converted into discrete event tokens, modeled with a compact Llama-style decoder, and evaluated on chronological holdout data.

Positioning

Generative order-flow modeling from public market data.

Mega-Trading asks a simple question: can market events be modeled like language tokens? It converts public Binance trade data into discrete event tokens, trains a compact Llama-style decoder, and compares predictions and generated event statistics against held-out chronological data.

public Binance trade dataone integer token per eventLlama-style decoderchronological backtestGitHub: github.com/duoan/mega-trading

Backtest loss

2.3379

Teacher-forced next-token loss on held-out chronological tokens.

Backtest perplexity

10.3592

exp(loss), capped in code to avoid overflow.

Top-1 / Top-5

49.23% / 69.42%

Token accuracy over 524,288 scored tokens.

Throughput

108K tok/s

Last submitted metric window, 111K tok/s over the last 50 points.

Take-Home Scope

The assignment was intentionally open-ended: build a non-trivial system, define the problem, decide what success means, and show how far the implementation goes. Mega-Trading frames that requirement as an end-to-end system for generative market-event modeling.

What I chose to build

A public-data prototype that treats market events like language tokens: ingest Binance trades, tokenize events, train a Transformer, evaluate predictions, and package the result as a reviewer-facing report.

End-to-End System

The project covers data ingestion, feature engineering, tokenizer fitting, dataset splits, model architecture, distributed-ready training, MLflow tracking, backtesting, generated-event analysis, and static deployment artifacts.

Success Criteria

Success is not only model accuracy. It includes reproducible data lineage, held-out chronological evaluation, baseline comparisons, training throughput, readable artifacts, and clear limitations for the next iteration.

Version Control and Evidence

Version control review stays simple because configs, source code, tests, reports, and generated space assets are separated. The showcase artifacts are covered by tests so the presentation stays tied to real metrics instead of hand-written claims.

Motivation and Fit

I chose this project because Deeter Analytics focuses on turning market data into decisive action, and my background is building end-to-end foundation-model systems with researchers and production teams.

Company context

Deeter Analytics describes its mission as transforming complex market data into actionable insights with clarity, precision, and transparent methodology. This project targets the same problem shape: convert noisy market data into measurable model outputs that can support downstream decisions. Company site.

My background

I work as a research engineer on end-to-end model systems, partnering with researchers to move models from experiments into production. My current work on Amazon Nova foundation models shaped the choice to build a real data-to-training-to-evaluation pipeline instead of a narrow demo.

Why this project

A market-event foundation-model prototype lets me show the skills the take-home prompt is trying to reveal: problem framing, data processing, model implementation, training infrastructure, optimization, evaluation, and reviewable artifacts.

System Story

Public trades

Binance public archives provide downloadable historical trades.

Event fields

Each event keeps direction, relative price, depth, size, and elapsed time.

Event token

Quantile buckets turn each event into one integer token ID.

Decoder model

A Llama-style causal Transformer learns next-token probabilities.

Backtest

Held-out chronological slices measure loss, perplexity, and top-k accuracy.

Rollout facts

Generated price-depth tokens are compared with real held-out distributions.

Dataset Evidence

Chronological split

Prepared from 62,012,828 total token windows: train 54,695,345, validation 1,116,216, and backtest 6,201,267. Validation and backtest are held out by ticker chronology.

Symbol coverage

32 USDT symbols, 1096 symbol-month partitions, 2023-05 through 2026-04. Top symbols dominate sequence volume, as expected for liquid public spot trades.

Model Architecture

Why one token per event?

Each market event is encoded as one integer token, similar to how an LLM encodes text. This makes the task easy to explain: given recent events, predict the next event.

Why Llama 3-style?

The decoder uses standard modern LLM components: RoPE, RMSNorm, grouped-query attention, SwiGLU, and tied embeddings. The report cites Llama and Transformer references for this choice.

Why this model size?

The submitted baseline uses 20 layers, hidden size 1024, 16 attention heads, and 4 KV heads: about 226.1M trainable parameters. It is large enough to exercise a real Transformer pipeline, but small enough for one high-memory GPU.

Why GQA and RoPE?

Grouped-query attention lowers KV projection cost while preserving multiple query heads. RoPE avoids learned absolute tables and works naturally with fixed 512-token causal windows.

Component	Submitted value	Reason
Vocabulary	579 realized tokens	Fitted quantile buckets over action, side, price/depth, size, and time plus special tokens.
Context	block_size=512, stride=256	Balances event context with large sample count and tractable attention cost.
Attention	heads=16, kv_heads=4, backend=triton	GQA plus local fused causal attention for GPU efficiency.
Feed-forward	SwiGLU intermediate_dim=2816	Llama-family FFN ratio, rounded to hardware-friendly width.
Dropout	0.0	The Triton fused attention path currently targets no-dropout causal GQA.

Training System

Loss curves

Final train loss 3.3414, validation loss 2.3500. The backtest loss is 2.3379.

Optimization throughput

Preloaded NumPy arrays, pinned workers, non-blocking transfer, torch.compile, and Triton attention reduce data-wait and kernel overhead after warmup.

Engineering Design Evidence

The repository includes design and profiling documents that make the implementation reviewable beyond the final metrics. These documents show how data contracts, kernels, training configs, profiling, and distributed execution were reasoned about.

Data plane design

docs/data-plane-design.md defines the raw Binance trade path, bounded-memory streaming prepare, tokenizer artifacts, and chronological split contract. This is the lineage layer behind every reported metric.

Training plane design

docs/training-plane-design.md records the training inputs, checkpoint/manifest outputs, DDP/FSDP-ready execution, and backtest handoff. It shows how data, model, metrics, and evaluation are kept consistent.

Custom kernels

docs/kernels.md documents three Triton paths. Causal GQA attention forward: 156.23 approximate TFLOP/s vs 148.49 for PyTorch SDPA on the RTX-shape bf16 benchmark (backward bottleneck logged honestly). RoPE and SwiGLU gate forward+backward beat PyTorch on the target shape. A Liger-Kernel-style fused linear cross-entropy path drops loss-head peak memory from 509 MB to 126 MB (~4×) at PyTorch-parity latency, with the failed hand-written fused-matmul attempts kept as a cautionary tuning history.

Performance profiling

docs/performance-profiling.md captures an end-to-end PyTorch profiler investigation. The baseline CPU optimizer window was 323.645 ms over two optimizer updates; after shape-bucketed Muon and ns_steps=3 it dropped to about 20.018 ms.

Training configs

docs/training-configs.md explains Mac, RTX, Modal, and default environments. The same project can run a smoke test, single-server CUDA training, or Modal GPU training through config-driven entry points.

Systems alignment

docs/training-systems-alignment.md maps the project to training-system concerns: throughput, GPU utilization, memory pressure, I/O bottlenecks, checkpoint reliability, failure recovery, and learning per unit of compute.

Ablation Study

The ablation pass turns engineering choices into testable hypotheses: capacity should help on a fixed slice, adding tickers should increase difficulty at the same budget, and longer training should be judged by both likelihood and rollout realism.

Hypotheses

Model capacity: small should beat micro on the same one-ticker data.
Data size: two tickers should be harder than one at 20 steps.
Training budget: 100 steps should improve likelihood, but not necessarily generated-event quality.

Probe: BTCUSDT and ETHUSDT, April 2026, block_size=128, stride=4096, 16 backtest batches, 5 decoded examples per run.

Conclusions

The cleanest result is capacity: one-ticker small improves backtest loss from 3.060 to 2.114 and price-depth L1 from 0.632 to 0.146. Multi-ticker training is harder: token accuracy can collapse even when marginal rollout statistics look plausible. The 100-step follow-up improves loss but worsens L1, so future selection needs both open-loop and rollout gates.

Ablation study charts for backtest loss, price-depth L1, and Top-1 accuracy

Data	Model	Steps	Train loss	Val loss	Backtest loss	Top-1	Top-5	Price-depth L1
1 ticker	micro	20	3.410	2.915	3.060	0.619	0.790	0.632
1 ticker	small	20	2.580	1.922	2.114	0.621	0.802	0.146
2 tickers	micro	20	3.675	4.773	4.771	0.304	0.597	1.229
2 tickers	small	20	2.797	4.990	4.943	0.004	0.111	0.141
2 tickers	small	100	1.457	4.152	4.166	0.051	0.161	0.801

Optimization Stack

Optimizer

Muon + AdamW routing applies Muon to large matrix parameters while preserving AdamW behavior for embeddings and tied output weights. Momentum=0.95, Newton-Schulz steps=5.

Schedule

Cosine LR with 125 warmup steps and min LR 2e-05. This keeps the early fused-kernel run stable and avoids an abrupt final LR drop.

Data path

Streaming prepare fits tokenizer samples, writes partitioned NumPy token streams, then training preloads arrays and overlaps workers with GPU compute.

Observability

MLflow stores the experiment history: 9 runs, 596,177 metric rows, 273 params, and 72 tags in the local tracking database.

submitted run: finishedmetrics tracked with MLflowTriton attention pathTriton RoPE + SwiGLU gatefused linear cross-entropytorch.compile enabled

Backtest and Stylized Facts

Generated vs real event facts

The model tracks mean price-depth direction reasonably, but generated kurtosis remains much higher than real held-out backtest depth. That is a useful target for future sampling and data improvements.

Submitted baseline

Run	Train seq	Vocab	Loss	PPL	Top-1	Top-5	L1
single-GPU baseline	54,695,345	579	2.3379	10.3592	49.23%	69.42%	0.1420

This is the submitted baseline used for the poster and technical report.

Important limitation: this repo reports open-loop chronological backtest loss and rollout-vs-real stylized facts. The TradeFM-inspired closed-loop market simulator for fill validation, impact, stress tests, and optimal-execution agents is not implemented yet.

Baseline Sanity Check

Uniform random is not the right comparator

With 579 realized token IDs, uniform random top-1 is only 0.17% and uniform random top-5 is 0.86%. The reported model top-1 of 49.23% is far above that.

Token imbalance is the real concern

The scored backtest sample has only 135 unique labels and an effective vocabulary of about 19.56. A cheating backtest majority-token baseline would already get 25.41% top-1.

Simple temporal baselines

Copying the previous token gets 33.63%. A train-sample one-step Markov argmax baseline gets 31.91% on the same scored sample. The model still clears both, but this is the comparison the report should emphasize.

Inference Path

Load

Read the trained checkpoint, tokenizer metadata, and token profile. Validate vocab size and context length before scoring or rollout.

Condition

Use a recent event-token window from a ticker stream, or start from a beginning-of-sequence seed. The model consumes token IDs and keeps the latest 512 events as context.

Generate

TradingModel.generate applies top-k sampling, default top_k=16, and appends sampled event tokens one step at a time.

Decode

Token IDs decode back into event features; price-depth bps can be accumulated into model-implied price paths for charting and downstream simulation.

Future Direction

Multimodal

Add news, filings, macro calendar events, funding/flows, and social sentiment as time-aligned conditioning streams. The goal is not text generation, but event-flow generation conditioned on exogenous market context.

L3 Data

Replace public trade-only proxies with venue-grade order-book data. The planned next source is IEX DEEP/HIST market data, whose historical feeds are distributed as pcap files with feed specifications, enabling lower-level market-data parsing and richer order-flow reconstruction.

Closed-loop simulator

Route generated order-flow through a deterministic matching or LOB simulator to evaluate fills, slippage, market impact, liquidity shocks, and execution policies.

Inference service

Package checkpoint + tokenizer + manifest as a reproducible model card artifact, expose batched rollout APIs, and attach provenance to every generated scenario.

Reference Frame

TradeFM

The project is motivated by TradeFM's framing of market microstructure as a generative foundation-model problem, while using a smaller public-data prototype. See arXiv:2602.23784.

Llama and LLMs

The model follows the decoder-only next-token prediction pattern used by modern LLMs and Llama-style architectures. See Llama 3 and Attention Is All You Need.

Data and Systems

Input data comes from Binance Public Data. The L3 roadmap targets IEX DEEP/HIST market data as a venue-grade historical feed candidate. GPU acceleration uses a Triton attention path where available; see Triton.

Reproducibility Artifacts

Artifact	Path	Use
Static overview	`huggingface-space/index.html`	Reviewer-facing project summary and key evidence.
Source repository	`github.com/duoan/mega-trading`	GitHub repository with source code, configs, tests, docs, and generated showcase artifacts.
Technical report PDF	`huggingface-space/technical-report.pdf`	Reviewer-facing compiled report hosted from the GitHub repository.
Technical report source	`huggingface-space/technical-report.tex`	LaTeX source with methods, config rationale, metrics, limitations, and roadmap.
Backtest dashboard	`huggingface-space/backtest-report-rtx.html`	Training curves, split chart, candles, forecast paths, and metrics.
Backtest JSON	`.mega-trading/data/evals/rtx/backtest.json`	Source of backtest loss, top-k accuracy, and stylized facts.
MLflow DB	`.mega-trading/data/runs/mlflow/mlflow.db`	Training and evaluation metrics for local experiment tracking.

Metrics and figures reflect the submitted run artifacts packaged in huggingface-space/.