System Architecture

Public Binance trade data is converted into discrete event tokens, modeled with a compact Llama-style decoder, and evaluated on chronological holdout data.

Mega-Trading system architecture showing data, tokenization, model, training, evaluation, and inference outputs
Positioning

Generative order-flow modeling from public market data.

Mega-Trading asks a simple question: can market events be modeled like language tokens? It converts public Binance trade data into discrete event tokens, trains a compact Llama-style decoder, and compares predictions and generated event statistics against held-out chronological data.

public Binance trade dataone integer token per eventLlama-style decoderchronological backtestGitHub: github.com/duoan/mega-trading
Backtest loss
2.3379

Teacher-forced next-token loss on held-out chronological tokens.

Backtest perplexity
10.3592

exp(loss), capped in code to avoid overflow.

Top-1 / Top-5
49.23% / 69.42%

Token accuracy over 524,288 scored tokens.

Throughput
108K tok/s

Last submitted metric window, 111K tok/s over the last 50 points.

Take-Home Scope

The assignment was intentionally open-ended: build a non-trivial system, define the problem, decide what success means, and show how far the implementation goes. Mega-Trading frames that requirement as an end-to-end system for generative market-event modeling.

What I chose to build

A public-data prototype that treats market events like language tokens: ingest Binance trades, tokenize events, train a Transformer, evaluate predictions, and package the result as a reviewer-facing report.

End-to-End System

The project covers data ingestion, feature engineering, tokenizer fitting, dataset splits, model architecture, distributed-ready training, MLflow tracking, backtesting, generated-event analysis, and static deployment artifacts.

Success Criteria

Success is not only model accuracy. It includes reproducible data lineage, held-out chronological evaluation, baseline comparisons, training throughput, readable artifacts, and clear limitations for the next iteration.

Version Control and Evidence

Version control review stays simple because configs, source code, tests, reports, and generated space assets are separated. The showcase artifacts are covered by tests so the presentation stays tied to real metrics instead of hand-written claims.

Motivation and Fit

I chose this project because Deeter Analytics focuses on turning market data into decisive action, and my background is building end-to-end foundation-model systems with researchers and production teams.

Company context

Deeter Analytics describes its mission as transforming complex market data into actionable insights with clarity, precision, and transparent methodology. This project targets the same problem shape: convert noisy market data into measurable model outputs that can support downstream decisions. Company site.

My background

I work as a research engineer on end-to-end model systems, partnering with researchers to move models from experiments into production. My current work on Amazon Nova foundation models shaped the choice to build a real data-to-training-to-evaluation pipeline instead of a narrow demo.

Why this project

A market-event foundation-model prototype lets me show the skills the take-home prompt is trying to reveal: problem framing, data processing, model implementation, training infrastructure, optimization, evaluation, and reviewable artifacts.

System Story

1

Public trades

Binance public archives provide downloadable historical trades.

2

Event fields

Each event keeps direction, relative price, depth, size, and elapsed time.

3

Event token

Quantile buckets turn each event into one integer token ID.

4

Decoder model

A Llama-style causal Transformer learns next-token probabilities.

5

Backtest

Held-out chronological slices measure loss, perplexity, and top-k accuracy.

6

Rollout facts

Generated price-depth tokens are compared with real held-out distributions.

Dataset Evidence

Chronological split

train54.7Mvalid1.1Mbacktest6.2M

Prepared from 62,012,828 total token windows: train 54,695,345, validation 1,116,216, and backtest 6,201,267. Validation and backtest are held out by ticker chronology.

Symbol coverage

BTCUSDT12.4METHUSDT11.1MSOLUSDT6.2MSHIBUSDT5.2MXRPUSDT4.3MDOGEUSDT4.0MBNBUSDT3.3MSUIUSDT2.0M

32 USDT symbols, 1096 symbol-month partitions, 2023-05 through 2026-04. Top symbols dominate sequence volume, as expected for liquid public spot trades.

Model Architecture

Why one token per event?

Each market event is encoded as one integer token, similar to how an LLM encodes text. This makes the task easy to explain: given recent events, predict the next event.

Why Llama 3-style?

The decoder uses standard modern LLM components: RoPE, RMSNorm, grouped-query attention, SwiGLU, and tied embeddings. The report cites Llama and Transformer references for this choice.

Why this model size?

The submitted baseline uses 20 layers, hidden size 1024, 16 attention heads, and 4 KV heads: about 226.1M trainable parameters. It is large enough to exercise a real Transformer pipeline, but small enough for one high-memory GPU.

Why GQA and RoPE?

Grouped-query attention lowers KV projection cost while preserving multiple query heads. RoPE avoids learned absolute tables and works naturally with fixed 512-token causal windows.

ComponentSubmitted valueReason
Vocabulary579 realized tokensFitted quantile buckets over action, side, price/depth, size, and time plus special tokens.
Contextblock_size=512, stride=256Balances event context with large sample count and tractable attention cost.
Attentionheads=16, kv_heads=4, backend=tritonGQA plus local fused causal attention for GPU efficiency.
Feed-forwardSwiGLU intermediate_dim=2816Llama-family FFN ratio, rounded to hardware-friendly width.
Dropout0.0The Triton fused attention path currently targets no-dropout causal GQA.

Training System

Loss curves

2.333 to 6.553

Final train loss 3.3414, validation loss 2.3500. The backtest loss is 2.3379.

Optimization throughput

1182.983 to 154444.025

Preloaded NumPy arrays, pinned workers, non-blocking transfer, torch.compile, and Triton attention reduce data-wait and kernel overhead after warmup.

Engineering Design Evidence

The repository includes design and profiling documents that make the implementation reviewable beyond the final metrics. These documents show how data contracts, kernels, training configs, profiling, and distributed execution were reasoned about.

Data plane design

docs/data-plane-design.md defines the raw Binance trade path, bounded-memory streaming prepare, tokenizer artifacts, and chronological split contract. This is the lineage layer behind every reported metric.

Training plane design

docs/training-plane-design.md records the training inputs, checkpoint/manifest outputs, DDP/FSDP-ready execution, and backtest handoff. It shows how data, model, metrics, and evaluation are kept consistent.

Custom kernels

docs/kernels.md documents three Triton paths. Causal GQA attention forward: 156.23 approximate TFLOP/s vs 148.49 for PyTorch SDPA on the RTX-shape bf16 benchmark (backward bottleneck logged honestly). RoPE and SwiGLU gate forward+backward beat PyTorch on the target shape. A Liger-Kernel-style fused linear cross-entropy path drops loss-head peak memory from 509 MB to 126 MB (~4×) at PyTorch-parity latency, with the failed hand-written fused-matmul attempts kept as a cautionary tuning history.

Performance profiling

docs/performance-profiling.md captures an end-to-end PyTorch profiler investigation. The baseline CPU optimizer window was 323.645 ms over two optimizer updates; after shape-bucketed Muon and ns_steps=3 it dropped to about 20.018 ms.

Training configs

docs/training-configs.md explains Mac, RTX, Modal, and default environments. The same project can run a smoke test, single-server CUDA training, or Modal GPU training through config-driven entry points.

Systems alignment

docs/training-systems-alignment.md maps the project to training-system concerns: throughput, GPU utilization, memory pressure, I/O bottlenecks, checkpoint reliability, failure recovery, and learning per unit of compute.

Ablation Study

The ablation pass turns engineering choices into testable hypotheses: capacity should help on a fixed slice, adding tickers should increase difficulty at the same budget, and longer training should be judged by both likelihood and rollout realism.

Hypotheses

  • Model capacity: small should beat micro on the same one-ticker data.
  • Data size: two tickers should be harder than one at 20 steps.
  • Training budget: 100 steps should improve likelihood, but not necessarily generated-event quality.

Probe: BTCUSDT and ETHUSDT, April 2026, block_size=128, stride=4096, 16 backtest batches, 5 decoded examples per run.

Conclusions

The cleanest result is capacity: one-ticker small improves backtest loss from 3.060 to 2.114 and price-depth L1 from 0.632 to 0.146. Multi-ticker training is harder: token accuracy can collapse even when marginal rollout statistics look plausible. The 100-step follow-up improves loss but worsens L1, so future selection needs both open-loop and rollout gates.

Ablation study charts for backtest loss, price-depth L1, and Top-1 accuracy
DataModelStepsTrain lossVal lossBacktest lossTop-1Top-5Price-depth L1
1 tickermicro203.4102.9153.0600.6190.7900.632
1 tickersmall202.5801.9222.1140.6210.8020.146
2 tickersmicro203.6754.7734.7710.3040.5971.229
2 tickerssmall202.7974.9904.9430.0040.1110.141
2 tickerssmall1001.4574.1524.1660.0510.1610.801

Optimization Stack

Optimizer

Muon + AdamW routing applies Muon to large matrix parameters while preserving AdamW behavior for embeddings and tied output weights. Momentum=0.95, Newton-Schulz steps=5.

Schedule

Cosine LR with 125 warmup steps and min LR 2e-05. This keeps the early fused-kernel run stable and avoids an abrupt final LR drop.

Data path

Streaming prepare fits tokenizer samples, writes partitioned NumPy token streams, then training preloads arrays and overlaps workers with GPU compute.

Observability

MLflow stores the experiment history: 9 runs, 596,177 metric rows, 273 params, and 72 tags in the local tracking database.

submitted run: finishedmetrics tracked with MLflowTriton attention pathTriton RoPE + SwiGLU gatefused linear cross-entropytorch.compile enabled

Backtest and Stylized Facts

Generated vs real event facts

Mean bps-0.701-0.833Std bps0.6270.440Kurtosis3.09514.297Return ACF1-0.070-0.006realgenerated

The model tracks mean price-depth direction reasonably, but generated kurtosis remains much higher than real held-out backtest depth. That is a useful target for future sampling and data improvements.

Submitted baseline

RunTrain seqVocabLossPPLTop-1Top-5L1
single-GPU baseline 54,695,345 579 2.3379 10.3592 49.23% 69.42% 0.1420

This is the submitted baseline used for the poster and technical report.

Important limitation: this repo reports open-loop chronological backtest loss and rollout-vs-real stylized facts. The TradeFM-inspired closed-loop market simulator for fill validation, impact, stress tests, and optimal-execution agents is not implemented yet.

Baseline Sanity Check

Uniform random is not the right comparator

With 579 realized token IDs, uniform random top-1 is only 0.17% and uniform random top-5 is 0.86%. The reported model top-1 of 49.23% is far above that.

Token imbalance is the real concern

The scored backtest sample has only 135 unique labels and an effective vocabulary of about 19.56. A cheating backtest majority-token baseline would already get 25.41% top-1.

Simple temporal baselines

Copying the previous token gets 33.63%. A train-sample one-step Markov argmax baseline gets 31.91% on the same scored sample. The model still clears both, but this is the comparison the report should emphasize.

Inference Path

Load

Read the trained checkpoint, tokenizer metadata, and token profile. Validate vocab size and context length before scoring or rollout.

Condition

Use a recent event-token window from a ticker stream, or start from a beginning-of-sequence seed. The model consumes token IDs and keeps the latest 512 events as context.

Generate

TradingModel.generate applies top-k sampling, default top_k=16, and appends sampled event tokens one step at a time.

Decode

Token IDs decode back into event features; price-depth bps can be accumulated into model-implied price paths for charting and downstream simulation.

Future Direction

Multimodal

Add news, filings, macro calendar events, funding/flows, and social sentiment as time-aligned conditioning streams. The goal is not text generation, but event-flow generation conditioned on exogenous market context.

L3 Data

Replace public trade-only proxies with venue-grade order-book data. The planned next source is IEX DEEP/HIST market data, whose historical feeds are distributed as pcap files with feed specifications, enabling lower-level market-data parsing and richer order-flow reconstruction.

Closed-loop simulator

Route generated order-flow through a deterministic matching or LOB simulator to evaluate fills, slippage, market impact, liquidity shocks, and execution policies.

Inference service

Package checkpoint + tokenizer + manifest as a reproducible model card artifact, expose batched rollout APIs, and attach provenance to every generated scenario.

Reference Frame

TradeFM

The project is motivated by TradeFM's framing of market microstructure as a generative foundation-model problem, while using a smaller public-data prototype. See arXiv:2602.23784.

Llama and LLMs

The model follows the decoder-only next-token prediction pattern used by modern LLMs and Llama-style architectures. See Llama 3 and Attention Is All You Need.

Data and Systems

Input data comes from Binance Public Data. The L3 roadmap targets IEX DEEP/HIST market data as a venue-grade historical feed candidate. GPU acceleration uses a Triton attention path where available; see Triton.

Reproducibility Artifacts

ArtifactPathUse
Static overviewhuggingface-space/index.htmlReviewer-facing project summary and key evidence.
Source repositorygithub.com/duoan/mega-tradingGitHub repository with source code, configs, tests, docs, and generated showcase artifacts.
Technical report PDFhuggingface-space/technical-report.pdfReviewer-facing compiled report hosted from the GitHub repository.
Technical report sourcehuggingface-space/technical-report.texLaTeX source with methods, config rationale, metrics, limitations, and roadmap.
Backtest dashboardhuggingface-space/backtest-report-rtx.htmlTraining curves, split chart, candles, forecast paths, and metrics.
Backtest JSON.mega-trading/data/evals/rtx/backtest.jsonSource of backtest loss, top-k accuracy, and stylized facts.
MLflow DB.mega-trading/data/runs/mlflow/mlflow.dbTraining and evaluation metrics for local experiment tracking.