How to Backtest a Prediction-Market Strategy
Backtesting a prediction-market strategy looks like backtesting equities until you hit the first contract resolution. Then the differences stop being cosmetic. A YES contract does not drift forever. It terminates at exactly $1.00 or $0.00 on a known date, and that single fact reshapes how you build, validate, and trust a backtest. This post covers what changes, the data you actually need, the pitfalls that quietly inflate your results, and why a backtest alone is never enough.
Why binary resolution changes the problem
An equity has no terminal value. You mark it to the last trade and the backtest is a sequence of price changes. A prediction-market contract is a binary claim with a hard expiry. On Kalshi, a winning contract settles at $1 and a losing one at $0, resolved against named source agencies filed with the CFTC as part of each contract's terms. On Polymarket, resolution runs through UMA's optimistic oracle: a proposer posts the outcome with a bond (roughly $750 pUSD), a challenge window opens (two hours by default, longer for some higher-value markets), and the market settles if uncontested. A dispute escalates to a UMA token-holder vote and can take days.
Three consequences follow directly:
- Your ground truth is the resolution, not a closing price. Every backtested position eventually marks to $1 or $0, so a correct outcome label per market is non-negotiable. Get a label wrong and you have mispriced every trade in that market.
- Payoff is bounded and asymmetric. Buying at 90c risks 90c to make 10c; buying at 5c risks 5c to make 95c. P&L distributions are lumpy and fat-tailed, not roughly normal. A Sharpe ratio computed on these returns is close to meaningless. Look at hit rate, average payoff per resolved market, and the full distribution of outcomes instead.
- Time has a cliff. Mean reversion that works at T-minus-30-days can reverse hard in the final hours as the market converges to 0 or 1. A backtest that ignores time-to-resolution will badly misjudge late-stage trades.
The data you actually need
Free price history gets you started and then quietly lies to you. Four inputs matter, in rough order of how often they are missed:
- Resolution outcomes. The definitive YES/NO per market. This is the cheapest data to get and the most expensive to get wrong.
- Order book depth, not mid-price. A mid of 55c does not mean you could have bought at 55c. The fillable price depends on the spread and the size resting at each level. Polymarket's CLOB API exposes current books and a prices-history endpoint, but its public orderbook-history endpoint stopped producing fresh snapshots around February 2026, so recent depth-aware work leans on third-party archives. Kalshi's REST API gives trades and 1m/1h/1d candlesticks plus live book snapshots over WebSocket, but no historical L2 book archive. Either way, reconstructing past depth is the hard part.
- Fees and settlement mechanics. Model the venue's real costs. Kalshi charges a per-trade fee; Polymarket moved off its old zero-fee model in 2026 to a dynamic taker fee (makers still pay zero, takers pay a category-dependent rate that is symmetric and peaks near 50c, with geopolitical and world-events markets fee-free). Also model that capital is locked through resolution: a couple of hours for an uncontested Polymarket market, days or weeks if disputed. A strategy that recycles capital fast in the backtest may stall in reality.
- Timestamps you could have acted on. Every signal must be computable from data available at that exact moment, or your backtest is fiction.
The pitfalls that inflate results
The classic biases all have prediction-market-specific shapes. Watch for these:
- Survivorship bias. If your dataset only includes clean, resolved markets, you have silently dropped the messy ones: rescheduled events, ambiguous resolutions, disputed Polymarket markets, and markets that were delisted or voided. Those are exactly the situations that hurt a live strategy. Test on the full universe including the ugly cases.
- Look-ahead leakage. The sharpest version here is the resolution itself. It is trivially easy to let the known outcome bleed into a feature, for example by joining on a resolved-outcome column or by using a news timestamp that postdates the trade. If a backtest looks suspiciously good on near-resolution trades, suspect leakage first.
- Mid-price fills. Filling at the mid is the single most common reason a backtested edge evaporates live. Use depth snapshots where you have them. Where you do not, assume conservative slippage of one to several cents scaled by market liquidity, and treat thin overnight or low-volume books as worse.
- Single-market overfit. A rule that nails one election or one team's matchups is curve-fit, not a strategy. Require an edge that holds across many independent markets before you believe it.
- Period selection. A few months that happened to contain a volatile election cycle is not a representative sample. Test across regimes and event types.
A minimal backtest loop
Stripped down, an honest backtest does roughly this per market:
for market in universe: # full universe, including voided/disputed
for t, book in market.snapshots(): # tick or interval book snapshots
signal = strategy(book, t) # only data available at time t
if signal:
fill = simulate_fill(book, signal.size) # walk the book, not the mid
fill.cost += venue_fees(fill)
positions.append(fill)
settle(positions, market.resolution) # mark every position to $1 or $0
report(hit_rate, avg_payoff, outcome_distribution) # not SharpeThe two lines that separate a real backtest from a fantasy are simulate_fill walking actual depth and settle using the true resolution. Everything else is bookkeeping.
Why paper trading is the required complement
A backtest answers one question: would this rule have worked on past data, under my modeling assumptions. It cannot tell you whether your fill model matches the live book today, whether your latency lets you actually hit the prices you saw, or whether liquidity has dried up since your sample. Prediction-market books are often thin, and the gap between assumed slippage and real slippage is exactly where a backtested edge dies.
Paper trading closes that gap. Running the strategy forward against the current live order book, with no capital at risk, surfaces the execution reality a historical backtest cannot. The honest sequence is: backtest to reject obviously broken ideas cheaply, then paper trade the survivors against live markets to see whether the edge holds when the book is real and moving, then go live small. That is the loop Banger runs (write a Python strategy, paper it against the live Polymarket and Kalshi books, promote it to live under a declarative risk envelope once it has earned it, with your own venue keys), but the sequence matters more than the tooling.
Backtesting tells you the idea is not stupid. Paper trading tells you it might actually work. You need both before you risk a dollar.