Before discussing probability levels, it's essential to clarify event definitions and settlement rules. Once the rules are clear, the next natural question is: Are the market prices reliable? Many people answer intuitively—"Last time it got it right, so it must be accurate," or "Last time it was wrong, so prediction markets don't work." Both judgments are too simplistic. Prediction markets don't output a binary "will/won't happen" conclusion but a set of probability estimates; evaluating whether they "got it right" also requires probabilistic language.
In fact, a market can frequently "hit the outcome" yet be highly distorted in probabilistic terms; or it can often "miss the direction" while honestly reflecting uncertainty. Evaluating solely by win/loss misses the most valuable—and most misunderstood—aspect of prediction markets: calibration.
To judge market quality, we must ask: What is calibration, and when can we say the market truly "got it right"?
Accuracy answers: Does the final judgment match the outcome?
Calibration answers: When the market says 70%, do about 70% of such events actually happen?
A simple example illustrates the difference. Suppose there are 100 instances of a certain event, and the market always quotes 90%. If 90 occur and 10 do not, the 90% quote is reasonable from a calibration perspective. If instead the market always quotes 51% for 100 events, and exactly 51 happen while 49 do not, accuracy still "looks good," but the market offers almost no informative probability distinction—it simply always stands slightly on one side.
Conversely, an honest 60% quote that ultimately fails doesn't mean "the market lied"; 60% inherently means there's a 40% chance of not happening. Equating "didn't happen" directly with "market failure" is evaluating a probabilistic tool with deterministic thinking.
For readers, the probabilistic reading from Lesson 1 needs upgrading here: reading prediction markets isn't just about "which side is favored," but also about whether that bias honestly reflects historical frequency.
A common way to assess calibration is to plot a calibration curve: group historical predictions by probability intervals (e.g., 50%–60%, 60%–70%, 70%–80%), then tally the actual occurrence rate in each interval. Ideally, the curve should approach the diagonal—events quoted at 70% should happen about 70% of the time over the long term.
Three deviations are commonly seen:
Overconfidence: The market quotes 80%, but actual occurrence is well below 80%. Hot topics and single-narrative markets often exhibit this.
Over-cautiousness: The market quotes 55%, but actual occurrence exceeds 55%. This can happen when information spreads slowly or participants are cautious.
Insufficient samples: Too few historical cases in a probability interval make statistics unstable. Long-tail events and new-topic markets often see this.
Thus, calibration isn't a one-off "right/wrong" label but a long-term property requiring enough samples and interval-based observation. This lesson does not aim to give a precise calibration coefficient for any platform—that requires professional data and methodology—but only to establish an evaluation framework: don't judge calibration based on just one or two hot markets.
The Brier score is a common metric for assessing probabilistic prediction quality. For binary events, each prediction's error versus the outcome is calculated; the lower the score, the closer probability estimates are to reality (perfect prediction scores 0, totally wrong approaches 1; exact calculation depends on formula).
The value of Brier score lies in penalizing "overconfident mistakes." Quoting 99% and failing gets a heavier penalty than quoting 60% and failing—this aligns with probabilistic thinking: the former claims much greater certainty, so errors cost more.
Ordinary users need not calculate Brier scores by hand, but should understand their meaning:
If two markets have similar accuracy, the one with lower Brier score usually expresses probability more honestly;
If a market often pushes probabilities toward extremes (0 or 1), it may seem "decisive" short-term, but long-term calibration tends to be worse;
To evaluate market quality, consider both "was it right" and "were probabilities quoted reasonably."
For the same event, quotes at different times contain different information. Quoting 40% a week before a game may reflect lineup, injuries, schedule—medium-term factors; quoting 65% an hour before may incorporate starting roster, weather, real-time news. Both are "probabilities," but answer slightly different questions: early quotes are estimates; later ones approach final consensus.
When reading prediction markets, note the timestamp. Discussing "the market has always been bullish" without considering time can misjudge information efficiency. The same applies to major macro events: a Fed-related contract a week before NFP versus one minute before data release are driven by different volatility factors—not interchangeable.
Probabilities shown on Gate Prediction Market interfaces are snapshots at that moment; if you use Gate for AI Agent's top-tier capabilities to pull macro backgrounds (like BTC price, dollar index, rate expectations), clarify their purpose is to explain "why probabilities changed," not to substitute asset price moves directly for event contract Yes prices. A surge in BTC does not automatically mean approval odds for a crypto event should rise—they may be related but must be separately defined and verified.
Different topic markets vary greatly in participant structure, information sources, and liquidity; calibration performance cannot be generalized.
Political and election markets: Information-rich and highly covered by media, but polarized narratives can cause phases of overconfidence; post-election reviews often discuss "pre-election 90% diverging from outcome"—a calibration issue.
Sports markets: Rules are clearer, long data history, some mainstream events calibrate well; but sudden injuries or referee controversies still cause short-term distortions.
Crypto and industry event markets: FDV thresholds, approval progress, partnership launches rely more on text definitions (see Lesson 2); speculative and narrative-driven participants may dominate, thin markets and jumpy quotes are common, calibration volatility tends to be higher.
Therefore, sweeping claims like "prediction markets are accurate" or "prediction markets aren't accurate" are meaningless. Instead ask: For which types of events, which periods, under what liquidity conditions is calibration achieved?
Gate for AI Agent or general AI tools can take on research tasks in this lesson such as organizing historical base rates for certain events, compiling past market quotes and settlement results, assisting in grouped statistics or sketching calibration curves. These accelerate organization and help form hypotheses to be tested.
Tasks they cannot take on include: asserting "this market has always been accurate" without reading original rules; packaging a few cases as general laws; or directly outputting "should buy Yes." Any AI-generated figures must trace back to original data; if sample size is insufficient, it should clearly state "not enough to evaluate calibration," rather than offering false precision. Agents stop at research; whether to trust a market's probabilities must be judged by humans based on rules, liquidity, and independent sources.
The core question of this lesson is: What is calibration, and when can we say the market "got it right"? The answer is that in prediction markets "getting it right" has two layers: whether the result occurred and whether probability estimates were reasonable. Accuracy only considers the first; calibration looks at long-term consistency of estimates. Indicators like Brier score remind us: quoting 90% and failing is a more serious probabilistic distortion than quoting 60% and failing.
We also see that time, topic, and liquidity significantly affect calibration performance; you cannot use wins/losses in a single hot market to conclude about all prediction markets. Gate Prediction Market offers current consensus snapshots; Gate for AI Agent's macro data provides background comparison but cannot replace event contract probability reading itself.
The next lesson will turn to another dimension that determines trustworthiness: even if calibration is good long-term, single quotes can still be skewed by liquidity, spreads, and manipulation—liquidity and information efficiency are essential steps when reading prediction markets.