Appendices Chapter B

Appendix B: Empirical Studies

This appendix outlines the empirical studies referenced throughout the book that warrant their own investigation. Each study is designed to be runnable on retail-tier data infrastructure (Databento, IB API, Sierra Chart historical, or similar) on a single contract or a small set of contracts.

The studies are framed as research questions with a methodology, expected analysis, and reporting format. They are not "results" in the sense of finished papers; they are protocols for the reader (or the author's own research pipeline) to execute and contribute to the empirical evidence base behind the book's claims.

Where the author has done preliminary work on a study, the working result is noted. These are internal, not peer-reviewed; treat as hypotheses to validate.


B.1 Naked POC Retest Probability

Research question

What is the empirical probability that a naked POC retests, conditional on: - Time since formation - Distance from current price - Volatility regime at formation

Methodology

  1. Identify all POCs across N sessions (N ≥ 90) on liquid futures (ES, NQ).
  2. Tag each POC as "naked" (not retested at the time it became naked).
  3. For each naked POC, track whether and when it retests over the subsequent 30 sessions.
  4. Stratify retest probability by:
  5. Days-since-formation (bins of 5 days).
  6. Distance-from-current-price (bins of 0.5 × ATR).
  7. Volatility regime at formation (low, normal, high).

Expected analysis

A bivariate plot of retest probability vs (days-since, distance) for each regime. The hypothesis: probability is high (60 to 80%) for distances within 1 to 2 × ATR, regardless of age; probability decays with distance more than with age.

Author's preliminary result

Informal sampling on ES 2024 to 2025 suggests: - Naked POCs within 1.5 × ATR retest at ~75% within 30 sessions. - Naked POCs > 3 × ATR retest at ~40% within 30 sessions. - Older naked POCs (formed 20+ sessions ago) actually retest at higher rates than recent ones, consistent with "more time means more opportunity for price to oscillate to the level."

Reporting

A table of conditional retest rates, plus the bivariate plot. Sample size and confidence intervals reported.


B.2 AVWAP Magnetism

Research question

When an AVWAP is approached from one side (above or below), what is the conditional distribution of: - Touch-and-reverse (price reverses on the touch). - Touch-and-extend (price touches and continues through).

Does it depend on: - The AVWAP's anchor type (session, prior settlement, FOMC, etc.). - Distance traveled to the AVWAP. - Regime at the time of approach.

Methodology

  1. Plot AVWAPs from 4 anchor types (session, prior settlement, weekly, FOMC) on ES over 90 sessions.
  2. For each AVWAP, identify all approaches (touch within 0.05 × ATR).
  3. For each approach, classify outcome over the next 20 bars:
  4. Reverse: price moves at least 0.5 × ATR away from AVWAP in the opposite direction.
  5. Extend: price moves at least 0.5 × ATR through the AVWAP in the approach direction.
  6. Chop: neither.
  7. Stratify by regime, anchor type, and distance traveled.

Expected analysis

A 4-cell stratification: regime × anchor × distance, with conditional probability of each outcome.

Author's hypothesis

  • In Range regimes, AVWAP touches reverse at high rates (~70%).
  • In Trend regimes, AVWAP touches extend at high rates (~65%) on the trend's side; reverse on the counter-trend side.
  • Multi-AVWAP confluence (zone where 2+ AVWAPs converge) has even higher reaction rates.

B.3 Footprint Stacked-Imbalance Reversal Probability

Research question

For a stacked imbalance (3 or 4 or 5+ consecutive cells with imbalance ratio > 2:1), what is the probability of subsequent reversal (or continuation), conditional on: - Stack length (3, 4, 5+). - Position relative to structure (at high-score level vs not). - Regime.

Methodology

  1. Identify stacked imbalances on ES 5-min footprint over 60 sessions.
  2. Tag each by stack length and structural location.
  3. Track 10-bar outcome:
  4. Reversal: price moves opposite the imbalance direction by at least 0.5 × ATR.
  5. Continuation: price extends in the imbalance direction by at least 0.5 × ATR.
  6. Stratify.

Expected analysis

The hypothesis: stacked imbalances at high-score levels have asymmetric outcomes (reversal-favored or continuation-favored depending on regime). Stacked imbalances away from structure are approximately coin-flip.


B.4 Sweep-of-Equal-Highs Reversal Probability

Research question

When equal highs are swept, what is the conditional reversal probability based on: - Regime (Trend vs Range). - Pierce magnitude (0.05, 0.10, 0.15, 0.20+ × ATR). - Order-flow agreement (CVD divergence + absorption present vs absent). - Time of day.

Methodology

  1. On NQ 5-min over 60 sessions, identify all sweeps of equal-high clusters (by Chapter 5 definitions).
  2. For each, tag the conditioning variables.
  3. Track 10-bar outcome.

Expected analysis

The book's main hypothesis: at structure with order-flow confirmation in a Range regime, reversal probability is 70 to 80%. Without order-flow confirmation in a Trend regime, the same pattern produces 25 to 35% reversal (i.e., continuation more likely).

Author's preliminary result

Informal samples on NQ 2024 to 2025 supports the conditioning hypothesis qualitatively. Quantitative validation requires the full study.


B.5 Open-Type Classifier Accuracy

Research question

How accurately does a 09:45 ET open-type classification predict end-of-day classification?

Methodology

  1. Over 60 ES sessions, classify each by open type at 09:45 ET (using the protocol from Chapter 13).
  2. At end of day, classify by the actual day-type (Trend Day, Range Day, etc., from Dalton's framework).
  3. Cross-tabulate.

Expected analysis

A 5x6 confusion matrix (5 open types × 6 day types). The hypothesis: Open Drive predicts Trend Day with ~50 to 60% accuracy; Open Auction predicts Range/Normal with ~50 to 60%. The accuracy is high enough to be useful, low enough that 09:45 classification should not be treated as deterministic.

Author's preliminary result

Informal validation on ES 2024 supports ~55 to 65% top-1 accuracy at 09:45 ET. Accuracy improves to ~80% by 10:30 ET.


B.6 Composite Regime Classification Stability

Research question

How often does the regime composite (ATR percentile, BBW percentile, KER, ADX) remain stable through a session vs flip?

Methodology

  1. Compute the composite at 10:30 ET, 12:00 ET, 14:00 ET on 60 ES sessions.
  2. For each session, count the number of regime label changes.
  3. Sessions with 0 changes: stable. With 1 change: one transition. With 2+: choppy / unclassifiable.

Expected analysis

The hypothesis: - ~70% of sessions have 0 changes. - ~20% have 1 change (typically morning trend → afternoon range or vice versa). - ~10% have 2+ changes (skip days).

If sessions with 2+ changes are statistically distinguishable in their performance impact, the data validates the "skip transitional regimes" rule.


B.7 FVG Retest Conditional Probability

Research question

For an FVG ≥ 1.0 × ATR formed during RTH, what is the probability of: - Proximal-edge retest (within 20 bars). - Mid-FVG retest. - Full-fill retest.

Conditional on: - Regime. - Impulse strength of the bar that formed the FVG. - Session phase (open, mid-morning, lunch, afternoon).

Methodology

  1. On ES 5-min over 90 sessions, identify all FVGs ≥ 1.0 × ATR.
  2. Track retest outcomes within 20 bars.
  3. Stratify.

Expected analysis

The hypothesis: - Proximal-edge retest: 65 to 75% in Trend regimes; 40 to 50% in Range. - Mid-FVG retest: 35 to 45% in Trend; 20 to 30% in Range. - Full-fill retest: 10 to 20% in Trend; 5 to 15% in Range.


B.8 Cross-Asset Divergence and Subsequent Reversion

Research question

When ES and NQ decouple (correlation drops below 0.7 intraday), how often does the divergence resolve toward the "median" view in subsequent sessions?

Methodology

  1. On 90 sessions, compute intraday rolling correlation between ES and NQ in 1-hour windows.
  2. Identify the 10% of sessions with the largest deviation from typical correlation.
  3. For each, track the subsequent 5 sessions: did the relative ranking of ES vs NQ revert?

Expected analysis

The hypothesis: divergences resolve in 60 to 70% of cases. This validates the "watch for divergence" framework as a flag rather than as a primary signal.


B.9 News Blackout Window Effectiveness

Research question

For traders who use the 5-min-before to 15-min-after blackout protocol on tier-one news, how much does the protocol reduce news-related slippage compared to traders who do not?

Methodology

  1. Identify all tier-one news events in 60 ES sessions.
  2. Two scenarios: (a) blackout discipline; (b) trade through.
  3. Compute hypothetical slippage on stops and entries in each scenario.

Expected analysis

The hypothesis: blackout discipline reduces news-related stop slippage by 70 to 90% compared to trade-through. The economic value compounds over months.


B.10 Daily-Loss Cap Compliance Effects

Research question

For traders with documented per-session loss caps, do those who comply consistently outperform those who violate?

Methodology

  1. From a sample of trader journals (synthetic or anonymized), classify sessions by:
  2. Hit cap and stopped (compliant).
  3. Hit cap but continued (violated).
  4. Did not hit cap.
  5. Compute monthly aggregate P&L by compliance category.

Expected analysis

The hypothesis: violators have larger monthly drawdowns despite occasionally recovering from violation losses. The expected value of "one more trade after the cap" is negative on net.


B.11 Level Quality Score Calibration

Research question

For levels scored by Chapter 5's quality function, is the empirical reaction rate monotone increasing in score?

Methodology

  1. Compute scores for all levels across 60 sessions on ES.
  2. For each level, observe the actual reaction (price reversed or continued through).
  3. Bin by score in 2-unit increments.
  4. Compute empirical reaction rate per bin.

Expected analysis

The hypothesis: monotone increasing curve, with ~30% reaction at score 0-2 (random level), rising to ~75 to 85% at score 15+.

Calibration insight

If the curve is not monotone, the score function needs re-weighting. The walk-forward approach: split data, fit weights on first half, validate on second half.


B.12 Regime-Stratified Strategy Performance

Research question

For each of the book's frameworks (2, 3, 4), what is the regime-stratified Sharpe?

Methodology

  1. Run each framework on 250+ sessions of historical data.
  2. Tag each trade with the regime classification at the time.
  3. Compute Sharpe per (framework × regime) cell.
  4. Identify the regimes where each framework is strongest and weakest.

Expected analysis

The hypothesis: - Framework 2 (Trend Continuation): strongest in Trend-Vol; positive but smaller in Trend-Calm; near zero or negative in Range regimes. - Framework 3 (Range Fade): strongest in Range-Calm; positive but smaller in Range-Vol; near zero or negative in Trend regimes. - Framework 4 (Squeeze Breakout): strongest in Squeeze regime transitions; near zero in stable regimes.


B.13 Open-Type x Open-Balance Interaction

Research question

How does the interaction of open type and open balance (in/out-of-balance relative to prior VAH/VAL) predict day-type classification?

Methodology

  1. For 60 ES sessions, classify open type and open balance.
  2. For each cell of the 5x3 matrix, compute the empirical day-type distribution.
  3. Identify the highest-conviction (highest day-type concentration) cells.

Expected analysis

The hypothesis: out-of-balance + Open Drive in same direction is the highest-conviction Trend Day setup (~80%+ Trend Day frequency). In-balance + Open Auction is the highest-conviction Range Day setup.


B.14 Reporting standards

For each study above, report:

  1. Methodology summary (data, period, contracts).
  2. Sample size for each cell of the stratification.
  3. Confidence intervals on conditional probabilities.
  4. Robustness check: re-run on a different time period or contract.
  5. Visualization (heatmap, bar chart, or other appropriate format).
  6. Implications for the framework: where do the results align with the book's claims, and where do they suggest revision?

The studies are alive: they should be re-run periodically as data accumulates and as market regimes evolve. A study that supported a claim in 2025 may not in 2027; the book's frameworks should adjust accordingly.


B.15 Computational infrastructure suggestions

For the studies above, the recommended toolchain:

  • Data: Databento (databento-python) for clean intraday futures data; CME DataMine for tick-precision when needed.
  • Computation: Python with pandas, numpy. For walk-forward, vectorbt or hand-rolled. For DSR/PBO, mlfinlab.
  • Visualization: matplotlib for static; plotly for interactive heatmaps and confusion matrices.
  • Versioning: git for code, parquet files for data snapshots.
  • Replication: every study should be runnable from a single notebook end to end.

This appendix is the empirical contract. The book's claims are the hypotheses; this appendix is the protocol for testing them. A trader who runs even half of these studies has a substantially calibrated picture of which book frameworks have edge in their specific market context and which do not.