DatasetDoctor | ML Insights Hub

Data leakage is one of the most common and damaging issues in modern machine learning. It occurs when information from the target variable "leaks" into the features used for training, providing the model with unrealistic clues that won't be available in production.

The Target Leakage Trap

Target leakage happens when your predictors include data that wouldn't actually be available at the time of prediction. For example, predicting if a customer will default on a loan using a feature like "Collection Agency Contact Date"—which only exists *after* the default has occurred.

Pro Tip

Always verify the "Time of Generation" for every feature in your dataset. If a feature is generated at the same time or after the target, it's a high-risk candidate for leakage.

Technical Detection: Correlation isn't enough

While high correlation between a feature and the target is a warning sign, it's not definitive proof of leakage. We recommend using a Predictive Power Score (PPS) analysis. PPS can detect non-linear relationships that standard Pearson correlation misses.

              python / leakage_detector.py
              
            
# Detecting target leakage using feature importance from
              sklearn.ensemble import RandomForestClassifier def
              check_leakage(df, target): X = df.drop(columns=[target]) y =
              df[target] model = RandomForestClassifier() model.fit(X, y) #
              Flags features with suspicious 0.95+ importance leakage_candidates
              = [f for f, s in zip(X.columns, model.feature_importances_) if s >
              0.95] return leakage_candidates

Prevention Strategies

Building a robust validation pipeline is your first line of defense. Use Time-Series Split for any data that has a temporal component. Random K-Fold cross-validation often masks leakage by allowing the model to look into the "future" relative to the training rows.