In machine learning workflows, the standard practice is to split datasets into training and testing subsets before applying most preprocessing transformations to prevent data leakage.
However, certain preliminary data cleaning operations may be performed safely on the entire dataset beforehand, as they do not depend on statistical summaries or introduce information from the test set into the training process.
Given below are examples of preprocessing that can be done before splitting.
- Removing duplicates.
- Fixing data types - e.g. date strings
- Remove bad data or impossible values - e.g. age > 150
- Removing whitespace from strings - e.g. trim the text
- Imagine you're studying for an exam.
- You’re supposed to practice using your textbook (training data) and then take the exam (test data) to see how well you’ve learned.
- Now imagine someone secretly shows you some of the exam questions while you’re studying.
- When you take the test, you score really high — but not because you truly understood the material. You just recognized the questions. That’s data leakage!!!
- The training data is what the model learns from.
- The test data is supposed to check how well it learned.
- If information from the test data sneaks into training, the model gets an unfair advantage.
- It looks like it performs very well.
- But when you give it completely new data in the real world, performance drops.
- So data leakage makes the model look smarter than it actually is — and that’s dangerous because it won’t work as well in real-life situations.
- Suppose you are building a model to predict house prices, and the dataset contains missing values in the feature “Lot Size.”
- You calculate the mean lot size using the entire dataset (including both training and test data) and use that value to fill in all missing entries.
- After performing this imputation, you split the data into training and test sets.
- This creates data leakage because the imputed values were influenced by information from the test set.
- As a result, the model’s evaluation may appear more accurate than it truly is, since the training process indirectly incorporated knowledge from unseen data.

