Data leakage in applied ML: reproducing examples from genomics, medicine and radiology
Analyzing the Causes and Consequences of Data Leakage
Hello everyone! I’m Shaivi Malik, a computer science and engineering student. I am thrilled to announce that I have been selected as a Summer of Reproducibility Fellow. I will be contributing to the Data leakage in applied ML: reproducing examples of irreproducibility project under the mentorship of Fraida Fund and Mohamed Saeed. You can find my proposal here.
This summer, we will reproduce studies from medicine, radiology and genomics. Through these studies, we’ll explore and demonstrate three types of data leakage:
- Pre-processing on train and test sets together
- Model uses features that are not legitimate
- Feature selection on training and test sets
For each paper, we will replicate the published results with and without the data leakage error, and present performance metrics for comparison. We will also provide explanatory materials and example questions to test understanding. All these resources will be bundled together in a dedicated repository for each paper.
This project aims to address the need for accessible educational material on data leakage. These materials will be designed to be readily adopted by instructors teaching machine learning in a wide variety of contexts. They will be presented in a clear and easy-to-follow manner, catering to a broad range of backgrounds and raising awareness about the consequences of data leakage.
Stay tuned for updates on my progress! You can follow me on GitHub and watch out for my upcoming blog posts.