Extending AIDRIN for Scientific Data Formats

AIDRIN - inspecting dataset readiness for AI pipelines

โฑ๏ธ Reading time: 2โ€“3 minutes

Hi ๐Ÿ‘‹

I’m Dhanush, a first-year Master’s student in Computer Engineering at New York University. I like building systems that work reliably at scale, and I care a lot about data quality because I’ve seen firsthand what happens when you ignore it. Before NYU, I spent a year at Tata Consultancy Services building backend systems and monitoring production workflows, which gave me a very concrete sense of how data quality problems compound quietly until they suddenly aren’t quiet anymore.

This summer I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of OSRE 2026, under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Division at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and data scientists evaluate whether a dataset is genuinely ready for AI workflows by assessing data quality, FAIR-principle compliance, data structure performance, bias, and impact of data on AI applications. The idea is straightforward: before you trust a model, you should be able to trust the data behind it. AIDRIN also supports remediating data when lapses occur in the data used by AI applications.

Why this work matters to me

While working on a course project building a taxi-trip prediction system, I spent an afternoon chasing a degraded prediction. The root cause was a single bad record propagating through the whole batch of data. I had to diagnose it manually. That experience stuck with me, because it showed me that data problems don’t announce themselves. They just quietly make everything worse.

AIDRIN is built to surface those problems earlier. What drew me to this project is a gap that limits who can use it today: AIDRIN supports CSV, Excel, JSON, NumPy, and HDF5, but many scientific workflows rely on formats that are harder to flatten into a spreadsheet, such as ROOT in high-energy physics, Zarr in climate and genomics, and HDF5 layouts that aren’t simple tables. Getting data out of those formats into CSV is often painful or lossy.

What I’m working on this summer

My project is about improving AIDRINโ€™s support for data and file formats that are not tabular. I’ll start by hardening the HDF5 reader, building an inventory of what’s actually in each file, and handling real scientific layouts instead of assuming everything is a flat table. On top of that, I’m adding native support for Zarr and ROOT, two formats that show up constantly in climate, genomics, and high-energy physics workflows but aren’t well served by converting everything to CSV first. I’m also building a pluggable ingestion layer for custom sources and multi-file support, so related datasets can be combined before analysis.

What’s next

Right now, I’m mapping how data flows from upload through to the metric modules and writing an HDF5 audit of sample files before changing implementation.

More updates as the summer goes on. If you work in data quality, scientific data management, or open-source tooling, I’d love to connect.

๐Ÿ‘‰ Read my proposal here

Dhanush
Dhanush
Master’s student, New York University

Dhanush is a Masterโ€™s student at New York University with interests in software engineering, distributed systems, machine learning, and open-source development. He enjoys building scalable backend systems, contributing to open-source projects, and exploring practical applications of AI. He is excited to collaborate with the community and create impactful solutions.

Jean Luca Bez
Jean Luca Bez
Research Scientist, Lawrence Berkeley National Laboratory

Jean Luca is a Career-Track Research Scientist at Lawrence Berkeley National Laboratory (LBNL), USA. Jean Luca’s research interests are in High Performance Computing (HPC), data management, I/O, storage, and AI data readiness.

Suren Byna
Suren Byna
Professor, The Ohio State University

Suren Byna is a Professor in the Department of Computer Science and Engineering (CSE) at The Ohio State University. He is a Visiting Faculty Scientist at Lawrence Berkeley National Lab (LBNL). Prior to joining OSU, he was a Senior Scientist in the Scientific Data Division at LBNL. His research interests are in developing and optimizing scientific data management system.