<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Dhanush | UCSC OSPO</title><link>https://ucsc-ospo.netlify.app/author/dhanush/</link><atom:link href="https://ucsc-ospo.netlify.app/author/dhanush/index.xml" rel="self" type="application/rss+xml"/><description>Dhanush</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://ucsc-ospo.netlify.app/author/dhanush/avatar_hue726a3e5a04ea0af791512aef9a957e1_90797_270x270_fill_q75_lanczos_center.jpg</url><title>Dhanush</title><link>https://ucsc-ospo.netlify.app/author/dhanush/</link></image><item><title>Extending AIDRIN for Scientific Data Formats</title><link>https://ucsc-ospo.netlify.app/report/osre26/lbl/aidrin/20260619-dhanush010/</link><pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate><guid>https://ucsc-ospo.netlify.app/report/osre26/lbl/aidrin/20260619-dhanush010/</guid><description>&lt;p>⏱️ Reading time: 2–3 minutes&lt;/p>
&lt;p>Hi 👋&lt;/p>
&lt;p>I&amp;rsquo;m Dhanush, a first-year Master&amp;rsquo;s student in Computer Engineering at New York University. I like building systems that work reliably at scale, and I care a lot about data quality because I&amp;rsquo;ve seen firsthand what happens when you ignore it. Before NYU, I spent a year at Tata Consultancy Services building backend systems and monitoring production workflows, which gave me a very concrete sense of how data quality problems compound quietly until they suddenly aren&amp;rsquo;t quiet anymore.&lt;/p>
&lt;p>This summer I&amp;rsquo;m contributing to &lt;a href="https://ucsc-ospo.netlify.app/project/osre26/lbl/aidrin">AIDRIN&lt;/a> (AI Data Readiness Inspector) as part of OSRE 2026, under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the &lt;a href="https://scidata.lbl.gov" target="_blank" rel="noopener">Scientific Data Division&lt;/a> at Lawrence Berkeley National Laboratory (LBNL).&lt;/p>
&lt;p>&lt;a href="https://github.com/idtlab/AIDRIN" target="_blank" rel="noopener">AIDRIN&lt;/a> is an open-source framework that helps researchers and data scientists evaluate whether a dataset is genuinely ready for AI workflows by assessing data quality, FAIR-principle compliance, data structure performance, bias, and impact of data on AI applications. The idea is straightforward: before you trust a model, you should be able to trust the data behind it. AIDRIN also supports remediating data when lapses occur in the data used by AI applications.&lt;/p>
&lt;h2 id="why-this-work-matters-to-me">Why this work matters to me&lt;/h2>
&lt;p>While working on a course project building a taxi-trip prediction system, I spent an afternoon chasing a degraded prediction. The root cause was a single bad record propagating through the whole batch of data. I had to diagnose it manually. That experience stuck with me, because it showed me that data problems don&amp;rsquo;t announce themselves. They just quietly make everything worse.&lt;/p>
&lt;p>AIDRIN is built to surface those problems earlier. What drew me to this project is a gap that limits who can use it today: AIDRIN supports CSV, Excel, JSON, NumPy, and HDF5, but many scientific workflows rely on formats that are harder to flatten into a spreadsheet, such as ROOT in high-energy physics, Zarr in climate and genomics, and HDF5 layouts that aren&amp;rsquo;t simple tables. Getting data out of those formats into CSV is often painful or lossy.&lt;/p>
&lt;h2 id="what-im-working-on-this-summer">What I&amp;rsquo;m working on this summer&lt;/h2>
&lt;p>My project is about improving AIDRIN’s support for data and file formats that are not tabular. I&amp;rsquo;ll start by hardening the HDF5 reader, building an inventory of what&amp;rsquo;s actually in each file, and handling real scientific layouts instead of assuming everything is a flat table. On top of that, I&amp;rsquo;m adding native support for Zarr and ROOT, two formats that show up constantly in climate, genomics, and high-energy physics workflows but aren&amp;rsquo;t well served by converting everything to CSV first. I&amp;rsquo;m also building a pluggable ingestion layer for custom sources and multi-file support, so related datasets can be combined before analysis.&lt;/p>
&lt;h2 id="whats-next">What&amp;rsquo;s next&lt;/h2>
&lt;p>Right now, I&amp;rsquo;m mapping how data flows from upload through to the metric modules and writing an HDF5 audit of sample files before changing implementation.&lt;/p>
&lt;p>More updates as the summer goes on. If you work in data quality, scientific data management, or open-source tooling, I&amp;rsquo;d love to connect.&lt;/p>
&lt;p>👉 &lt;a href="https://drive.google.com/file/d/1g44iFr41vXxzA6rlCRQ_1L-hkEqhS_Zu/view?usp=sharing" target="_blank" rel="noopener">Read my proposal here&lt;/a>&lt;/p></description></item></channel></rss>