🏠 Home
🔭 About
📺 Programs
Overview
🧪 Open Source Research Experience
🧪 Summer of Reproducibility
🪺 Open Source Incubator Fellowship
🎓 Open Source Education
📚 Resources
📝 Blog
🎪 Events
data science
[Mid-term] Capturing provenance into Data Science/Machine Learning workflows
This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package. The initial weeks I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms.
Jesse Lima
Last updated on Dec 6, 2024
Verify the reproducibility of an experiment
Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project. My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.
Jesse Lima
Last updated on Aug 4, 2023
FlashNet: Towards Reproducible Data Science for Storage System
The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers.
Haryadi S. Gunawi
Polyphorm / PolyPhy
PolyPhy is a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data.
Oskar Elek
Apache AsterixDB
AsterixDB is an open source parallel big-data management system. AsterixDB is a well-established Apache project that has beedddn active in research for more than 10 years. It provides a flexible data model that supports modern NoSQL applications with a powerful query processor that can scale to billions of records and terabytes of data.
Ahmed Eldawy
FasTensor
FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines.
John Wu
FasTensor
FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines.
John Wu
,
Bin Dong
Polyphorm / PolyPhy
Polyphorm is an agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data.
Oskar Elek
DirtViz 2.0 (2023)
DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings.
Colleen Josephson
«
Cite
×