Unstructured

Categories
Uncategorized

Introducing Unstructured

Introducing Unstructured

The Problem

Our team has years of collective experience working on large-scale machine learning initiatives across diverse industries including finance, healthcare/pharma, CPG, logistics, energy, and government. Throughout that time we noticed a problem: slow, arduous data preparation processes hamper customers’ ability to successfully deploy artificial intelligence solutions.

Lacking adequate tooling, developers resort to building one-off solutions for each project, resulting in wasted effort and slow turnaround times. By the time their data is prepped and ready for a machine learning pipeline, professional services bills have piled up and delivery timelines have stretched by weeks or months. While powerful solutions have emerged over the past several years to help engineers deal with messy tabular data, the ecosystem of tools to curate and clean unstructured data such as text, images, and audio remains sparse. This is a huge problem because more than 80 percent of organizations’ data is unstructured.

The Solution

We’ve decided to tackle this challenge head on by building the first open-source platform designed to accelerate the preprocessing of unstructured data. With an initial focus on natural language data, our Python library provides off-the-shelf components that make it fast, easy, and inexpensive to transform these data sets from raw to ML-ready. The library includes bricks for partitioning documents into their constituent parts, cleaning out unwanted text, such as boilerplate text and sentence fragments, and staging outputs for downstreams tasks such as data labeling in LabelStudio or inference with Hugging Face. Once user have orchestrated a preprocessing pipeline, our tooling allows them to publish their pipeline as a REST API or download it in a container. Together, these bricks fill critical gaps in tooling for data scientists to rapidly preprocess their unstructured data. 

Critically, our community-based platform makes it easy to share pipeline code so no one needs to start at square one. Simply fork, edit, and publish. What used to take days, weeks, or months now takes minutes, giving data scientists time back to focus on model selection, training, and other high-value machine learning tasks. Inspired by the Hugging Face and Ribrix communities, we’re adopting a similar model with the goal of transforming the economics of applying ML to unstructured data.

Get Involved

Now that we’ve officially launched, we have an exciting few months planned. Here are a few ways you can keep up-to-date on our latest features and announcements:

Sign up for Beta access for our hosted REST API. The API currently supports preprocessing 10-K, 10-Q, and S-1 filings, with new preprocessing pipelines coming soon.

Subscribe to our blog to stay up to date on our latest features. Features we have on our near term product roadmap include staging bricks for downstream ML vendors, PDF parsing capabilities, and the ability to publish APIs for custom preprocessing pipelines.

Follow, star, and contribute back to our repos on GitHub. We love community contributions!

Join us as we work together to make machine learning with unstructured data a breeze.

Critically, our community-based platform makes it easy to share pipeline code so no one needs to start at square one. Simply fork, edit, and publish. What used to take days, weeks, or months now takes minutes, giving data scientists time back to focus on model selection, training, and other high-value machine learning tasks. Inspired by the Hugging Face and Ribrix communities, we’re adopting a similar model with the goal of transforming the economics of applying ML to unstructured data.

Brian Raymond

Founder & CEO

Sign Up for
Beta Access