Unstructured

Categories
Uncategorized

LabelStudio Integration

LabelStudio Integration

Here at Unstructured, we’re dedicated to developing tools that enable data scientists to integrate seamlessly with their favorite downstream tools. With that in mind, we’re excited to announce the release of our first staging brick, which formats data for upload to LabelStudio. Staging bricks are included in our open-source unstructured package and require just a simple pip install. In this post, we’ll show how to use the LabelStudio staging brick together with the SEC pipeline we introduced last week to render the risk section of SEC filings ready to label for sentiment analysis. First, let’s pull an SEC filing from EDGAR and extract the risk section.
from prepline_sec_filings.sec_document import SECDocument, SECSection
from prepline_sec_filings.fetch import get_form_by_ticker
text = get_form_by_ticker(
    'rgld', 
    '10-K', 
    company='<your-name-or-org>', 
    email='<your-email>'
)
doc = SECDocument.from_string(text)
risk_section = doc.get_section_narrative(SECSection.RISK_FACTORS) 

Next, we’ll use the stage_for_label_studio staging brick to get the data ready for upload.

import json

from unstructured.staging.label_studio import stage_for_label_studio

label_studio_data = stage_for_label_studio(risk_section, text_field="text", id_field="id")

# The resulting JSON file is ready to be uploaded to LabelStudio
with open("label_studio.json", "w") as f:
    json.dump(label_studio_data, f, indent=4)
That’s it! Your data is ready to upload to a project in LabelStudio. Create a new project, upload your data, choose “Text Classification” for your labeling setup, and you’re ready to go! You can see the full documentation for our LabelStudio brick here, but it really is that simple.

Upload Data

choose text classification

Now you are ready to label data!

Need help getting started? Join the conversation on our Slack page to chat with the Unstructured team and fellow community members.

Sign Up for
Beta Access

Categories
Uncategorized

SEC Pipelines

SEC Pipelines

10-K, 10-Q, and S-1 filings provide investors with a vital source of information about the risks and opportunities associated with publicly traded companies. In order to understand the impact of these filings on investment decisions, however, analysts first need to extract information from complex and variable XML documents. For an analyst doing this by hand, it would take over 300 hours to extract and structure the content of the risk factors section for each of the 4,000+ publicly traded companies in the US. It could take equally long to develop custom parsing code capable of handling all of the corner cases that appear in real-world filings.

<XBRL>
<?xml version='1.0' encoding='UTF-8'?>
<!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 9.6.7811.37134 -->
      <!-- Based on: iXBRL 1.1 -->
      <!-- Created on: 8/11/2021 10:45:07 PM -->
      <!-- iXBRL Library version: 1.0.7811.37150 -->
      <!-- iXBRL Service Job ID: 19f8db26-9ac2-4427-9c71-ed8734c57db7 -->
<html xmlns:us-gaap="http://fasb.org/us-gaap/2020-01-31" 
xmlns:link="http://www.xbrl.org/2003/linkbase" 
xmlns:country="http://xbrl.sec.gov/country/2020-01-31" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns:rgld="http://www.royalgold.com/20210630" 
xmlns:xbrldt="http://xbrl.org/2005/xbrldt" 
xmlns:ixt-sec=
"http://www.sec.gov/inlineXBRL/transformation/20contextRef=
"Duration_7_1_2020_To_6_30_2021_srt_TitleOfIndividualAxis_rgld_
OfficersAndCertainEmployeesMember_us-gaap

The start of a messy XBRL file from EDGAR

Fortunately, Unstructured is here to help! Unstructured is excited to announce the release of our open source pre processing pipeline for select SEC filings, which you can find on GitHub here. The SEC filings API allows users to extract narrative text from one or more sections of a 10-K, 10-Q, or S-1 filing in iXBRL format.
curl -X 'POST' \
  'https://api.unstructured.io/sec-filings/v0.2.0/section' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'text_files=@<your-file-name>.xbrl' \
  -F 'section=RISK_FACTORS'
See this file for a list of valid inputs for the section parameter. To fetch an iXBRL document from EDGAR, use the following helper function from the pipeline repo.
from prepline_sec_filings.fetch import get_form_by_ticker
text = get_form_by_ticker(
    'rgld', 
    '10-K', 
    company='<your-name-or-org>', 
    email='<your-email>'
)
After fetching the document, save it locally and pass it into the API using the file parameter. The API is aware of what sections are valid per filing type, and you may specify one or more of them in the API request using the section. The valid section parameters are listed in this file under the SECSection enum in the GitHub repo, e.g. “PROSPECTUS_SUMMARY” or “RISK_FACTORS”. Alternatively, use  section=_ALL to retrieve all sections.

This API is open and free to use. Enjoy!

Have questions or need help? Use this invite to join us on our community Slack channel.

Sign Up for
Beta Access

Categories
Uncategorized

Introducing Unstructured

Introducing Unstructured

The Problem

Our team has years of collective experience working on large-scale machine learning initiatives across diverse industries including finance, healthcare/pharma, CPG, logistics, energy, and government. Throughout that time we noticed a problem: slow, arduous data preparation processes hamper customers’ ability to successfully deploy artificial intelligence solutions.

Lacking adequate tooling, developers resort to building one-off solutions for each project, resulting in wasted effort and slow turnaround times. By the time their data is prepped and ready for a machine learning pipeline, professional services bills have piled up and delivery timelines have stretched by weeks or months. While powerful solutions have emerged over the past several years to help engineers deal with messy tabular data, the ecosystem of tools to curate and clean unstructured data such as text, images, and audio remains sparse. This is a huge problem because more than 80 percent of organizations’ data is unstructured.

The Solution

We’ve decided to tackle this challenge head on by building the first open-source platform designed to accelerate the preprocessing of unstructured data. With an initial focus on natural language data, our Python library provides off-the-shelf components that make it fast, easy, and inexpensive to transform these data sets from raw to ML-ready. The library includes bricks for partitioning documents into their constituent parts, cleaning out unwanted text, such as boilerplate text and sentence fragments, and staging outputs for downstreams tasks such as data labeling in LabelStudio or inference with Hugging Face. Once user have orchestrated a preprocessing pipeline, our tooling allows them to publish their pipeline as a REST API or download it in a container. Together, these bricks fill critical gaps in tooling for data scientists to rapidly preprocess their unstructured data. 

Critically, our community-based platform makes it easy to share pipeline code so no one needs to start at square one. Simply fork, edit, and publish. What used to take days, weeks, or months now takes minutes, giving data scientists time back to focus on model selection, training, and other high-value machine learning tasks. Inspired by the Hugging Face and Ribrix communities, we’re adopting a similar model with the goal of transforming the economics of applying ML to unstructured data.

Get Involved

Now that we’ve officially launched, we have an exciting few months planned. Here are a few ways you can keep up-to-date on our latest features and announcements:

Sign up for Beta access for our hosted REST API. The API currently supports preprocessing 10-K, 10-Q, and S-1 filings, with new preprocessing pipelines coming soon.

Subscribe to our blog to stay up to date on our latest features. Features we have on our near term product roadmap include staging bricks for downstream ML vendors, PDF parsing capabilities, and the ability to publish APIs for custom preprocessing pipelines.

Follow, star, and contribute back to our repos on GitHub. We love community contributions!

Join us as we work together to make machine learning with unstructured data a breeze.

Critically, our community-based platform makes it easy to share pipeline code so no one needs to start at square one. Simply fork, edit, and publish. What used to take days, weeks, or months now takes minutes, giving data scientists time back to focus on model selection, training, and other high-value machine learning tasks. Inspired by the Hugging Face and Ribrix communities, we’re adopting a similar model with the goal of transforming the economics of applying ML to unstructured data.

Brian Raymond

Founder & CEO

Sign Up for
Beta Access