Unstructured

Categories
Uncategorized

LabelStudio Integration

LabelStudio Integration

Here at Unstructured, we’re dedicated to developing tools that enable data scientists to integrate seamlessly with their favorite downstream tools. With that in mind, we’re excited to announce the release of our first staging brick, which formats data for upload to LabelStudio. Staging bricks are included in our open-source unstructured package and require just a simple pip install. In this post, we’ll show how to use the LabelStudio staging brick together with the SEC pipeline we introduced last week to render the risk section of SEC filings ready to label for sentiment analysis. First, let’s pull an SEC filing from EDGAR and extract the risk section.
from prepline_sec_filings.sec_document import SECDocument, SECSection
from prepline_sec_filings.fetch import get_form_by_ticker
text = get_form_by_ticker(
    'rgld', 
    '10-K', 
    company='<your-name-or-org>', 
    email='<your-email>'
)
doc = SECDocument.from_string(text)
risk_section = doc.get_section_narrative(SECSection.RISK_FACTORS) 

Next, we’ll use the stage_for_label_studio staging brick to get the data ready for upload.

import json

from unstructured.staging.label_studio import stage_for_label_studio

label_studio_data = stage_for_label_studio(risk_section, text_field="text", id_field="id")

# The resulting JSON file is ready to be uploaded to LabelStudio
with open("label_studio.json", "w") as f:
    json.dump(label_studio_data, f, indent=4)
That’s it! Your data is ready to upload to a project in LabelStudio. Create a new project, upload your data, choose “Text Classification” for your labeling setup, and you’re ready to go! You can see the full documentation for our LabelStudio brick here, but it really is that simple.

Upload Data

choose text classification

Now you are ready to label data!

Need help getting started? Join the conversation on our Slack page to chat with the Unstructured team and fellow community members.

Sign Up for
Beta Access

Categories
Uncategorized

SEC Pipelines

SEC Pipelines

10-K, 10-Q, and S-1 filings provide investors with a vital source of information about the risks and opportunities associated with publicly traded companies. In order to understand the impact of these filings on investment decisions, however, analysts first need to extract information from complex and variable XML documents. For an analyst doing this by hand, it would take over 300 hours to extract and structure the content of the risk factors section for each of the 4,000+ publicly traded companies in the US. It could take equally long to develop custom parsing code capable of handling all of the corner cases that appear in real-world filings.

<XBRL>
<?xml version='1.0' encoding='UTF-8'?>
<!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 9.6.7811.37134 -->
      <!-- Based on: iXBRL 1.1 -->
      <!-- Created on: 8/11/2021 10:45:07 PM -->
      <!-- iXBRL Library version: 1.0.7811.37150 -->
      <!-- iXBRL Service Job ID: 19f8db26-9ac2-4427-9c71-ed8734c57db7 -->
<html xmlns:us-gaap="http://fasb.org/us-gaap/2020-01-31" 
xmlns:link="http://www.xbrl.org/2003/linkbase" 
xmlns:country="http://xbrl.sec.gov/country/2020-01-31" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns:rgld="http://www.royalgold.com/20210630" 
xmlns:xbrldt="http://xbrl.org/2005/xbrldt" 
xmlns:ixt-sec=
"http://www.sec.gov/inlineXBRL/transformation/20contextRef=
"Duration_7_1_2020_To_6_30_2021_srt_TitleOfIndividualAxis_rgld_
OfficersAndCertainEmployeesMember_us-gaap

The start of a messy XBRL file from EDGAR

Fortunately, Unstructured is here to help! Unstructured is excited to announce the release of our open source pre processing pipeline for select SEC filings, which you can find on GitHub here. The SEC filings API allows users to extract narrative text from one or more sections of a 10-K, 10-Q, or S-1 filing in iXBRL format.
curl -X 'POST' \
  'https://api.unstructured.io/sec-filings/v0.2.0/section' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'text_files=@<your-file-name>.xbrl' \
  -F 'section=RISK_FACTORS'
See this file for a list of valid inputs for the section parameter. To fetch an iXBRL document from EDGAR, use the following helper function from the pipeline repo.
from prepline_sec_filings.fetch import get_form_by_ticker
text = get_form_by_ticker(
    'rgld', 
    '10-K', 
    company='<your-name-or-org>', 
    email='<your-email>'
)
After fetching the document, save it locally and pass it into the API using the file parameter. The API is aware of what sections are valid per filing type, and you may specify one or more of them in the API request using the section. The valid section parameters are listed in this file under the SECSection enum in the GitHub repo, e.g. “PROSPECTUS_SUMMARY” or “RISK_FACTORS”. Alternatively, use  section=_ALL to retrieve all sections.

This API is open and free to use. Enjoy!

Have questions or need help? Use this invite to join us on our community Slack channel.

Sign Up for
Beta Access