LabelStudio Integration
Here at Unstructured, we’re dedicated to developing tools that enable data scientists to integrate seamlessly with their favorite downstream tools. With that in mind, we’re excited to announce the release of our first staging brick, which formats data for upload to LabelStudio. Staging bricks are included in our open-source unstructured package and require just a simple pip install. In this post, we’ll show how to use the LabelStudio staging brick together with the SEC pipeline we introduced last week to render the risk section of SEC filings ready to label for sentiment analysis. First, let’s pull an SEC filing from EDGAR and extract the risk section.
from prepline_sec_filings.sec_document import SECDocument, SECSection
from prepline_sec_filings.fetch import get_form_by_ticker
text = get_form_by_ticker(
'rgld',
'10-K',
company='<your-name-or-org>',
email='<your-email>'
)
doc = SECDocument.from_string(text)
risk_section = doc.get_section_narrative(SECSection.RISK_FACTORS)
Next, we’ll use the stage_for_label_studio staging brick to get the data ready for upload.
import json
from unstructured.staging.label_studio import stage_for_label_studio
label_studio_data = stage_for_label_studio(risk_section, text_field="text", id_field="id")
# The resulting JSON file is ready to be uploaded to LabelStudio
with open("label_studio.json", "w") as f:
json.dump(label_studio_data, f, indent=4)
That’s it! Your data is ready to upload to a project in LabelStudio. Create a new project, upload your data, choose “Text Classification” for your labeling setup, and you’re ready to go! You can see the full documentation for our LabelStudio brick here, but it really is that simple.
Upload Data
choose text classification
Now you are ready to label data!
Need help getting started? Join the conversation on our Slack page to chat with the Unstructured team and fellow community members.