Getting Structured Data Out of Text with LLMs

Aaron Berdanier

Jul 24, 2024 — 5 min read

We’ve done multiple projects to automatically pull structured data from text, including:

Automatically parsing emails
Scanning research articles
Processing customer feedback reports

This is called natural language processing, where we use computers to process information contained in unstructured text. This is increasingly done with large language models (LLMs) because they provide amazing flexibility for processing natural language with intuitive inputs.

LLMs are next-token generators. Acting kind of like a glorified autocomplete, they generate text that is most likely to come next based on what was given as an input. To generate structured text, we need to guide the model for what output we want.

Here I'm going to show you how we do that with some basic prompts.

Structured Formats

There are a few common formats for structuring data: JSON, YAML, and XML. These are all designed so that they can be used by a computer program—whether that is sticking data into a spreadsheet or running a database operation or code function.

Luckily, most foundation LLMs have been trained on many examples of structured data, so they have an understanding of these data formats out of the box. This means we can usually get pretty good results with zero- or few-shot learning (i.e., no need for a custom-trained model).

Let’s take a look at an example and see how this works. For all of these examples, I’m using Mistral 7B Instruct (which we could hook up for you through Amazon Bedrock!), but it works similarly in other models as well. Here is the input for all of these examples, where we just substituted {format} with JSON, YAML, or XML.

Extract the name, email address, and sentiment (positive, neutral, or negative) from this EMAIL. 

Return the OUTPUT as formatted {format} with keys for sentiment and contact. 
Contact should have subkeys for name and email. 
Wrap it all in a result key.

EMAIL:
From: aaron@stradiapartners.com
Subject: Help
Body:
Hi,
I'm really happy with my product, but I need some help.
can you write me back?
Aaron

OUTPUT:

JSON – nests key-value pairs in brackets

Pros:

Handles numbers and text differently, simplifying post-processing
Strong support in the newer OpenAI models, with specific JSON training and built-in output validation

Cons:

Brackets and quotations add extra tokens, increasing cost and decreasing efficiency

Output (57 tokens):

{
  "result": {
    "contact": {
      "name": "Aaron",
      "email": "aaron@stradiapartners.com"
    },
    "sentiment": "positive"
  }
}

YAML – uses indents to identify hierarchical structure

Pros:

Easily human readable
Without extra “fluff” it is more efficient and cheaper than other formats

Cons:

Easy to mess up the indents, potentially increasing errors (although I haven’t quantified that)

Output (30 tokens):

result:
  contact:
    email: aaron@stradiapartners.com
    name: Aaron
  sentiment: positive

XML – wraps data in tags

Pros:

Explicit wrappers ensure “extra” text that a model might return are ignored (sometimes models, especially “chatty” models, like to add extra context before or after the structured data like “Here is your data: … Do you need anything else?”)

Cons:

Opening and closing tags make it extremely verbose, increasing cost and decreasing efficiency

Output (57 tokens):

<result>
  <contact>
    <name>Aaron</name>
    <email>aaron@stradiapartners.com</email>
  </contact>
  <sentiment>positive</sentiment>
</result>

Challenges

As stochastic parrots, LLMs are not without issues.

For structured data, the first challenge is what happens if the model returns an incomplete format? In the JSON example, if the model finishes before adding the closing } bracket, then the computer won't be able to properly parse it. This can sometimes be fixed by doing code checks after running it and then rerunning if there is an error. Some models also add explicit generation checks that require valid formatted output.

The second challenge is hallucination of data. Sometimes hallucination can be fixed by providing more specifications in the input (in the sentiment example, if we don't give a specific list of the options then the model might get creative with the output) or by lowering the randomness temperature.

Other times, unclear inputs might confuse the output (e.g., if there are multiple names in the email, it could inadvertently grab the wrong one), or just simply interpolate results that aren't there (maybe based on content that the model had seen in its training set).

There are a few tricks for avoiding these shortcomings.

curved road beside mountain — Photo by Naomi August / Unsplash

Adding Extra Guardrails

Specifying the Schema

This can help ensure that the output is formatted correctly. And OpenAPI specifications provide a clear way to do this. In fact, OpenAI models have the option to input a specific JSON-formatted schema in the OpenAPI format.

From the OpenAPI documentation, here is an example showing how a model specification (in this case in YAML format! The JSON format is also common) can represent data in JSON or XML format:

Model specification

components:
 schemas:
   book:
     type: object
     properties:
       id:
         type: integer
       title:
         type: string
       author:
         type: string

JSON data

{
 "id": 0,
 "title": "string",
 "author": "string"
}

XML data

<book>
 <id>0</id>
 <title>string</title>
 <author>string</author>
</book>

Including a Few Examples

This can provide additional guidance to the model through a "worked" sample. To do this, we might include a sample EMAIL and OUTPUT in the desired format before the actual EMAIL that we need to process.

Fine-Tuning

Finally, for high-risk or mission-critical cases where it is important to get it right, fine-tuning the model with lots of training samples can really lock in the output for what you need. We've found fine-tuning to increase accuracy from the low-70%s to over 90%.

As for which format to choose, this decision depends on the data you need to process and the model you plan to use. Generally, we like JSON for OpenAI models and XML for other foundation models. We find that the extra token costs and processing time are rarely a concern, and these formats allow for more explicit definition.

This kind of project is exciting for us because it allows us to tailor the output to your specific needs, which lets us dig into what you need to get out of it. We also find that this kind of project can save a lot of time for people, for example speeding up how quickly a team can work through hundreds or thousands of messages, and sometimes even increase "objectivity" if multiple people are working on the same task.

We'd love to hear how you think this can be helpful for you! Email us to talk.