Document Understanding and the Power of Entity Extraction

October 30, 2024

Your business needs answers fast, but traditional document processing methods like Intelligent Document Processing (IDP) and Optical Character Recognition (OCR) struggle with unstructured data and unforeseen questions. In our recent podcast series, we explored how these technologies fall short in handling complex document understanding, particularly when the question or data format isn’t clear.

In previous episodes, we discussed two scenarios: when users have questions but lack answers, as with supplier requests, and when they have answers but not the specific questions, such as in chatbot product queries. Today, we examine a third challenge: how flexible, context-driven Natural Language Processing (NLP) enables automation and reduces dependence on rigid structures. Modern NLP empowers technology leaders to interpret data in context, offering more accurate answers and a new approach to information extraction that drives decision-making.

The Problem with Traditional Methods for Extracting Information

Traditional document processing depends on structured formats and fixed data locations, which break down when handling unstructured text or ambiguous requests. Most legacy systems rely on mapping and regular expressions to locate information like zip codes, phone numbers, or product codes. These methods work for well-structured data but fail when data structure is inconsistent or context is unclear.

For instance, a system may be programmed to detect a five-digit zip code but fail to recognize a “zip+4” format or distinguish between different types of phone numbers. Regular expressions are rigid tools—they can capture patterns but cannot understand context, leaving users to manually sort through and confirm data accuracy, slowing down workflows and increasing error risk.

Why Context Matters in Data Extraction

Context is essential for accurate data extraction, but traditional systems can’t interpret it. NLP overcomes this limitation by allowing systems to understand data meaningfully, rather than relying solely on structure. This change enables systems to handle ambiguous information, identify patterns without rigid structure, and retrieve what users need rather than only what fits a specific pattern.

Lexical matching improves user input flexibility, as users don’t need exact matches. For example, a query about a football team doesn’t require the exact team name—NLP can recognize variations like “Buccaneers” or “Rams.” This approach also applies to fields where exact matches aren’t realistic, such as finding addresses, dates, or contact information in documents.

Entity extraction goes further by identifying items within text based on their role rather than their format. In an invoice, for instance, NLP can distinguish between an invoice number, a PO number, and a part number, capturing each without pre-defined rules for every possible format. Context-aware NLP allows businesses to isolate the correct information, such as distinguishing between a shipping address and a billing address, without manual intervention.

Advancements in NLP Add Flexibility and Speed

Traditional document processing depends heavily on regular expressions, which require technical expertise and ongoing maintenance. NLP offers a superior alternative. Instead of rigid rules, NLP leverages context and language understanding to locate and interpret data intuitively, reducing the technical burden on users.

NLP enables flexibility beyond regular expressions. While regular expressions might locate patterns like “PO” followed by six digits, they can’t adapt to variations without extensive configuration. With NLP, users don’t need to set rigid patterns for every possible variation; the system recognizes relevant data based on meaning and context.

This flexibility lowers the technical bar, allowing non-developers to set up business rules in plain language. A business analyst can define rules to identify data such as invoice numbers, shipping addresses, or hazardous material restrictions without understanding NLP’s technical details. NLP interprets these rules based on context, capturing the correct data without exhaustive manual input or pattern definitions.

Another strength of NLP is handling ambiguity. If a document contains similar data, like multiple phone numbers or dates, NLP can determine which is relevant based on context, minimizing human intervention and making processes more efficient. NLP thus enables automation without the need for complete system overhauls or specialized technical skills, freeing up valuable resources.

Key Techniques in NLP for Improved Data Extraction

NLP offers several techniques that enhance data extraction by adding context and flexibility, allowing systems to handle unstructured data accurately and with minimal technical setup.

  1. Lexical Comparison for Flexible Input: This allows systems to match terms with similar meanings, making data entry more forgiving. For example, a user doesn’t need to type the exact name of a football team, and the system will still identify related terms accurately.
  2. Entity Extraction for Data Identification: NLP can identify key data points within text, such as invoice numbers or PO numbers, based on context rather than exact format, enabling it to adapt to various documents.
  3. Context-Based Distinction Between Similar Items: NLP uses context to distinguish between items that look similar, like multiple dates or addresses. It can pick out the relevant date or address without requiring manual sorting.
  4. Semantic Analysis for Intent and Relevance: NLP can assess whether content is relevant to the intended topic, filtering out irrelevant responses in surveys or customer feedback, reducing the need for manual sorting.
  5. Key Phrase Detection and Content Validation: NLP detects essential phrases and themes, helping validate content against specific criteria. This approach ensures important details aren’t missed, reducing the need for manual review.

These techniques provide a comprehensive approach to data extraction, allowing systems to adapt dynamically to unstructured data, minimize user errors, and streamline workflows. The result is faster, more accurate data extraction that supports better decision-making.

How to Provide Context Using Modern NLP

The effectiveness of NLP depends on clear instructions on what data to extract and what to ignore. Setting these parameters allows NLP to filter out irrelevant details and focus on the necessary information.

Define What to Look For: When working with NLP, specify the information needed. For example, if you’re extracting data from invoices, make it clear that the system should identify items like PO numbers, part numbers, or shipping addresses. This helps NLP isolate relevant data points, especially when dealing with unstructured or mixed data formats.

Specify What to Ignore: Just as important is defining what the system should disregard. If a document contains multiple addresses, the system needs to know which address type is relevant. By setting parameters to ignore certain types, like a billing address, you ensure only the needed data is extracted, reducing irrelevant noise.

Context Enhances Accuracy: Providing context is crucial for accurate and efficient NLP. When a system understands which entities to look for and which to ignore, it can adapt to various document types and layouts without needing complex rules, improving data consistency.

Moving Beyond Rigid Patterns: Leveraging context allows NLP to perform data extraction based on meaning, not just format. This capability minimizes errors, streamlines data workflows, and reduces reliance on technical configurations. Context-driven NLP adds a level of intelligence to document processing that allows businesses to handle data more accurately, reliably, and efficiently.

Experience Effortless NLP-Driven Automation with Krista

Krista makes it simple for businesses to leverage NLP advancements without the technical burden. With NLP capabilities built into its platform, Krista enables context-driven, accurate data extraction that doesn’t require in-house expertise or manual configuration. Krista automates tasks like entity identification and context-based data sorting, freeing your team to act on insights rather than manage complex rules. This means faster, more reliable decision-making and the flexibility to scale document understanding across workflows. To see how Krista can transform your approach to document processing, evaluate Krista today and experience NLP-driven automation made easy.

Links and Resources

Speakers

Scott King

Scott King

Chief Marketer @ Krista

John Michelsen

Chief Geek @ Krista

Chris Kraus

VP Product @ Krista

Transcription

Scott King

Well, hey everyone. Thanks for joining this episode of the Union Podcast. I’m Scott, and I’m joined by my usual co-hosts, Chris Krauss and John Michelson. John, I see Dallas in the background again, so welcome home. Today, we’re wrapping up our document understanding series. In our first episode, we discussed unknown questions—where I have a lot of answers and content but don’t know the specific questions. This is similar to someone using a chatbot to look up information or product specs. In the second episode, we covered known questions where I have the questions but not the answers, like needing to ask suppliers for specific information. Chris, today, we want to talk about everything else outside of those scenarios.

In situations where advancements in NLP make this possible, maybe you, John, can give us some insight on how this wasn’t feasible 6, 12, or 18 months ago due to limitations in GPU speeds. But Chris, if someone isn’t in either of those scenarios, what are we talking about today?

Chris Kraus

What people often don’t realize is that natural language processing, even outside of an LLM, is incredibly powerful. We’ve been trained in intelligent document processing to look for a structured document—find the zip code, look left of the zip code label, right of it, everything using very regular expressions. But, as John and I often joke, there’s nothing regular about a regular expression. Not everyone’s mind works with dot-slash-dot-star-asterisk—it’s very irregular and location-based.

However, lexical processing is an amazing tool. For example, if I want to find a football team, do I have to use the exact name, or can I use something like the Buccaneers or the Rams to lexically match the team name? When it comes to comparing content—checking if two statements are similar or identifying conflicting information within a paragraph, like an unhappy customer statement—there are many ways we can apply natural language.

The exciting part is that this opens up possibilities with unstructured text. We’ve traditionally been trained to think in structured, mathematical terms, which, of course, I enjoy, but now we’re analyzing documents in free-form text. With unstructured text, verb-noun agreement matters, and we must recognize names and proper nouns. Krista helps by handling these rules and letting NLP work it out. I don’t have to worry about those specifics—I simply ask it for the name of a place, and it recognizes proper nouns and general nouns. John, would you like to go over how we address these challenges and the advancements in NLP with similarity processing?

John Michelsen

Great setup. You definitely have these interesting new projects where you have a body of content you can expose to random questions. Or you know what you want from inbound documents, but you need specific answers.

We’ve talked about that, but what about all the other work? You might be an automation specialist or a business analyst maintaining requirements and guiding development for your business. Or you could be a developer or anyone involved in deploying technology.

What can natural language processing do outside of those two big scenarios? The answer is quite a bit, especially with the use cases Chris mentioned, plus validating user input without needing exact precision. If you have a long list of items and need a specific one, displaying a massive dropdown list or forcing precise matches doesn’t work well. Irregular expressions and exact matches are blunt instruments.

The natural language toolbox has a lot of useful tools. You don’t need the whole toolbox, as we might in a full document understanding case, but you can use a few tools for everyday processing. My hope is to raise awareness for non-developers that they can specify these expectations in the solutions they’re building. And for technical teams, these capabilities are available. We’ve made them accessible, and they should be part of every use case. Soon, we won’t even call it AI—it’ll just be software.

This evolution has happened with terms like big data, which we no longer need to mention explicitly. Big data is still around; it’s just a standard part of software now. The same shift will happen with GenAI and other aspects of AI.

For example, if you have a long list and need user input, you can do a lexical fuzzy match to find the closest items. If two or three items are too close, it’ll prompt the user to pick from those instead of showing a huge list or requiring exact matches. This improves usability, enhances user experience, and reduces errors in data input.

Scott King

So John, it sounds like you’re talking about less technical users—probably like me, since I’m not a software developer. Maybe this is why people often prefer to talk to other people, because the context of what I’m trying to do can be lost. I don’t always understand the system, but if I talk to a person, they help bridge that gap. Are you saying NLP is better at figuring out what I want to do? And Chris, does that connect to your football example, where the Buccaneers are clearly a team? Is that the big advancement, or is there more to it?

John Michelsen

It’s a variety of things. But just to be clear—does AI naturally or automatically understand context? Generally, no. If you provide the context, then ask your question, you’ll get a solid answer. Without it, the response might not be as accurate. A good example, which Chris could expand on, is processing inbound orders. This isn’t the new stuff we discussed last week; it’s the current process of handling orders—whether they come from an EDI feed or elsewhere. Often, someone still needs to validate the information on the order before it enters the system. But, really, why isn’t the software handling 90% of this validation? And the answer is, it could, right, Chris?

Chris Kraus

Yes, it could. In an order, many elements look similar. For example, there could be multiple email addresses. I don’t want the company’s or support email address; I want the one linked to the buyer, not the shipper. You provide that level of context to NLP, and when it encounters multiple addresses, it knows which to ignore. Similarly, if your company logo appears with your address, it’s important to clarify that the system should exclude your address and only review the customer’s. Just as we instinctively skip over our own address, NLP can be guided to ignore certain data. Providing these hints about what not to look for is just as crucial as specifying what to find.

John Michelsen

If I’m doing something simple, like validating a zip code, it’s straightforward—five digits with some additional checks. You might use one of those irregular expressions, especially with the plus-four zip code in the US. But it’s much easier to specify that the zip code should be near the rest of the address. Doing so immediately increases the likelihood of getting accurate results. The toolbox includes options like fuzzy matching and lexical matching, though the GenAI world has often overlooked lexical matching.
Semantic matching, for example, considers words like “cat” and “dog” more similar than “cat” and “car” based on meaning. Lexical analysis, however, views them as entirely different words based on literal character construction. Homonyms can throw off lexical analysis, which is why a mix of tools is essential.

Let’s say you’re processing survey responses. You might want a semantic analysis to check if a comment is related to a specific topic. If it’s 70% likely to match the topic, you could accept the survey answer. But if it’s off-topic, it might be spam or a bot response. Other situations call for lexical analysis. For instance, when collecting feedback about managers, you’d want to ensure comments are accurately linked to the right manager.
If I need to collect accurate input without requiring exact spelling—for example, with my name “Michelsen”—a lexical analysis can capture typical misspellings. This is useful, especially in a call center setting, where finding a customer can be challenging. If I say “Michelsen,” they might assume it’s spelled with an “A” as in “Michaelsen,” which isn’t the case for my name. Lexical analysis would help here, allowing a simple search using additional context like “John Michelsen in Dallas.”

All applications could benefit from a refresh with this perspective. In every project, consider where you’re placing unnecessary burdens on end users, allowing incorrect data, or making people manually verify information that a comprehensive NLP toolbox could handle. These aren’t just niche use cases; they’re relevant to everyday processes. With today’s technology, we can move beyond outdated regular expressions. It’s crucial to understand that software now has the capability to handle much of this validation automatically.

Scott King

John, as you were talking, I was wondering where to apply this. Then you mentioned that every software application needs a refresh. Are you suggesting that every app and user interface should be updated with capabilities like these?

John Michelsen

In a way, yes. I’d put it like this: raise your expectations and look at any application you use or manage. Ask yourself, what are the tasks currently handled by people that a machine could be doing? Especially when it comes to validating input, which is a huge area. This becomes even more critical in fields like medicine or financial services.
For example, consider that an organization recently lost five billion dollars due to a single input error—someone accidentally typed in the wrong data. Another case, back in 2019, involved a candidate we were interviewing. He wasn’t a fit for us, but he shared a great story. He was presented with a time-off request system that showed his current balance and asked what he wanted to adjust. Being technical, he typed “minus eight” to deduct eight hours. The system calculated his balance as 48 hours after a day off instead of 32.

This could’ve been prevented by simply not allowing negative numbers. We’re aiming to make these controls accessible, especially to non-developers who might miss such details. A system should validate requests like time-off balances. Krista, for instance, is good at anomaly detection—spotting outliers. If a system encounters a negative number in a PTO request, it should flag it as unusual.

This is the point: whether you’re using or creating software, your expectations should increase. Expect software to handle these kinds of checks automatically, and if you’re responsible for building it, recognize that these capabilities are readily available.

Scott King

You mentioned financial services and healthcare, which make sense to me, with things like tickers, product codes, and healthcare codes. Missing one of those could have serious consequences. And these codes are often related; certain healthcare codes are adjacent to others. As you mentioned, NLP could identify anomalous behavior there, which would be valuable, especially considering the complexities of healthcare billing. I’ve never worked in that industry, but it sounds like a nightmare.

John Michelsen

Yes, we’ve actually done some projects in that space, and we’re about to start a major one in the coming weeks. Chris is very involved with it.

Scott King

Chris, based on what John discussed regarding advancements, what can people do? John mentioned that those building applications need to elevate their expectations and become more aware of what’s possible. Sometimes, though, it’s hard because people see AI as magic. So, what steps should we take?

Chris Kraus

There’s a balance between seeing AI as magic and understanding NLP—natural language processing. Generative AI is focused on language, so don’t expect it to handle all math, but use it as intended. We’ve traditionally thought of unstructured text as something only humans can handle, marking it as an exception. Now, though, we have technologies that can work with unstructured text, even if it’s just an order in an email and not structured XML or EDI.

First, we shouldn’t fear unstructured text; we should embrace it. Computers can handle it now. Some of our customers have been amazed by Krista’s accuracy, saying it’s 80% accurate even when they initially doubted it. They checked and found that Krista was right, while human employees were less accurate due to fatigue from repetitive tasks.
Another thing to realize is that applying business rules doesn’t require investing in complex, hard-to-configure rules engines. Human beings can now communicate business rules in simple language. For instance, ensuring invoices are processed in the right order or checking if shipping hazardous materials on a specific train is allowed can all be handled by NLP.

This shift will change how we work. Business analysts and subject matter experts will write rules in plain English, without needing to understand the technical side of NLP or structured text. They shouldn’t have to—that’s the point. The technology should understand and handle it.

John Michelsen

Yes, that’s a great example. You asked if it’s okay to ship a particular material via a certain delivery method, like a train. That’s very context-specific, which ties into the point you made earlier, Scott. For someone to succeed with these AI capabilities, even in regular business activities they’ve been automating, it’s not as challenging as it might seem. That may sound unlikely, but it’s actually feasible. You just need to make sure the solution you’re building can connect the dots for you, uniquely adjusting each time the question is asked, since everyone’s context may vary.

We don’t usually get 100% certainty in responses; confidence is rarely absolute. Without going into philosophy, none of us are 100% confident in anything—we’re just sufficiently confident to proceed. It’s like stepping forward without being absolutely sure nothing is on the floor that could hurt you. Software works similarly, but we’re learning how to handle high, but not perfect, confidence levels. We’re used to binary answers—yes or no, did or didn’t—but life isn’t like that, and neither is modern software. It’s becoming more human in that it can account for uncertainty.

The point is, if you provide the right context, software can correctly answer questions like whether a material can be shipped via a specific vehicle. It’s not as technically demanding as you might think and is a good example of raising expectations. Currently, people often look up this information manually or rely on experience, but software could handle that. And if something changes, such as a rail line gaining permission to ship certain materials, it’s easy to update the system’s context. Instead of retraining everyone, you simply adjust the software’s context.

Getting back to the topic, large AI projects in areas like the ones we’ve discussed are great, and we’re here to help with those. But day-to-day tasks can also benefit from NLP capabilities. It doesn’t mean a complete app overhaul—it just reduces the need for humans to handle tasks software can manage. More processes can be straight-through, with built-in input validation, better user experience, and more flexible, high-confidence matching. Ultimately, NLP can and should be part of every app.

Scott King

I love it. Thanks, John. I appreciate you explaining all of this in the series. I liked how you mentioned technology understanding people and how software is becoming more human. That really aligns with our mission. Thanks, everyone, for joining, and until next time.

Our 2025 AI Buyer's Guide is Now Available

Close Bitnami banner
Bitnami