text extraction Articles - Enterprise Knowledge

Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup

EK Team — Wed, 02 Jul 2025 19:39:00 +0000

The Challenge

Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss when target documents could not be found at all. To learn more about the client’s use case and EK’s initial strategy, please see the first blog in the Optimizing Historical Knowledge Retrieval series: Standardizing Metadata for Enhanced Research Access.

To make these research papers more discoverable, part of EK’s solution was to add “about-ness” tags to the document metadata through a classification process. Many of the files in this document management system (DMS) were lower quality PDF scans of older documents, such as typewritten papers and pre-digital technical reports that often included handwritten annotations. To begin classifying the content, the team first needed to transform the scanned PDFs into machine-readable text. EK utilized an Optical Character Recognition (OCR) tool, which can “read” non-text file formats for recognizable language and convert it into digital text. When processing the archival documents, even the most advanced OCR tools still introduced a significant amount of noise in the extracted text. This frequently manifested as:

A table, figure, or handwriting in the document being read in as random symbols and white space.
Inserting random punctuation where a spot or pen mark may have been on the file, breaking up words and sentences.
Excessive or misplaced line breaks separating related content.
Other miscellaneous irregularities in the text that make the document less comprehensible.

The first round of text extraction using out-of-the-box OCR capabilities resulted in many of the above issues across the output text files. This starter batch of text extracts was sent to the classification model to be tagged. The results were assessed by examining the classifier’s evidence within the document for tagging (or failing to tag) a concept. Through this inspection, the team found that there was enough clutter or inconsistency within the text extracts that some irrelevant concepts were misapplied and other, applicable concepts were being missed entirely. It was clear from the negative impact on classification performance that document comprehension needed to be enhanced.

Auto-Classification
Auto-Classification (also referred to as auto-tagging) is an advanced process that automatically applies relevant terms or labels (tags) from a defined information model (such as a taxonomy) to your data. Read more about Enterprise Knowledge’s auto-tagging solutions here:

The Solution

To address this challenge, the team explored several potential solutions for cleaning up the text extracts. However, there was concern that direct text manipulation might lead to the loss of critical information if blanket applied to the entire corpus. Rather than modifying the raw text directly, the team decided to leverage a client-side Large Language Model (LLM) to generate additional text based on the extracts. The idea was that the LLM could potentially better interpret the noise from OCR processing as irrelevant and produce a refined summary of the text that could be used to improve classification.

The team tested various summarization strategies via careful prompt engineering to generate different kinds of summaries (such as abstractive vs. extractive) of varying lengths and levels of detail. The team performed a human-in-the-loop grading process to manually assess the effectiveness of these different approaches. To determine the prompt to be used in the application, graders evaluated the quality of summaries generated per trial prompt over a sample set of documents with particularly low-quality source PDFs. Evaluation metrics included the complexity of the prompt, summary generation time, human readability, errors, hallucinations, and of course – precision of auto-classification results.

The EK Difference

Through this iterative process, the team determined that the most effective summaries for this use case were abstractive summaries (summaries that paraphrase content) of around four complete sentences in length. The selected prompt generated summaries with a sufficient level of detail (for both human readers and the classifier) while maintaining brevity. To improve classification, the LLM-generated summaries are meant to supplement the full text extract, not to replace it. The team incorporated the new summaries into the classification pipeline by creating a new metadata field for the source document. The new ‘summary’ metadata field was added to the auto-classification submission along with the full text extracts to provide additional clarity and context. This required adjusting classification model configurations, such as the weights (or priority) for the new and existing fields.

Large Language Models (LLMs)
A Large Language Model is an advanced AI model designed to perform Natural Language Processing (NLP) tasks, including interpreting, translating, predicting, and generating coherent, contextually relevant text. Read more about how Enterprise Knowledge is leveraging LLMs in client solutions here:

The Results

By including the LLM-generated summaries in the classification request, the team was able to provide more context and structure to the existing text. This additional information filled in previous gaps and allowed the classifier to better interpret the content, leading to more precise subject tags compared to using the original OCR text alone. As a bonus, the LLM-generated summaries were also added to the document metadata in the DMS, further improving the discoverability of the archived documents.

By leveraging the power of LLMs, the team was able to clean up noisy OCR output to improve auto-tagging capabilities as well as further enriching document metadata with content descriptions. If your organization is facing similar challenges managing and archiving older or difficult to parse documents, consider how Enterprise Knowledge can assist in optimizing your content findability with advanced AI techniques.

Download Flyer

Ready to Get Started?

Get in Touch

The post Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup appeared first on Enterprise Knowledge.

How Do I Update and Scale My Knowledge Graph?

Lulit Tesfaye — Tue, 12 Jan 2021 14:00:56 +0000

Enterprise Knowledge Graph Governance Best Practices

Successfully building, implementing, and scaling an enterprise knowledge graph is a serious undertaking. Those who have been successful at it would emphasize that it takes a clear definition of need (use cases), an appetite to start small, and a few iterations to get it right. When done right, a knowledge graph provides valuable business outcomes, including a scalable organizational flexibility to enrich your data and information with institutional knowledge while aggregating content from numerous sources to enable your systems’ understanding of the context and the evolving nature of your business domain.

Having worked on multiple knowledge graph implementation projects, the most common question I get is, “what does it take for an organization to maintain and update an enterprise knowledge graph?” Though many organizations have been successfully building knowledge graph pilots and prototypes that adequately demonstrate the potential of the technology, few have successfully deployed an enterprise knowledge graph that proves out the true business value and ROI this technology offers. Such forethought about governance from the get-go plays a key role in ensuring that the upfront investment in a tangible solution remains a long-term success. Here, I’ll share the key considerations and the approaches we have found effective when it comes to instituting successful approaches to grow and manage an enterprise knowledge graph to ensure it continues serving the upstream and downstream applications that rely on it.

First and foremost, building an effective knowledge graph begins with understanding and defining clear use cases and the business problems that it will be solving for your organization. Starting here will enable you to anticipate and tackle questions like:

“Who will be the primary end-users or subject matter experts?”

“What type of data do you need?”

“What data or systems will it be applied to?”

“How often does your data change?”

“Who will be updating and maintaining it?”

Addressing these questions early on will not only allow you to shape your development and implementation scope, but also define a repeatable process for managing change and future efforts. The section below provides specific areas of consideration when getting started.

1. Build it Right – Use Standards

As a natural integration framework, an enterprise knowledge graph is part of an architectural layer that consists of a wide array of solutions, ranging from the organizational data itself, to data models that support object or context oriented information models (taxonomy, ontology, and a knowledge graph), and user facing applications that allow you to interact with data and information directly (search, analytics dashboards, chatbots, etc). Thus, properly understanding and designing the architecture is one of the most fundamental aspects for making sure it doesn’t become stale or irrelevant.

A practical knowledge graph needs to leverage common semantic information organization models such as metadata schemas, taxonomies, and ontologies. These serve as data models or schemas by representing your content in systems and placing constraints for what types of business entities are connected to a graph and related to one another. Building a knowledge graph through these layers that serve as “blueprints” of your business processes helps maintain the identity and structure for your knowledge graph to continue growing and evolving through time. A knowledge graph built on these logical models that are explicitly defined makes your business logic machine readable and allows for the understanding of the context and relationships of your data and your business entities. Using these unifying data models also enables you to integrate data in different formats (for example, unstructured PDF documents, relational databases, and structured text formats like XML and JSON), rendering your enterprise data interconnected and reusable across disparate and diverse technologies such as Content Management Systems (CMS) or Customer Management Systems (CRM).

When building these information models (taxonomies and ontologies), leveraging semantic web standards such as the Resource Description Framework (RDF), the Simple Knowledge Organization System (SKOS), and the Web Ontology Language (OWL), offer many long term benefits by facilitating governance, interoperability, and scale. Specifically, leveraging these well-established standards when developing your knowledge graph allows you to:

Represent and transfer information across multiples systems, solutions, or types of data/content and avoid vendor lock to proprietary solutions;
Share your content internally across the organization or externally with other organizations;
Support and integrate with publicly available taxonomies, ontologies, and linked open data sources to jump start your enterprise semantic models or to enrich your existing information architecture with industry standards; and
Enable your systems to understand business vocabulary and design for its evolution.

2. Understand the Frequency of Change and the Volume of Your Data

A viable knowledge graph solution is closely linked to the business model and domain of the organization, which means it should always be relevant, up to date, accurate, and have a scalable coverage of all valuable sources of information. Frequent changes to your data model or knowledge graph means your organization’s domain is in constant shift and needs your knowledge and information to constantly keep up.

These types of changes should not require the rebuilding or restructuring of your entire graph. As such, depending on your industry and use cases, determining the frequency and update intervals as well as your governance model is a good way to effectively govern your enterprise knowledge graph.

For instance, for our clients in the accounting or tax domain, industry and organizational vocabulary/metadata and their underlying processes/content are relatively static. Therefore the knowledge, entities, and processes in their business domain don’t typically change that frequently. This means real-time updates and editing of their knowledge graph solution at a scale may not be a primary need or capability that needs focus right away. Such use cases allow these organizations to realize savings by shifting the focus from enterprise level metadata management tools or large scale data engineering solutions to effectively defining their data model and governance to address the immediate use cases or business requirements at hand.

In other scenarios for our clients in the digital marketing and analytics industry, obtaining a 360-view of a consumer in real-time is their bread and butter. This means that marketing and analytics teams need to immediately know when, for example, a “marketable consumer” changes their address or contact information. It is imperative in this case that such rapidly changing business domains have the resources, capabilities, and automation necessary to update and govern their knowledge graphs at scale.

3. Develop Programmatic Access Points to Connect Your Applications:

Common enterprise knowledge graph solutions are constructed through data transformation pipelines. This renders a repeatable process for the mapping of structured sources and the extraction, disambiguation, classification, and tagging of unstructured sources. It also means that the main way to affect the data in the knowledge graph is to govern the input data (e.g. exports from taxonomy management systems, content management platforms, database systems, etc.). Otherwise, ad-hoc changes to the knowledge graph will be lost or erased every time new data is loaded from a connected application.

Therefore, designing and implementing a repeatable data extraction and application model that is guided by the governance of the source systems is one of the fundamental architectures to build a reliable knowledge graph.

4. Put validation checks and analytics processes in place

Apply checks to identify conflicting information within your knowledge graph. Even though it’s rather challenging to train a knowledge graph to automatically know the right way to organize new knowledge and information, the ability to track and check why certain attributes and values were applied to your data or content should be part of the design for all data that is aggregated in the solution. One technique we’ve used is to segment inferred or predicted data into a separate graph reserved for new and uncertain information. In this way, uncertain data can be isolated from observed or confirmed information, making it easier to trace the origins of inferred information, or to recompute inferences and predictions as your underlying data or artificial intelligence models change. Confidence scores or ratings in both entities and relationships can also be used to indicate graph accuracy. Additional effective practices that provide checks and processes for creating and updating a knowledge graph include instituting consistent naming conventions throughout the design and implementation (e.g., URIs) and establishing guidelines for version control and workflows, including a log of all changes and edits to the graph. Many enterprise knowledge graphs also support the SHACL Semantic Web standard, which can be used to validate your graph when adding new data and check for logical inconsistencies.

5. Develop a Governance Plan and Operating Model

An effective knowledge graph governance model addresses the common set of standards and processes to handle changes and requests to the knowledge graph and peripheral systems at all levels. Specifically, a good knowledge graph governance model will provide an approach or specification for the following:

Governance roles and responsibilities. Common governance roles include a governance group of taxonomists/ontologists, data engineers or scientists, database and application managers and administrators, and knowledge or business representatives or analysts;
Governance around data sources that feed the knowledge graph. For instance when there’s unclean data coming in from a source system, specific roles and processes for correcting this data;
Specific processes for updating the knowledge graph in the system it is managed (i.e., processes to ensure major and minor changes to the knowledge graph are accurately assessed and implemented). Including governance around adding new data sources — what does it look like, who needs to be involved, etc.;
Approaches to handle changes to the underlying ontology data model. Common change requests include addition, modification or depreciation of an ontological class, attributes, synonyms or relationships;
Approaches to tackling common barriers to continue building and enhancing a successful ontology and knowledge graph. Common challenges include lack of effective text analytics and extraction tools to automate the organization of content and application of tags/relationships, and intuitive management and updates to Linked Data;
Guidance on communication to stakeholders and end users including sample messaging and communication best practices and methods; and
Review cadence. Identify common intervals for changes and adjustments to the knowledge graph solution by understanding the complexity and fluidity of your data and build in recurring review cycles and governance meetings accordingly

Closing

As a representation of an organization’s knowledge, an enterprise knowledge graph allows for aggregation of a breadth of information across systems and departments. If left with no ownership and plan, it can easily grow out of sync and result in rework, redesign and a lot of wasted effort.

Whether you are just beginning to design an enterprise knowledge graph and wish to understand the value and benefits, or you are looking for a proven approach for defining governance, maintenance, and plan to scale, check out our additional thought leadership and real world case studies to learn more. Our expert graph engineers and consultants are also on standby if you need any support. Contact us with any questions.

Get Started

Ask Us a Question

The post How Do I Update and Scale My Knowledge Graph? appeared first on Enterprise Knowledge.

A Case for Chatbots

EK Team — Mon, 07 Sep 2020 13:00:05 +0000

“Welcome to our chat! You can ask me anything or request real human help any time. Ask me anything!” Whether you’re looking to buy a car, ordering pizza, or browsing research publications, the average internet experience tends to feature some form of a user-oriented chatbot experience. Whether you find the omnipresence of chatbots maddening or inspiring, trends suggest they’re here to stay. Chatbots empower your users by automating question answering and guiding troubleshooting. Many organizations choose to feature a chatbot on their company’s home page and, after ironing out the initial wrinkles, see a positive drop in customer service requests. Intrigued? This blog is an introductory overview of the how of chatbots and considers the various elements at-play behind the curtain.

Chatbots Today

Chatbots are most commonly used for information gathering, user-empowering customer service assistance, or request mapping (think of the phone prompts you go through each time you call your internet services provider – if you’re looking to cancel your service because you’re moving, it would be a waste of both parties’ time to connect you to someone who provides service bundling packages). Chatbots also vary in their capabilities, with some looking for responses to Boolean questions (‘Did you say you want to cancel your internet services?’) while others are primed to mimic human conversation by ‘responding’ to users via full, natural language-sounding sentences (though no known chatbots have yet to convince their audience that they are capable of autonomous thought, otherwise known as the Turing test).

In addition to their varying abilities, chatbots also tend to take two forms. This blog considers a ‘real’ chatbot and how they work. Specifically, these ‘smart’ chatbots include components of machine learning that work hand-in-hand with natural language processing capabilities, allowing the bot to ‘understand’ human language and the forms that questions or requests take per the structure of language. And in addition to intaking human language queries, these bots also produce responses mimicking human language, allowing them to provide contextualized answers or share actionable options with the user. Alternatively, a simpler bot (or ‘dumb’ bot) that’s commonly seen across the web functions only as another avenue to the organization’s search. These bots process incoming requests or questions as search queries, and, as their ‘responses,’ return links to content determined most relevant per the keywords identified in that query.

A Peek Behind the Curtain – The How

From a development perspective, we can shed some light on the technical tasks necessary to successfully implement a chatbot by reviewing one of our past projects. At a global development bank, executive leadership knew their colleagues were missing information because they were routinely unaware of it or couldn’t find it. The bank wanted a solution that collected all of their institutional wisdom and learned people’s areas of knowledge and need so it could automatically assemble and send targeted information in two instances: the first being to connect individuals to that information at the moment of request and the second being sending appropriate information to the right people prior to important events, like a topic-specific board meeting.

To address this two-pronged need, the EK team built a semantic hub. This solution uses machine learning to automatically deliver content to bank employees when and where they need it through email and related webpage widgets. The hub interfaces with a graph database and knowledge graph platform, and a semantic technology tool that manages both taxonomies and ontologies that’s provided by the Semantic Web Company. The hub then provides contextualized recommendations to deliver relevant content on any bank-managed website. Although we were tackling a project that required collaboration between both the bank and two service providers, we delivered a complex solution without having the project managers feeling like they were managing three different, disparate organizations. As a finished product, our chatbot recognizes that a user is requesting documents, and their message is forwarded to the recommendation engine to generate results, scouring the engine for metadata that matches, or is akin to, the user’s query. The tool is in active daily use, providing timely recommendations on three different web applications and in advance of important meetings.

Additionally, at the outset of all Chatbot development projects, before you can connect users to content, you have to first affix descriptive metadata to that content so that the chatbot can find it. A dedicated taxonomy and ontology development effort is necessary, as are subsequent validation sessions throughout a series of project phases. Such sessions can and should include a content cleanup process and a series of focus groups to validate user needs and your prioritized use cases to be addressed by the bot. Once your content is nice and NERDy, a selection of that content should go through a text extraction process which further informs and validates the taxonomy/ontology, ensuring you are working with metadata that best reflects your organization’s information. Content from around the organization is auto-tagged (using a taxonomy management tool) and collected within the graph database.

The Why for Implementation

If the above example seems at all daunting to you and you’re wondering if chatbots are worth the effort to implement at your own organization, consider this recent stat:

Chatbots allow for a customized user experience, and not only allow users to get the information they need more quickly, but can be designed and oriented toward each user’s unique intent and interest. And for an additional convincing statistic: a survey done in 2016 by Oracle showed that 80% of business decision-makers said they already used chatbots or plan to use them by the end of 2020. All kinds of users are interfacing with chatbots, whether they’re service users, potential customers, or those with executive decision-making abilities. And while chatbot development can be complex, firms like ours, with deep experience in data and ontology mapping and user experience design, can facilitate the design and development process for your organization. Additional benefits to chatbot implementation include increases in revenue as easy-to-answer inquiries are automated, decreases in overstaffing on your CX team, and greater customer satisfaction for users on your site, increasing the likelihood that they become repeat users.

If you think chatbots could benefit your organization (and they probably can), don’t hesitate to reach out to us at info@enterprise-knowledge.com. We can assist you at any stage of the chatbot process, from the design, data mapping, and the development process from beginning to end.

The post A Case for Chatbots appeared first on Enterprise Knowledge.