auto-tagging Articles - Enterprise Knowledge https://enterprise-knowledge.com/tag/auto-tagging/ Mon, 17 Nov 2025 22:19:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://enterprise-knowledge.com/wp-content/uploads/2022/04/EK_Icon_512x512.svg auto-tagging Articles - Enterprise Knowledge https://enterprise-knowledge.com/tag/auto-tagging/ 32 32 How to Leverage LLMs for Auto-tagging & Content Enrichment https://enterprise-knowledge.com/how-to-leverage-llms-for-auto-tagging-content-enrichment/ Wed, 29 Oct 2025 14:57:56 +0000 https://enterprise-knowledge.com/?p=25940 When working with organizations on key data and knowledge management initiatives, we’ve often noticed that a roadblock is the lack of quality (relevant, meaningful, or up-to-date) existing content an organization has. Stakeholders may be excited to get started with advanced … Continue reading

The post How to Leverage LLMs for Auto-tagging & Content Enrichment appeared first on Enterprise Knowledge.

]]>
When working with organizations on key data and knowledge management initiatives, we’ve often noticed that a roadblock is the lack of quality (relevant, meaningful, or up-to-date) existing content an organization has. Stakeholders may be excited to get started with advanced tools as part of their initiatives, like graph solutions, personalized search solutions, or advanced AI solutions; however, without a strong backbone of semantic models and context-rich content, these solutions are significantly less effective. For example, without proper tags and content types, a knowledge portal development effort  can’t fully demonstrate the value of faceting and aggregating pieces of content and data together in ‘knowledge panes’. With a more semantically rich set of content to work with, the portal can begin showing value through search, filtering, and aggregation, leading to further organizational and leadership buy-in.

One key step in preparing content is the application of metadata and organizational context to pieces of content through tagging. There are several tagging approaches an organization can take to enrich pre-existing content with metadata and organizational context, including manual tagging, automated tagging capabilities from a taxonomy and ontology management system (TOMS), using apps and features directly from a content management solution, and various hybrid approaches. While many of these approaches, in particular acquiring a TOMS, are recommended as a long-term auto-tagging solution, EK has recommended and implemented Large Language Model (LLM)-based auto-tagging capabilities across several recent engagements. Due to LLM-based tagging’s lower initial investment compared to a TOMS and its greater efficiency than manual tagging, these auto-tagging solutions have been able to provide immediate value and jumpstart the process of re-tagging existing content. This blog will dive deeper into how LLM tagging works, the value of semantics, technical considerations, and next steps for implementing an LLM-based tagging solution.

Overview of LLM-Based Auto-Tagging Process

Similar to existing auto-tagging approaches, the LLM suggests a tag by parsing through a piece of content, processing and identifying key phrases, terms, or structure that gives the document context. Through prompt engineering, the LLM is then asked to compare the similarity of key semantic components (e.g., named entities, key phrases) with various term lists, returning a set of terms that could be used to categorize the piece of content. These responses can be adjusted in the tagging workflow to only return terms meeting a specific similarity score. These tagging results are then exported to a data store and applied to the content source. Many factors, including the particular LLM used, the knowledge an LLM is working with, and the source location of content, can greatly impact the tagging effectiveness and accuracy. In addition, adjusting parameters, taxonomies/term lists, and/or prompts to improve precision and recall can ensure tagging results align with an organization’s needs. The final step is the auto-tagging itself and the application of the tags in the source system. This could look like a script or workflow that applies the stored tags to pieces of content.

Figure 1: High-level steps for LLM content enrichment

EK has put these steps into practice, for example, when engaging with a trade association on a content modernization project to migrate and auto-tag content into a new content management system (CMS). The organization had been struggling with content findability, standardization, and governance, in particular, the language used to describe the diverse areas of work the trade association covers. As part of this engagement, EK first worked with the organization’s subject matter experts (SMEs) to develop new enterprise-wide taxonomies and controlled vocabularies integrated across multiple platforms to be utilized by both external and internal end-users. To operationalize and apply these common vocabularies, EK developed an LLM-based auto-tagging workflow utilizing the four high-level steps above to auto-tag metadata fields and identify content types. This content modernization effort set up the organization for document workflows, search solutions, and generative AI projects, all of which are able to leverage the added metadata on documents. 

Value of Semantics with LLM-Based Auto-Tagging

Semantic models such as taxonomies, metadata models, ontologies, and content types can all be valuable inputs to guide an LLM on how to effectively categorize a piece of content. When considering how an LLM is trained for auto-tagging content, a greater emphasis needs to be put on organization-specific context. If using a taxonomy as a training input, organizational context can be added through weighting specific terms, increasing the number of synonyms/alternative labels, and providing organization-specific definitions. For example, by providing organizational context through a taxonomy or business glossary that the term “Green Account” refers to accounts that have met a specific environmental standard, the LLM would not accidentally tag content related to the color green or an account that is financially successful.

Another benefit of an LLM-based approach is the ability to evolve both the semantic model and LLM as tagging results are received. As sets of tags are generated for an initial set of content, the taxonomies and content models being used to train the LLM can be refined to better fit the specific organizational context. This could look like adding additional alternative labels, adjusting the definition of terms, or adjusting the taxonomy hierarchy. Similarly, additional tools and techniques, such as weighting and prompt engineering, can tune the results provided by the LLM and evolve the results generated to achieve a higher recall (rate the LLM is including the correct term) and precision (rate the LLM is selecting only the correct term) when recommending terms. One example of this is  adding weighting from 0 to 10 for all taxonomy terms and assigning a higher score for terms the organization prefers to use. The workflow developed alongside the LLM can use this context to include or exclude a particular term.

Implementation Considerations for LLM-Based Auto-Tagging 

Several factors, such as the timeframe, volume of information, necessary accuracy, types of content management systems, and desired capabilities, inform the complexity and resources needed for LLM-based content enrichment. The following considerations expand upon the factors an organization must consider for effective LLM content enrichment. 

Tagging Accuracy

The accuracy of tags from an LLM directly impacts end-users and systems (e.g., search instances or dashboards) that are utilizing the tags. Safeguards need to be implemented to ensure end-users can trust the accuracy of the tagged content they are using. These help ensure that a user is not mistakenly accessing or using a particular document, or that they are frustrated by the results they get. To mitigate both of these concerns, a high recall and precision score with the LLM tagging improves the overall accuracy and lowers the chance for miscategorization. This can be done by investing further into human test-tagging and input from SMEs to create a gold-standard set of tagged content as training data for the LLM. The gold-standard set can then be used to adjust how the LLM weights or prioritizes terms, based on the organizational context in the gold-standard set. These practices will help to avoid hallucinations (factually incorrect or misleading content) that could appear in applications utilizing the auto-tagged set of content.

Content Repositories

One factor that greatly adds technical complexity is accessing the various types of content repositories that an LLM solution, or any auto-tagging solution, needs to read from. The best content management practice for auto-tagging is to read content in its source location, limiting the risk of duplication and the effort needed to download and then read content. When developing a custom solution, each content repository often needs a distinctive approach to read and apply tags. A content or document repository like SharePoint, for example, has a robust API for reading content and seamlessly applying tags, while a less widely adopted platform may not have the same level of support. It is important to account for the unique needs of each system in order to limit the disruption end-users may experience when embarking on a tagging effort.

Knowledge Assets

When considering the scalability of the auto-tagging effort, it is also important to evaluate the breadth of knowledge asset types being analyzed. While the ability of LLMs to process several types of knowledge assets has been growing, each step of additional complexity, particularly evaluating multiple types, can result in additional resources and time needed to read and tag documents. A PDF document with 2-3 pages of content will take far fewer tokens and resources for an LLM to read its content than a long visual or audio asset. Going from a tagging workflow of structured knowledge assets to tagging unstructured content will increase the overall time, resources, and custom development needed to run a tagging workflow. 

Data Security & Entitlements

When utilizing an LLM, it is recommended that an organization invest in a private or an in-house LLM to complete analysis, rather than leveraging a publicly available model. In particular, an LLM does not need to be ‘on-premises’, as several providers have options for LLMs in your company’s own environment. This ensures a higher level of document security and additional features for customization. Particularly when tackling use cases with higher levels of personal information and access controls, a robust mapping of content and an understanding of what needs to be tagged is imperative. As an example, if a publicly facing LLM was reading confidential documents on how to develop a company-specific product, this information could then be leveraged in other public queries and has a higher likelihood of being accessed outside of the organization. In an enterprise data ecosystem, running an LLM-based auto-tagging solution can raise red flags around data access, controls, and compliance. These challenges can be addressed through a Unified Entitlements System (UES) that creates a centralized policy management system for both end users and LLM solutions being deployed.

Next Steps:

One major consideration with an LLM tagging solution is maintenance and governance over time. For some organizations, after completing an initial enrichment of content by the LLM, a combination of manual tagging and forms within each CMS helps them maintain tagging standards over time. However, a more mature organization that is dealing with several content repositories and systems may want to either operationalize the content enrichment solution for continued use or invest in a TOMS. With either approach, completing an initial LLM enrichment of content is a key method to prove the value of semantics and metadata to decision-makers in an organization. 
Many technical solutions and initiatives that excite both technical and business stakeholders can be actualized by an LLM content enrichment effort. By having content that is tagged and adhering to semantic standards, solutions like knowledge graphs, knowledge portals, and semantic search engines, or even an enterprise-wide LLM Solution, are upgraded even further to show organizational value.

If your organization is interested in upgrading your content and developing new KM solutions, contact us!

The post How to Leverage LLMs for Auto-tagging & Content Enrichment appeared first on Enterprise Knowledge.

]]>
Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup https://enterprise-knowledge.com/optimizing-historical-knowledge-retrieval-leveraging-an-llm-for-content-cleanup/ Wed, 02 Jul 2025 19:39:00 +0000 https://enterprise-knowledge.com/?p=24805 Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss ... Continue reading

The post Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup appeared first on Enterprise Knowledge.

]]>
 

The Challenge

Enterprise Knowledge (EK) recently worked with a Federally Funded Research and Development Center (FFRDC) that was having difficulty retrieving relevant content in a large volume of archival scientific papers. Researchers were burdened with excessive search times and the potential for knowledge loss when target documents could not be found at all. To learn more about the client’s use case and EK’s initial strategy, please see the first blog in the Optimizing Historical Knowledge Retrieval series: Standardizing Metadata for Enhanced Research Access.

To make these research papers more discoverable, part of EK’s solution was to add “about-ness” tags to the document metadata through a classification process. Many of the files in this document management system (DMS) were lower quality PDF scans of older documents, such as typewritten papers and pre-digital technical reports that often included handwritten annotations. To begin classifying the content, the team first needed to transform the scanned PDFs into machine-readable text. EK utilized an Optical Character Recognition (OCR) tool, which can “read” non-text file formats for recognizable language and convert it into digital text. When processing the archival documents, even the most advanced OCR tools still introduced a significant amount of noise in the extracted text. This frequently manifested as:

  • A table, figure, or handwriting in the document being read in as random symbols and white space.
  • Inserting random punctuation where a spot or pen mark may have been on the file, breaking up words and sentences.
  • Excessive or misplaced line breaks separating related content.
  • Other miscellaneous irregularities in the text that make the document less comprehensible.

The first round of text extraction using out-of-the-box OCR capabilities resulted in many of the above issues across the output text files. This starter batch of text extracts was sent to the classification model to be tagged. The results were assessed by examining the classifier’s evidence within the document for tagging (or failing to tag) a concept. Through this inspection, the team found that there was enough clutter or inconsistency within the text extracts that some irrelevant concepts were misapplied and other, applicable concepts were being missed entirely. It was clear from the negative impact on classification performance that document comprehension needed to be enhanced.

Auto-Classification
Auto-Classification (also referred to as auto-tagging) is an advanced process that automatically applies relevant terms or labels (tags) from a defined information model (such as a taxonomy) to your data. Read more about Enterprise Knowledge’s auto-tagging solutions here:

The Solution

To address this challenge, the team explored several potential solutions for cleaning up the text extracts. However, there was concern that direct text manipulation might lead to the loss of critical information if blanket applied to the entire corpus. Rather than modifying the raw text directly, the team decided to leverage a client-side Large Language Model (LLM) to generate additional text based on the extracts. The idea was that the LLM could potentially better interpret the noise from OCR processing as irrelevant and produce a refined summary of the text that could be used to improve classification.

The team tested various summarization strategies via careful prompt engineering to generate different kinds of summaries (such as abstractive vs. extractive) of varying lengths and levels of detail. The team performed a human-in-the-loop grading process to manually assess the effectiveness of these different approaches. To determine the prompt to be used in the application, graders evaluated the quality of summaries generated per trial prompt over a sample set of documents with particularly low-quality source PDFs. Evaluation metrics included the complexity of the prompt, summary generation time, human readability, errors, hallucinations, and of course – precision of  auto-classification results.

The EK Difference

Through this iterative process, the team determined that the most effective summaries for this use case were abstractive summaries (summaries that paraphrase content) of around four complete sentences in length. The selected prompt generated summaries with a sufficient level of detail (for both human readers and the classifier) while maintaining brevity. To improve classification, the LLM-generated summaries are meant to supplement the full text extract, not to replace it. The team incorporated the new summaries into the classification pipeline by creating a new metadata field for the source document. The new ‘summary’ metadata field was added to the auto-classification submission along with the full text extracts to provide additional clarity and context. This required adjusting classification model configurations, such as the weights (or priority) for the new and existing fields.

Large Language Models (LLMs)
A Large Language Model is an advanced AI model designed to perform Natural Language Processing (NLP) tasks, including interpreting, translating, predicting, and generating coherent, contextually relevant text. Read more about how Enterprise Knowledge is leveraging LLMs in client solutions here:

The Results

By including the LLM-generated summaries in the classification request, the team was able to provide more context and structure to the existing text. This additional information filled in previous gaps and allowed the classifier to better interpret the content, leading to more precise subject tags compared to using the original OCR text alone. As a bonus, the LLM-generated summaries were also added to the document metadata in the DMS, further improving the discoverability of the archived documents.

By leveraging the power of LLMs, the team was able to clean up noisy OCR output to improve auto-tagging capabilities as well as further enriching document metadata with content descriptions. If your organization is facing similar challenges managing and archiving older or difficult to parse documents, consider how Enterprise Knowledge can assist in optimizing your content findability with advanced AI techniques.

Download Flyer

Ready to Get Started?

Get in Touch

The post Optimizing Historical Knowledge Retrieval: Leveraging an LLM for Content Cleanup appeared first on Enterprise Knowledge.

]]>
Extracting Knowledge from Documents: Enabling Semantic Search for Pharmaceutical Research and Development https://enterprise-knowledge.com/extracting-knowledge-from-documents-enabling-semantic-search/ Mon, 03 Mar 2025 18:00:37 +0000 https://enterprise-knowledge.com/?p=23177 The Challenge A major pharmaceutical research and development company faced difficulty creating regulatory reports and files based on years of drug experimentation data. Their regulatory intelligence teams and drug development chemists spent dozens of hours searching through hundreds of thousands … Continue reading

The post Extracting Knowledge from Documents: Enabling Semantic Search for Pharmaceutical Research and Development appeared first on Enterprise Knowledge.

]]>

The Challenge

A major pharmaceutical research and development company faced difficulty creating regulatory reports and files based on years of drug experimentation data. Their regulatory intelligence teams and drug development chemists spent dozens of hours searching through hundreds of thousands of documents to find past experiments and their results in order to fill out regulatory compliance documentation. The company’s internal search platform enabled users to look for documents, but required exact matches on specific keywords to surface relevant results, and lacked useful search filters. Additionally, due to the nature of chemistry and drug development, many documents were difficult to understand at a glance and required scientists to read through them in order to determine if they were relevant or not.

The Solution

EK collaborated with the company to improve their internal search platform by enhancing Electronic Lab Notebook (ELN) metadata, thereby increasing the searchability and findability of critical research documents, and created a strategy for leveraging ELNs in AI-powered services such as chatbots and LLM-generated document summaries. EK worked with the business stakeholders to evaluate the most important information within ELNs and understand the document structure, and developed semantic models in their taxonomy management system with more than 960 relevant concepts designed to capture the way their expert chemists understand the experimental activities and molecules referenced in the ELNs. With the help of the client’s technical infrastructure team, EK developed a new corpus analysis and ELN autotagging pipeline that leveraged the taxonomy management system’s built-in document analyzer and integrated the results with their data warehouse and search schema. Through three rounds of testing, EK iteratively improved the extraction of metadata from ELNs using the concepts in the semantic model to provide additional metadata on over 30,000 ELNs to be leveraged within the search platform. EK wireframed 6 new User Interface (UI) features and enhancements for the search platform designed to leverage the additional metadata provided by the autotagging pipeline, including search-as-you-type functionality and improved search filters, and socialized them with the client’s UI/ User Experience (UX) team. Finally, EK supported the client with strategic guidance for leveraging their internal LLM service to create accurate regulatory reports and AI summaries of ELNs within the search platform.

    The EK Difference

    EK leveraged its understanding of the capabilities and features of enterprise search platforms, and taxonomy management systems’ functionality, to advise the organization on industry standards and best practices for managing its taxonomy and optimizing search with semantics. Furthermore, EK’s experience working with other pharmaceutical institutions and large organizations in the development of semantic models benefited the client by ensuring their semantic models were comprehensively and specifically tailored to meet their needs for the development of their semantic search platform and generative AI use cases. Throughout the engagement, EK incorporated an Agile project approach that focused on iterative development and regular insight gathering from client stakeholders, to quickly prototype enhancements to the autotagging pipeline, semantic models and the search platform that the client could present to internal stakeholders to gain buy-in for future expansion. 

    The Results

    EK’s expertise in knowledge extraction, semantic modeling and implementation, along with a user-focused strategy that ensured that improvements to the search platform were grounded in stakeholder needs, enabled EK to effectively provide the client with a major update to their search experience. As a result of the engagement, the client’s newly established autotagging pipeline is enhancing tens of thousands of critical research documents with much-needed additional metadata, enabling dynamic context-aware searches and providing users of the search platform with insight at a glance into what information an ELN contains. The semantic models powering the upgraded search experience allow users to look for information using natural, familiar language by capturing synonyms and alternative spellings of common search terms, ensuring that users can find what they are looking for without having to do multiple searches. The planned enhancements to the search platform will save scientists at the company hours every week from searching for information and judging if specific ELNs are useful for their purposes or not, reducing reliance on individual employee knowledge and the need for the regulatory intelligence team to rediscover institutional knowledge. Furthermore, the company is equipped to move forward towards leveraging the combined power of semantic models and AI to improve the speed and efficiency of document understanding and use. By utilizing improved document metadata provided by the auto-tagging pipeline in conjunction with their internal LLM service, they will be able to generate factual document summaries in the search platform and automate the creation of regulatory reports in a secure, verifiable, and hallucination-free manner. 
     

     

    The post Extracting Knowledge from Documents: Enabling Semantic Search for Pharmaceutical Research and Development appeared first on Enterprise Knowledge.

    ]]>
    A Guide to Selecting the Right Auto-Tagging Approach https://enterprise-knowledge.com/a-guide-to-selecting-the-right-auto-tagging-approach/ Wed, 15 Jan 2025 14:25:09 +0000 https://enterprise-knowledge.com/?p=22900 Auto-tagging processes automate the manual labor of applying relevant keyword tags to data and content, enhancing accessibility and improving the organization of large datasets. Whether you’re trying to improve how quickly you find data or embarking on a content cleanup … Continue reading

    The post A Guide to Selecting the Right Auto-Tagging Approach appeared first on Enterprise Knowledge.

    ]]>
    Auto-tagging processes automate the manual labor of applying relevant keyword tags to data and content, enhancing accessibility and improving the organization of large datasets. Whether you’re trying to improve how quickly you find data or embarking on a content cleanup effort or you’re setting up a knowledge management system across an entire organization, these tools streamline the process and save you time. In this guide, we’ll explore the spectrum of auto-tagging options, from ready-made solutions to fully customized AI models, to help you find the perfect fit for your needs.

    What is Auto-Tagging?

    When managing large volumes of data, applying the right tags to clearly organize your information can feel like an endless task. Auto-tagging is an advanced process that automatically applies relevant terms/labels to your data. By identifying key information within a source and adding it as metadata, it makes organizing and finding data easy.

    For organizations today, auto-tagging can streamline knowledge management systems, boost content/data personalization, and improve search engine optimization. Imagine automatically enabling customers to find exactly what they need from product manuals, categorizing thousands of marketing materials in just a few clicks, or categorizing research articles to help scientists locate relevant studies and data faster.

    Auto-tagging isn’t just about efficiency—it can boost revenue by speeding up team workflows, cutting down search time, and enhancing customer experiences. It ensures that the right data reaches the right people, making it a smart choice for better business outcomes.

    Do You Have an Existing Taxonomy or Metadata Model?

    First things first, auto-tagging processes require an existing framework – either a taxonomy, or a metadata model, or both – to work from. Instead of starting from scratch and attempting to label your data from a blank slate, the auto-tagging process uses your taxonomy or metadata model to tap into your established framework and efficiently organize information and fine-tune keywords. If you don’t have a taxonomy or metadata model in place, don’t worry—it just means you’ll need to start by building one. To get started, you can review EK’s many resources on the topic: The 4 Steps to Designing an Effective TaxonomyHow Do I Implement A Taxonomy?, or Best Practices for Successful Metadata Governance. This is a crucial step to create a proper auto-tagging solution tailored to your unique data. 

    Auto-Tagging Approaches

    When implementing auto-tagging, organizations can choose from a range of options that fall on a Spectrum of Customizability. On one end, you have Ready-Made Auto-Tagging Solutions, which are typically features housed within Taxonomy Management Systems (TMS) such as PoolParty, Semaphore, and TopBraid EDG. These plug-and-play tools offer quick setup, user-friendly features, and low-code/no-code interfaces, making them ideal for standard use cases. They also integrate easily if you are already using one of the tools listed above for taxonomy management, and require minimal upkeep. However, they come with limitations—customization options can be more restrictive, limited to the system’s current configuration capabilities, making them less suitable for unique business needs or specialized datasets like images or audio. Additionally, TMS often involve recurring licensing costs and lack transparency regarding model details, performance, and training methods. If these challenges don’t align with your project requirements, consider alternative approaches such as Custom-AI solutions, which are reviewed later in this blog.

    On the other end of the spectrum, there are Custom AI-Enabled Auto-Tagging solutions, where models are trained specifically on your business’s unique data for more specialized tagging needs. These solutions offer high levels of customization, greater accuracy for specific data types, and full transparency on how the model is built and performs. Once deployed, they avoid recurring costs outside of internal upkeep. The trade-off, however, is the requirement for significant technical expertise, high upfront costs, and a time-intensive data labeling and training process. These models also need regular updates and maintenance to ensure continued accuracy and performance.

    While each approach has its benefits, no solution is entirely hands-off—all solutions on the Spectrum of Customizability require human input to ensure accuracy,  relevance, and usability. The right choice depends on your organization’s specific needs, data complexity, and level of control over the tagging process. The next few sections will dive into the specifics of Ready-Made vs  Custom AI-Enabled Solutions and expand on the Spectrum of Customizability.

     

    The Spectrum of Customizability

    When it comes to auto-tagging, one size doesn’t fit all. Different organizations have unique data, needs, and resources, which means the right tagging solution can vary significantly. This is where the concept of customizability comes into play. Whether you need a simple, off-the-shelf solution or a highly tailored model, auto-tagging options exist on a spectrum of customizability, balancing ease of use with the ability to fine-tune for specific needs.

    In this section, we’ll break down the levels of customizability, from pre-built solutions that are ready to go out of the box, to fully custom AI models that give complete control over tag generation and data processing. By understanding this spectrum, you can better assess which solution might work best for your use-case. 

      TMS Hybrid Custom AI
    Description These solutions are designed for rapid deployment, requiring minimal setup or configuration. However, you have little control over tag generation or model parameters. They’re best for simple use cases or highly standardized data. These tools combine pre-built systems with the ability to integrate existing taxonomies, tweak configurations, or fine-tune pre-trained models using proprietary data. They balance ease of use with moderate customizability, enabling tailored tagging without full-scale development efforts. These models are built from the ground up to meet highly specific tagging needs, starting with proprietary training data and leveraging traditional supervised learning techniques, offering complete control over the tagging logic.
    Level of Customizability Low Customizability Moderate Customizability High Customizability
    Best For Organizations that need a fast, simple tagging solution that works out of the box. Organizations that need adaptable solutions for specific data contexts, with flexibility to adjust terms, tagging logic, and visibility on model types. Organizations with highly specialized tagging needs that require complete control over the model, and are willing to invest in dedicated resources for maintenance.

    Conclusion

    In summary, we’ve explored what auto-tagging is, how it can help your organization, and the auto-tagging Spectrum of Customizability, discussing benefits, challenges, and best use cases. If you’re unsure which method best fits your needs, or if you need help selecting a ready-made tool or developing an AI-enabled model, EK’s auto-tagging experts are here to assist. Don’t hesitate to reach out for personalized support!

    The post A Guide to Selecting the Right Auto-Tagging Approach appeared first on Enterprise Knowledge.

    ]]>
    Expert Analysis: How Does My Organization Use Auto-tagging Effectively? Part Two https://enterprise-knowledge.com/expert-analysis-how-does-my-organization-use-auto-tagging-effectively-part-two/ Tue, 20 Dec 2022 15:47:58 +0000 https://enterprise-knowledge.com/?p=16913 In part one of this series, James Midkiff and Sara Duane, senior technical consultants at EK, discussed what auto-tagging is and when an organization should use it. This blog will discuss auto-tagging best practices to answer the foundational question, “How … Continue reading

    The post Expert Analysis: How Does My Organization Use Auto-tagging Effectively? Part Two appeared first on Enterprise Knowledge.

    ]]>
    In part one of this series, James Midkiff and Sara Duane, senior technical consultants at EK, discussed what auto-tagging is and when an organization should use it. This blog will discuss auto-tagging best practices to answer the foundational question, “How does my organization use auto-tagging effectively?”

    How do I improve my data (content or taxonomy) for auto-tagging? What are some auto-tagging best practices?

    James Midkiff

    There are two main ways that I would recommend improving your data for auto-tagging, content componentization and enriching your taxonomy for auto-tagging.

    Componentize the Content

    Break up larger content items into smaller consumable segments. There’s an art to segmenting content and, for larger content items, this could be as simple as breaking up the content for each entry in the table of contents. In most cases, each segment should describe a specific task or concept of information. A segment should answer a user’s question completely and avoid covering multiple topics. Once the content is broken down into components, auto-tag each component separately. This ensures that when producing search results or recommendations for users, specific sections of documents can be provided rather than requiring users to dig through the larger document for the section relevant to them. Auto-tagging the content components increases our understanding of the content because we can see the similarities and unique tags between the components.

    Enriching your Taxonomy for Auto-tagging

    Taxonomies should be evaluated for their use with auto-tagging as they need to be expressive in ways that an auto-tagger can leverage (unlike high-level/more intuitive taxonomies used for  search and navigation). Below are a number of questions and expected outcomes that can lead you to revise the taxonomy.

    Question 1: Does the taxonomy have a lot of single-word labels, should we auto-tag all of them, and do we expect them to get tagged incorrectly?

    Expected Outcome: We want to ensure that single-word labels, e.g., “Investment”, are tagged in the correct context since a single-word label is easier to match in text than a multi-word label.

    Question 2: How does the taxonomy leverage acronyms and do they make it easier or harder for terms to be auto-tagged?

    Expected Outcome: When acronyms are included in the preferred label of a term, i.e. “National Park Service (NPS)”, we want to make sure that synonyms are provided to match “National Park Service” and “NPS” separately.

    Question 3: Do the labels match how each taxonomy concept appears in the content?

    Expected Outcome: Similar to question 2, we want to review the taxonomy concepts to make sure they contain the same labels or synonyms as expected in the tagged text. If two departments of an organization refer to a topic by different names, then both names should be captured in the taxonomy. This can be an issue when the taxonomy serves multiple purposes, i.e. the organization of content for users and auto-tagging.

    The taxonomy should reflect the way that concepts are represented in the content and an evaluation helps align the taxonomy for auto-tagging purposes.

    Sara Duane

    When considering auto-tagging best practices, there are two main “buckets” of auto-tagging to consider. The first is topical-based auto-tagging which, as I described above, is best suited to content that is rich in the subject matter. To improve your taxonomy for this type of auto-tagging, you should dedicate time to analyzing the content and ensuring that your taxonomy meets the level of term granularity within it. You must ensure that these terms are included as preferred labels or synonyms. For example, if your taxonomy contains a list of motor vehicles, such as automobiles and motorcycles, but the content is much more granular and refers only to specific brands, tags will not be effectively applied to the content. These brands may need to be integrated into the taxonomy to match the level of granularity of the content.

    The second “bucket” is rules-based auto-tagging. This refers to automatically applying tags to content based on rules that you establish ahead of time. These rules can be inheriting existing metadata on a piece of content or applying metadata to the content based upon where it is stored or who authored it. For example, it might be the case that all final reports are stored in a few folders. It can be established that any content item coming from those folders is given that content type tag. You will need to work with content SMEs to establish these rules and ensure the accuracy of the tags.

    In some cases, rules-based auto-tagging will need to be extended with an algorithmic approach. You can develop a machine learning classification model that analyzes existing content and tags in order to predict the tags that should be applied to new content. This approach helps scale the rules-based auto-tagging approach as the amount of content grows.

    We can use the precision, recall, and F-score values to help us iterate on the taxonomy and auto-tagging process to see how different changes impact the overall performance of the auto-tagger.

    In either case, the most significant best practice for conducting auto-tagging is dedicating time to fine-tuning the process. You will need to conduct multiple rounds, tweaking the taxonomy and rules to best fit the content you are working with.

    How do I measure auto-tagging success?

    James Midkiff

    There are some mathematical ways to measure auto-tagging success such as precision, recall, and, a value that combines the two, an F-score. Given two sets of tags for a document, one from a human indexer and one from an auto-tagger,

    • Precision is the number of matches divided by the total number of auto-tagger tags,
    • Recall is the number of matches divided by the total number of human indexer tags, and
    • F-score is the harmonic mean of precision and recall.

    Simply put, precision measures how often the auto-tagger was correct when it assigned a concept. For example, if the concept “dogs” was assigned by the auto-tagger to 100 documents, but a human indexer only assigned “dogs” to 75 of those documents, then the precision of the auto-tagger for “dogs” is 75%. However, this is an incomplete measure as it doesn’t consider how many documents were tagged by the auto-tagger. If the auto-tagger assigned “dogs” to one document that matched the human indexer, then it would have a precision of 100%, even if a human indexer still assigned that concept to 75 other documents – the auto-tagger never incorrectly added “dogs” when it shouldn’t have.

    Recall, on the other hand, measures how often the auto-tagger assigned a concept that the human indexer assigned. For example, if a human indexer assigns “dogs” to 100 documents and the auto-tagger only assigns it to 20 of those, “dogs” would have a recall of 20%. Again, this number alone isn’t that helpful – if the auto-tagger assigned “dogs” to every document it saw, it would have perfect recall (although the precision would likely be terrible).

    F-score = in between recall and precisionAs you can see, neither precision nor recall tell the full story on their own. That’s where the F-score comes in. The F-score balances precision and recall to give a single number that provides an overall idea of how similar the auto-tagger is to a human indexer. 

    The F-score is a number between 0 and 1, where 0 is terrible and 1 is a perfect match to the human indexer. The number takes into account where tags match, where tags are missed by either party, and the total number of tags. If you have perfect precision but terrible recall (or vice versa), the F-score won’t be very good. This makes sure you don’t over-rely on precision or recall alone.

    We can use the precision, recall, and F-score values to help us iterate on the taxonomy and auto-tagging process to see how different changes impact the overall performance of the auto-tagger.

    Sara Duane

    EK has an established process for measuring auto-tagging success, although specific measures of success may change based on the client or use case. For a determination of success, oftentimes, a gold standard is key. This gold standard is a pre-determination of the best set of tags for a specific input, such as a content item, given the relevant taxonomies. This gold standard needs to be built or confirmed by SMEs who have expert knowledge of the content for the use case. This gold standard can also be used to train the auto-tagging ML algorithm. 

    Organizations may also want to consider comparing the results of earlier rounds of auto-tagging to rounds that occur after fine-tuning to demonstrate the improvement in results that this process can bring. For example, EK completed auto-tagging with a US-based investment and insurance company, and following a couple of rounds of finetuning and auto-tagging, the taxonomy was applied to content with an accuracy of 86-99% depending on the metadata field. One of the metrics used to define the success of this result was an accuracy comparison to the earlier round of auto-tagging that occurred pre-fine-tuning.

    A table that compares the performance of three taxonomies in tagging 3 types of content.

    (above) A sample comparison of auto-tagging accuracy per content type from an initial round to a second round.

    Additionally, the success of auto-tagging can also be measured by the coverage of tags across the taxonomy. When analyzing the tags that were applied to content, you should determine what categories of your taxonomy these tags originated from. Were all categories of the taxonomy auto-tagged? Were some categories auto-tagged more than others? Were there any categories that didn’t tag well at all? If there were areas of the taxonomy that did not receive as much coverage, you may need to do additional, more granular taxonomy design for the purposes of auto-tagging, focusing on terms that are prominent and important from the content. For example, in the image above, the vehicle type and department taxonomies were tagged less than the topic taxonomy. If I was expecting more pieces of content to be auto-tagged with a vehicle type, I would need to add additional, more granular terms to this taxonomy that are prevalent in the content.

    Conclusion

    There is no single way to conduct auto-tagging or measure its success. However, adhering to best practices, such as preparing your content and taxonomy for auto-tagging, will increase your organization’s level of success. When choosing a method to measure this success, ensure that the metrics meet your organization’s business and technical requirements. 

    Are you considering auto-tagging at your organization? Do you want to learn more about the process and how auto-tagging will perform for your use case? We can help! Contact us for more.

    The post Expert Analysis: How Does My Organization Use Auto-tagging Effectively? Part Two appeared first on Enterprise Knowledge.

    ]]>
    Identifying Security Risks Using Auto-Tagging and Text Analytics https://enterprise-knowledge.com/identifying-security-risks-using-auto-tagging-and-text-analytics/ Mon, 21 Nov 2022 21:19:47 +0000 https://enterprise-knowledge.com/?p=16859 On Thursday, November 10, Joe Hilger and Sara Duane spoke at Text Analytics Forum about identifying secure and confidential information using auto-tagging. Information security continues to grow in importance in today’s society. We hear stories all of the time about … Continue reading

    The post Identifying Security Risks Using Auto-Tagging and Text Analytics appeared first on Enterprise Knowledge.

    ]]>
    On Thursday, November 10, Joe Hilger and Sara Duane spoke at Text Analytics Forum about identifying secure and confidential information using auto-tagging. Information security continues to grow in importance in today’s society. We hear stories all of the time about hackers accessing private information from companies and government agencies. Every organization struggles with employees who store confidential information on insecure network drives or cloud drives. Hilger and Duane did a project with a federal research organization that used auto-tagging and text analytics to identify confidential information that needed to be moved to a secure location. During the presentation, we shared the approach we took to identify this information and how we made sure that the tagging and text analytics were accurate. Attendees learned best practices for designing a taxonomy for auto-tagging and tuning auto-tagging as well as ways to identify confidential information across the enterprise.

    The post Identifying Security Risks Using Auto-Tagging and Text Analytics appeared first on Enterprise Knowledge.

    ]]>
    Expert Analysis: When should my organization use auto-tagging? Part One https://enterprise-knowledge.com/when-should-my-organization-use-auto-tagging/ Tue, 19 Jul 2022 13:52:03 +0000 https://enterprise-knowledge.com/?p=15706 As EK works with our clients to design data models, including taxonomies and knowledge graphs, we often implement corresponding auto-tagging solutions to augment the organization and enrichment of unstructured content. In this blog series, two of our senior technical consultants, … Continue reading

    The post Expert Analysis: When should my organization use auto-tagging? Part One appeared first on Enterprise Knowledge.

    ]]>
    As EK works with our clients to design data models, including taxonomies and knowledge graphs, we often implement corresponding auto-tagging solutions to augment the organization and enrichment of unstructured content. In this blog series, two of our senior technical consultants, James Midkiff and Sara Duane, answer questions to define auto-tagging and help your organization understand when and how to leverage automated tagging and classification capabilities successfully. This is part one of a two-part expert analysis series that focuses on when and how to implement auto-tagging in your organization.

    What is Auto-tagging?

    James Midkiff

    Auto-tagging is the process of identifying key pieces of information inside of text and applying those key pieces as metadata to a document. This process usually includes a combination of named entity recognition (NER) and natural language processing (NLP) techniques that identify terms or phrases anywhere within the text. Auto-tagging leverages the following ideas to apply metadata to content:

    • Synonyms – the ability to recognize alternative ways of referring to the same concept and their associated acronyms
    • Disambiguation – the ability to discern between multiple concepts with the same label (i.e. does “ballast” refer to a ship’s ballast or those used when laying railroad tracks?)
    • Custom Rules – using “if-then” scenarios to check for patterns within the text and apply concepts as a result of those scenarios.

    These applied terms and phrases help content and data authors and consumers better understand the text context by providing additional context clues.

    Sara Duane

    Auto-tagging is an advanced application of taxonomy in which terms are automatically applied to content as tags through text recognition, inheritance, or other automated means. These tags describe the content and thus can be useful for search and browsing, as they improve the findability of the content items users are looking for.

    An image depicting a sample piece of content with the fields and values defined by a taxonomy
    A sample piece of content with tags applied in department, country, and subject metadata fields.

    For example, a company we recently worked with wanted to use auto-tagging to automatically apply tags to content based on topics to improve search and findability as well as identify or flag confidential information to be moved to a secure location.

    When should my organization use auto-tagging?

    James Midkiff

    Organizations should consider auto-tagging when they have a lot of text-based content that needs to be described and found. For example, one of our clients had a repository of call center requests and they couldn’t find existing content due to insufficient metadata. They implemented an auto-tagging pipeline to fill in the gaps, enabling call center personnel to quickly find relevant content when serving requests. Whether considering auto-tagging for a new system or a backlog of content that has been accumulating for the past 50 years, auto-tagging helps ensure that text is well described. By improving content metadata, we improve the ability to find and discover properly tagged content in downstream navigation, recommendation, and search interfaces.

    Manually tagging content can be costly, time-consuming, error-prone, and require the indexers to be familiar with an organization’s domain and taxonomy. An auto-tagging solution works around these issues by automating the application of metadata and taxonomy quickly while leveraging all the synonyms and acronyms that the organization uses to describe information, ultimately saving the organization money long-term.

    Sara Duane

    There are many reasons that organizations seek out the powers of auto-tagging. In general, many of these reasons revolve around the overall themes of improving search/findability and saving human effort and time.

    An organization may be looking for ways to improve the findability of content. As having accurate content tags helps to meet this goal, an organization may want to use auto-tagging to automatically apply a taxonomy and/or ontology to content items as tags. Oftentimes, this auto-tagging will work best when the content is both highly text-heavy and subject-oriented and the taxonomy has been aligned with the terminology within the content, providing an organization a way to automatically recognize “aboutness.”

    It is also possible that organizations may already tag their content for findability, but consider auto-tagging for time-saving and consistency reasons. By using auto-tagging, individuals do not need to manually tag content, thus giving them time back in their day for other tasks and also increasing the consistency with which tags are applied, as individuals tag with more subjectivity.

    How do I get started? How do I implement this into existing workflows?

    James Midkiff

    There are a few iterative steps an organization can take to start auto-tagging content.A diagram of a cycle where there are three phases: 1. identify the content, 2. aggregate data that describes the content, 3. test and determine the desired approach

    Identify the Content

    Determine what systems and what content within those systems could benefit from auto-tagging. Benefits could be realized through improved content description and findability as well as time saved creating content.

    Aggregate or build the data that describes your content

    We want to ensure we tag all of the relevant and informative data about a piece of content. For larger pieces of content, the full text should be auto-tagged. However, it’s common to highlight critical concepts by tagging the title and abstract of the text separately. Make sure to workshop the fields and rules you want to apply when tagging content.

    Test it and determine the desired approach

    Workshop your approach. Don’t just slap an auto-tagging solution on your content and expect success. Iterate on what is tagged, how the tags are combined to describe the content, and what portions of the taxonomy can be auto-tagged and validated. Additionally, identify points in tagging workflows where individuals can review, update, and approve tags applied by the auto-tagger. Providing this supervised tagging approach, also referred to as a human-in-the-loop workflow, is key to a successful auto-tagging process as it allows you to confirm the auto-tagging approach and discover new tagging rules that could be applied. Integrate auto-tagging with existing systems and workflows or determine if a one-time migration covers your use case. Evaluate each decision as you work towards the best auto-tagging approach for your organization.

    Sara Duane

    To start the auto-tagging process, there are foundational steps an organization should take in the realms of content and taxonomy. As James mentioned, an organization should select the sampling of content that will be used for auto-tagging. Ideally, this content should be text-heavy and driven by specific use cases. Once the content is selected, taxonomies should be built or refined to represent the tags you’d like the content to receive. EK’s bottom-up analysis approach is key in this process, as it is important for the taxonomy to be designed around the content that will be tagged.

    This taxonomy can then be implemented into a taxonomy management system with auto-tagging capabilities that then applies these tags to content via an API. Building the application of tags into an organization’s existing workflows is key for consistency and usability of the process, so the system can be configured to apply these tags upon the upload of the content. Typically, to start, EK recommends that a SME manually reviews the automatically applied tags to ensure alignment with the content and provide learning/feedback for the next iteration and fine-tuning of the auto-tagging algorithms.

    Conclusion

    In this Expert Analysis, we covered the value of auto-tagging and the initial steps for getting started at your organization. Part two of this blog will go a step further and address best practices for auto-tagging and measuring auto-tagging success.

    In the meantime, check out some of our case studies to see how organizations are leveraging auto-tagging:


    And, are you interested in learning more about auto-tagging for your organization? Contact us!

    The post Expert Analysis: When should my organization use auto-tagging? Part One appeared first on Enterprise Knowledge.

    ]]>
    Designing Your Taxonomy to Fit Your Use Case: Advanced Taxonomy Use Cases (Part 2) https://enterprise-knowledge.com/designing-your-taxonomy-to-fit-your-use-case-advanced-taxonomy-use-cases-part-2/ Thu, 27 Jan 2022 15:00:39 +0000 https://enterprise-knowledge.com/?p=14246 In my previous blog, I wrote about building a taxonomy for two foundational use cases – findability and navigation – that we commonly design for at EK. This blog will focus on two advanced, yet still common, taxonomy use cases. … Continue reading

    The post Designing Your Taxonomy to Fit Your Use Case: Advanced Taxonomy Use Cases (Part 2) appeared first on Enterprise Knowledge.

    ]]>
    In my previous blog, I wrote about building a taxonomy for two foundational use cases – findability and navigation – that we commonly design for at EK. This blog will focus on two advanced, yet still common, taxonomy use cases. Both of these more advanced use cases have more complex requirements and factors to consider during the design process, as I will dive into below.

    Auto-Tagging

    5 Things to Keep in Mind for an Auto-Tagging Use Case: 1) Remember your end user: a machine!, 2) Focus on a topical taxonomy to reflect "aboutness," 3) Granularity is important, 4) Again, synonyms are your friend!, 5) Ensure taxonomy terms are reflective of what is in the content

    Developing a taxonomy for an auto-tagging use case brings us to the more advanced applications of taxonomy. Auto-tagging refers to the advanced application of taxonomy in which terms are automatically applied to content as tags through text recognition, inheritance, or other automated means. This process is important because, if implemented and iterated upon correctly, it can save SMEs or content taggers time as they will not have to manually apply tags to the content. For example, let’s say that you would like to automatically apply your enterprise topical taxonomy to each piece of content in your Knowledge Base through the use of an auto-tagging tool, such as a taxonomy management system. When you design a taxonomy for auto-tagging, your main consideration should be that the taxonomy is designed for a machine as the end user, instead of a human. For the taxonomy to be leveraged by an auto-tagging tool, there are very different design requirements.

    Current auto-tagging tools and capabilities are often limited by the text found within the content item being tagged. Whereas manual tagging performs best at determining the “aboutness” of content because of the subject matter expertise of human taggers, auto-tagging tools are not able to read between the lines or identify true “aboutness” like a human can. Auto-tagging parses through the content it is given and uses text recognition and context to determine the subject of the text and then applies tags based on terms from the taxonomy. Since determining the “aboutness” of content is often the biggest challenge of an auto-tagging tool, it is essential that the taxonomy is designed to best support the machine in finding key topics. Thus, a taxonomist should design a topical taxonomy for this use case. 

    The topical taxonomy should represent what the content is about in a way that reflects the subjects directly used in the text.  The taxonomy should match the granularity of the content, and get into the details of what is presented in the content. With a taxonomy management system, the more granular child concepts can trigger the broader tag of a parent concept, demonstrating the importance of ensuring a detailed taxonomy. If the auto-tagging tool recognized and determined that the content was about a “van,” for example, the parent concept, “automobile,” could also be correctly identified.

    A significant consideration when designing a taxonomy for auto-tagging is that, in order for the auto-tagging tool to best succeed, the taxonomy terms need to be explicitly mentioned in the text. Due to the granularity and consistency needed, bottom-up taxonomy design (as referenced in this blog), or analyzing the content itself when developing the taxonomy, is the most important part of the design process. The topics of your content items should help form the basis of your taxonomy. Analyze the content, determining the topic of each piece and consulting SMEs to accurately represent the language. Additionally, conduct a corpus analysis. A corpus analysis is an examination of the words that are both most commonly used and of most significance in a set of content items, and can be conducted by most taxonomy management tools on the market. You should seek to include many of the terms surfaced by corpus analysis in your taxonomy through terms and synonyms, as they reflect what is in the content itself.  

    And, I’ll bring it up again for this use case, too – don’t underestimate the power of alternative labels, or synonyms! Synonyms are your friends when designing a taxonomy to use for auto-tagging. The more relevant and accurate synonyms that are applied to taxonomy terms, the more likely the auto-tagging tool will be able to correctly parse through the text and recognize what the content is about. Of course, though, you should be careful to use synonyms correctly – for example, don’t use repetitive synonyms for multiple terms. For example, even if “vehicle” can oftentimes be synonymous with “car,” you should not include it as a synonym for “car” if “vehicle” is contained in the taxonomy as a broader term. This would be repetitive and ineffective.

    Ontology / Graph Applications

    4 Things to Keep in Mind for an Ontology / Graph Applications Use case: 1) Frame your taxonomy around the information modeled in the graph, 2) Conduct analyses and source taxonomy terms from data/content that will be in the graph, 3) leverage linked open data, 4) keep ontology design in mind throughout.

    The next advanced use case for a taxonomy is an ontology or graph application use case, such as, for example, a recommendation engine that suggests training courses to education professionals based on their profiles or a chatbot that allows employees of a consulting firm to request documents using natural language queries. For these use cases, modeling both information and relationships effectively is of the highest importance.

    When developing your taxonomy for an ontology/graph use case, you should begin the analysis and design stages by thinking about the main questions that will be asked of the ontology/graph application. If the application is going to be a customer-facing chatbot, you should think through the questions that a customer will frequently ask, and seek to frame your taxonomy around the topics of these questions. In this way, a topical taxonomy is essential for an ontology/graph use case, as the topical taxonomy encompasses what the content is about, and the ontology will tie the topics together through the relationships that connect them to each other, and to other key business concepts, like “customer” or “product.” 

    Similar to the auto-tagging use case, you should conduct various types of analyses to best source terms for your taxonomy. Two of these include content and corpus analyses, which will allow your taxonomy to reflect the content and data stored in the knowledge graph, thus ensuring that the graph application will better understand the content. If your graph application relates to search, you should also conduct a keyword analysis to determine the most popular terms users search for. These popularly used terms should also be included in your topical taxonomy. In this way, your taxonomy will be equipped to support users’ search habits.

    Additionally, consider leveraging linked open data to reflect vocabulary and terms used in the industry that are relevant in your topical taxonomy too. In order for your ontology or graph application to be most useful, it is important that your topical taxonomy model not only your company’s information, but also similar information relevant to your industry so that your graph application can search and extend beyond your individual content items.

    Keeping ontology design in mind as you design your taxonomy is another opportunity to advance your taxonomy, as you can leverage the higher level concepts in your taxonomy for ontology classes, and the more specific taxonomy terms and synonyms as attributes or relationships. For example, the higher level taxonomy concept of “automobile” may become an ontology class, and “vehicle type” may become an attribute to incorporate the types of vehicles described in the child concepts of the “automobile” concept. One of my colleagues writes further about moving from a taxonomy to an ontology in the blog, “From Taxonomy to Ontology.

    Conclusion

    As is evident from the various design considerations for the above two advanced use cases for a taxonomy, and the two foundational use cases described in my last blog, determining your use case before starting to design a taxonomy is essential for the success of the project. As taxonomists, we must also consider and plan for the possibility that a taxonomy may have multiple use cases that may even overlap and contradict each other from a design perspective, ultimately affecting the overall complexity of the taxonomy design

    Do you need help determining the use cases for a successful taxonomy design? Let us help. Contact us at info@enterprise-knowledge.com to get in touch.

    The post Designing Your Taxonomy to Fit Your Use Case: Advanced Taxonomy Use Cases (Part 2) appeared first on Enterprise Knowledge.

    ]]>
    The Phantom Data Problem: Finding and Managing Secure Content https://enterprise-knowledge.com/the-phantom-data-problem-finding-and-managing-secure-content/ Fri, 10 Sep 2021 13:39:20 +0000 https://enterprise-knowledge.com/?p=13609 Every organization has content/information that needs to be treated as confidential. In some cases, it’s easy to know where this content is stored and to make sure that it is secure. In many other cases, this sensitive or confidential content … Continue reading

    The post The Phantom Data Problem: Finding and Managing Secure Content appeared first on Enterprise Knowledge.

    ]]>
    Are you actually aware of the knowledge, content, and information you have housed on your network? Does your organization have content that should be secured so that not everyone can see it? Are you confident that all of the content that you should be securing is actually in a secure location? If someone hacked into your network, would you be worried about the information they could access?

    Every organization has content/information that needs to be treated as confidential. In some cases, it’s easy to know where this content is stored and to make sure that it is secure. In many other cases, this sensitive or confidential content is created and stored on shared drives or in insecure locations that employees could stumble upon or hackers could take advantage of. Especially in larger organizations that have been in operation for decades, sensitive content and data that has been left and forgotten in unsecured locations is a common, high-risk problem. We call hidden and risky content ‘Phantom Data’ to express that it is often unknown or unseen and also has the strong potential to hurt your organization’s operations. Most organizations have a Phantom Data problem and very few know how to solve it. We have helped a number of organizations address this problem and I am going to share our approach so that others can be protected from the exposure of confidential information that could lead to fines, a loss of reputation, and/or potential lawsuits.

    We’ve consolidated our recommended approach to this problem into four steps. This approach offers better ways to defend against hackers, unwanted information loss, and unintended information disclosures.

    1. Identify a way to manage the unmanaged content.
    2. Implement software to identify Personally Identifiable Information (PII) and Personal Health Information (PHI).
    3. Implement an automated tagging solution to further identify secure information.
    4. Design ongoing content governance to ensure continued compliance.

    Manage Unmanaged Content

    Shared drives and other unmanaged data sources are the most common cause of the Phantom Data problem. If possible, organizations should have well-defined content management systems (document management, digital asset management, and web content management solutions) to store their information. These systems should be configured with a security model that is auditable and aligns with the company’s security policies.

    Typically we work with our clients to define a security model and an information architecture for their CMS tools, and then migrate content to the properly secured infrastructure. The security model needs to align with the identity and access management tools already in place. The information architecture should be defined in a way that makes information findable for staff across business departments/units, but also makes it very clear as to where secure content should be stored. Done properly, the CMS will be easy to use and your knowledge workers will find it easier to place secure content in the right place.

    In some cases, our clients need to store content in multiple locations and are unable to consolidate it onto a single platform. In these cases, we recommend a federated content management approach using a metadata store or content hub. This is a solution we have built for many of our clients. The hub stores the metadata and security information about each piece of content and points to the content in its central location. The image below shows how this works.

    Metadata hub

    Once the hub is in place, the business can now see which content needs security and ensure that the security of the source systems matches the required security identified in the hub.

    Implement PII and PHI Software

    There are a number of security software solutions that are designed to scan content to identify PII and PHI information. These tools look at content to identify the following information:

    • Credit card and bank account information
    • Passport or driver’s license information
    • Names, DOBs, phone numbers
    • Email addresses
    • Medical conditions
    • Disabilities
    • Relative information

    These are powerful tools that are worth implementing as part of this solution set. They are focused on one important part of the Phantom Data issue, and can deliver a solution with out-of-the-box software. In addition, many of these tools already have pre-established connectors to common CMS tools.

    Once integrated, these tools provide a powerful alert function to the existence of PII and PHI information that should be stored in more secure locations.

    Implement an Automated Tagging Solution

    Many organizations assume that a PII and PHI scanning tool will completely resolve the problem of finding and managing Phantom Data. Unfortunately, PII and PHI are only part of the problem. There is a lot of content that needs to be secured or controlled that does not have personal or health information in it. As an example, at EK we have content from clients that describes internal processes, which should not be shared. There is no personal information in it, but it still needs to be stored in a secure environment to protect our clients’ confidentiality. Our clients may also have customer or product information that needs to be secured. Taxonomies and auto-tagging solutions can help identify these files. 

    We work with our clients to develop taxonomies (controlled vocabularies) that can be used to identify content that needs to be secured. For example, we can create a taxonomy of client names to spot content about a specific client. We can also create a topical taxonomy that identifies the type of information in the document. Together, these two fields can help an administrator see content whose topic and text suggest that it should be secured.

    The steps to implement this tagging are as follows:

    1. Identify and procure a taxonomy management tool that supports auto-tagging.
    2. Develop one or more taxonomies that can be used to identify content that should be secured.
    3. Implement and tune auto-tagging (through the taxonomy management tool) to tag content.
    4. Review the tagging combinations that most likely suggest a need for security, and develop rules to notify administrators when these situations arise.
    5. Implement notifications to content/security administrators based on the content tags.

    Once the tagging solution is in place, your organization will have two complementary methods to automatically identify content and information that should be secured according to your data security policy.

    Design and Implement Content Governance

    The steps described above provide a great way to get started solving your Phantom Data problem. Each of these tools is designed to provide automated methods to alert users about this problem going forward. The solution will stagnate if a governance plan is not put in place to ensure that content is properly managed and the solution adapts over time.

    We typically help our clients develop a governance plan and framework that:

    • Identifies the roles and responsibilities of people managing content;
    • Provides auditable reports and metrics for monitoring compliance with security requirements; and
    • Provides processes for regularly testing, reviewing, and enhancing the tagging and alerting logic so that security is maintained even as content adapts.

    The governance plan gives our clients step-by-step instructions, showing how to ensure ongoing compliance with data protection policies to continually enhance the process over time.

    Beyond simply creating a governance plan, the key to success is to implement it in a way that is easy to follow and difficult to ignore. For instance, content governance roles and processes should be implemented as security privileges and workflows directly within your systems.

    In Summary

    If you work in a large organization with any sort of decentralized management of confidential information, you likely have a Phantom Data problem. Exposure of Phantom Data can cost organizations millions of dollars, not to mention the loss of reputation that organizations can suffer if the information security failure becomes public.

    If you are worried about your Phantom Data risks and are looking for an answer, please do not hesitate to reach out to us.

    The post The Phantom Data Problem: Finding and Managing Secure Content appeared first on Enterprise Knowledge.

    ]]>
    Knowledge AI: Content Recommender and Chatbot Powered by Auto-Tagging and an Enterprise Knowledge Graph https://enterprise-knowledge.com/knowledge-ai-content-recommender-and-chatbot-powered-by-auto-tagging-and-an-enterprise-knowledge-graph/ Mon, 26 Apr 2021 13:00:00 +0000 https://enterprise-knowledge.com/?p=13047 The Challenge A global development bank needed a better way to disseminate information and in-house expertise to all of their staff to support the efficient completion of projects, while also providing employees with an intuitive knowledge sharing tool that is … Continue reading

    The post Knowledge AI: Content Recommender and Chatbot Powered by Auto-Tagging and an Enterprise Knowledge Graph appeared first on Enterprise Knowledge.

    ]]>

    The Challenge

    A global development bank needed a better way to disseminate information and in-house expertise to all of their staff to support the efficient completion of projects, while also providing employees with an intuitive knowledge sharing tool that is embedded in their daily process to mitigate rework and knowledge loss.

    Leadership recognized that their employees were unable to leverage the organization’s knowledge capital because it wasn’t easily findable. In categorizing and ingesting both the institutional knowledge [contained in both structured (web pages, databases, etc.) and unstructured (emails, PDFs, videos, etc.) content items] and each individual’s area of expertise, the bank hoped to automatically assemble and proactively deliver targeted information to the appropriate individuals. Their goal, as summed up by the project sponsor, was – “We want knowledge to reach out to people!”

    The Solution

    To organize and typify the various categories of both the institution’s knowledge and that of its employees, EK enriched their business taxonomy, developed an ontology and a knowledge graph to create a semantic hub (colloquially referred to as “The Brain”) that, while leveraging the knowledge graph, collects organizational content, user context, and project activities. This solution uses AI to automatically deliver content to bank employees when and where they need it. The Brain was built on a graph database and a taxonomy management tool. Content from around the organization is auto-tagged (using the taxonomy management tool) and collected within the graph database. Together, these two tools, in which this aggregated information is managed and stored, power a recommendation engine that delivers contextualized recommendations via email, suggesting (in the form of links or attachments) relevant articles and information per the following scenarios:

    • A user schedules a calendar event on a given topic, or 
    • New content is introduced to the system that matches a user’s pre-defined interests.

    Presently, the same strategy is being expanded to power a chatbot as part of the bank’s larger AI Strategy. These outputs are published to the bank’s website to help improve knowledge retention and to showcase the institution’s in-house expertise via Google recognition and search optimization for future reference.

    The EK Difference

    Leveraging our vast experience with taxonomy/ontology design and semantic technologies, we helped the bank model their domain through a series of workshops and stakeholder interviews. Once the domain was in place, we applied our expertise in Solutions Architecture and Big Data orchestration to develop an application that quickly and efficiently loads and tags content from multiple sources into a single repository – a Knowledge Graph – used to provide recommendations to bank staff.

    We specifically applied our core competency in analysis, design, implementation, operations, and maintenance of information management systems and technical platforms for managing subject expert knowledge and topical information to ensure the bank had a solution that met their specific needs. Throughout the entire process, EK went beyond technical implementation, engaging with business users to ensure we were designing interfaces, workflows, security models, content cleanup practices, classification procedures, and governance guidelines to inform and define the long-term adoption and sustainability of the system.

    EK further employed our data science and engineering experience to iteratively enable knowledge-oriented AI to train the recommendation algorithm and upstream applications to consume and “understand” the bank’s data in a manner similar to which their staff understands and uses it.

    The Results

    In addition to connecting people to information, the tool is providing timely content recommendations on three different web applications and in advance of important meetings, as well as via a Chatbot service.

    Using knowledge graphs based on this linked data strategy enabled the bank to connect all of their knowledge assets in a meaningful way to:

    • Increase the relevancy and personalization of the search experience;
    • Enable employees to discover content across unstructured content types, such as webinars, classes, or other learning materials based on factors like location, interest, role, seniority level, etc.; and
    • Further facilitate connections between people who share similar interests, expertise, or location.

    The post Knowledge AI: Content Recommender and Chatbot Powered by Auto-Tagging and an Enterprise Knowledge Graph appeared first on Enterprise Knowledge.

    ]]>