Ben Kass, Author at Enterprise Knowledge

Cutting Through the Noise: An Introduction to RDF & LPG Graphs

Ben Kass — Wed, 09 Apr 2025 18:29:40 +0000

Graph is good. From capturing business understanding to support standardization and data analytics to informing more accurate LLM results through Graph-RAG, knowledge graphs are an important component of how modern businesses translate data and content into actionable knowledge and information. For individuals and organizations that are beginning their journey with graph, two of the most puzzling abbreviations that they will encounter early on are RDF and LPG. What are these two acronyms, what are their strengths and weaknesses, and what does this mean for you? Follow along as this article walks through RDF and LPG, touching on these and other common questions.

Definitions

RDF

To paraphrase from our deep dive on RDF, the Resource Description Framework (RDF) is a semantic web standard used to describe and model information. RDF consists of “triples,” or statements, with a subject, predicate, and object that resemble an English sentence; RDF data is then stored in what are known as “triple-store graph databases”. RDF is a W3C standard for representing information, with common serializations, and is the foundation for a mature framework of related standards such as RDFS and OWL that are used in ontology and knowledge graph development. RDF and its related standards are queried using SPARQL, a W3C recommended RDF query language that uses pattern matching to identify and return graph information.

LPG

A Labeled Property Graph (LPG) is a data model for graph databases that represents data as nodes and edges in a directed graph. Within an LPG, nodes and edges have associated properties such as labels that are modeled as single value key-value pairs. There are no native or centralized standards for the creation of LPGs; however, the Graph Query Language (GQL), an ISO standardized query language released in April 2024, is designed to serve as a standardized query template for LPGs. Because GQL is a relatively recent standard, it is not yet adopted by all LPG databases.

What does this mean? How are they different?

There are a number of differences between RDF graphs and LPGs, some of which we will get into. At their core, though, the differences between RDF and LPG stem from different approaches to information capture.

RDF and its associated standards put a premium on defining a conceptual model, applying this conceptual model to data, and inferring new information using category theory and first order logic. They are closely tied to standards for taxonomies and linked data philosophies of data reuse and connection.

LPGs, by contrast, are not model-driven, and instead are more concerned with capturing data rather than applying a schema over it. There is less of a focus on philosophical underpinnings and shared standards, and more importance given to the ability to traverse and mathematically analyze nodes in the graph.

Specific Benefits & Drawbacks of Each

RDF

Pluses

Minuses

Self-Describing: RDF describes both data and the data model in the same graph
Data Validation: RDF can validate data and data models using SHACL, a W3C standard
Expressivity: RDF and its larger semantic family is well suited to capturing the logical underpinnings and human understanding of a subject area.
Flexible Modeling: RDF was originally designed for web use cases in which multiple data schemas / sources of truth are aggregated together. Due to this flexibility, RDF is useful in aligning schemas and querying across heterogeneous / different datasets, as well as metadata management and master data management
Global Identifiers: Entities in the graph are assigned (resolvable) URIs. This has enabled the creation of open source models for both foundational concepts such as provenance and time, as well as domain specific models in complex subject areas like Process Chemistry and Finance that can be utilized and reused.
Standardization: Wide standard implementation enables simple switching between vendor solutions
Native Reasoning: OWL is another W3C standard built on RDF that enables logical reasoning over the graph using category theory

High Cognitive Load: Due to the mathematical and philosophical underpinnings it can take more time to come up to speed on how to model in RDF and OWL
Complexity of OWL Implementations: There are a number of different standards for how to implement OWL reasoning, and it is not always clear even to some experienced modelers which should be used when
N-ary Structures: RDF cannot model many-to-many relationships. Instead, intermediary structures are required, which can increase the verbosity of the graph.
Property Relations: Relationships cannot be added to existing properties in base RDF, restricting the kinds of statements that can be made. An RDF standard to extend this functionality, RDF*, is available in some triple-stores but is still under development and not consistently offered by vendors.

LPG

Pluses

Minuses

Efficient Storage: LPGs are generally more performant with large datasets, and frequently updated data compared to RDF
Graph Traversal: LPGs were designed for graph traversal to facilitate clustering, centrality, shortest path, and other common graph algorithms to perform deep data analysis.
Analytics Libraries: There are a number of open source machine learning and graph algorithm libraries available for use with LPGs.
Developer-Friendly: LPGs are often a first choice for developers since LPGs’ data-first design and query languages more closely align to preexisting SQL expertise.
Property Relations: LPGs support the ability to attach relationships on properties natively.

No Formal Schema: There is not a formal mechanism for enforcing a data schema on an LPG. Without a validation mechanism to ensure adherence to a model, the translation of data into entities and connections can become fuzzy and difficult to verify for correctness.
Vendor Lock-In: Tooling is often proprietary, and switching between LPG databases is difficult and inflexible due to the lack of a common serialization and proliferation of proprietary languages.
Lack of Reasoning: There are no native reasoning capabilities for logical inferences based on class inheritance, transitive properties, and other common logical expressions, although some tools have plug-ins to enable basic inference.

Common Questions

Which do I use for a knowledge graph?

Although some organizations define knowledge graphs as being built upon RDF triple stores, you can use either RDF or LPG to develop a knowledge graph so long as you apply and enforce adherence to a knowledge model and schema over your LPG. Managing and applying a knowledge model is easier within RDF, so it is often the first choice for knowledge graphs, but it is still doable with LPGs. For example, in his book Semantic Modeling for Data, Panos Alexopoulos references using Neo4j, an LPG vendor, to represent and store a knowledge graph.

Is it easier to use an LPG?

LPGs have a reputation for being easier to use because they do not require you to begin by developing a model, unlike RDF, allowing users to quickly get started and stand up a graph. This does not necessarily translate to LPGs being easier to use over time, however. Modeling up front helps to solve data governance questions that will come up later as a graph scales. Ultimately, data governance and the need for a graph to reflect a unified view of the world, regardless of format, mean that the work which happens to model up-front in RDF also ends up happening over the lifetime of an LPG.

Which do I need to support an LLM with RAG?

Graph-RAG is a design framework that supports an LLM by utilizing both vector embeddings and a knowledge graph. Either an LPG or an RDF graph can be used to power Graph-RAG. Semantic RAG is a more contextually aware variant that uses a small amount of locally stored vector embeddings and an RDF data graph with an RDF ontology for its semantic inference capabilities.

Do I have to choose between RDF and LPG when creating a graph?

It depends. We have seen larger enterprises embrace both in instances where they want to take advantage of the pros of each. For example, utilizing an RDF graph for data aggregation across sources, and then pulling the data from the RDF graph into an LPG for data analysis. However, if you are within a single graph database tool/application, you will be required to choose which standard you want to use. Although there are graph databases that allow you to store either RDF or LPG, such as Amazon Neptune, these databases lock you into RDF or LPG once you select a standard to use for storage. Neptune does allow users to query over data using both SPARQL and property graph query languages, which bridges some of the gaps in RDF and LPG functionality. As of the time of writing, however, Neptune is less feature rich for RDF and LPG data management than comparable purely RDF or purely LPG databases such as GraphDB and Neo4J.

Can I use both?

You can use RDF and LPGs together, but there are manageability concerns when doing so. Because there are no formal semantic standards for LPGs in the same way as there are for RDF, it is generally destructive to move data from an LPG into an RDF graph. Instead, the RDF graph should be used as a source of logical reasoning information using constructs like class inheritance. Smaller portions of the RDF graph, called subgraphs, can then be exported to the LPG for use with graph-based ML and traversal-based algorithms. Below is a sample architecture that utilizes both RDF and LPG for entity resolution:

Which should I choose if I want to use programming languages like Python and Java?

Both RDF and LPG ecosystems offer robust support for both Java and Python, each with mature libraries and dedicated APIs tailored to their respective data models. For RDF, Java developers can leverage tools like RDF4J, which provides comprehensive support for constructing, querying (via SPARQL), and reasoning over RDF datasets, while Python developers benefit from RDFlib’s simplicity in parsing, serializing, and querying RDF data. In contrast, LPG databases such as Neo4j deliver specialized libraries—Neo4j’s native Java API and Python drivers like Py2neo or the official Neo4j Python driver—that excel at handling graph traversals, pattern matching, and executing graph algorithms. Additionally, these LPG tools often integrate with popular frameworks (e.g., Spring Data for Java or NetworkX for Python), enabling more sophisticated data analytics and machine learning workflows.

How should I choose between RDF and LPG?

How are you answering business use cases with the graph? What kind of queries will you be asking/running? That will determine which graph format best fits your needs. Regardless of model or standard, when defining a graph the first thing to do is to determine personas, use cases, requirements, and competency questions. Once you have these, particularly requirements and competency questions, you can determine which graph form best fits your use case(s). To help clarify this, we have a list of use case-based rules of thumb.

Use Case Rules of Thumb

Conclusion

Both RDF and LPGs have relative strengths and weaknesses, as well as preferred use cases. LPGs are suited for big data analytics and graph analysis, while RDF are more useful for data aggregation and categorization. Ultimately, you can build a knowledge graph and semantic layer with either, but how you manage it and what it can do will be different for each. If you have more questions on RDF and LPG, reach out to EK with questions and we will be happy to provide additional guidance.

The post Cutting Through the Noise: An Introduction to RDF & LPG Graphs appeared first on Enterprise Knowledge.

What is Semantics and Why Does it Matter?

Ben Kass — Thu, 20 Mar 2025 15:05:34 +0000

This white paper will unpack what semantics is, and walk through the benefits of a semantic approach to your organization’s data across search, usability, and standardization. As a knowledge and information management consultancy, EK works closely with clients to help them reorganize and transform their organization’s knowledge structure and culture. One habit that we’ve noticed in working with clients is a hesitancy on their part to engage with the meaning and semantics of their data, summed up in the question “Why semantics?” This can come from a few places:

An unfamiliarity with the concept;
The fear that semantics is too advanced for a lower data-maturity organization; or
The assumption that problems of semantics can be engineered away with the right codebase

These are all reasons we’ve seen for semantic hesitancy. And to be fair, between the semantic layer, semantic web, semantic search, and other applications, it can be easy to lose track of what semantics means and what the benefits are.

What is Semantics?

The term semantics originally comes from philosophy, where it refers to the study of how we construct and transmit meaning through concepts and language. This might sound daunting, but the semantics we refer to when looking at data is a much more limited application.

Data semantics looks at what data is meant to represent – the meaning and information contained within the data – as well as our ability to encode and interpret that meaning.

Data semantics can cover the context in which the data was produced, what the data is referring to, and any information needed to understand and make use of the data.

To better understand what this looks like, let’s look at an imaginary example of tabular data for a veterinary clinic tracking visits:

Name	Animal	Breed	Sex	Date	Reason for Visit	Notes
Katara	Cat	American Shorthair	F	11/22/23	Checkup
Grayson	Rabbit	English Lop	M	10/13/23	Yearly Vaccination
Abby	Dog	German Shorthaired Pointer	F	9/28/23	Appointment	Urinary problems

My Awesome Vet Clinic

Above, we have the table of sample data for our imaginary veterinary clinic. Within this table, we can tell what a piece of data is meant to represent by its row and column placement. Looking at the second row of the first column, we can see that the string “Katara” refers to a name because it sits under the column header “Name”. If we look at the cell to the right of that one, we can see that Katara is in fact the name of a Cat. Continuing along the row to the right tells us the breed of cat, the date of the visit, and the reason that Katara’s owners have taken her in today.

The real-life Katara, watching the author as he typed the first draft of this white paper

While the semantics of our table might seem basic compared to more advanced applications and data formats, it is important for being able to understand and make use of the data. This leads into my first point:

You are Already Using Semantics

Whether you have a formal semantics program or not, your organization is already engaging with the semantics of data as a daily activity. Because semantics is so often mentioned as a component of advanced data applications and solutioning, people sometimes wrongly assume that enhancing and improving the semantics of data can only be a high-maturity activity. One client at a lower-maturity organization brought this up directly, saying “Semantics is the balcony of a house. Right now what I need is the foundation.” What they were missing, and what we showed them through our time with this client, is that understanding and improving the semantics of your data is a foundational activity. From how tables are laid out, to naming conventions, to the lists of terms that appear in cells and drop-downs, semantics is inextricably linked to how we use and understand data.

Achieving Five-Star Semantic Data

To prevent misunderstandings, we need to improve our data’s semantic expressiveness. Let’s look at the veterinary clinic data sample again. Earlier, we assumed that “Name” refers to the name of the animal being brought in, but suppose someone unfamiliar with the table’s setup is given the data to use. If the clinic’s billing needs to make a call, will they realize that “Katara” refers to the name of the cat and not the cat’s owner, or will they make an embarrassing mistake? When evaluating the semantics of data, I like to reference Panos Alexopoulos’s book Semantic Modeling for Data. Here, Panos defines semantic data modeling as creating representations of data such that the meaning is “explicit, accurate, and commonly understood by both humans and computer systems.” Each of these is an important component of ensuring that the semantics of our data support use, growth over time, and analysis.

Explicit

Data carries meaning. Often the meaning of data is understood implicitly by the people who are close to the production of the data and involved in the creation of datasets. Because they already know what the data is, they might not feel the need to describe explicitly what the data is, how it was put together, and what the definition of different terms are. Unfortunately, this can lead to some common issues once the data is passed on to be used by other people:

Misunderstanding what the data can be used for
Misunderstanding what the data describes
Misunderstanding what data elements mean

When we look at the initial tabular example, we know that Katara is a cat because of the table’s structure. If we were to take the concept of “Katara” outside of the table, though, we would lose that information–”Katara” would just be a string, without any guidance as to whether that string refers to Katara the cat, Katara the fictional character, or any other Kataras that may exist.

To handle the issue of understanding what the data can be used for, we want to capture how the data was produced, and what it is meant to be used for, explicitly for consumers. Links between source and target data sets should also be called out explicitly to facilitate use and understanding, instead of being left to the reader to assume.

What the data describes can be captured by modeling the most important things (or entities) that the data is describing. Looking at our veterinary clinic data, let’s pull out these entities, their information, and the links between them:

A sample conceptual model for the veterinary clinic information, adding in some additional information such as phone number and address

We now have the beginnings of a conceptual model. This model is an abstraction that identifies the “things” behind the data–the conceptual entities that the information within the cells is referring to. Because the model makes relationships between entities explicit, it helps people new to the data understand the inherent structure. This makes it easier to join or map new datasets to the dataset that we modeled.

Finally, to capture what data elements mean, we can make use of a data dictionary. A data dictionary contains additional metadata about data elements, such as their definition, standardized name, and attributes. Using a data dictionary, we can see what allowable values are for the field “Animal”, for instance, or how the definitions of an “appointment” vs a “checkup” differ.

Accurate

Data should be able to vouch for its own accuracy to promote trust and usage. It is also important for those accuracy checks to be human-readable as well as something that can be used and understood by machines. It might seem obvious at first glance that we want data to be accurate. Less obvious is how we can achieve accuracy. To ensure our data is accurate, we should define what accuracy looks like for our data. This can be formatting information: dates should be encoded as YYYY-MM-DD following the ISO 8601 Standard rather than Month/Day/Year, for example. It can also take the form of a regular expression that ensures that phone numbers are 10 digits with a valid North American area code. Having accuracy information captured as a part of your data’s semantics works both to ensure that data is correct at the source, and that poor, inaccurate data is not mixed into the dataset down the line. As the saying goes, “Garbage in, garbage out.”

Machine Readable

Looking back to our conceptual diagram, it has a clear limitation. Human users can use the model to understand how entities in the data link together, but there isn’t any machine readability. With a well-defined machine-readable model, programs would be able to know that visits are always associated with one animal, and that animals must have one or more owners. That knowledge could then be used programmatically to verify when our data is accurate or inaccurate. This is the benefit of machine-readable semantics, and it is something that we want to enable across all aspects of our data. One way of encoding data semantics to be readable by humans and machines is to use a semantic knowledge graph. A semantic knowledge graph captures data, models, and associated semantic metadata in a graph structure that can be browsed by humans or queried programmatically. It fulfills the explicit semantics and accuracy requirements listed above in order to promote the usability and reliability of data.

Example: Solving for Semantics with a Knowledge Graph

To demonstrate what good semantic data can do, let’s take our data and put it into a simple semantic knowledge graph:

Sample knowledge graph based on our veterinary clinic data

Within this graph, we have made our semantics explicit by defining not just the data but also the model that our data follows. The graph also captures the data concept information that we would want to find in a data dictionary. If we want to know more about any part of this model – for example, what the relationship “hasBreed” refers to – we can navigate to that part of the model and find out more information:

The definition of the relationship “hasBreed” within the model

Within the graph model, we’ve captured the taxonomies that can be used to fill out information on an animal’s Type and Breed as well as cardinality of relationships to ensure that the data and its use remains accurate. And, because we are using a knowledge graph, all of this information is machine readable, allowing us to do things like query the graph. Going back to the first example, we can ask the graph for the name of Katara’s Owner vs Katara’s Name to receive the contextually correct response (see the sample SPARQL query below).

SELECT ?PetOwnerName ?PetName

WHERE {

    ?PetOwner hasPet ?Pet .

    ?PetOwner schema:name ?PetOwnerName .

    ?Pet schema:name ?PetName .

}

Rather than having to guess at the meaning of different cells in a table, our three components of good semantics ensure that we can understand and make sense of the data.

?PetOwnerName	?PetName
Ben	Katara
Shay	Grayson
Michael	Abby

Example csv result based on the above SPARQL Query

Conclusion

This article has walked through how semantics is a core part of being able to understand and make use of an organization’s data. For any group working with data at scale, where there is a degree of separation between data producers and data consumers for a given dataset, having clear and well-documented semantics is crucial. Without good semantics, many of the fundamental uses of data run into roadblocks and difficulties. Ultimately, the important question to ask is not “Why semantics?”, rather “Where does semantics fit into a data strategy?” At Enterprise Knowledge, we can work with you to develop an enterprise data strategy that takes your needs into account. Contact us to learn more.

The post What is Semantics and Why Does it Matter? appeared first on Enterprise Knowledge.

AI & Taxonomy: the Good and the Bad

Ben Kass — Tue, 04 Mar 2025 18:18:49 +0000

The recent popularity of new machine learning (ML) and artificial intelligence (AI) applications has disrupted a great deal of traditional data and knowledge management understanding and tooling. At EK, we have worked with a number of clients who have questions–how can these AI tools help with our taxonomy development and implementation efforts? As a rapidly developing area, there is still more to be discovered in terms of how these applications and agents can be used as tools. However, from our own experience, experiments, and work with AI-literate clients, we have noticed alignment on a number of benefits, as well as a few persistent pitfalls. This article will walk you through where AI and ML can be used effectively for taxonomy work, and where it can lead to limitations and challenges. Ultimately, AI and ML should be used as an additional tool in a taxonomist’s toolbelt, rather than as a replacement for human understanding and decision-making.

Pluses

Taxonomy Component Generation

One area of AI integration that Taxonomy Management System (TMS) vendors quickly aligned on is the usefulness of LLMs (Large Language Models) and ML for assisting in the creation of taxonomy components like Alternative Labels, Child Concepts, and Definitions. Using AI to create a list of potential labels or sub-terms that can quickly be added or discarded is a great productivity aid. Content generation is especially powerful when it comes to generating definitions for a taxonomy. Using AI, you can draft hundreds of definitions for a taxonomy at a time, which can then be reviewed, updated, and approved. This is an immensely useful time-saver for taxonomists–especially those that are working solo within a larger organization. By giving an LLM instructions on how to construct definitions, you can avoid creating bad definitions that restate the term being defined (for example, Customer Satisfaction: def. When the customer is satisfied.), and save time that would be spent by the taxonomist looking up definitions individually. I also like using LLMs to help suggest labels for categories when I am struggling to find a descriptive term that isn’t a phrase or jargon.

Mapping Between Vocabularies

Some taxonomists may already be familiar with this use case; I first encountered this five years ago back in 2020. LLMs, as well as applications that can do semantic embedding and similarity analysis, are great for doing an initial pass at cross-mapping between vocabularies. Especially for application taxonomies that ingest a lot of already-tagged content/data from different sources, this can cut down on the time spent reviewing hundreds of terms across multiple taxonomies for potential mappings. One example of this is Learning Management Systems, or LMSs. Large LMSs typically license learning content from a number of different educational vendors. In order to present users with a unified discovery and search experience, the topic categories, audiences, and experience levels that vendors assign to their learning content will need to be mapped to the LMS’s own taxonomies to ensure consistent tagging for findability.

Document Processing and Summarization

One helpful content use case for LLMs is their ability to summarize existing content and text, rather than creating new text from scratch. Using an LLM to create content summaries and abstracts can be a useful input for automatic tagging of longer, technical documents. This should not be the only input for auto-tagging, since hallucinations may lead to missed tags, but when tagged alongside the document text, we have seen improved tagging performance.

Topic Modeling and Classification

The components that make up BERTopic framework for topic modeling. Within each category, components are interchangeable. Image of BERTopic components reprinted with permission from https://maartengr.github.io/BERTopic/algorithm/algorithm.html

Most taxonomists are familiar with using NLP (Natural Language Processing) tools to perform corpus analysis, or the automated identification of potential taxonomy terms from a set of documents. Often taxonomists use either standalone applications or TMS modules to investigate word frequency, compound phrases, and overall relevancy of terms. These tools serve an important part of taxonomy development and validation processes, and we recommend using a TMS to handle NLP analysis and tagging of documents at scale.

BERTopic is an innovative topic modeling approach that is remarkably flexible in handling various information formats and can identify hierarchical relationships with adjustable levels of detail. BERTopic uses document embedding and clustering to add additional layers of analysis and processing to the traditional NLP approach of term frequency-inverse document frequency, or TF-IDF, and can incorporate LLMs to generate topic labels and summaries for topics. For organizations with a well-developed semantic model, the BERTopic technique can also be used for supervised classification, sentiment analysis, and topic tagging. Topic modeling is a useful tool for providing another dimension with which taxonomists can review documents, and demonstrates how LLMs and ML frameworks can be used for analysis and classification.

Pitfalls

Taxonomy Management

One of the most desired features that we have heard from clients is the use of an Agentic AI to handle long-term updates to and expansion of a taxonomy. Despite the desire for a magic bullet that will allow an organization to scale up their taxonomy use without additional headcount, to date there has been no ML or AI application or framework that can replace human decision making in this sphere. As the following pitfalls will show, decision making for taxonomy management still requires human judgement to determine whether management decisions are appropriate, aligned to organizational understanding and business objectives, support taxonomy scaling, and more.

Human Expertise and Contextual Understanding

Taxonomy management requires discussions with experts in a subject area and the explicit capture of their information categories. Many organizations struggle with knowledge capture, especially for tacit knowledge gained through experience. Taxonomies that are designed with only document inputs will fail to capture this important implicit information and language, which can lead to issues in utilization and adoption.

These taxonomies may struggle to handle instances where common terms are used differently in a business context, along with terms where the definition is ambiguous. For example, “Product” at an organization may refer not only to purchasable goods and services, but also to internal data products & APIs, or even not-for-profit offerings like educational materials and research. And within a single taxonomy term, such as “Data Product”, there may be competing ideas of scope and definition across the organization that need to be brought into alignment for it to be accurately used.

Content Quality and Bias

AI taxonomy tools are dependent on the quality of content used for training them. Content cleanup and management is a difficult task, and unfortunately many businesses lag behind in both capturing up-to-date information, and deprecating or removing out-of-date information. This can lead to taxonomies that are out of date with modern understanding of a field. Additionally, if the documents used have a bias towards a particular audience, stakeholder group, or view of a topic then the taxonomy terms and definitions suggested by the AI reflect that bias, even if that audience, stakeholder group, or view is not aligned with your organization. I’ve seen this problem come up when trying to use press releases and news to generate taxonomies – the results are too generic, vague, and public rather than expert oriented to be of much use.

Governance Processes and Decision Making

Similar to the pitfalls of using AI for taxonomy management, governance and decision making are another area where human judgement is required to ensure that taxonomies are aligned to an organization’s initiatives and strategic direction. Choosing whether undertagged terms should be sunsetted or updated, responding to changes in how words are used, and identifying new domain areas for taxonomy expansion are all tasks that require conversation with content owners and users, alongside careful consideration of consequences. As a result, ultimate taxonomy ownership and responsibility should lie with trained taxonomists or subject matter experts.

AI Scalability

There are two challenges to using AI alongside taxonomies. The first challenge is the shortage of individuals with the specialized expertise required to scale AI initiatives from pilot projects to full implementations. In today’s fast-evolving landscape, organizations often struggle to find taxonomists or semantic engineers who can bridge deep domain knowledge with advanced machine learning skills. Addressing this gap can take two main approaches. Upskilling existing teams is a viable strategy—it is cost-effective and builds long-term internal capability, though it typically requires significant time investment and may slow progress in the short term. Alternatively, partnering with external experts offers immediate access to specialized skills and fresh insights, but it can be expensive and sometimes misaligned with established internal processes. Ultimately, a hybrid approach—leveraging external partners to guide and accelerate the upskilling of internal teams—can balance these tradeoffs, ensuring that organizations build sustainable expertise while benefiting from immediate technical support.

The second challenge is overcoming infrastructure and performance limitations that can impede the scaling of AI solutions. Robust and scalable infrastructure is essential for maintaining data latency, integrity, and managing storage costs as the volume of content and complexity of taxonomies grow. For example, an organization might experience significant delays in real-time content tagging when migrating a legacy database to a cloud-based system, thereby affecting overall efficiency. Similarly, a media company processing vast amounts of news content can encounter bottlenecks in automated tagging, document summarization, and cross-mapping, resulting in slower turnaround times and reduced responsiveness. One mitigation strategy would be to leverage scalable cloud architectures, which offer dynamic resource allocation to automatically adjust computing power based on demand—directly reducing latency and enhancing performance. Additionally, the implementation of continuous performance monitoring to detect system bottlenecks and data integrity issues early would ensure that potential problems are addressed before they impact operations.

Closing

Advances in AI, particularly with large language models, have opened up transformative opportunities in taxonomy development and the utilization of semantic technologies in general. Yet, like any tool, AI is most effective when its strengths are matched with human expertise and a well-thought-out strategy. When combined with the insights of domain experts, ML/AI not only streamlines productivity and uncovers new layers of content understanding but also accelerates the rollout of innovative applications.

Our experience shows that overcoming the challenges of expertise gaps and infrastructure limitations through a blend of internal upskilling and strategic external partnerships can yield lasting benefits. We’re committed to sharing these insights, so if you have any questions or would like to explore how AI can support your taxonomy initiatives, we’re here to help.

The post AI & Taxonomy: the Good and the Bad appeared first on Enterprise Knowledge.

The Minimum Requirements To Consider Something a Semantic Layer

Ben Kass — Fri, 28 Feb 2025 17:58:58 +0000

Semantic Layers are an important design framework for connecting information across an organization in preparation for Enterprise AI and Knowledge Intelligence. But with every new technology and framework, interest in utilizing the technological advance outpaces experience in effective implementation. As awareness of the importance of a semantic layer grows, and the market is becoming saturated with products, it is crucial to clearly distinguish between what is and is not a semantic layer. This distinction helps identify architectures that provide the full benefits of a semantic layer–such as aggregating structured and unstructured data with business context and understanding–versus more general data fabrics and semantic applications that may only provide some of its benefits.

To draw this distinction, it’s essential to understand the components that make up a semantic layer and how they connect, as well as the core capabilities and design requirements.

A Semantic Layer is not

One vendor
A metrics layer
A TMS, LMS, or CMS

No one application is a semantic layer; a semantic layer is a framework for design. This article will focus on summarizing the requirements of the semantic layer framework design. For a deeper exploration of the specific components and how they interact and can be implemented, see “What is a Semantic Layer? (Components and Enterprise Applications)”.

Requirement 1: A Semantic Layer supports more than one consuming application

A Semantic Layer is not equivalent to a model or orchestration layer developed to serve only one data product. While application-specific semantic models and integrations–such as those unifying customer information or tracking specific business health analytics via executive dashboards–can be critical to your business’s tech stack, they are not enough to connect information across the organization. To do this, there must be multiple applications operating within a design framework that enables the sharing of semantic data, such as catalogs, recommendation engines, dashboards, and semantic search engines. A semantic layer-type framework that serves only one downstream application risks becoming too closely tied to one domain and stakeholder group, limiting its broader organizational impact.

Requirement 2: A Semantic Layer connects data/content from more than one source system

Similar to enabling more than one application, a semantic layer should also connect information from multiple source systems. A layer that pulls only from one source is not able to meet modern needs for structured and unstructured data aggregation across silos to generate insights. Without a layer for interconnection, organizations run the risk of creating silos between data sources and applications. Additionally, organizations and semantic layer teams should develop data processing and analytics tools that are reusable across source systems as a part of the semantic layer. Tying a layer to a single downstream application encourages the duplication of work, instead of solution reuse enabled by the semantic layer. One multi-national bank that EK worked with developed a semantic layer to pull together complex risk management information from across multiple sources. The bank ended up cutting down their time spent on what used to be weeks-long efforts to aggregate data for regulators, and made information from siloed process-specific applications available in one central system for easy access and use.

Requirement 3: A Semantic Layer establishes a logical architecture

The semantic layer can serve as a logical connection layer between source systems

What separates a semantic layer from a well-implemented data catalog or data governance tool is its ability to serve as a connection and insight layer between multiple heterogenous sources of information, for multiple downstream data products and applications. To serve sources of information that have different data and content structures, the semantic layer needs to be based on a logical architecture that source models can map to. This logical architecture can be managed as a part of the ontology models if desired, but the important thing is that it serves as a necessary abstraction step so that business stakeholders can move from talking about the specific physical details of databases and documents, to what the data and content is about. Without this, the work required to ensure that the layer is both aggregating and enriching information will fail to scale over multiple domains. Moreso, the layer itself may begin to fracture over time without a logical architecture to unify its approach to disparate applications and data sources.

Requirement 4: A Semantic Layer reflects business stakeholder context and vocabulary

A semantic layer is more than simply a means of mapping data between systems. It serves to capture business knowledge and context so that actionable insights can be pulled from structured and unstructured data. In order to do this, the semantic layer must leverage terminology that business stakeholders use to describe and categorize information. These vocabularies then serve as a standardized source of metadata to ensure that insights from across the enterprise can be compared, contrasted, and linked for further analysis. Without reflecting the language of business stakeholders and understanding the context they use it in to describe key information, the semantic layer will not be able to accurately and effectively enrich data with business meaning. The layer may make data more accessible, but it will fail to make that data meaningful.

Requirement 5: A Semantic Layer leverages standards

As a semantic layer evolves, so too does a business’s understanding of what tooling best fits their needs for the layer’s components. Core semantic layer components such as the glossary, graph, and metadata storage should be, based on widely adopted semantic web standards, such as RDF and SKOS, to avoid vendor tool lock in for interchangeability. No one component should become an anchor – instead, each component should function more like lego bricks that may be changed out as an organization’s semantic ecosystem and needs evolve. Additionally, basing a semantic layer on standards opens up a world of already matured libraries, design frameworks, and application integrations that can extend and enhance the functionality of the semantic layer, rather than requiring a development team to re-create the wheel. Without standards-based architecture, organizations risk problems with the long-term scalability and management of their semantic layer.

Conclusion

A Semantic Layer connects information across an organization, by establishing a standards-based logical architecture, informed by business context and vocabulary, that connects two or more source systems to two or more downstream applications. Efforts that do not meet these requirements will fail to realize the benefits of the semantic layer, resulting in incomplete or failed projects. The five key requirements for the semantic layer framework described in this article create a baseline for what a semantic layer implementation should be. While not exhaustive, understanding and following these requirements will help organizations unlock the full benefits of the semantic layer to deliver real business value. These requirements will ensure that your semantic layer is able to capture knowledge and embed business context across your organization to power Enterprise AI and Knowledge Intelligence. If you are interested in learning more about semantic layer development and how EK can help, check out our other blogs on the subject, or reach out to EK if you have specific questions.

The post The Minimum Requirements To Consider Something a Semantic Layer appeared first on Enterprise Knowledge.

Governing a Federated Data Model

Ben Kass — Thu, 25 Apr 2024 15:36:11 +0000

Kjerish, CC BY-SA 4.0, via Wikimedia Commons

Data proliferates. Whether you are a small team or a multinational enterprise, information grows at an accelerated rate over time. As that data proliferates, you can run into issues of interoperability, duplication, and inconsistency that slows down the speed with which actionable insights can be derived. In order to mitigate this natural tendency, we develop and enforce standardized data models.

Developing enterprise data models introduces new concerns, such as how centralized ownership of the model should be. While it can be helpful to have a singular overarching data model team, centralization can also introduce its own challenges. Chief among these is the introduction of a modeling bottleneck. If only one team can produce or approve models, that slows down the speed with which models can be developed and improved. Even if that team is incorporating feedback and review from the data experts, centralization is typically a blocker to ensuring that deep domain knowledge is captured and kept updated. It is for that reason that frameworks such as data mesh and data fabric promote domain ownership of data models by the people closest to the data within a larger federated framework.

Before continuing, we should define a few terms:

Of course, implementing domains working within a federated data model brings its own challenges for data governance. Some–such as the need for global standardization to promote interoperability across data products–are common data challenges, while others–such as the federation of governance responsibilities–may be new to organizations embarking on a decentralized model journey. This article will walk through how to begin transitioning to federated data model governance.

Similar to a town hall or local government, success will rely on ensuring that many different stakeholders have a seat at the table and a sense of shared responsibility. Dietmar Rabich / Wikimedia Commons / “Dülmen, Rathaus, Ratssaal — 2017 — 9667-73” / CC BY-SA 4.0

Moving away from a centralized model: Balancing standardization and autonomy

For organizations that have already implemented centralized data governance, the thought that governance responsibilities can or should be federated out to different domains may seem strange. Data governance grows out of the need for standardization, interoperability, and regulatory standards, all of which are typically associated with centralized management. These needs are central to any large organization’s data governance, and they don’t go away when creating a federated governance model. However, within a federated data model, these standardization needs are balanced against the principles of domain autonomy that support data innovation and agile production. Time spent explaining field naming conventions and data structure to non-experts and waiting for approval can slow or even stymie the ability to make data internally available, resulting in increased cost, lost hours, and lower innovation.

To support domain autonomy and the ability to move quickly when iterating on or creating new data products, some of the responsibility for ensuring that data meets governance standards is shifted onto the domains as part of a shared governance model. Business-wide governance concerns like security and enterprise authorization remain with central teams, while specific governance implementations like authorization rules and data quality assurance are handled on a domain basis. Domains handle the domain-level governance checks and leave the centralized governance group to tackle more central issues like regulatory compliance, meaning that the data product teams spend less time waiting on centralized governance checks when iterating on a data product.

The federated and central governance teams are not separate entities, working without knowledge of one another. Domain teams are able to weigh in on and guide global data product governance policies, through a cross-functional governance team.

Global governance, local implementation

Within a federated governance model, it is still important to be able to create enterprise-wide governance policies. Individual domains need guidance on meeting regulatory requirements in areas of privacy and security, among others. Additionally, for standardization to be of the greatest benefit, all of the groups producing data need to align on the same standards.

It is for these reasons that the federated governance model relies on a cross-functional governance team for policy decisions, as well as guidance on how to implement governance to meet those policies. This cross-functional team should be made up of domain representatives and representatives from Central IT, Compliance, Standards, and other experts in relevant governance areas. This ensures that policy decisions are not removed from the data producers, and that domains have a say in governance decisions while remaining connected to your organization’s central governance bodies. Policies that should be determined by this governance team can include PII requirements, API contracts, mappings, security policies, representation requirements, and more.

An example federated governance diagram

In order to ensure that domains are fully engaged in the governance process, it is best practice to involve the data product teams early in the governance process. For an organization new to federated data models, this should happen when the data product teams are being stood up, rather than waiting for product teams to be fully formed before grafting on later. When Enterprise Knowledge spearheaded the development of an enterprise ontology for data fabric at a major multinational bank, we worked with the major stakeholders to start defining a federated governance program and initial domains from the beginning of the engagement alongside the initial ontology modeling. This helped to ensure that there was early buy-in from the teams that would later define and be responsible for data products.

The data product teams are ultimately responsible for executing the governance policies of this group, so it is vital that they are involved in defining those policies. Lack of involvement can lead to friction between the governance and data product teams, especially if the data product teams feel that the policies are out of sync with their governance needs.

Shift left on standards

The idea of “shifting left” comes from software testing, where it means to evaluate early rather than later in project lifecycles. Similarly, shifting left on standards looks to incorporate data management practices early into data lifecycles, rather than trying to tack them on at the end. Data management frameworks prioritize working with data close to its source, and data governance should be no different. Standards need to be embedded as early within the data lifecycle as possible in order to promote greater usability downstream within data products.

For data attributes, this could mean mapping data to standardized concept definitions and business rules as defined in an ontology. EK has worked with clients to shift left on standardization by using a semantic layer to connect standardized vocabulary to source data across disparate datasets, and map the data to a shared logical model. Applying standardization within the data products improves the user experience for data consumers and lessens the time lost when working with multiple data products.

Zhamak Dehghani, the creator of the data mesh framework, suggests looking for places where standardization and governance can be applied programmatically as part of what she refers to as “computational governance.” Depending on an organization’s technical maturity (i.e. the availability and use of technical solutions internally), governance tasks such as the anonymization of PII, access controls, retention schedules, and more can be coded as a part of the data products. This is another instance of embedding standardization within domains to promote data quality and ease of use. Early standardization lessens the amount of later coordination that is required to publish data products, resulting in faster production, and it is one of the keys to enabling a federated data model.

Conclusion

While federated data governance will be a new paradigm to many organizations, it has clear advantages for data environments that rely on expertise across different subject areas. The best practices discussed in this article will ensure that your organization’s data ecosystem is not only a powerful tool for standardization and insights, but also a robust and reliable one. Data product thinking can be an exciting new way to gain insights from your data, but the change in paradigm required can also leave new users feeling lost and unsure. If you want to learn more about the social and technical sides of setting up federated data governance, contact us and we can discuss your organization’s needs in detail.

The post Governing a Federated Data Model appeared first on Enterprise Knowledge.

How a Knowledge Graph Can Accelerate Data Mesh Transformation

Ben Kass — Tue, 11 Jul 2023 13:32:29 +0000

Despite a variety of powerful technical solutions designed to centrally manage enterprise data, organizations are still running into the same stubborn data issues. Bottlenecks in data registration, lack of ownership, inflexible models, out of date schemas and streams, and quality issues have continued to plague centralized data initiatives like data warehouses and data lakes. To combat the organizational and technical problems that give rise to these issues, businesses are turning to the data mesh framework as an alternative.

Data mesh focuses on decentralizing data, moving from the data lake paradigm to a network of data products owned and produced by specific business domains. Within the data mesh, data is treated like a product: it has specific business units that own the production of data as a product and work to meet the needs of the enterprise data consumers. Domain ownership and data product thinking, both of which promote autonomy, are balanced through a self-serve data platform and a federated governance model that ensure data products meet standardization and quality needs, and are easy for domains to produce and register. The name “data mesh” comes from the graph nature of this decentralized model: the data products serve as points, or nodes, which are linked together to power deeper data insights.

The shift from centralized data management to decentralized data management can be a stark change in mindset, and organizations can struggle to know where to begin. Translating the four principles of the data mesh framework–data as product, domain ownership of data, self-service data platform, and federated computational governance–to technical and organizational requirements requires a new perspective on data and semantics. It requires a graph perspective, similar to a knowledge graph. At Enterprise Knowledge, we help clients to model their data using knowledge graphs, and we’ve demonstrated how a knowledge graph can be instrumental to jump-starting an organization’s data mesh transformation. As this article will walk through, knowledge graphs are an intuitive and powerful method for accelerating a mesh initiative.

How can knowledge graphs accelerate data mesh?

A knowledge graph is a graph-based representation of a knowledge domain. The term was first introduced by Google as a means to connect searches to the concepts like a person or organization searched for rather than the exact strings used. A knowledge graph models the human understanding of concepts and processes as entities, with relationships between them showing the context and dependencies of each entity. Data is then mapped to the graph as “instances” of entities and relationships. This encodes the human understanding of a knowledge domain such that it can be understood and reasoned over by machines. Knowledge graphs are a powerful tool for mapping different data sources to a shared understanding, and can power enterprise search, content recommenders, enterprise AI applications, and more.

Both knowledge graphs and data mesh take different sources of information and represent them as nodes on a graph with links between them in order to power advanced insights and applications. Unlike data mesh, knowledge graphs are often more centrally managed (one example being a knowledge graph powering a semantic layer) and focus on representing attributes rather than products at the node level, but both approaches take a domain approach to modeling business understanding and capturing the meaning or semantics underlying the data.

In her book Data Mesh: Delivering Data-Driven Value at Scale, Zhamak Dehghani, the creator of the data mesh framework, even describes data products within a data mesh as constituting an “emergent knowledge graph” (pp. 8, 199). The deep similarities between data mesh and knowledge graphs lead to a natural synergy between the two approaches. Having a knowledge graph is like having superpowers when it comes to tackling a data mesh transformation.

Superpower #1: The knowledge graph provides a place and framework for data product modeling

One of the first stumbling blocks for organizations embarking on a data mesh initiative is understanding how to model for data products within a domain. For the purpose of this article, we will assume that starter domains and data products have already been identified within an organization. One of the first questions people ask is how they can model data products without creating silos.

Having a self-service data platform provides a common registration model, but it doesn’t enforce interoperability between data products. To be able to link between data products without wasting time creating cross-mappings, it is necessary to align on common representations of shared data, which Zhamak Dehghani refers to as “polysemes.” Polysemes are concepts that are shared across data products in different domains, like customer or account. Taking “account” as our example of a shared concept, we will want to be able to link information related to accounts in different data products. To do these, the products we are linking across will need to align on a method of identifying an account, as well as a common conceptual understanding of what an account is. This is where a knowledge graph comes in, as a flexible framework for modeling domain knowledge.

Using an RDF knowledge graph, we can model account as well as its associated identifier and the relationship between them. Preferably, account and other polysemes will be modeled by the domain(s) closest to them, with input from other relevant data product domains. If we then model other data products within our knowledge graph, those data products will be able to link to and reuse the account polyseme knowing that the minimum modeling required to link across account has been achieved. The knowledge graph, unlike traditional data models, provides the flexibility for the federated governance group to begin aligning on the concepts and relationships common to multiple domains, and to reuse existing concepts when modeling data products.

Note: There are multiple ways of creating or enforcing a data product schema from a knowledge graph once modeled. SHACL is one language for describing and validating a particular data shape in a knowledge graph, as is the less common ShEx. SPARQL, GraphQL, and Cypher are all query languages that can be used to retrieve semantic information to construct schemas.

Superpower #2: Use the knowledge graph to power your self-service data mesh platform

The data mesh self-service platform is the infrastructure that enables the data products. It should provide a common method for data product registration that can be completed by generalist data practitioners rather than specialists, and a place for data consumers to find and make use of the data products that are available. In order to fulfill the data mesh infrastructure needs, a data mesh platform needs to: support data product discovery, provide unique identifiers for data products, resolve entities, and enable linking between different data products. To show how a knowledge graph can help a data mesh platform to meet these structural requirements, I will walk through each of them in turn.

Discover Data Products

For data products to transform an organization, they need to be easily found by the data consumers who can best make use of the analytical data. This includes both analysts as well as other data product teams who can reuse existing products. Take a financial services company’s credit risk domain, for example. Within this domain you can have data products that serve attributes and scores from each of the three big credit bureaus, as well as aggregate products that combine these credit bureau data products with internally tracked metrics. These products can be used by a wide variety of the business’s applications: reviewing potential loan and credit card customers, extending further credit to existing customers, identifying potential new customers for credit products, analyzing how risk changes and develops over time, etc.

A knowledge graph supports this reuse and findability by defining the metadata that describes the data products. Associated taxonomies can also help to tag use cases and subject areas for search, and existing knowledge graph connections between data products can surface similar datasets as recommendations when consumers browse the mesh.

Assign Unique Identifiers

The core of the data product node within a mesh is its Uniform Resource Identifier, or URI. This identifier is connected to a data product’s discovery metadata and its semantic information: relationships to other data products, schema, documentation, data sensitivity, and more. Knowledge graphs are built on a system of URIs that uniquely identify the nodes and relationships with the graph. These URIs can be reused by the data mesh to identify and link data products without having to create and manage a second set of identifiers.

Resolve Entities

Entity resolution is the process of identifying and linking data records that refer to the same entity, such as a customer, business, or product. The node-property relationship structure of knowledge graphs is well suited to entity resolution problems, since it categorizes types of entities and the relationships between them. This in turn powers machine learning entity resolution to disambiguate entities. Disambiguated entities can then be linked to a single identifier. Resolving entities surfaces linkages across data products that would not otherwise be identified, generating new insights into data that was previously isolated.

Link Products Together

A data mesh is not a data mesh unless there are linkages across the data products to form new data products and power new analytics. Without those linkages, it is only a set of disconnected silos that fail to address institutional data challenges. Linkages that exist between data products in the knowledge graph will be surfaced by a data mesh platform, such as a data catalog, to provide further context and information to users. New linkages can be easily defined and codified using the knowledge graph system of nodes and properties that underlie that data mesh platform.

Conclusion

Data thrives on connections. The ability to create insights across an enterprise can’t come from one stream or relational table alone. Insights rely on being able to place data within an entire constellation of different sources and different domains, so that you can understand the data’s context and dependencies across a sea of information. Both data mesh and knowledge graphs understand this, and work to define that necessary context and the relationships while simultaneously providing opportunities to enhance the quality of the information sources. These graph based representations of data are not an either-or that businesses need to choose between; instead, they complement each other when used together as a framework for data. If you are interested in learning more about beginning or accelerating a data mesh transformation with a knowledge graph, reach out to us to learn how we can drive your data transformation together.

The post How a Knowledge Graph Can Accelerate Data Mesh Transformation appeared first on Enterprise Knowledge.

Top 5 Tips for Managing and Versioning an Ontology

Ben Kass — Tue, 10 Jan 2023 21:50:29 +0000

Sometimes, clients who come to EK confident in their ontology development capabilities find themselves wrongfooted when it comes to creating an ontology management plan. This is partly a result of documentation – there are a wide variety of resources on development methodologies, and considerably less on the nitty gritty of making sure that an ontology remains use case-aligned and usable for years to come. Lack of guidance can result in vague ontology management plans that don’t fully account for the actions that will need to be taken. This article will go over the five key components to an effective ontology versioning and management plan. After reading, you will be able to pursue ontology management confident that you have the details down.

Accommodating Change

Ontologies are not static artifacts. They grow as new use cases are identified and brought on, and develop as the content and understanding that underpins them changes. Sometimes, the process of deploying an ontology leads to these changes, as the business understanding of how the ontology will be applied is refined. For example, one of our clients began their modeling project with the goal of creating a standardized set of canonical data schemas. As the project approached implementation and met roadblocks, the team realized that what data consumers really needed were trusted data products made available through an internal data catalog, rather than additional schemas. Modeling and governance practices shifted to support the new use case, and the project was successful thanks to a greater alignment with data consumer needs.

Whatever the source of change, the ontology will need to have a plan in place to ensure that updates are transparent, maintain interoperability, and can scale. To ensure that the goals of transparency and interoperability are met requires a robust approach to versioning, as part of a comprehensive governance plan. Periodic change is common to ontologies, especially in the first year of development. We have seen clients completely change their approach to modeling relationships, or who decide to move away from reusing open ontology models that weren’t well suited to their use case. In both those instances, the client needed a way to communicate the magnitude of the change, and ensure that users aligned to the newest version of the ontology.

Versioning is the ability to track and communicate what changes have been made to a file as it is updated. Generally, new versions are identified through the use of a version number, and come with information on what changes were made. Like updates to a piece of software, versioning lets users know when there is a more up-to-date version of the ontology they should move onto, and what changes were made. Versioning is critical to making sure that integrations with the ontology stay aligned Versioning can also communicate the level of changes made via an update, and whether those changes are backwards compatible or not. Tracking changes via a versioning plan is the key to ensuring usability of an ontology over time, as well as its longevity in the face of change. This leads into the first tip:

1. Track Version Information within the Ontology

There are a number of places where a version number can be tracked and delivered to consumers. One of the best places to track this information is within the ontology itself, using a datatype property. This has the benefits of making sure that the ontology cannot be separated from its version information, and that this information is easy for the ontologist to access and update.

OWL ontologies can use the preexisting OWL attribute owl:versionInfo to store version information. If the OWL standard is not in use, then version information can be tracked using rdfs:comment or an annotation property assigned by the editor. Semantic Web Company’s PoolParty Thesaurus Manager (PPT), for instance, uses rdfs:comment to track description information and can be used to record a version number for ontologies. TopQuadrant’s TopBraid EDG Ontology editor defines a custom attribute, http://topbraid.org/metadata#version or metadata:version, to store ontology version information.

Example of how rdfs:comment can be used to store a version number in PPT. In this example, the version number is 1.2.0

Regardless of which standard you use, the key is to ensure the version info can always be found alongside the ontology, and that the version info can be updated easily when changes are committed.

Be Aware: Some ontologies track the version number in the namespace of the ontology. Tracking the version number like this means that the namespace changes with every new update, which can cause difficulties with software integrations. As a result, this method of version tracking is generally not recommended.

One example of a namespace with a version number is the FOAF ontology namespace, pictured above. For more information on namespace and URIs, check out Resolving Uniform Resource Identifiers.

2. Use the Semantic Versioning (SEMVER) Standard

The Semantic Versioning specification, or SEMVER, is a software development standard that guides how to create and apply version numbers in such a way that users can understand the level of changes made. Within SEMVER, version numbers are constructed following a pattern of X.Y.Z where X is the major version number, Y is the minor version number, and Z is the patch version number.

A mock version number. In this example, 8 is the major version number, 1 is the minor version number, and 7 is the patch version number. Note that the “Version” is not a part of the number, and not required under SEMVER.

The Major version number is incremented when updates are made that will cause a break in backwards compatibility. The Minor version number is incremented when updates are made that add functionality without causing a break in backwards compatibility. Finally, the Patch version number is reserved for bug fixes that do not cause a break in backwards compatibility. The Patch version number is less commonly used alongside ontologies, as an ontology editor will typically be able to catch any RDF issues as part of its quality assurance features. More information detailing how to use this standard can be found within the documentation.

By following the SEMVER rules for the construction of version numbers, the ontology will communicate an update’s level of impact to users. Changes to the major version number signal that there may be required updates to integrations with the ontology, while minor and patch number changes do not. It ensures that the ontology versioning follows the best practices of a widely adopted standard.

Note: Not every change that first appears to break backwards compatibility actually will, depending on implementation. Consider first if the entity being updated is in use within a source system. If the entity is not in use, then it can safely be altered or removed without affecting compatibility. Generally, anything not in use by another system can be safely changed or removed without requiring a major change process.

3. Have a Plan for Deprecation

Removing outdated modeling is a reality of ontology upkeep and development. Privacy and security in particular are two areas that we often see evolve as an organizational understanding of how to enforce privacy and security develops and language shifts. When this happens, the previous concepts and terms need to be sunsetted once their replacements become available.

Just deleting the entities everytime that modeling becomes outdated will quickly rack up potentially breaking changes however, and this can lead to wide disruption of downstream consuming systems. Instead of immediately deleting the outdated entities, it is better to deprecate them first.

Unlike deletion, deprecation does not immediately remove a piece of modeling. Instead, deprecation involves signaling that the modeling in question is no longer supported, and will be removed in the future. Deprecation should also indicate where possible what modeling should be used instead of the deprecated modeling.

Deprecating before deletion allows for ontology users to prepare for upcoming breaking changes. The deprecated entity should clearly state that it is deprecated, why it was deprecated, and point to possible replacements if any exist. Deprecated entities should be easily distinguished from non-deprecated entities in the editor. The open-source ontology editor Protégé will automatically strike through deprecated concepts, for example.

Example of a deprecated class in Protégé.

Deprecated concepts can also be distinguished by adding “(Deprecated)” to the end of the label. Following these rules for deprecation will help to preserve backwards compatibility in the short term, while ensuring that users move away from outdated modeling before it is removed.

Bonus Tip: Deprecated entities should stay in the ontology until the next major change, at which point they should be removed. This helps to group major changes together, while also giving ontology users time to move off of the deprecated entities.

4. Keep a Changelog

Data consumers will want to know what changes were made between versions. If you are using a version control system like git or GitHub, then the changes made between versions will automatically be reflected in the file comparison, also known as a diff. It is important to note here that these systems track every change in the RDF serialization. Normal RDF editors do not write entities in a specific order, so changes in that order will be incorrectly flagged by the diff as updates to the ontology. We typically avoid this by using an extension that sorts the RDF when it is written, such as the Ordered Turtle Serializer for rdflib, or TopBraid’s Sorted Turtle.

Alternatively, the record of these changes can be tracked and delivered by the ontologist, either through a note attached to new version publications or an excel sheet documenting the changes. One example of this is the Financial Industry Business Ontology, or FIBO, which maintains an extensive record of changes made within each revision as part of their release notes. Note that manual tracking can quickly become overwhelming, so look to automate where possible when producing a changelog.

5. Deliver the Right Version

Once an ontology has been versioned, you need to make sure that the correct version of the ontology is being delivered to users. While it may be tempting to simply delete and replace the old version, there may be consumers who need to stay on a prior ontology version temporarily. Rather than deletion, look into providing both Ontology IRI and an Ontology Version IRI endpoints alongside exports. The Version IRI endpoint is a link or identifier that points to a specific version of the ontology, while the more general Ontology IRI is a link or identifier that points to the latest version of the ontology. Manchester University’s Protégé pizza tutorial ontology distinguishes between its IRIs by adding the version number to the end of the version IRI.

The Manchester University pizza tutorial ontology IRIs. Note that Protégé automatically supports the use of both IRI types, as do other editors like TopBraid EDG.

Another approach is to create a publicly available archive that hosts prior versions of the ontology. Consumers who are unable to move onto the latest version of the ontology can then have uninterrupted access to the modeling they need. Be sure to communicate that prior versions are no longer being updated, however. Trying to maintain and update different ontology versions can quickly get out of hand. Also make sure that the location of the latest version is stable and does not change.

While it may be easy to overlook, having a versioning plan in place is an important part of maintaining the long-term usability of an ontology. These are our top considerations and tips if you and your team are looking to understand what it takes to develop and maintain your model. Here at Enterprise Knowledge, we work with a wide variety of clients helping them to create and manage ontologies, and help our clients to create the customized versioning and governance plan that best suits their needs. If you would like to learn more about ontology versioning and governance, reach out to us to learn more about how we can create a customized plan together.

The post Top 5 Tips for Managing and Versioning an Ontology appeared first on Enterprise Knowledge.

Trimming an Untamed Taxonomy

Ben Kass — Tue, 21 Dec 2021 15:00:53 +0000

When your taxonomy has overgrown your path towards usability, it’s time to do some gardening.

Congratulations: you have a taxonomy! You’ve gone through the work of gathering user feedback, developing a design, validating the design, and you’ve come out the other end with a stable set of terms. Maybe you’ve only just completed your efforts, or maybe you’re working with a legacy system. Either way, you’ve done it. Time to rest on your laurels.

Only…now that you have it, the taxonomy doesn’t really seem to be working for you. Maybe it turns out that five separate levels of hierarchy devoted to the technical differences between slippers and slip-ons can’t be implemented in the current version of your content management system, and now they sit unused in an Excel sheet instead. Or you’re hearing from content authors that some subjects have so many terms that they struggle to reliably choose the correct tags while other subjects struggle from a lack of terms and ambiguity among them. Maybe this taxonomy worked fine years ago when it was first implemented, but it has gradually become more and more unwieldy and difficult to manage year after year.

Whatever your situation, you’ve reached the point of crisis. What was once or what could have been a well-tended garden of terms has become an untamed forest of thorns. What now? This blog post will take you through some of the common problems with thorny taxonomies and solutions for turning your frog back into a prince.

Problem 1: Too Many Terms

Symptoms:

Too many terms to choose from.
Some terms with very little uptake and use.
Likely ambiguity in spite of this proliferation of terms.

When creating a taxonomy, we want to gather and process as much input as we can to inform the ultimate design. As a result, it might seem like a taxonomy with more terms is inevitably better than a smaller taxonomy on the same subject. However, this isn’t the case. The greater the number of terms you have, the greater the tradeoff you’re making in regard to usability. Now, there are use cases in which you would want a larger taxonomy; for example, taxonomies developed for machine learning, auto-tagging, and natural language processing all require a high level of granularity. In contrast, taxonomies developed for search and navigation should seek to be broad and intuitive for a wide range of users. For enterprise taxonomies, you want to look for areas of compromise and “best fit” rather than aiming for perfect coverage in your design.

So, what should you do?

Solutions:

Consider how your taxonomy is going to be used – do you want a broadly comprehensive taxonomy for search, or a taxonomy made to tag a large corpus of documents with high detail? If you’re tracking metrics on the current use and application of your taxonomy, look at the terms that are used the least. Are the concepts these terms refer to already covered by other areas of the taxonomy? Terms that are already covered can become synonyms of related, more commonly used terms, or removed entirely. Are they ambiguous? Ambiguous terms can be removed or replaced with a more specific term. Alternatively, you might want to consider enhancing an ambiguous term with a scope note. Especially in specialized vocabularies, it may be that what appears to your taggers as an ambiguous term has, in fact, a specific industry or field definition that they are unfamiliar with. In this case, it may be enough to define the term rather than removing or replacing it.

Problem 2: Hierarchy and Balance

Symptoms:

The bottom terms of your taxonomy vary widely in regard to specificity.
Some branches of the taxonomy are 5 or more levels deep, while others bottom out at 2 or 3 levels of hierarchy.
The most used terms tend to be hidden behind several layers of categorization.
Project teams that contributed the most to the taxonomy effort have far more defined branches of the taxonomy.

Solutions:

Take stock of how many layers of hierarchy your taxonomy has, versus how many it needs. In general, because an enterprise taxonomy will be used by people with a variety of subject backgrounds, we aim to keep an enterprise taxonomy to 2-3 levels deep to promote usability. More specialized vocabularies or advanced use cases may require further hierarchy. Hierarchical or parent-child relationships are a powerful tool for distinguishing between different concepts, and they are what differentiates a taxonomy from a flat list of terms. Implementing that specificity comes at a cost – the more levels of hierarchy you require users to understand and navigate, the greater the complexity and difficulty they will encounter when searching for a specific term.

So, what should you do?

In some cases, it may help to move up branches of the taxonomy that are at a lower level of hierarchy in order to make them easier for users to apply and find. It can be counterintuitive, since this move may involve moving terms out of categories that they would otherwise fit into. The focus should be on striking a balance between accuracy and usability.

Your Fantastic Farm Taxonomy

If you have a few lower branches of your taxonomy that are notably more specific and relate to a particular project team or work area, you should also consider whether you would be better served by moving these branches into a secondary taxonomy that can then be used in conjunction with your original taxonomy. A secondary taxonomy applies to certain types of content within a narrow focus area, and is an excellent method of providing further granularity for teams that require and can make use of it. After moving these terms out of the original taxonomy, you may then fill their area of the taxonomy with a few more general terms that better match the specificity of the rest of your taxonomy.

For instance, looking at the example of a farming-related taxonomy in the image to the left (Your Fantastic Farm Taxonomy), suppose we want to tag documents related to farming. Most of the documents fit into a general subject area, but a subset is specific to individual crops. Let’s say that our users want to be able to find the documents they need related to the crop they’re planting. Rather than unbalancing our taxonomy by placing all of the crop information under Agriculture, where it would be difficult to discover and unusually large relative to the taxonomy as a whole, we can pull out crops into its own category instead. This results in a taxonomy that is easier to apply and navigate.

Finally, read through the terms of your taxonomy with an eye toward consistency. If you’re not looking to implement auto-tagging immediately, it’s fine to have some mix of general terms (i.e. Car → Toyota) and specific terms (Silver Toyota Camry 2009). But if there are areas marked by specificity, such as using product and program names, or individual items of a class rather than the class, consider making the existing terms synonyms and finding either newer, more generalized class terms or just going with the next highest level of terms instead.

A Note on Faceting:

You may know the old phrase, sometimes attributed to Benjamin Franklin or Samuel Johnson: “A place for everything and everything in its place.”; or the taxonomist’s version: “Mutually exclusive and collectively exhaustive”, also known as MECE. When creating a taxonomy, it can be tempting to search for that perfect place for each and every term. Inevitably, however, you are going to come across instances in which a term could conceivably belong to two very different branches of your tree. There are several routes you can take when this happens:

If your system supports it (more on that in a moment) you can use polyhierarchy to place that child under two parent terms.
You can try to make the two instances of the term more specific, in order to distinguish them from one another (Milk (Agriculture) for milk as relates to farming vs Milk (MNCH) for milk as relates to maternal and neonatal child health, for instance).
You can make a choice and remove the term from one of its two parents. This will obviously affect the use case that isn’t chosen, but is easier to handle from a system perspective.

What I would suggest though, is to consider faceting. A faceted taxonomy is structured such that the user is expected to combine multiple terms when searching/filtering. This is especially common in product taxonomies – think of Amazon. Rather than having a hierarchy of terms that eventually leads to “red ball,” there would be a branch of color descriptors and a separate branch of toys, with the expectation that a tagger would apply both “red” and “ball” as separate terms to describe a red ball. Faceting is well-matched to the search behaviors of non-experts, and a powerful option for balancing your taxonomy.

Problem 3: Systems and Training

Symptoms:

There is confusion over how to apply terms, or a disconnect between your current users and the groups that gave input on the taxonomy.
The taxonomy is implemented in several systems, with slight differences between each implementation.
There are differences between the “master” taxonomy and the system implementations.

In discussing a taxonomy, it’s important to consider not just the terms and the relations between them, but the systems in which the taxonomy is instantiated. You should consider:

How many levels of hierarchy can the target system hold and display?
Is the taxonomy contained and managed centrally, feeding into various systems and keeping each system up to date, or is it scattered across the enterprise in various conditions?
How do new and existing users learn to use the taxonomy?

Problems in any of these areas will adversely affect the usability of your taxonomy and your ability to manage it long term.

So, what should you do?

Solutions:

Make sure that your taxonomy is designed with implementation in mind. If there is limited support for hierarchy in your target system, then there are a couple of avenues for adopting your taxonomy. You can turn lower-level terms into synonyms of their parent terms, and display only the parents. Or you can look into faceting, and implement your taxonomy as a series of flat vocabularies that can be combined for greater specificity.

You may also want to look into using a Taxonomy Management System, or TMS. This is a tool that can centrally manage your taxonomy and feed it to target systems. Standard TMSs will have features to aid in the management and quality checking of your taxonomy, and may also have auto-tagging, corpus analysis, and other capabilities. One of the biggest advantages of using one, though, is keeping your taxonomy updated and in sync across systems. Allowing different implementations of your taxonomy across systems and differences in terms can lead to user confusion, driving low adoption and a failure to use your taxonomy effectively.

Training is another system-related aspect of your taxonomy. Instructional documentation around term definitions, how to apply them, and mechanisms for providing feedback all help to improve tagging accuracy. Maintaining documentation around taxonomy sources, updates, and changes over time is also important for avoiding confusion within the long-term health and management of the taxonomy.

For the Future

Once you have a good working taxonomy it’s important to ask yourself what you can do to avoid problems in the future. A well-managed taxonomy will be a tool for findability, analytics, and alignment for years to come, while a poorly-managed one will fail to support its use cases and actively cause confusion. So what can you do? How do we defeat the expiration date on your taxonomy?

The answer is to turn to the unsung hero of taxonomy and KM efforts: governance. Creating a KM-governance structure with stakeholders from across your organization’s taxonomy users is the best way to maintain and adapt your taxonomy over time. Admittedly, this is a tempting step to skip – after all the work that goes into creating the taxonomy, it can be hard to muster the energy for the long-term work of maintaining and adding terms. However, governance work can be one of the most valuable KM duties if you give it the necessary attention. Finding areas of alignment and compromise across the enterprise is a challenge that not only improves your taxonomy’s quality, but also forces you to develop a greater understanding of your institution and build connections across teams. A great governance structure is the secret weapon to keeping and using a strong business taxonomy. You should have approval processes for major and minor changes to the taxonomy, and meet regularly to discuss proposed changes.

Conclusion

I hope that this article helps to provide a first step if you are struggling with an untamed taxonomy of your own. Enterprise Knowledge has many taxonomy, ontology, and KM experts well-versed in problems across many organizational contexts, and we are happy to partner with you at any stage of your taxonomy journey. Contact us to learn more.

The post Trimming an Untamed Taxonomy appeared first on Enterprise Knowledge.