resource description framework Articles - Enterprise Knowledge

Expert Analysis: Does My Organization Need a Graph Database?

EK Team — Fri, 14 Jan 2022 15:00:16 +0000

As EK works with our clients to integrate knowledge graphs into their technical ecosystems, client stakeholders often ask, “Why should we leverage knowledge graphs?” and more specifically, “Do we need a graph database?” Our consultants then collaborate with stakeholders to weigh the pros and cons of using a knowledge graph and graph database to solve their findability and discoverability use cases. In this blog, two of our senior technical consultants, Bess Schrader and James Midkiff, answer common questions about knowledge graphs and graph databases, focusing on how to best fit them into your organization’s ecosystem without overengineering the solution.

Why should I leverage knowledge graphs?

Bess Schrader

Knowledge graphs model information in the same way human beings think about information, making it easier to organize, store, and logically find data. This reduces the silos between technical users and business users, reduces the ambiguity about what data and information mean, and makes knowledge more sustainable and accessible. Knowledge graphs are the implementation of an ontology, a critical design component for understanding your organization.

Many graph databases also support inference, allowing you to explore previously uncaptured relationships in your data, based on logic developed in your ontology. This reasoning capability can be an incredibly powerful tool, helping you gain insights from your business logic.

James Midkiff

Knowledge graphs are a concept, a way of thinking, and they aren’t always necessarily tied to a graph database. Even if you are against adopting a graph database, you should design an ontology for your organization’s data to visualize and align with how you and your colleagues think. Modeling your organization gives you the complete view and vision for how to best leverage your organization’s content. This vision is your knowledge graph, an innovative way for your organization to tackle the latest data problems. However, this ontology doesn’t have to be implemented in a graph database. The technical implementation should be built using technologies that efficiently support the use cases and are easy to maintain.

Does my use case require a graph database?

Bess Schrader

Any organization that wants to map its internal data to external data would benefit from a graph. If your use case includes publishing your data and connecting to other data sets, a knowledge graph and graph database (particularly one that uses the Resource Description Framework, or RDF) are the way to go to ensure the data is flexible and interoperable. Even if you do not intend to connect and/or publish data, storing robust definitions alongside data in a graph is one of the best ways to ensure that the meaning behind fields is not lost. With the addition of RDF*, the expressivity of a graph to describe organizational data is unmatched by other data formats.

When your ontology and instance data are all in the same place (a graph database), technical and non-technical users alike can always determine what a given field is supposed to mean. This ensures that your data is sustainable and maintainable. For example, many organizations use acronyms or abbreviations when setting up relational databases or nested data structures like JSON or XML. However, the definition and usage notes for these fields are often not included alongside the data itself. This leads to situations where data consumers and developers may find, for example, a field called “pqf” in a SQL table or JSON file created several years ago by a former employee. If no one at the organization knows what “pqf” means or what downstream systems might be using this field, this data becomes an unusable maintenance burden.

However, using well-formed ontologies and RDF knowledge graphs, this property “pqf” would be a “first-class citizen” with its own properties, including a label (“Prequalified for Funding”) and definition (“This field indicates whether a customer has been prequalified for a financial product. A value of ‘true’ indicates that the customer has been preapproved”), explaining what the property means and how it should be used. This reduces ambiguity and confusion for both developers and data consumers.

James Midkiff

A majority of knowledge graph use cases involve information discovery and search. Graph databases are flexible, allowing you to easily adapt the model as new data and use cases are considered. Additionally, graphs make it painless to aggregate data from separate data sources and combine the data to create a single view of an employee, topic, or other important entity at your organization. Below are some questions to ask when faced with this question.

Does the use case require data model flexibility and are the use cases going to adapt or change over time?
Do you need to combine data from multiple sources into a single view?
Do you need to be able to search for multiple types of information simultaneously?

If you answer yes to any of the above, graph databases are a good solution. Some use cases do not require cross-entity examination (i.e. asking questions across relationships) or are not easily calculated in a graph. In these cases, you should not invest in learning and implementing a graph database. As an alternative, you can create a dynamic model inside of a NoSQL database and provide search functionality via a search engine. You can also do network-based and machine learning calculations in your programming language of choice after a small data transformation. As stated previously, implementations should be largely driven by the use cases they are supporting and will support in the future.

I’m nervous about migrating to a new data format. Why should my team learn about and invest in graph database technologies?

Bess Schrader

In addition to the advantages described above, one major benefit of using RDF-compliant graph databases is that they’re based on standards that have been maintained by the W3C for over two decades. These standards, including RDF and SPARQL, were developed over 20 years ago to promote long-term growth for the web. In other words, RDF is not a trendy new format that may disappear in five years, and you can be confident when investing in learning about this technology. The use of standards provides freedom from proprietary vendor tools, enabling you to effortlessly create, move, integrate, and share your data between different standards-compliant software. Using semantic web standards also enables you to seamlessly connect your content and data to a taxonomy (whether internal or external), as most taxonomies are created and stored in an RDF format.

Similarly, SPARQL, the RDF query language, is based on pattern matching and can be easier to learn for non-technical users than more complex programming languages. SPARQL also allows for federated queries, which enable a user to query across multiple knowledge graphs stored in different graph databases, as long as the databases are RDF-compliant. Using federated queries, you could query your own data (e.g. a dataset of articles discussing the stock market and finances) in combination knowledge graphs like with Wikidata, a free and openly accessible RDF knowledge graph used by Wikipedia. This would allow you to take any article that mentions a stock ticker symbol, follow that symbol to the Wikidata entry, and pull back the size and industry of the organization to which the ticker refers. You could then leverage this information to filter your articles by industry or company size, without needing to gather that information yourself. In other words, federated queries allow you to query beyond the bounds of your own organization’s knowledge.

James Midkiff

Many organizations do not need to externally share the knowledge graph data they create. The data may support externally-facing use cases, like chatbots, search, and knowledge panels, and this is usually more than sufficient to meet an organization’s knowledge graph needs. Taxonomies can be transformed and imported into any relational or NoSQL database in a similar manner that we use to translate all other data formats into RDF when building a graph. While graph databases can make this connection more seamless, they are by no means the only way to implement a taxonomy. Relational and NoSQL databases are more commonly used, making it easier to find the necessary skill sets to implement and maintain them. With so many developers used to query languages like SQL, the pattern-based nature of SPARQL can be difficult for developers to learn and adopt.

To be clear, graph databases are an investment. They’re a shift in how we approach and integrate data, which can lead to some adoption costs. However, they can also bring advantages to an organization in addition to what Bess mentioned above.

Comprehensive, Connected Data – Graphs provide descriptive data models and the ability to query and combine multiple graphs together painlessly, without requiring the join tables, intermediary schemas, or rules often required by relational databases.
Extendable Foundation – Knowledge graphs and graph databases enable the reuse of existing information as well as the flexibility to add more types of data, properties, and relationships to implement new use cases with minimal effort.
Lower Costs – The upfront investment (licensing fees, the cost of migrating data, and the cost of hiring or growing the appropriate skill sets) can balance out in the long term given the flexibility to adapt the data model with evolving data and use cases.

Graph technologies are important to consider when building for the future and scale of data at your organization.

Conclusion

Like any major data architecture component, graph databases have their fair share of both pros and cons, and the choice to use them will ultimately come down to what fits the needs of each organization. If you’re looking for help in determining whether a graph database is a good fit for your organization, contact us.

The post Expert Analysis: Does My Organization Need a Graph Database? appeared first on Enterprise Knowledge.

Sandbox: Expert Analysis Formatting

EK Team — Fri, 14 Jan 2022 14:04:32 +0000

Why should I leverage knowledge graphs?

Bess Schrader

James Midkiff

Does my use case require a graph database?

	Bess Schrader Any organization that wants to map its internal data to external data would benefit from a graph. If your use case includes publishing your data and connecting to other data sets, a knowledge graph and graph database (particularly one that uses the Resource Description Framework, or RDF) are the way to go to ensure the data is flexible and interoperable. Even if you do not intend to connect and/or publish data, storing robust definitions alongside data in a graph is one of the best ways to ensure that the meaning behind fields is not lost. With the addition of RDF, the expressivity of a graph to describe organizational data is unmatched by other data formats. When your ontology and instance data are all in the same place (a graph database), technical and non-technical users alike can always determine what a given field is supposed to mean. This ensures that your data is sustainable and maintainable. For example, many organizations use acronyms or abbreviations when setting up relational databases or nested data structures like JSON or XML. However, the definition and usage notes for these fields are often not* included alongside the data itself. This leads to situations where data consumers and developers may find, for example, a field called “pqf” in a SQL table or JSON file created several years ago by a former employee. If no one at the organization knows what “pqf” means or what downstream systems might be using this field, this data becomes an unusable maintenance burden. However, using well-formed ontologies and RDF knowledge graphs, this property “pqf” would be a “first-class citizen” with its own properties, including a label (“Prequalified for Funding”) and definition (“This field indicates whether a customer has been prequalified for a financial product. A value of ‘true’ indicates that the customer has been preapproved”), explaining what the property means and how it should be used. This reduces ambiguity and confusion for both developers and data consumers.
	James Midkiff A majority of knowledge graph use cases involve information discovery and search. Graph databases are flexible, allowing you to easily adapt the model as new data and use cases are considered. Additionally, graphs make it painless to aggregate data from separate data sources and combine the data to create a single view of an employee, topic, or other important entity at your organization. Below are some questions to ask when faced with this question. Does the use case require data model flexibility and are the use cases going to adapt or change over time? Do you need to combine data from multiple sources into a single view? Do you need to be able to search for multiple types of information simultaneously? If you answer yes to any of the above, graph databases are a good solution. Some use cases do not require cross-entity examination (i.e. asking questions across relationships) or are not easily calculated in a graph. In these cases, you should not invest in learning and implementing a graph database. As an alternative, you can create a dynamic model inside of a NoSQL database and provide search functionality via a search engine. You can also do network-based and machine learning calculations in your programming language of choice after a small data transformation. As stated previously, implementations should be largely driven by the use cases they are supporting and will support in the future.

I’m nervous about migrating to a new data format. Why should my team learn about and invest in graph database technologies?

James Midkiff

Comprehensive, Connected Data – Graphs provide descriptive data models and the ability to query and combine multiple graphs together painlessly, without requiring the join tables, intermediary schemas, or rules often required by relational databases.
Extendable Foundation – Knowledge graphs and graph databases enable the reuse of existing information as well as the flexibility to add more types of data, properties, and relationships to implement new use cases with minimal effort.
Lower Costs – The upfront investment (licensing fees, the cost of migrating data, and the cost of hiring or growing the appropriate skill sets) can balance out in the long term given the flexibility to adapt the data model with evolving data and use cases.

Graph technologies are important to consider when building for the future and scale of data at your organization.

Conclusion

The post Sandbox: Expert Analysis Formatting appeared first on Enterprise Knowledge.

How Do I Update and Scale My Knowledge Graph?

Lulit Tesfaye — Tue, 12 Jan 2021 14:00:56 +0000

Enterprise Knowledge Graph Governance Best Practices

Successfully building, implementing, and scaling an enterprise knowledge graph is a serious undertaking. Those who have been successful at it would emphasize that it takes a clear definition of need (use cases), an appetite to start small, and a few iterations to get it right. When done right, a knowledge graph provides valuable business outcomes, including a scalable organizational flexibility to enrich your data and information with institutional knowledge while aggregating content from numerous sources to enable your systems’ understanding of the context and the evolving nature of your business domain.

Having worked on multiple knowledge graph implementation projects, the most common question I get is, “what does it take for an organization to maintain and update an enterprise knowledge graph?” Though many organizations have been successfully building knowledge graph pilots and prototypes that adequately demonstrate the potential of the technology, few have successfully deployed an enterprise knowledge graph that proves out the true business value and ROI this technology offers. Such forethought about governance from the get-go plays a key role in ensuring that the upfront investment in a tangible solution remains a long-term success. Here, I’ll share the key considerations and the approaches we have found effective when it comes to instituting successful approaches to grow and manage an enterprise knowledge graph to ensure it continues serving the upstream and downstream applications that rely on it.

First and foremost, building an effective knowledge graph begins with understanding and defining clear use cases and the business problems that it will be solving for your organization. Starting here will enable you to anticipate and tackle questions like:

“Who will be the primary end-users or subject matter experts?”

“What type of data do you need?”

“What data or systems will it be applied to?”

“How often does your data change?”

“Who will be updating and maintaining it?”

Addressing these questions early on will not only allow you to shape your development and implementation scope, but also define a repeatable process for managing change and future efforts. The section below provides specific areas of consideration when getting started.

1. Build it Right – Use Standards

As a natural integration framework, an enterprise knowledge graph is part of an architectural layer that consists of a wide array of solutions, ranging from the organizational data itself, to data models that support object or context oriented information models (taxonomy, ontology, and a knowledge graph), and user facing applications that allow you to interact with data and information directly (search, analytics dashboards, chatbots, etc). Thus, properly understanding and designing the architecture is one of the most fundamental aspects for making sure it doesn’t become stale or irrelevant.

A practical knowledge graph needs to leverage common semantic information organization models such as metadata schemas, taxonomies, and ontologies. These serve as data models or schemas by representing your content in systems and placing constraints for what types of business entities are connected to a graph and related to one another. Building a knowledge graph through these layers that serve as “blueprints” of your business processes helps maintain the identity and structure for your knowledge graph to continue growing and evolving through time. A knowledge graph built on these logical models that are explicitly defined makes your business logic machine readable and allows for the understanding of the context and relationships of your data and your business entities. Using these unifying data models also enables you to integrate data in different formats (for example, unstructured PDF documents, relational databases, and structured text formats like XML and JSON), rendering your enterprise data interconnected and reusable across disparate and diverse technologies such as Content Management Systems (CMS) or Customer Management Systems (CRM).

When building these information models (taxonomies and ontologies), leveraging semantic web standards such as the Resource Description Framework (RDF), the Simple Knowledge Organization System (SKOS), and the Web Ontology Language (OWL), offer many long term benefits by facilitating governance, interoperability, and scale. Specifically, leveraging these well-established standards when developing your knowledge graph allows you to:

Represent and transfer information across multiples systems, solutions, or types of data/content and avoid vendor lock to proprietary solutions;
Share your content internally across the organization or externally with other organizations;
Support and integrate with publicly available taxonomies, ontologies, and linked open data sources to jump start your enterprise semantic models or to enrich your existing information architecture with industry standards; and
Enable your systems to understand business vocabulary and design for its evolution.

2. Understand the Frequency of Change and the Volume of Your Data

A viable knowledge graph solution is closely linked to the business model and domain of the organization, which means it should always be relevant, up to date, accurate, and have a scalable coverage of all valuable sources of information. Frequent changes to your data model or knowledge graph means your organization’s domain is in constant shift and needs your knowledge and information to constantly keep up.

These types of changes should not require the rebuilding or restructuring of your entire graph. As such, depending on your industry and use cases, determining the frequency and update intervals as well as your governance model is a good way to effectively govern your enterprise knowledge graph.

For instance, for our clients in the accounting or tax domain, industry and organizational vocabulary/metadata and their underlying processes/content are relatively static. Therefore the knowledge, entities, and processes in their business domain don’t typically change that frequently. This means real-time updates and editing of their knowledge graph solution at a scale may not be a primary need or capability that needs focus right away. Such use cases allow these organizations to realize savings by shifting the focus from enterprise level metadata management tools or large scale data engineering solutions to effectively defining their data model and governance to address the immediate use cases or business requirements at hand.

In other scenarios for our clients in the digital marketing and analytics industry, obtaining a 360-view of a consumer in real-time is their bread and butter. This means that marketing and analytics teams need to immediately know when, for example, a “marketable consumer” changes their address or contact information. It is imperative in this case that such rapidly changing business domains have the resources, capabilities, and automation necessary to update and govern their knowledge graphs at scale.

3. Develop Programmatic Access Points to Connect Your Applications:

Common enterprise knowledge graph solutions are constructed through data transformation pipelines. This renders a repeatable process for the mapping of structured sources and the extraction, disambiguation, classification, and tagging of unstructured sources. It also means that the main way to affect the data in the knowledge graph is to govern the input data (e.g. exports from taxonomy management systems, content management platforms, database systems, etc.). Otherwise, ad-hoc changes to the knowledge graph will be lost or erased every time new data is loaded from a connected application.

Therefore, designing and implementing a repeatable data extraction and application model that is guided by the governance of the source systems is one of the fundamental architectures to build a reliable knowledge graph.

4. Put validation checks and analytics processes in place

Apply checks to identify conflicting information within your knowledge graph. Even though it’s rather challenging to train a knowledge graph to automatically know the right way to organize new knowledge and information, the ability to track and check why certain attributes and values were applied to your data or content should be part of the design for all data that is aggregated in the solution. One technique we’ve used is to segment inferred or predicted data into a separate graph reserved for new and uncertain information. In this way, uncertain data can be isolated from observed or confirmed information, making it easier to trace the origins of inferred information, or to recompute inferences and predictions as your underlying data or artificial intelligence models change. Confidence scores or ratings in both entities and relationships can also be used to indicate graph accuracy. Additional effective practices that provide checks and processes for creating and updating a knowledge graph include instituting consistent naming conventions throughout the design and implementation (e.g., URIs) and establishing guidelines for version control and workflows, including a log of all changes and edits to the graph. Many enterprise knowledge graphs also support the SHACL Semantic Web standard, which can be used to validate your graph when adding new data and check for logical inconsistencies.

5. Develop a Governance Plan and Operating Model

An effective knowledge graph governance model addresses the common set of standards and processes to handle changes and requests to the knowledge graph and peripheral systems at all levels. Specifically, a good knowledge graph governance model will provide an approach or specification for the following:

Governance roles and responsibilities. Common governance roles include a governance group of taxonomists/ontologists, data engineers or scientists, database and application managers and administrators, and knowledge or business representatives or analysts;
Governance around data sources that feed the knowledge graph. For instance when there’s unclean data coming in from a source system, specific roles and processes for correcting this data;
Specific processes for updating the knowledge graph in the system it is managed (i.e., processes to ensure major and minor changes to the knowledge graph are accurately assessed and implemented). Including governance around adding new data sources — what does it look like, who needs to be involved, etc.;
Approaches to handle changes to the underlying ontology data model. Common change requests include addition, modification or depreciation of an ontological class, attributes, synonyms or relationships;
Approaches to tackling common barriers to continue building and enhancing a successful ontology and knowledge graph. Common challenges include lack of effective text analytics and extraction tools to automate the organization of content and application of tags/relationships, and intuitive management and updates to Linked Data;
Guidance on communication to stakeholders and end users including sample messaging and communication best practices and methods; and
Review cadence. Identify common intervals for changes and adjustments to the knowledge graph solution by understanding the complexity and fluidity of your data and build in recurring review cycles and governance meetings accordingly

Closing

As a representation of an organization’s knowledge, an enterprise knowledge graph allows for aggregation of a breadth of information across systems and departments. If left with no ownership and plan, it can easily grow out of sync and result in rework, redesign and a lot of wasted effort.

Whether you are just beginning to design an enterprise knowledge graph and wish to understand the value and benefits, or you are looking for a proven approach for defining governance, maintenance, and plan to scale, check out our additional thought leadership and real world case studies to learn more. Our expert graph engineers and consultants are also on standby if you need any support. Contact us with any questions.

Get Started

Ask Us a Question

The post How Do I Update and Scale My Knowledge Graph? appeared first on Enterprise Knowledge.

Why am I Mr. SPARQL?

EK Team — Fri, 09 Oct 2020 13:00:08 +0000

Over the past few years, I have gained a lot of experience working with graph databases, RDF, and SPARQL.^¹ SPARQL can be tricky for both new and experienced users as it is not always obvious why a query is returning unexpected data. After a brief intro to SPARQL, I will note some reminders and tips to consider when writing SPARQL queries to reduce the number of head-scratching moments. This blog is intended for users who have a basic knowledge of SPARQL.

What is SPARQL?

RDF is a W3C standard model for describing and relating content on the web through triples. It’s also a standard storage model for graph databases. The W3C recommended RDF query language is SPARQL. Similar to other query languages like SQL, SPARQL allows technical business analysts to transform and retrieve data from a graph database. Some graph databases provide support for other query languages but most provide support for both RDF and SPARQL. You can find more detailed information in section 2 of our best practices for knowledge graphs and in our blog titled “Why a Taxonomist Should Know SPARQL.” Now that we have a basic understanding of SPARQL, let’s jump into some SPARQL recommendations.

SPARQL is Based on Patterns

SPARQL queries match patterns in the RDF data. In the WHERE clause of a query, you specify what triples to look for, i.e. what subjects, predicates, and objects you need to answer a question. When retrieving the identifier of all people in a database, a new SPARQL user might write the query as follows:

SELECT ?id WHERE {
    ?s a :Person .
}

This is a common mistake for new SPARQL-ers, especially those coming from a SQL background. A SPARQL query only knows the patterns that you give it–it does not know the schema of your graph (at least in this instance). The above query has no knowledge of an ?id variable or where to retrieve it from, so the query will fail to retrieve data. Extend the query with an additional triple to explicitly define where the ?id variable can be found:

SELECT ?id WHERE {
    ?s a :Person .
    ?s :identifier ?id .
}

The WHERE clause provides the pattern you wish to match, while the SELECT clause explicitly lists which variables from your WHERE clause you’d like to return.

SPARQL Matches Patterns Exactly

I often find myself unexpectedly restricting or duplicating the results of a query. This is best explained with an example query: “Find the name and telephone number for all people in the database.”

 SELECT ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .
    ?s :cellNumber ?cellNumber .
}

The above SPARQL query only returns a result for people that have a cell number. This might be what you want, but what if you were looking for a complete list of people regardless of if they have a cell number? In SPARQL, you would have to wrap the cell number in an OPTIONAL clause.

SELECT ?s ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
}

A person will also appear twice in the results if they have two numbers. If this isn’t the behavior you want, you will need to group the results on the person (?s) and combine the numbers.

SELECT ?s ?name (GROUP_CONCAT(?cellNumber) as ?numbers) WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
} GROUP BY ?s ?name

For simplicity, I also assumed that each person only has one name in the database, but you can expand this to meet your data needs.

When writing SPARQL queries, you have to be aware of your data model and know which predicates are required, optional, or multi-valued. If a predicate is required for every subject, you can match it in a pattern with no issues. If a predicate is optional, make sure you are not removing any results that you want. And, if a predicate is multi-valued, you might need to group results to avoid data duplication. It never hurts to run a query to check that your data model matches what you expect. This could lead you to find problems in your data transforming or loading process.

Subqueries and Unions Can Save Complexity

Occasionally a query I am writing needs to cover a number of different conditions. An example query is, “Find all topics and countries that our content is tagged with that have a tagging score of greater than 50.” This question is not too complex on its own but it helps emphasize the point.

You could write this query and go down the rabbit hole of IF and BIND as I initially did. A SPARQL IF statement allows you to select between two values based on a boolean (true or false) statement. BIND statements let you set the value of a variable. IF and BIND statements are very useful in certain situations for dynamically setting variables. The above query could be written as follows.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    ?content :tagged ?tag .

    # Verify the tag is for a topic
    ?tag :about ?term .
    ?term a ?type .
    BIND(IF(?type = :Topic, ?term, ?null) as ?topic)
    BIND(IF(?type = :Country, ?term, ?null) as ?country)

    # Check the score
    ?tag :score ?score .
    FILTER(?score > 50)
} GROUP BY ?content

The query matches the type of each term associated with ?content and then sets the value of ?topic and ?country based on the type. We use a FILTER to restrict the tags to only those with a score greater than 50. In this case, the query solves the question by leveraging a nifty use of BIND and IF, but there are less complex solutions.

As your queries and data get more complex, the RDF patterns that you need to match may not line up as nicely. In our case, the relationship between content and topics or countries is the same, so we only needed to include two lines of logic. A much simpler approach is to UNION together two subqueries or subpatterns. This allows the query to retrieve topics and countries separately, matching two different sets of RDF patterns.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        ?content :tagged ?tag .

        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    } UNION {
        ?content :tagged ?tag .

        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    }
} GROUP BY ?content

This breaks up the SPARQL query into two smaller queries that are much easier to approach without needing to worry about how to combine multiple sets of patterns in the same query. Additionally, this query could be optimized by using a subquery that retrieves the content and tags with a score above 50 before checking for the valid types.

SELECT
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        SELECT ?content ?tag WHERE {
            ?content :tagged ?tag .

            # Check the score
            ?tag :score ?score .
            FILTER(?score > 50)
        }
    }
    {
        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .
    } UNION {
        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .
    }
} GROUP BY ?content

In this query, the results of the subquery are merged with the results of the UNION enabling us to still apply custom patterns to topics and countries. We use a subquery in order to avoid matching the ?content and ?tag values more than once and the merge enforces that every tag has to be about a topic or a country.

Final SPARQL Thoughts

SPARQL is a robust query language for working with RDF data. Try not to overlook uncommon SPARQL functions (such as VALUES, STRDT, and SAMPLE) and check if your graph database has any proprietary functions that you can leverage for even more flexibility. As a more general recommendation, always take the time to step back and see if there’s a cleaner, more efficient way to retrieve the data you need.

Enterprise Knowledge writes more performant queries and designs data models to enable advanced graph solutions. If you can’t find your own Mr. SPARQL unicorn, and whether you are new to the graph space or looking to optimize your existing data, contact us to discuss how EK can help take your solution to the next level.

^¹If the title of this blog is familiar, that’s because it is a reference to an episode of The Simpsons. In one episode, a Japanese cleaning agency used Homer’s face (or one closely resembling it) as the logo of a brand called “Mr. Sparkle.” Homer calls up the brand and asks, “Why am I Mr. Sparkle?” One of my colleagues mentioned that he was reminded of this episode anytime he heard me discussing SPARQL with the rest of the EK Team.

I have been using SPARQL actively for the past 3 years and have come to recognize that it requires a unique mindset. There are some common gotcha moments and optimization techniques for improving queries but writing the initial query requires an understanding of the RDF format and piecing it together is just as critical as making it more efficient. The most effective SPARQL developers in your organization will be the unicorns, the individuals with a knowledge of code who are able to adjust course quickly, hold complex logic in their head, and enjoy the time it takes to solve puzzles.

The post Why am I Mr. SPARQL? appeared first on Enterprise Knowledge.

RDF*: What is it and Why do I Need it?

EK Team — Fri, 24 Jul 2020 16:24:24 +0000

RDF* (pronounced RDF star) is an extension to the Resource Description Framework (RDF) that enables RDF graphs to more intuitively represent complex interactions and attributes through the implementation of embedded triples. This allows graphs to capture relationships between more than two entities, add metadata to existing relationships, and add provenance information to all triples, reducing the burden of maintenance.

But let’s back up…before we talk about RDF*, let’s cover the basics — what is RDF, and how is RDF* different from RDF?

What is RDF?

The Resource Description Framework (RDF) is a semantic web standard used to describe and model information for web resources or knowledge management systems. RDF consists of “triples,” or statements, with a subject, predicate, and object that resemble an English sentence.

For example, take the English sentence: “Bess Schrader is employed by Enterprise Knowledge.” This sentence has:

A subject: Bess Schrader
A predicate: is employed by
An object: Enterprise Knowledge

Bess Schrader and Enterprise Knowledge are two entities that are linked by the relationship is employed by. An RDF triple representing this information would look like this:

(There are many ways, or serializations, to represent RDF. In this blog, I’ll be using the Turtle syntax because it’s easy to read, but this information could also be shown in RDF/XML, JSON for Linking Data, and other formats.)

The World Wide Web Consortium (W3C) maintains the RDF Specification, making it easy for applications and organizations to develop RDF data in an interoperable way. This means if you create RDF data in one tool and share it with someone else using a different RDF tool, they will still be able to easily use your data. This interoperability allows you to build on what’s already been done — you can combine your enterprise knowledge graph with established, open RDF datasets like Wikidata, jump starting your analytic capabilities. This also makes data sharing and migration between internal RDF systems simple, enabling you to unify data and reducing your dependency on a single tool or vendor.

For more information on RDF and how it can be used, check out Why a Taxonomist Should Know SPARQL.

What are the limitations of RDF (Why is RDF* necessary)?

Standard RDF has many strengths:

Like most graph models, it more intuitively captures the way we think about the world as humans (as networks, not as tables), making it easier to design, capture, and query data.
As a standard supported by the W3C, it allows us to create interoperable data and systems, all using the same standard to represent and encode data.

However, it has one key weakness: because RDF is based on triples, standard RDF can only connect two objects at a time. For many use cases, this limitation isn’t a problem. Consider my example from above, where I want to represent the relationship between me and my employer:

Simple! However, what if I want to capture the role or position that I hold at this organization? I could add a triple denoting my position:

Great! But what if I decide to add in my (fictional) employment history?

Now it’s unclear whether I was a consultant at Enterprise Knowledge or at Hogwarts.

There are a variety of ways to address this problem in RDF. One of the most popular is reification or n-ary relations, in which you create an intermediary node that allows you to group more than two entities together. For example:

Using this technique allows you to clear up confusion and model the complexity of the world. However, adding these intermediary nodes takes away some of the simplicity of graph data — the idea of an “employment event” isn’t exactly intuitive.

There are many other methods that have been developed to handle this kind of complexity in RDF, including singleton properties and named graphs/quads. Additionally, an entirely different type of non-RDF graph model, labeled property graphs, allows users to attach properties directly to relationships. However, labeled property graphs don’t allow for interoperability at the same scale as RDF — it’s much harder to share and combine different data sets, and moving data from tool to tool isn’t as simple.

None of these solutions retain both of the strengths of RDF: the interoperable standards and the intuitive data model. This crucial limitation of RDF has limited its effectiveness in certain applications, particularly those involving temporal or transactional data.

What is RDF*?

RDF* (pronounced RDF-star) is an extension to RDF that proposes a solution to the weaknesses of RDF mentioned above. As an extension, RDF* supplements RDF but doesn’t replace it.

The main idea behind RDF* is to treat a triple as a single entity. By “nesting” or “embedding” triples, an entire triple can become the subject of a second triple. This allows you to add metadata to triples, assigning attributes to a triple, or creating relationships not just between two entities in your knowledge graph, but between triples and entities, or triples and triples. Take our example from above. In standard RDF, if I want to express past employers and positions, I need to use reification:

In RDF*, I can use nested triples to simply denote the same information:

This eliminates the need for intermediary entities and makes the model easier to understand and implement.

Just as standard RDF can be queried via the SPARQL query language, RDF* can be queried using SPARQL*, allowing users to query both standard and nested triples.

Currently, RDF* is under consideration by the W3C and has not yet been officially accepted as a standard. However, the specification has been formally defined in Foundations of an Alternative Approach to Reification in RDF, and many enterprise tools supporting RDF have added support for RDF* (including BlazeGraph, AnzoGraph, Stardog, and GraphDB ). Hopefully this standard will be formally adopted by the W3C, allowing it to retain and build on the original strengths of RDF: its intuitive model/simplicity and interoperability.

What are the benefits of RDF*?

As you can see above, RDF* can be used to represent relationships that involve more than one entity (e.g. person, role, and organization) in a more intuitive manner than standard RDF. However, RDF* has additional use cases, including:

Adding metadata to a relationship (For example: start dates and end dates for jobs, marriages, events, etc.)

Adding provenance information for triples: I have a triple that indicates Bess Schrader works for Enterprise Knowledge. When did I add this triple to my graph? What was the source of this information? Who added the information to the graph?

Conclusion

On its own, RDF provides an excellent way to create, combine, and share semantic information. Extending this framework with RDF* gives knowledge engineers more flexibility to model complex interactions between multiple entities, attach attributes to relationships, and store metadata about triples, helping us more accurately model the real world while improving our ability to understand and verify where data origins.

Looking for more information on RDF* and how you can leverage it to solve your data challenges? Contact Enterprise Knowledge.

The post RDF*: What is it and Why do I Need it? appeared first on Enterprise Knowledge.