SPARQL Articles - Enterprise Knowledge

Transforming Tabular Data into Personalized, Componentized Content using Knowledge Graphs in Python

Kate Erfle — Tue, 22 Mar 2022 13:30:26 +0000

My colleagues Joe Hilger and Neil Quinn recently wrote blogs highlighting the benefits of leveraging a knowledge graph in tandem with a componentized content management system (CCMS) to curate personalized content for users. Hilger set the stage explaining the business value of a personalized digital experience and the logistics of these two technologies supporting one another to create it. Quinn makes these concepts more tangible by processing sample data into a knowledge graph in Python and querying the graph to find tailored information for a particular user. This post will again show the creation and querying of a knowledge graph in Python, however, the same sample data will now be sourced from external CSV files.

A Quick Note on CSVs

CSV files, or comma-separated values files, are widely used to store tabular data. If your company uses spreadsheet applications, such as Microsoft Excel or Google Sheets, or relational databases, then it is likely you have encountered CSV files before. This post will help you use the already existent, CSV-formatted data throughout your company, transform it into a usable knowledge graph, and resurface relevant pieces of information to users in a CCMS. Although this example uses CSV files as the tabular dataset format, the same principles apply to Excel sheets and SQL tables alike.

Aggregating Data

The diagram below is a visual model of the knowledge graph we will create from data in our example CSV files.

In order to populate this graph, just as in Quinn’s blog, we will begin with three sets of data about:

Customers and the products they own
Products and the parts they are composed of
Parts and the actions that need to be taken on them

This information is stored in three CSV files, Customer_Data.csv, Product_Data.csv and Part_Data.csv:

Customers

Customer ID	Customer Name	Owns Product
1	Stephen Smith	Product A
2	Lisa Lu	Product A

Products

Product ID	Product Name	Composed of
1	Product A	Part X
1	Product A	Part Y
1	Product A	Part Z

Parts

Part ID	Part Name	Action
1	Part X
2	Part Y
3	Part Z	Recall

To create a knowledge graph from these tables, we will need to

Read the data tables from our CSV files into DataFrames (an object representing a 2-D data structure, such as a spreadsheet or table)
Transform the DataFrames into RDF triples and add them to the graph

In order to accomplish these two tasks, we will be utilizing two Python libraries. Pandas, a data analysis and manipulation library, will help us serialize our CSV files into DataFrames and rdflib, a library for working with RDF data, will allow us to create RDF triples from the data in our DataFrames.

Reading CSV Data

This first task is quite easy to accomplish using pandas. Pandas has a read_csv method for ingesting CSV data into a DataFrame. For this use case, we only need to provide two parameters: the CSV’s file path and the number of rows to read. To read the Customers table from our Customer_Data.csv file:

import pandas as pd

customer_table = pd.read_csv("Customer_Data.csv", nrows=2)

The value of customer_table is:

       Customer ID      Customer Name     Owns Product
0                1      Stephen Smith        Product A
1                2            Lisa Lu        Product A

We repeat this process for the Products and Parts files, altering the filepath_or_buffer and nrows parameters to reflect the respective file’s location and table size.

Tabular to RDF

Now that we have our tabular data stored in DataFrame variables, we are going to use rdflib to create subject-predicate-object triples for each column/row entry in the three DataFrames. I would recommend reading Quinn’s blog prior to this one as I am following the methods and conventions that he explains in his post.

Utilizing the Namespace module will provide us a shorthand for creating URIs, and the create_eg_uri method will url-encode our data values.

from rdflib import Namespace
from urllib.parse import quote

EG = Namespace("http://example.com/")

def create_eg_uri(name: str) -> URIRef:
    """Take a string and return a valid example.com URI"""
    quoted = quote(name.replace(" ", "_"))
    return EG[quoted]

The columns in our data tables will need to be mapped to predicates in our graph. For example, the Owns Product column in the Customers table will map to the http://example.com/owns predicate in our graph. We must define the column to predicate mappings for each of our tables before diving into the DataFrame transformations. Additionally, each mapping object contains a “uri” field which indicates the column to use when creating the unique identifier for an object.

customer_mapping = {
    "uri": "Customer Name",
    "Customer ID": create_eg_uri("customerId"),
    "Customer Name": create_eg_uri("customerName"),
    "Owns Product": create_eg_uri("owns"),
}

product_mapping = {

    "uri": "Product Name",
    "Product ID": create_eg_uri("productId"),
    "Product Name": create_eg_uri("productName"),
    "Composed of": create_eg_uri("isComposedOf"),
}

part_mapping = {

    "uri": "Part Name",
    "Part ID": create_eg_uri("partId"),
    "Part Name": create_eg_uri("partName"),
    "Action": create_eg_uri("needs"),
}

uri_objects = ["Owns Product", "Composed of", "Action"]

The uri_objects variable created above indicates which columns from the three data tables should have their values parsed as URI References, rather than Literals. For example, Composed of maps to a Part object. We want to make the object in the triple EG:Product_A EG:isComposedOf a URI pointing to/referencing a particular Part, not just the string name of the Part. Whereas the Product Name column creates triples such as EG:Product_A EG:productName “name” and “name” is simply a string, i.e. a Literal, and not a reference to another object.

Now, using all of the variables and methods declared above, we can begin the translation from DataFrame to RDF. For the purposes of this example, we create a global graph variable and a reusable translate_df_to_rdf function which we will call for each of the three DataFrames. With each call to the translate function, all triples for that particular table are added to the graph.

from rdflib import URIRef, Graph, Literal
import pandas as pd

graph = Graph()

def translate_df_to_rdf(customer_data, customer_mapping):
    # Counter variable representing current row in the table
    i = 0
    num_rows = len(customer_data.index)

    # For each row in the table
    while i < num_rows:
        # Create URI subject for triples in this row using ‘Name’ column
        name = customer_data.loc[i, customer_mapping["uri"]]
        row_uri = create_eg_uri(name)

        # For each column/predicate mapping in mapping dictionary
        for column_name, predicate in customer_mapping.items():

            # Grab the value at this specific row/column entry
            value = customer_data.loc[i, column_name]

            # Strip extra characters from value
            if isinstance(value, str):
                value = value.strip()

            # Check if the value exists
            if not pd.isnull((value)):
                # Determine if object should be a URI or Literal
                if column_name in uri_objects:
                    # Create URI object and add triple to graph
                    uri_value = create_eg_uri(value)
                    graph.add((row_uri, predicate, uri_value))
                else:
                    # Create Literal object and add triple to graph
                    graph.add((row_uri, predicate, Literal(value)))
        i = i + 1

Querying the Graph

In this case, we make three calls to translate_df_to_rdf:

translate_df_to_rdf(customer_data, customer_mapping)
translate_df_to_rdf(product_data, product_mapping)
translate_df_to_rdf(part_data, part_mapping)

Now that our graph is populated with the Customers, Products, and Parts data, we can query it for personalized content of our choosing. So, if we want to find all customers who own products that are composed of parts that need a recall, we can create and use the same query from Quinn’s previous blog:

sparql_query = """SELECT ?customer ?product
WHERE {
  ?customer eg:owns ?product .
  ?product eg:isComposedOf ?part .
  ?part eg:needs eg:Recall .
}"""

results = graph.query(sparql_query, initNs={"eg": EG})
for row in results:
    print(row)

As you would expect, the results printed in the console are two ?customer ?product pairings:

(rdflib.term.URIRef('http://example.com/Stephen_Smith'), rdflib.term.URIRef('http://example.com/Product_A'))
(rdflib.term.URIRef('http://example.com/Lisa_Lu'), rdflib.term.URIRef('http://example.com/Product_A'))

Summary

By transforming our CSV files into RDF triples, we created a centralized, connected graph of information, enabling the simple retrieval of very granular and case-specific data. In this case, we simply traversed the relationships in our graph between Customers, Products, Parts, and Actions to determine which Customers needed to be notified of a recall. In practice, these concepts can be expanded to meet any personalization needs for your organization.

Knowledge Graphs are an integral part of serving up targeted, useful information via a Componentized Content Management System, and your organization doesn’t need to start from scratch. CSVs and tabular data can easily be transformed into RDF and aggregated as the foundation for your organization’s Knowledge Graph. If you are interested in transforming your data into RDF and want help planning or implementing a transformation effort, contact us here.

The post Transforming Tabular Data into Personalized, Componentized Content using Knowledge Graphs in Python appeared first on Enterprise Knowledge.

Content Personalization with Knowledge Graphs in Python

Neil Quinn — Mon, 14 Feb 2022 15:00:14 +0000

In a recent blog post, my colleague Joe Hilger described how a knowledge graph can be used in conjunction with a componentized content management system (CCMS) to provide personalized content to customers. This post will show the example data from Hilger’s post being loaded into a knowledge graph and queried to find the content appropriate for each customer, using Python and the rdflib package. In doing so, it will help make these principles more concrete, and help you in your journey towards content personalization.

To follow along, a basic understanding of Python programming is required.

Aggregating Data

Hilger’s article shows the following visualization of a knowledge graph to illustrate how the graph connects data from many different sources and encodes the relationship between them.

To show this system in action, we will start out with a few sets of data about:

Customers and the products they own
Products and the parts they are composed of
Parts and the actions that need to be taken on them

In practice, this information would be pulled from the sales tracking, product support, and other systems it lives in via APIs or database queries, as described by Hilger.

customers_products = [
    {"customer": "Stephen Smith", "product": "Product A"},
    {"customer": "Lisa Lu", "product": "Product A"},
]

products_parts = [
    {"product": "Product A", "part": "Part X"},
    {"product": "Product A", "part": "Part Y"},
    {"product": "Product A", "part": "Part Z"},
]
parts_actions = [{"part": "Part Z", "action": "Recall"}]

We will enter this data into a graph as a series of subject-predicate-object triples, each of which represents a node (the subject) and its relationship (the predicate) to another node (the object). RDF graphs use uniform resource identifiers (URIs) to provide a unique identifier for both nodes and relationships, though an object can also be a literal value.

Unlike the traditional identifiers you may be used to in a relational database, URIs in RDF always use a URL format (meaning they begin with http://), although a URI is not required to point to an existing website. The base part of this URI is referred to as a namespace, and it’s common to use your organization’s domain as part of this. For this tutorial we will use http://example.com as our namespace.

We also need a way to represent these relationship predicates. For most enterprise RDF knowledge graphs, we start with an ontology, which is a data model that defines the types of things in our graph, their attributes, and the relationships between them. For this example, we will use the following relationships:

Relationship	URI
Customer’s ownership of a product	http://example.com/owns
Product being composed of a part	http://example.com/isComposedOf
Part requiring an action	http://example.com/needs

Note the use of camelCase in the name – for more best practices in ontology design, including how to incorporate open standard vocabularies like SKOS and OWL into your graph, see here.

The triple representing Stephen Smith’s ownership of Product A in rdflib would then look like this, using the URIRef class to encode each URI:

from rdflib import URIRef

triple = (
    URIRef("http://example.com/Stephen_Smith"),
    URIRef("http://example.com/owns"),
    URIRef("http://example.com/Product_A"),
)

Because typing out full URLs every time you want to add or reference a component of a graph can be cumbersome, most RDF-compliant tools and development resources provide some shorthand way to refer to these URIs. In rdflib that’s the Namespace module. Here we create our own namespace for example.com, and use it to more concisely create that triple:

from rdflib import Namespace

EG = Namespace("http://example.com/")

triple = (EG["Stephen_Smith"], EG["owns"], EG["Product_A"])

We can further simplify this process by defining a function to transform these strings into valid URIs using the quote function from the urlparse module:

from urllib.parse import quote

def create_eg_uri(name: str) -> URIRef:
    """Take a string and return a valid example.com URI"""
    quoted = quote(name.replace(" ", "_"))
    return EG[quoted]

Now, let’s create a new Graph object and add these relationships to it:

from rdflib import Graph

graph = Graph()

owns = create_eg_uri("owns")
for item in customers_products:
    customer = create_eg_uri(item["customer"])
    product = create_eg_uri(item["product"])
    graph.add((customer, owns, product))

is_composed_of = create_eg_uri("isComposedOf")
for item in products_parts:
    product = create_eg_uri(item["product"])
    part = create_eg_uri(item["part"])
    graph.add((product, is_composed_of, part))

needs = create_eg_uri("needs")
for item in parts_actions:
    part = create_eg_uri(item["part"])
    action = create_eg_uri(item["action"])
    graph.add((part, needs, action))

Querying the Graph

Now we are able to query the graph, in order to find all of the customers that own a product containing a part that requires a recall. To do this, we’ll construct a query in SPARQL, the query language for RDF graphs.

SPARQL has some features in common with SQL, but works quite differently. Instead of selecting from a table and joining others, we will describe a path through the graph based on the relationships each kind of node has to another:

sparql_query = """SELECT ?customer ?product
WHERE {
  ?customer eg:owns ?product .
  ?product eg:isComposedOf ?part .
  ?part eg:needs eg:Recall .
}"""

The WHERE clause asks for:

Any node that has an owns relationship to another – the subject is bound to the variable ?customer and the object to ?product
Any node that has an isComposedOf relationship to the ?product from the previous line, the subject of which is then bound to ?part
Any node where the object has a needs relationship to an object which is a Recall.

Note that we did not at any point tell the graph which of the URIs in our graph referred to a customer. By simply looking for any node that owns something, we were able to find the customers automatically. If we had a requirement to be more explicit about typing, we could add triples to our graph describing the type of each entity using the RDF type relationship, then refer to these in the query.

We can then execute this query against the graph, using the initNs argument to map the “eg:” prefixes in the query string to our example.com namespace, and print the results:

results = graph.query(sparql_query, initNs={"eg": EG})
 
for row in results:
    print(row)

This shows us the URIs for the affected customers and the products they own:

(rdflib.term.URIRef('http://example.com/Stephen_Smith'), rdflib.term.URIRef('http://example.com/Product_A'))
(rdflib.term.URIRef('http://example.com/Lisa_Lu'), rdflib.term.URIRef('http://example.com/Product_A'))

These fields could then be sent back to our componentized content management system, allowing it to send the appropriate recall messages to those customers!

Summary

The concepts and steps described in this post are generally applicable to setting up a knowledge graph in any environment, whether in-memory using Python or Java, or with a commercial graph database product. By breaking your organization’s content down into chunks inside a componentized content management system and using the graph to aggregate this data with your other systems, you can ensure that the exact content each customer needs to see gets delivered to them at the right time. You can also use your graph to create effective enterprise search systems, among many other applications.

Interested in best in class personalization using a CCMS plus a knowledge graph? Contact us.

The post Content Personalization with Knowledge Graphs in Python appeared first on Enterprise Knowledge.

Why am I Mr. SPARQL?

EK Team — Fri, 09 Oct 2020 13:00:08 +0000

Over the past few years, I have gained a lot of experience working with graph databases, RDF, and SPARQL.^¹ SPARQL can be tricky for both new and experienced users as it is not always obvious why a query is returning unexpected data. After a brief intro to SPARQL, I will note some reminders and tips to consider when writing SPARQL queries to reduce the number of head-scratching moments. This blog is intended for users who have a basic knowledge of SPARQL.

What is SPARQL?

RDF is a W3C standard model for describing and relating content on the web through triples. It’s also a standard storage model for graph databases. The W3C recommended RDF query language is SPARQL. Similar to other query languages like SQL, SPARQL allows technical business analysts to transform and retrieve data from a graph database. Some graph databases provide support for other query languages but most provide support for both RDF and SPARQL. You can find more detailed information in section 2 of our best practices for knowledge graphs and in our blog titled “Why a Taxonomist Should Know SPARQL.” Now that we have a basic understanding of SPARQL, let’s jump into some SPARQL recommendations.

SPARQL is Based on Patterns

SPARQL queries match patterns in the RDF data. In the WHERE clause of a query, you specify what triples to look for, i.e. what subjects, predicates, and objects you need to answer a question. When retrieving the identifier of all people in a database, a new SPARQL user might write the query as follows:

SELECT ?id WHERE {
    ?s a :Person .
}

This is a common mistake for new SPARQL-ers, especially those coming from a SQL background. A SPARQL query only knows the patterns that you give it–it does not know the schema of your graph (at least in this instance). The above query has no knowledge of an ?id variable or where to retrieve it from, so the query will fail to retrieve data. Extend the query with an additional triple to explicitly define where the ?id variable can be found:

SELECT ?id WHERE {
    ?s a :Person .
    ?s :identifier ?id .
}

The WHERE clause provides the pattern you wish to match, while the SELECT clause explicitly lists which variables from your WHERE clause you’d like to return.

SPARQL Matches Patterns Exactly

I often find myself unexpectedly restricting or duplicating the results of a query. This is best explained with an example query: “Find the name and telephone number for all people in the database.”

 SELECT ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .
    ?s :cellNumber ?cellNumber .
}

The above SPARQL query only returns a result for people that have a cell number. This might be what you want, but what if you were looking for a complete list of people regardless of if they have a cell number? In SPARQL, you would have to wrap the cell number in an OPTIONAL clause.

SELECT ?s ?name ?cellNumber WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
}

A person will also appear twice in the results if they have two numbers. If this isn’t the behavior you want, you will need to group the results on the person (?s) and combine the numbers.

SELECT ?s ?name (GROUP_CONCAT(?cellNumber) as ?numbers) WHERE {
    ?s a :Person .
    ?s :name ?name .

    OPTIONAL {
        ?s :cellNumber ?cellNumber .
    }
} GROUP BY ?s ?name

For simplicity, I also assumed that each person only has one name in the database, but you can expand this to meet your data needs.

When writing SPARQL queries, you have to be aware of your data model and know which predicates are required, optional, or multi-valued. If a predicate is required for every subject, you can match it in a pattern with no issues. If a predicate is optional, make sure you are not removing any results that you want. And, if a predicate is multi-valued, you might need to group results to avoid data duplication. It never hurts to run a query to check that your data model matches what you expect. This could lead you to find problems in your data transforming or loading process.

Subqueries and Unions Can Save Complexity

Occasionally a query I am writing needs to cover a number of different conditions. An example query is, “Find all topics and countries that our content is tagged with that have a tagging score of greater than 50.” This question is not too complex on its own but it helps emphasize the point.

You could write this query and go down the rabbit hole of IF and BIND as I initially did. A SPARQL IF statement allows you to select between two values based on a boolean (true or false) statement. BIND statements let you set the value of a variable. IF and BIND statements are very useful in certain situations for dynamically setting variables. The above query could be written as follows.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    ?content :tagged ?tag .

    # Verify the tag is for a topic
    ?tag :about ?term .
    ?term a ?type .
    BIND(IF(?type = :Topic, ?term, ?null) as ?topic)
    BIND(IF(?type = :Country, ?term, ?null) as ?country)

    # Check the score
    ?tag :score ?score .
    FILTER(?score > 50)
} GROUP BY ?content

The query matches the type of each term associated with ?content and then sets the value of ?topic and ?country based on the type. We use a FILTER to restrict the tags to only those with a score greater than 50. In this case, the query solves the question by leveraging a nifty use of BIND and IF, but there are less complex solutions.

As your queries and data get more complex, the RDF patterns that you need to match may not line up as nicely. In our case, the relationship between content and topics or countries is the same, so we only needed to include two lines of logic. A much simpler approach is to UNION together two subqueries or subpatterns. This allows the query to retrieve topics and countries separately, matching two different sets of RDF patterns.

SELECT 
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        ?content :tagged ?tag .

        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    } UNION {
        ?content :tagged ?tag .

        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .

        # Check the score
        ?tag :score ?score .
        FILTER(?score > 50)
    }
} GROUP BY ?content

This breaks up the SPARQL query into two smaller queries that are much easier to approach without needing to worry about how to combine multiple sets of patterns in the same query. Additionally, this query could be optimized by using a subquery that retrieves the content and tags with a score above 50 before checking for the valid types.

SELECT
    ?content 
    (GROUP_CONCAT(?topic) as ?topics)
    (GROUP_CONCAT(?country) as ?countries)
WHERE {
    {
        SELECT ?content ?tag WHERE {
            ?content :tagged ?tag .

            # Check the score
            ?tag :score ?score .
            FILTER(?score > 50)
        }
    }
    {
        # Verify the tag is for a topic
        ?tag :about ?topic .
        ?topic a :Topic .
    } UNION {
        # Verify the tag is for a country
        ?tag :about ?country .
        ?country a :Country .
    }
} GROUP BY ?content

In this query, the results of the subquery are merged with the results of the UNION enabling us to still apply custom patterns to topics and countries. We use a subquery in order to avoid matching the ?content and ?tag values more than once and the merge enforces that every tag has to be about a topic or a country.

Final SPARQL Thoughts

SPARQL is a robust query language for working with RDF data. Try not to overlook uncommon SPARQL functions (such as VALUES, STRDT, and SAMPLE) and check if your graph database has any proprietary functions that you can leverage for even more flexibility. As a more general recommendation, always take the time to step back and see if there’s a cleaner, more efficient way to retrieve the data you need.

Enterprise Knowledge writes more performant queries and designs data models to enable advanced graph solutions. If you can’t find your own Mr. SPARQL unicorn, and whether you are new to the graph space or looking to optimize your existing data, contact us to discuss how EK can help take your solution to the next level.

^¹If the title of this blog is familiar, that’s because it is a reference to an episode of The Simpsons. In one episode, a Japanese cleaning agency used Homer’s face (or one closely resembling it) as the logo of a brand called “Mr. Sparkle.” Homer calls up the brand and asks, “Why am I Mr. Sparkle?” One of my colleagues mentioned that he was reminded of this episode anytime he heard me discussing SPARQL with the rest of the EK Team.

I have been using SPARQL actively for the past 3 years and have come to recognize that it requires a unique mindset. There are some common gotcha moments and optimization techniques for improving queries but writing the initial query requires an understanding of the RDF format and piecing it together is just as critical as making it more efficient. The most effective SPARQL developers in your organization will be the unicorns, the individuals with a knowledge of code who are able to adjust course quickly, hold complex logic in their head, and enjoy the time it takes to solve puzzles.

The post Why am I Mr. SPARQL? appeared first on Enterprise Knowledge.

RDF*: What is it and Why do I Need it?

EK Team — Fri, 24 Jul 2020 16:24:24 +0000

RDF* (pronounced RDF star) is an extension to the Resource Description Framework (RDF) that enables RDF graphs to more intuitively represent complex interactions and attributes through the implementation of embedded triples. This allows graphs to capture relationships between more than two entities, add metadata to existing relationships, and add provenance information to all triples, reducing the burden of maintenance.

But let’s back up…before we talk about RDF*, let’s cover the basics — what is RDF, and how is RDF* different from RDF?

What is RDF?

The Resource Description Framework (RDF) is a semantic web standard used to describe and model information for web resources or knowledge management systems. RDF consists of “triples,” or statements, with a subject, predicate, and object that resemble an English sentence.

For example, take the English sentence: “Bess Schrader is employed by Enterprise Knowledge.” This sentence has:

A subject: Bess Schrader
A predicate: is employed by
An object: Enterprise Knowledge

Bess Schrader and Enterprise Knowledge are two entities that are linked by the relationship is employed by. An RDF triple representing this information would look like this:

(There are many ways, or serializations, to represent RDF. In this blog, I’ll be using the Turtle syntax because it’s easy to read, but this information could also be shown in RDF/XML, JSON for Linking Data, and other formats.)

The World Wide Web Consortium (W3C) maintains the RDF Specification, making it easy for applications and organizations to develop RDF data in an interoperable way. This means if you create RDF data in one tool and share it with someone else using a different RDF tool, they will still be able to easily use your data. This interoperability allows you to build on what’s already been done — you can combine your enterprise knowledge graph with established, open RDF datasets like Wikidata, jump starting your analytic capabilities. This also makes data sharing and migration between internal RDF systems simple, enabling you to unify data and reducing your dependency on a single tool or vendor.

For more information on RDF and how it can be used, check out Why a Taxonomist Should Know SPARQL.

What are the limitations of RDF (Why is RDF* necessary)?

Standard RDF has many strengths:

Like most graph models, it more intuitively captures the way we think about the world as humans (as networks, not as tables), making it easier to design, capture, and query data.
As a standard supported by the W3C, it allows us to create interoperable data and systems, all using the same standard to represent and encode data.

However, it has one key weakness: because RDF is based on triples, standard RDF can only connect two objects at a time. For many use cases, this limitation isn’t a problem. Consider my example from above, where I want to represent the relationship between me and my employer:

Simple! However, what if I want to capture the role or position that I hold at this organization? I could add a triple denoting my position:

Great! But what if I decide to add in my (fictional) employment history?

Now it’s unclear whether I was a consultant at Enterprise Knowledge or at Hogwarts.

There are a variety of ways to address this problem in RDF. One of the most popular is reification or n-ary relations, in which you create an intermediary node that allows you to group more than two entities together. For example:

Using this technique allows you to clear up confusion and model the complexity of the world. However, adding these intermediary nodes takes away some of the simplicity of graph data — the idea of an “employment event” isn’t exactly intuitive.

There are many other methods that have been developed to handle this kind of complexity in RDF, including singleton properties and named graphs/quads. Additionally, an entirely different type of non-RDF graph model, labeled property graphs, allows users to attach properties directly to relationships. However, labeled property graphs don’t allow for interoperability at the same scale as RDF — it’s much harder to share and combine different data sets, and moving data from tool to tool isn’t as simple.

None of these solutions retain both of the strengths of RDF: the interoperable standards and the intuitive data model. This crucial limitation of RDF has limited its effectiveness in certain applications, particularly those involving temporal or transactional data.

What is RDF*?

RDF* (pronounced RDF-star) is an extension to RDF that proposes a solution to the weaknesses of RDF mentioned above. As an extension, RDF* supplements RDF but doesn’t replace it.

The main idea behind RDF* is to treat a triple as a single entity. By “nesting” or “embedding” triples, an entire triple can become the subject of a second triple. This allows you to add metadata to triples, assigning attributes to a triple, or creating relationships not just between two entities in your knowledge graph, but between triples and entities, or triples and triples. Take our example from above. In standard RDF, if I want to express past employers and positions, I need to use reification:

In RDF*, I can use nested triples to simply denote the same information:

This eliminates the need for intermediary entities and makes the model easier to understand and implement.

Just as standard RDF can be queried via the SPARQL query language, RDF* can be queried using SPARQL*, allowing users to query both standard and nested triples.

Currently, RDF* is under consideration by the W3C and has not yet been officially accepted as a standard. However, the specification has been formally defined in Foundations of an Alternative Approach to Reification in RDF, and many enterprise tools supporting RDF have added support for RDF* (including BlazeGraph, AnzoGraph, Stardog, and GraphDB ). Hopefully this standard will be formally adopted by the W3C, allowing it to retain and build on the original strengths of RDF: its intuitive model/simplicity and interoperability.

What are the benefits of RDF*?

As you can see above, RDF* can be used to represent relationships that involve more than one entity (e.g. person, role, and organization) in a more intuitive manner than standard RDF. However, RDF* has additional use cases, including:

Adding metadata to a relationship (For example: start dates and end dates for jobs, marriages, events, etc.)

Adding provenance information for triples: I have a triple that indicates Bess Schrader works for Enterprise Knowledge. When did I add this triple to my graph? What was the source of this information? Who added the information to the graph?

Conclusion

On its own, RDF provides an excellent way to create, combine, and share semantic information. Extending this framework with RDF* gives knowledge engineers more flexibility to model complex interactions between multiple entities, attach attributes to relationships, and store metadata about triples, helping us more accurately model the real world while improving our ability to understand and verify where data origins.

Looking for more information on RDF* and how you can leverage it to solve your data challenges? Contact Enterprise Knowledge.

The post RDF*: What is it and Why do I Need it? appeared first on Enterprise Knowledge.

Why a Taxonomist Should Know SPARQL

EK Team — Wed, 01 Apr 2020 13:07:08 +0000

As the Knowledge and Information Management field moves towards adopting semantic technologies like ontologies and enterprise knowledge graphs, taxonomists and taxonomy managers need to know about W3C semantic web standards including RDF, SKOS, SPARQL because data is becoming more interconnected and complex, and we need to move beyond the traditional hierarchical taxonomy relationships in order to truly model our knowledge domains. Taxonomies can also use these standards to extend into ontologies, which increase the value of a taxonomist’s work by supporting AI initiatives and features. As my colleagues have defined in previous blogs, Ontologies are semantic data models that define the types of things that exist in our domain and the properties that can be used to describe them; Knowledge Graphs are the instantiation of our ontology models with real, live business data. EK recommends designing both of these using the W3C standards for interoperability which will be discussed in this blog. It is critical that Taxonomists and Taxonomy Managers become familiar with RDF, SKOS, and SPARQL as more and more taxonomies are being built and implemented using the underlying structure of RDF and SKOS. The top taxonomy management tools in this space are also built to support these semantic standards.

Knowing, and being able to leverage these semantic standards, will not only increase a Taxonomist or Taxonomy Manager’s ability to maintain and enhance taxonomy designs, but will also ensure that taxonomies are built to last as a source of truth for their domain and to serve as the building blocks to an ontology.

What a Taxonomist needs to know about RDF, SKOS, and SPARQL

The W3C, or World Wide Web Consortium, is an international standards organization that develops open standards to ensure the growth and longevity of the world wide web. Among these are the standards and recommendations for RDF, SKOS, and SPARQL. RDF stands for Resource Description Framework and is used to describe and model information for web resources or knowledge management systems. RDF consists of “triples” or statements that resemble a sentence. If we think back to elementary school English classes and sentence diagramming, we build sentences or triples that contain a subject, predicate, and object.

SKOS is built on RDF, and stands for Simple Knowledge Organization System and is another W3C recommendation for how taxonomies should be structured and represented.

SPARQL is pronounced “sparkle” and is a recursive acronym for “SPARQL Protocol and RDF Query Language”, which is a set of specifications from the W3C. SPARQL allows you to query one or more triples and return varied results based on the type of information we are looking for from our taxonomy or graph database. All that is needed to leverage SPARQL is 1) data that is represented in RDF format and (2) an endpoint inside an enterprise taxonomy/ontology management tool, or a publicly available endpoint like Wikidata.

What is the Value of RDF and SPARQL?

When metadata about concepts within a taxonomy is stored using RDF (Last modified date, created by, approval status, etc.) taxonomists can use SPARQL to interact with and ask questions about your taxonomy design in many different ways, including: to update the taxonomy, pull concrete values from the data, or even track changes for governance. A query could pull all concepts in draft status, or all concepts edited by a specific person in the last 30 days. We can also use SPARQL to explore our data by querying unknown relationships to discover new connections. We’ve received questions from clients that have prompted the need for SPARQL queries to do basic reporting on a taxonomy structure or to return a subset of the project data for updating another system. Consider if we only want a portion of the enterprise taxonomy that is used for the intranet and Content Management System (CMS) for the Digital Asset Management system (DAM). We can use a SPARQL query to pull only the concepts that live under a certain tree or broader concept to then import into the DAM.

The primary value of RDF is in the triples that allow us to make statements and connect different concepts beyond broader and narrower relationships, building a flexible and interoperable taxonomy. Specifically, RDF adds value in three main ways, all of which are related to the idea of use and reuse of information.

URIs (Uniform Resource Identifiers): do exactly what they sound like – identify resources with unique IDs without being specific to the resource’s location or use so that it can be reused.
Linked Open Data: Openly available data (triples) or models (taxonomies/ontologies) that can be sourced and used to enhance a custom taxonomy, or to negate the need to design a taxonomy that already exists (e.g. DBPedia)
Interoperability: The idea that by using semantic standards, all vocabularies or models built using those standards can be integrated and used with each other, and with other systems or applications.

Even though our business or enterprise taxonomies may be highly specific, internal vocabularies, we can still leverage RDF and SKOS to ensure interoperability behind our firewalls. Specifically, the use and reuse of the taxonomy in multiple systems so that all the systems, and all those users, are speaking the same language. This is also key for the development and implementation of knowledge graphs that will leverage RDF and SPARQL to pull information from the disparate systems together for greater usability.

What Kind of Information Can I Query?

When writing a SPARQL query you are typically saying “I want X information from Y data that meets Z conditions.” The conditions are written as triple patterns, which are similar to RDF triples but may include variables to add flexibility in how they match against the data. For example, if we have a taxonomy where all terms should have preferred labels in both English and French, and we need to get a list of terms from our taxonomy that still need French translations, we can use a SPARQL query using their scheme/top concept so that we can send terms to the appropriate SMEs to translate. This SPARQL query would follow the pattern above and ask “I want all concepts that are narrower terms of a Concept A that do not have a French prefLabel.” It might look like this:

Some SPARQL queries might be as simple as identifying how many concepts are under a parent concept in our taxonomy. Many taxonomy management tools will provide statistics on the total number of concepts within the taxonomy but may not provide those statistics at the remaining lower levels of the taxonomy hierarchy.

We have also used SPARQL to support the approval workflow and update process from the taxonomy management system to a second, custom application for tagging data. In this case, we needed a query that would return all the draft concepts and all their related triples (the information that makes up the concept) so the second application could be updated with the new concepts, leaving existing concepts as they were.

Conclusion

SKOS, RDF, and SPARQL work together to ensure interoperability and usability of your organization’s data and information by standardizing the way taxonomists design and manage taxonomies and streamlining the path toward ontologies and knowledge graphs. Leveraging what I’ve described in this blog, with the appropriate designs and implementations, can translate to Enterprise AI readiness for an organization and overall, better visibility and usage of your organization’s data and information.

Whether you are just beginning the process of designing a taxonomy, or are focused on implementation, semantic standards should be a primary consideration to ensure longevity, usability, and interoperability with many systems and tools. We are here to help you utilize these standards and implement them efficiently. Contact us.

The post Why a Taxonomist Should Know SPARQL appeared first on Enterprise Knowledge.

What’s the Difference Between an Ontology and a Knowledge Graph?

EK Team — Wed, 15 Jan 2020 14:00:38 +0000

As semantic applications become increasingly hot topics in the industry, clients often come to EK asking about ontologies and knowledge graphs. Specifically, they want to know the differences between the two. Are ontologies and knowledge graphs the same thing? If not, how are they different? What is the relationship between the two?

In this blog, I’ll walk you through both ontologies and knowledge graphs, describing how they’re different and how they work together to organize large amounts of data and information.

What is an ontology?

Ontologies are semantic data models that define the types of things that exist in our domain and the properties that can be used to describe them. Ontologies are generalized data models, meaning that they only model general types of things that share certain properties, but don’t include information about specific individuals in our domain. For example, instead of describing your dog, Spot, and all of his individual characteristics, an ontology should focus on the general concept of dogs, trying to capture characteristics that most/many dogs might have. Doing this allows us to reuse the ontology to describe additional dogs in the future.

There are three main components to an ontology, which are usually described as follows:

Classes: the distinct types of things that exist in our data.
Relationships: properties that connect two classes.
Attributes: properties that describe an individual class.

For example, imagine we have the following information on books, authors, and publishers:

First we want to identify our classes (the unique types of things that are in the data). This sample data appears to capture information about books, so that’s a good candidate for a class. Specifically, the sample data captures certain types of things about books, such as authors and publishers. Digging a little deeper, we can see our data also captures information about publishers and authors, such as their locations. This leaves us with four classes for this example:

Books
Authors
Publishers
Locations

Next, we need to identify relationships and attributes (for simplicity, we can consider both relationships and attributes as properties). Using the classes that we identified above, we can look at the data and start to list all of the properties we see for each class. For example, looking at the book class, some properties might be:

Books have authors
Books have publishers
Books are published on a date
Books are followed by sequels (other books)

Some of these properties are relationships that connect two of our classes. For example, the property “books have authors” is a relationship that connects our book class and our author class. Other properties, such as “books are published on a date,” are attributes, describing only one class, instead of connecting two classes together.

It’s important to note that these properties might apply to any given book, but they don’t necessarily have to apply to every book. For example, many books don’t have sequels. That’s fine in our ontology, because we just want to make sure we capture possible properties that could apply to many, but not necessarily all, books.

While the above list of properties is easy to read, it can be helpful to rewrite these properties to more clearly identify our classes and properties. For example, “books have authors” can be written as:

Book → has author → Author

Although there are many more properties that you could include, depending on your use case, for this blog, I’ve identified the following properties:

Book → has author → Author
Book → has publisher→ Publisher
Book → published on → Publication date
Book → is followed by → Book
Author → works with → Publisher
Publisher → located in → Location
Location → located in → Location

Remember that our ontology is a general data model, meaning that we don’t want to include information about specific books in our ontology. Instead, we want to create a reusable framework we could use to describe additional books in the future.

When we combine our classes and relationships, we can view our ontology in a graph format:

What is a knowledge graph?

Using our ontology as a framework, we can add in real data about individual books, authors, publishers, and locations to create a knowledge graph. With the information in our tables above, as well as our ontology, we can create specific instances of each of our ontological relationships. For example, if we have the relationship Book → has author → Author in our ontology, an individual instance of this relationship looks like:

If we add in all of the individual information that we have about one of our books, To Kill a Mockingbird, we can start to see the beginnings of our knowledge graph:

If we do this with all of our data, we will eventually wind up with a graph that has our data encoded using our ontology. Using this knowledge graph, we can view our data as a web of relationships, instead of as separate tables, drawing new connections between data points that we would otherwise be unable to understand. Specifically, using SPARQL, we can query this data, using inferencing, letting our knowledge graph make connections for us that weren’t previously defined.

So how are ontologies and knowledge graphs different?

As you can see from the example above, a knowledge graph is created when you apply an ontology (our data model) to a set of individual data points (our book, author, and publisher data). In other words:

ontology + data = knowledge graph

Ready to get started? Check our ontology design and knowledge graph design best practices, and contact us if you need help beginning your journey with advanced semantic data models.

The post What’s the Difference Between an Ontology and a Knowledge Graph? appeared first on Enterprise Knowledge.