Fernando Aguilar Islas, Author at Enterprise Knowledge https://enterprise-knowledge.com Tue, 17 Jun 2025 17:12:59 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://enterprise-knowledge.com/wp-content/uploads/2022/04/EK_Icon_512x512.svg Fernando Aguilar Islas, Author at Enterprise Knowledge https://enterprise-knowledge.com 32 32 Graph Analytics in the Semantic Layer: Architectural Framework for Knowledge Intelligence https://enterprise-knowledge.com/graph-analytics-in-the-semantic-layer-architectural-framework-for-knowledge-intelligence/ Tue, 17 Jun 2025 17:12:59 +0000 https://enterprise-knowledge.com/?p=24653 Introduction As enterprises accelerate AI adoption, the semantic layer has become essential for unifying siloed data and delivering actionable, contextualized insights. Graph analytics plays a pivotal role within this architecture, serving as the analytical engine that reveals patterns and relationships … Continue reading

The post Graph Analytics in the Semantic Layer: Architectural Framework for Knowledge Intelligence appeared first on Enterprise Knowledge.

]]>
Introduction

As enterprises accelerate AI adoption, the semantic layer has become essential for unifying siloed data and delivering actionable, contextualized insights. Graph analytics plays a pivotal role within this architecture, serving as the analytical engine that reveals patterns and relationships often missed by traditional data analysis approaches. By integrating metadata graphs, knowledge graphs, and analytics graphs, organizations can bridge disparate data sources and empower AI-driven decision-making. With recent technological advances in graph-based technologies, including knowledge graphs, property graphs, Graph Neural Networks (GNNs), and Large Language Models (LLMs), the semantic layer is evolving into a core enabler of intelligent, explainable, and business-ready insights

The Semantic Layer: Foundation for Connected Intelligence

A semantic layer acts as an enterprise-wide framework that standardizes data meaning across both structured and unstructured sources. Unlike traditional data fabrics, it integrates content, media, data, metadata, and domain knowledge through three main interconnected components:

1. Metadata Graphs capture the data about data. They track business, technical, and operational metadata – from data lineage and ownership to security classifications – and interconnect these descriptors across the organization. In practice, a metadata graph serves as a unified catalog or map of data assets, making it ideal for governance, compliance, and discovery use cases. For example, a bank might use a metadata graph to trace how customer data flows through dozens of systems, ensuring regulatory requirements are met and identifying duplicate or stale data assets.

2. Knowledge Graphs encode the business meaning and context of information. They integrate heterogeneous data (structured and unstructured) into an ontology-backed model of real-world entities (customers, accounts, products, and transactions) and the relationships between them. A knowledge graph serves as a semantic abstraction layer over enterprise data, where relationships are explicitly defined using standards like RDF/OWL for machine understanding. For example, a retailer might utilize a knowledge graph to map the relationships between sources of customer data to help define a “high-risk customer”. This model is essential for creating a common understanding of business concepts and for powering context-aware applications such as semantic search and question answering.

3. Analytics Graphs focus on connected data analysis. They are often implemented as property graphs (LPGs) and used to model relationships among data points to uncover patterns, trends, and anomalies. Analytics graphs enable data scientists to run sophisticated graph algorithms – from community detection and centrality to pathfinding and similarity – on complex networks of data that would be difficult to analyze in tables. Common use cases include fraud detection/prevention, customer influence networks, recommendation engines, and other link analysis scenarios. For instance, fraud analytics teams in financial institutions have found success using analytics graphs to detect suspicious patterns that traditional SQL queries missed. Analysts frequently use tools like Kuzu and Neo4J, which have built-in graph data science modules, to store and query these graphs at scale. In contrast, graph visualization tools (Linkurious and Hume) help analysts explore the relationships intuitively.

Together, these layers transform raw data into knowledge intelligence; read more about these types of graphs here.

Driving Insights with Graph Analytics: From Knowledge Representation to Knowledge Intelligence with the Semantic Layer

  • Relationship Discovery
    Graph analytics reveals hidden, non-obvious connections that traditional relational analysis often misses. It leverages network topology, how entities relate across multiple hops, to uncover complex patterns. Graph algorithms like pathfinding, community detection, and centrality analysis can identify fraud rings, suspicious transaction loops, and intricate ownership chains through systematic relationship analysis. These patterns are often invisible when data is viewed in tables or queried without regard for structure. With a semantic layer, this discovery is not just technical, it enables the business to ask new types of questions and uncover previously inaccessible insights.
  • Context-Aware Enrichment
    While raw data can be linked, it only becomes usable when placed in context. Graph analytics, when layered over a semantic foundation of ontologies and taxonomies, enables the enrichment of data assets with richer and more precise information. For example, multiple risk reports or policies can be semantically clustered and connected to related controls, stakeholders, and incidents. This process transforms disconnected documents and records into a cohesive knowledge base. With a semantic layer as its backbone, graph enrichment supports advanced capabilities such as faceted search, recommendation systems, and intelligent navigation.
  • Dynamic Knowledge Integration
    Enterprise data landscapes evolve rapidly with new data sources, regulatory updates, and changing relationships that must be accounted for in real-time. Graph analytics supports this by enabling incremental and dynamic integration. Standards-based knowledge graphs (e.g., RDF/SPARQL) ensure portability and interoperability, while graph platforms support real-time updates and streaming analytics. This flexibility makes the semantic layer resilient, future-proof, and always current. These traits are crucial in high-stakes environments like financial services, where outdated insights can lead to risk exposure or compliance failure.

These mechanisms, when combined, elevate the semantic layer from knowledge representation to a knowledge intelligence engine for insight generation. Graph analytics not only helps interpret the structure of knowledge but also allows AI models and human users alike to reason across it.

Graph Analytics in the Semantic Layer Architecture

Business Impact and Case Studies

Enterprise Knowledge’s implementations demonstrate how organizations leverage graph analytics within semantic layers to solve complex business challenges. Below are three real-world examples from their case studies:
1. Global Investment Firm: Unified Knowledge Portal

A global investment firm managing over $250 billion in assets faced siloed information across 12+ systems, including CRM platforms, research repositories, and external data sources. Analysts wasted hours manually piecing together insights for mergers and acquisitions (M&A) due diligence.

Enterprise Knowledge designed and deployed a semantic layer-powered knowledge portal featuring:

  • A knowledge graph integrating structured and unstructured data (research reports, market data, expert insights)
  • Taxonomy-driven semantic search with auto-tagging of key entities (companies, industries, geographies)
  • Graph analytics to map relationships between investment targets, stakeholders, and market trends

Results

  • Single source of truth for 50,000+ employees, reducing redundant data entry
  • Accelerated M&A analysis through graph visualization of ownership structures and competitor linkages
  • AI-ready foundation for advanced use cases like predictive market trend modeling

2. Insurance Fraud Detection: Graph Link Analysis

A national insurance regulator struggled to detect synthetic identity fraud, where bad actors slightly alter personal details (e.g., “John Doe” vs “Jon Doh”) across multiple claims. Traditional relational databases failed to surface these subtle connections.

Enterprise Knowledge designed a graph-powered semantic layer with the following features:

  • Property graph database modeling claimants, policies, and claim details as interconnected nodes/edges
  • Link analysis algorithms (Jaccard similarity, community detection) to identify fraud rings
  • Centrality metrics highlighting high-risk networks based on claim frequency and payout patterns

Results

  • Improved detection of complex fraud schemes through relationship pattern analysis
  • Dynamic risk scoring of claims based on graph-derived connection strength
  • Explainable AI outputs via graph visualizations for investigator collaboration

3. Government Linked Data Investigations: Semantic Layer Strategy

A government agency investigating cross-border crimes needed to connect fragmented data from inspection reports, vehicle registrations, and suspect databases. Analysts manually tracked connections using spreadsheets, leading to missed patterns and delayed cases.

Enterprise Knowledge delivered a semantic layer solution featuring:

  • Entity resolution to reconcile inconsistent naming conventions across systems
  • Investigative knowledge graph linking people, vehicles, locations, and events
  • Graph analytics dashboard with pathfinding algorithms to surface hidden relationships

Results

  • 30% faster case resolution through automated relationship mapping
  • Reduced cognitive load with graph visualizations replacing manual correlation
  • Scalable framework for integrating new data sources without schema changes

Implementation Best Practices

Enterprise Knowledge’s methodology emphasizes several critical success factors :

1. Standardize with Semantics
Establishing a shared semantic foundation through reusable ontologies, taxonomies, and controlled vocabularies ensures consistency and scalability across domains, departments, and systems. Standardized semantic models enhance data alignment, minimize ambiguity, and facilitate long-term knowledge integration. This practice is critical when linking diverse data sources or enabling federated analysis across heterogeneous environments.

2. Ground Analytics in Knowledge Graphs
Analytics graphs risk misinterpretation when created without proper ontological context. Enterprise Knowledge’s approach involves collaboration with intelligence subject matter experts to develop and implement ontology and taxonomy designs that map to Common Core Ontologies for a standard, interoperable foundation.

3. Adopt Phased Implementation
Enterprise Knowledge develops iterative implementation plans to scale foundational data models and architecture components, unlocking incremental technical capabilities. EK’s methodology includes identifying starter pilot activities, defining success criteria, and outlining necessary roles and skill sets.

4. Optimize for Hybrid Workloads
Recent research on Semantic Property Graph (SPG) architectures demonstrates how to combine RDF reasoning with the performance of property graphs, enabling efficient hybrid workloads. Enterprise Knowledge advises on bridging RDF and LPG formats to enable seamless data integration and interoperability while maintaining semantic standards.

Conclusion

The semantic layer achieves transformative impact when metadata graphs, knowledge graphs, and analytics graphs operate as interconnected layers within a unified architecture. Enterprise Knowledge’s implementations demonstrate that organizations adopting this triad architecture achieve accelerated decision-making in complex scenarios. By treating these components as interdependent rather than isolated tools, businesses transform static data into dynamic, context-rich intelligence.

Graph analytics is not a standalone tool but the analytical core of the semantic layer. Grounded in robust knowledge graphs and aligned with strategic goals, it unlocks hidden value in connected data. In essence, the semantic layer, when coupled with graph analytics, becomes the central knowledge intelligence engine of modern data-driven organizations.
If your organization is interested in developing a graph solution or implementing a semantic layer, contact us today!

The post Graph Analytics in the Semantic Layer: Architectural Framework for Knowledge Intelligence appeared first on Enterprise Knowledge.

]]>
Unlocking Knowledge Intelligence from Unstructured Data https://enterprise-knowledge.com/unlocking-knowledge-intelligence-from-unstructured-data/ Fri, 28 Mar 2025 17:18:28 +0000 https://enterprise-knowledge.com/?p=23553 Introduction Organizations generate, source, and consume vast amounts of unstructured data every day, including emails, reports, research documents, technical documentation, marketing materials, learning content and customer interactions. However, this wealth of information often remains hidden and siloed, making it challenging … Continue reading

The post Unlocking Knowledge Intelligence from Unstructured Data appeared first on Enterprise Knowledge.

]]>
Introduction

Organizations generate, source, and consume vast amounts of unstructured data every day, including emails, reports, research documents, technical documentation, marketing materials, learning content and customer interactions. However, this wealth of information often remains hidden and siloed, making it challenging to utilize without proper organization. Unlike structured data, which fits neatly into databases, unstructured data often lacks a predefined format, making it difficult to extract insights or apply advanced analytics effectively.

Integrating unstructured data into a knowledge graph is the right approach to overcome organizations’ challenges in structuring unstructured data. This approach allows businesses to move beyond traditional storage and keyword search methods to unlock knowledge intelligence. Knowledge graphs contextualize unstructured data by linking and structuring it, leveraging the business-relevant concepts and relationships. This enhances enterprise search capabilities, automates knowledge discovery, and powers AI-driven applications.

This blog explores why structuring unstructured data is essential; the challenges organizations face, and the right approach to integrate unstructured content into a graph-powered knowledge system. Additionally, this blog highlights real-world implementations demonstrating how we have applied his approach to help organizations unlock knowledge intelligence, streamline workflows, and drive meaningful business outcomes.

Why Structure Unstructured Data in a Graph

Unstructured data offers immense value to organizations if it can be effectively harnessed and contextualized using a knowledge graph. Structuring content in this way unlocks potential and drives business value. Below are three key reasons to structure unstructured data:

1. Knowledge Intelligence Requires Context

Unstructured data often holds valuable information, but is disconnected across different formats, sources, and teams. A knowledge graph enables organizations to connect these pieces by linking concepts, relationships, and metadata into a structured framework. For example, a financial institution can link regulatory reports, policy documents, and transaction logs to uncover compliance risks. With traditional document repositories, achieving knowledge intelligence may be impossible, or at least very resource intensive.

Additionally, organizations must ensure that domain-specific knowledge informs AI systems to improve relevance and accuracy. Injecting organizational knowledge into AI models, enhances AI-driven decision-making by grounding models in enterprise-specific data.

2. Enhancing Findability and Discovery

Unstructured data lacks standard metadata, making traditional search and retrieval inefficient. Knowledge graphs power semantic search by linking related concepts, improving content recommendations, and eliminating reliance on simple keyword matching. For example, in the financial industry, investment analysts often struggle to locate relevant market reports, regulatory updates, and historical trade data buried in siloed repositories. A knowledge graph-powered system can link related entities, such as companies, transactions, and market events, allowing analysts to surface contextually relevant information with a single query, rather than sifting through disparate databases and document archives.

3. Powering Explainable AI and Generative Applications

Generative AI and Large Language Models (LLMs) require structured, contextualized data to produce meaningful and accurate responses. A graph-enhanced AI pipeline allows enterprises to:

A. Retrieve verified knowledge rather than relying on AI-generated assumptions likely resulting in hallucinations.

B. Trace AI-generated insights back to trusted enterprise data for validation.

C. Improve explain ability and accuracy in AI-driven decision-making.

 

Challenges of Handling Unstructured Data in a Graph

While structured data neatly fits into predefined models, facilitating easy storage and retrieval of unstructured data presents a stark contrast. Unstructured data, encompassing diverse formats such as text documents, images, and videos lack the inherent organization and standardization to facilitate machine understanding and readability. This lack of structure poses significant challenges for data management and analysis, hindering the ability to extract valuable insights. The following key challenges highlight the complexities of handling unstructured data:

1. Unstructured Data is Disorganized and Diverse

Unstructured data is frequently available in multiple formats, including PDF documents, slide presentations, email communications, or video recordings. However, these diverse formats lack a standardized structure, making extracting and organizing data challenging. Format inconsistency can hinder effective data analysis and retrieval, as each type presents unique obstacles for seamless integration and usability.

2. Extracting Meaningful Entities and Relationships

Turning free text into structured graph nodes and edges requires advanced Natural Language Processing (NLP) to identify key entities, detect relationships, and disambiguate concepts. Graph connections may be inaccurate, incomplete, or irrelevant without proper entity linking.

3. Managing Scalability and Performance

Storing large-scale unstructured data in a graph requires efficient modeling, indexing, and processing strategies to ensure fast query performance and scalability.

Complementary Approaches to Unlocking Knowledge Intelligence from Unstructured Data

A strategic and comprehensive approach is essential to unlock knowledge intelligence from unstructured data. This involves designing a scalable and adaptable knowledge graph schema, deconstructing and enriching unstructured data with metadata, leveraging AI-powered entity and relationship extraction, and ensuring accuracy with human-in-the-loop validation and governance.

1. Knowledge Graph Schema Design for Scalability

A well-structured schema efficiently models entities, relationships, and metadata. As outlined in our best practices for enterprise knowledge graph design, a strategic approach to schema development ensures scalability, adaptability, and alignment with business needs. Enriching the graph with structured data sources (databases, taxonomies, and ontologies) improves accuracy. It enhances AI-driven knowledge retrieval, ensuring that knowledge graphs are robust and optimized for enterprise applications.

2. Content Deconstruction and Metadata Enrichment

Instead of treating documents as static text, break them into structured knowledge assets, such as sections, paragraphs, and sentences, then link them to relevant concepts, entities, and metadata in a graph. Our Content Deconstruction approach helps organizations break large documents into smaller, interlinked knowledge assets, improving search accuracy and discoverability.

3. AI-Powered Entity and Relationship Extraction

Advanced NLP and machine learning techniques can extract insights from unstructured text data. These techniques can identify key entities, categorize documents, recognize semantic relationships, perform sentiment analysis, summarize text, translate languages, answer questions, and generate text. They offer a powerful toolkit for extracting insights and automating tasks related to natural language processing and understanding.

A well-structured knowledge graph enhances AI’s ability to retrieve, analyze, and generate insights from content. As highlighted in How to Prepare Content for AI, ensuring content is well-structured, tagged, and semantically enriched is crucial for making AI outputs accurate and context-aware.

4. Human-in-the-loop for Validation and Governance

AI models are powerful but have limitations and can produce errors, especially when leveraging domain-specific taxonomies and classifications. AI-generated results should be reviewed and refined by domain experts to ensure alignment with standards, regulations, and subject matter nuances. Combining AI efficiency with human expertise maximizes data accuracy and reliability while minimizing compliance risks and costly errors.

From Unstructured Data to Knowledge Intelligence: Real-World Implementations and Case Studies

Our innovative approach addresses the challenges organizations face in managing and leveraging their vast knowledge assets. By implementing AI-driven recommendation engines, knowledge portals, and content delivery systems, we empower businesses to unlock the full potential of their unstructured data, streamline processes, and enhance decision-making. The following case studies illustrate how organizations have transformed their data ecosystems using our enterprise AI and knowledge management solutions which incorporate the four components discussed in the previous section.

  • AI-Driven Learning Content and Product Recommendation Engine
    A global enterprise learning and product organization struggled with the searchability and accessibility of its vast unstructured marketing and learning content, causing inefficiencies in product discovery and user engagement. Customers frequently left the platform to search externally, leading to lost opportunities and revenue. To solve this, we developed an AI-powered recommendation engine that seamlessly integrated structured product data with unstructured content through a knowledge graph and advanced AI algorithms. This solution enabled personalized, context-aware recommendations, improving search relevance, automating content connections, and enhancing metadata application. As a result, the company achieved increased customer retention and better product discovery, leading to six figures in closed revenue.
  • Knowledge Portal for a Global Investment Firm
    A global investment firm faced challenges leveraging its vast knowledge assets due to fragmented information spread across multiple systems. Analysts struggled with duplication of work, slow decision-making, and unreliable investment insights due to inconsistent or missing context. To address this, we developed Discover, a centralized knowledge portal powered by a knowledge graph that integrates research reports, investment data, and financial models into a 360-degree view of existing resources. The system aggregates information from multiple sources, applies AI-driven auto-tagging for enhanced search, and ensures secure access control to maintain compliance with strict data governance policies. As a result, the firm achieved faster decision-making, reduced duplicate efforts, and improved investment reliability, empowering analysts with real-time, contextualized insights for more informed financial decisions.
  • Knowledge AI Content Recommender and Chatbot
    A leading development bank faced challenges in making its vast knowledge capital easily discoverable and delivering contextual, relevant content to employees at the right time. Information was scattered across multiple systems, making it difficult for employees to find critical knowledge and expertise when performing research and due diligence. To solve this, we developed an AI-powered content recommender and chatbot, leveraging a knowledge graph, auto-tagging, and machine learning to categorize, structure, and intelligently deliver knowledge. The knowledge platform was designed to ingest data from eight sources, apply auto-tagging using a multilingual taxonomy with over 4,000 terms, and proactively recommend content across eight enterprise systems. This approach significantly improved enterprise search, automated knowledge delivery, and minimized time spent searching for information. Bank leadership recognized the initiative as “the most forward-thinking project in recent history.”
  • Course Recommendation System Based on a Knowledge Graph
    A healthcare workforce solutions provider faced challenges in delivering personalized learning experiences and effective course recommendations across its learning platform. The organization sought to connect users with tailored courses that would help them master key competencies, but its existing recommendation system struggled to deliver relevant, user-specific content and was difficult to maintain. To address this, we developed a cloud-hosted semantic course recommendation service, leveraging a healthcare-oriented knowledge graph and Named Entity Recognition (NER) models to extract key terms and build relationships between content components. The AI-powered recommendation engine was seamlessly integrated with the learning platform, automating content recommendations and optimizing learning paths. As a result, the new system outperformed accuracy benchmarks, replaced manual processes, and provided high-quality, transparent course recommendations, ensuring users understood why specific courses were suggested.

Conclusion

Unstructured data holds immense potential, but without structure and context, it remains difficult to navigate. Unlike structured data, which is already organized and easily searchable, unstructured data requires advanced techniques like knowledge graphs and AI to extract valuable insights. However, both data types are complementary and essential for maximizing knowledge intelligence. By integrating structured and unstructured data, organizations can connect fragmented content, enhance search and discovery, and fuel AI-powered insights. 

At Enterprise Knowledge, we know success requires a well-planned strategy, including preparing content for AI,  AI-driven entity and relationship extraction, scalable graph modeling or enterprise ontologies, and expert validation. We help organizations unlock knowledge intelligence by structuring unstructured content in a graph-powered ecosystem. If you want to transform unstructured data into actionable insights, contact us today to learn how we can help your business maximize its knowledge assets.

 

The post Unlocking Knowledge Intelligence from Unstructured Data appeared first on Enterprise Knowledge.

]]>
Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024) https://enterprise-knowledge.com/driving-behavioral-change-for-information-management-through-data-driven-green-strategy-edw-2024/ Fri, 03 May 2024 17:46:00 +0000 https://enterprise-knowledge.com/?p=20464 Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented “Driving Behavioral Change for Information Management through Data-Driven Green Strategy” on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida.  In … Continue reading

The post Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024) appeared first on Enterprise Knowledge.

]]>
Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented “Driving Behavioral Change for Information Management through Data-Driven Green Strategy” on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. 

In this presentation, Majumder and Aguilar Islas discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization.

In this session, participants gained answers to the following questions:

  • What is a Green Information Management (IM) Strategy, and why should you have one?
  • How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? 
  • How can an organization use insights into their data to influence employee behavior for IM?
  • How can you reap additional benefits from content reduction that go beyond Green IM?

The post Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024) appeared first on Enterprise Knowledge.

]]>
Exploring Vector Search: Advantages and Disadvantages https://enterprise-knowledge.com/exploring-vector-search-advantages-and-disadvantages/ Thu, 26 Oct 2023 17:22:14 +0000 https://enterprise-knowledge.com/?p=19120 The search for information is at the core of enhancing productivity and decision-making in the enterprise. In today’s digital age, searching for information has become more intuitive. With just a few clicks, we can explore vast knowledge and gain once-inaccessible … Continue reading

The post Exploring Vector Search: Advantages and Disadvantages appeared first on Enterprise Knowledge.

]]>
The search for information is at the core of enhancing productivity and decision-making in the enterprise. In today’s digital age, searching for information has become more intuitive. With just a few clicks, we can explore vast knowledge and gain once-inaccessible insights. The ability to search for information empowers individuals and organizations to stay informed, make educated decisions, and ultimately drive success. The introduction of numerous search strategies and frameworks has facilitated access. Although, it has also presented a problematic option for companies that must select between various search systems to deliver knowledge to their consumers at their point of need. Vector search is one of the latest enterprise search frameworks that leverages the power of large language models (LLMs) to index and retrieve content. In this article, I will examine the main advantages and disadvantages of vector search when choosing the framework for your enterprise search initiative.

 

Advantages of Vector Search

One of vector search’s main advantages is its ability to deliver highly relevant and accurate search results. Unlike traditional keyword-based search systems, which only match exact words or phrases, vector search considers the semantic meaning and context of the search query. For example, if you are searching for “apple stock”, then keyword search will retrieve content related to those keywords which may include food recipes, or references to the “big apple”, while vector search will retrieve content in the financial domain. Moreover, even if the user does not use exact keywords, the system can still understand the query’s intent and return relevant results based on semantic similarity and context. This functionality dramatically improves the user experience and increases the likelihood of quickly and efficiently finding the desired information. Furthermore, they are well-suited to handling conversational search queries and understanding user intent, thus enhancing user engagement.

Another main advantage of vector search is the versatility of the content and use cases it can accommodate by leveraging the multiple language tasks its underlying LLM model can perform. These are three main features that drive the primary differentiation between vector search and other search methods:

Multilingual Capabilities: Vector search engines have LLMs that interpret and process linguistic nuances, ensuring accurate information retrieval even in complex multilingual settings. Additionally, the multilingual capabilities of vector search engines make them invaluable tools for cross-lingual information retrieval, facilitating knowledge sharing and collaboration across language barriers.
Summarization, Named Entity Recognition (NER), and Natural Language Generation (NLG): Advanced vector search engines can efficiently summarize top results or lengthy documents. They can also use named entity recognition (NER) to identify, extract, and classify named entities such as people, organizations, and locations from unstructured content. Furthermore, generative AI enables these search engines to generate human-like text, which makes them useful for tasks such as content creation and automated report generation. These features benefit chatbots, virtual assistants, and customer support applications.
Recommendation System and Content-to-Content Search: Vector search engines go beyond retrieving search results; they can also power recommendation systems and content-to-content search. By representing content as vectors and measuring their similarity, vector search can efficiently identify duplicate or closely related documents. It is a valuable tool for organizations aiming to maintain content quality and integrity and those seeking to deliver relevant content recommendations to their users. This capability allows vector search engines to excel in plagiarism detection, content recommendation, and content clustering applications.

In summary, the advantages of vector search are numerous and compelling. Its ability to provide highly relevant and context-aware search results, its versatility in accommodating diverse language tasks, and its support for summarization, named entity recognition, natural language generation, recommendation systems, and content-to-content searches demonstrate its relevance as part of a comprehensive search strategy. However, it’s also essential to explore this approach’s potential drawbacks. Let’s shift our focus to the challenges of vector search to gain a more comprehensive understanding of its implications and limitations in various contexts.

 

Disadvantages of Vector Search

Vector search undoubtedly presents a wide range of opportunities but has challenges and limitations. One of the main disadvantages is the complex implementation process required for vector search. It can require significant computational power and expertise to properly design and implement the algorithms and models needed for vector search. It is essential to have a solid understanding of these drawbacks to conduct an accurate and thorough analysis of the viability of vector search in various settings. Here are some additional disadvantages of vector search:

Loss of Transparency and Hidden Bias: The inner workings of vector search engines are often opaque since they rely on pre-trained LLMs to vectorize the content. This lack of transparency can be a drawback in scenarios where you must explain or justify search results, such as in regulatory compliance or auditing processes. In these situations, the inability to explain clearly how the vector search engine arrived at specific results can raise concerns regarding bias or unfairness. Additionally, the lack of transparency can hinder efforts to identify and rectify potential issues or biases in the search algorithm.
Challenges in Specialized and Niche Contexts: Vector search encounters difficulties with rare or niche items, struggles to capture nuanced semantic meanings, and may need more precision in highly specialized fields. This limitation can lead to suboptimal search results in industries where precise terminology is crucial, like legal, healthcare, or scientific research. In this instance, a graph-based semantic search engine would be ideal because it could leverage an ontology to capture the intricate relationships and connections between specialized terms and concepts defined in an industry or enterprise taxonomy.
Performance vs. Accuracy Trade-off: LLM-based content vectorization can provide vectors of varying dimensions. The higher the dimensionality, the more information can be kept in vectors, resulting in more exact search results. The high dimensionality, however, comes at a higher processing cost and slower response times. As a result, vector search engines use approximate closest neighbor (ANN) techniques to accelerate the process while sacrificing some search precision. These algorithms provide outcomes similar, but not identical, to their nearest neighbors. It’s a trade-off between speed and precision, and organizations must decide how much precision they’re willing to give up for faster search speeds.
Privacy Concerns: Handling sensitive or personal data with vector search engines, especially when using APIs to access and train LLM services, may raise privacy concerns. If not carefully managed, the training and utilization of such models could result in unintentional data exposure, leading to data breaches or privacy violations.

Overall, the complex implementation process demands computational power and expertise, while the lack of transparency and potential hidden biases can raise concerns, particularly in compliance- and fairness-sensitive contexts. Vector search struggles in specialized fields and encounters a trade-off between search speed and precision when employing approximate nearest-neighbor algorithms to deal with high vector dimensionality and content at scale. Furthermore, handling sensitive data poses privacy risks if not carefully managed. Understanding these disadvantages is pivotal to making informed decisions regarding adopting vector search.

 

Conclusion

In conclusion, vector search represents a significant leap in search technology but requires careful assessment to maximize its benefits and mitigate potential limitations in diverse applications. As knowledge management and AI continue to evolve, the right search strategy can be a game-changer in unlocking the full potential of your organization’s knowledge assets. At EK, we recognize that adopting vector search should align with the organization’s goals, resources, and data characteristics. We recently worked with one of our clients to iteratively develop the vector search process and training algorithms to help them take advantage of their multilingual content and varied unstructured and structured data. Contact us to have our search experts work closely with you to understand your specific requirements and design a tailored search solution that optimizes the retrieval of relevant and accurate information.

The post Exploring Vector Search: Advantages and Disadvantages appeared first on Enterprise Knowledge.

]]>
Expert Analysis: Keyword Search vs Semantic Search – Part One https://enterprise-knowledge.com/expert-analysis-keyword-search-vs-semantic-search-part-one/ Mon, 20 Mar 2023 19:30:55 +0000 https://enterprise-knowledge.com/?p=17820 For a long time, keyword search was the predominant method to provide search to an enterprise application. In fact, it is still a tried-and-true means to help your users find what they are looking for within your content. However, semantic … Continue reading

The post Expert Analysis: Keyword Search vs Semantic Search – Part One appeared first on Enterprise Knowledge.

]]>
For a long time, keyword search was the predominant method to provide search to an enterprise application. In fact, it is still a tried-and-true means to help your users find what they are looking for within your content. However, semantic search has recently gained wider acceptance as a plausible alternative to keyword search. In this Expert Analysis blog, two of our senior consultants, Fernando Aguilar and Chris Marino, explain these different methods and provide guidance on when to choose one over the other.

What’s the difference between a keyword search system and a semantic search system?

Keyword Search (Chris Marino)

The heart of a keyword search system is a data structure called an “inverted index.” You can think of it as a two-column table. Each row in the table corresponds to a term found in your corpus of documents. One column contains the term, and the other column contains a list of all your documents (by ID) where that particular term appears. The process of filling up this table with the content in your documents is called “indexing.”

When a user performs a search in a keyword system, the search engine takes the words from their query and looks for an exact match in the inverted index. Then, it returns the list of matching documents. However, instead of returning them in random order, it applies a ranking (or scoring) algorithm to ensure that the more relevant documents appear first. This ranking algorithm is normally based on a couple of factors: “term frequency” (the number of times the terms appear in the document) and the rarity of the word across your entire corpus of documents. For example, if you search for “vacation policy” in your company’s documents, “vacation” most likely appears less frequently than “policy,” so those documents with “vacation” should have a higher score.

Semantic Search (Fernando Aguilar)

Semantic search, also known as vector search, is a type of search method that goes beyond traditional keyword-based search and attempts to understand the intent and meaning behind the user’s query. It uses natural language processing (NLP) and machine learning algorithms to analyze the context and relationships between words and concepts in a query, and to identify the most relevant results based on their semantic meaning. This approach is often used in applications such as chatbots, virtual assistants, and enterprise search to provide more accurate and personalized results to users.

In contrast to keyword search, which relies on matching specific keywords or phrases in documents or databases, semantic search is able to understand the underlying meaning of the query and identify related concepts, synonyms, and even ambiguous terms. This enables it to provide more comprehensive and relevant results, especially in cases where the user’s intent may not be well-defined or where multiple meanings are possible.

What are the Pros and Cons of using Keyword Search vs Semantic Search?

Keyword Search (Chris Marino)

Keyword search is a workhorse application that has been around for decades. This fact makes it a natural choice for many search solutions. It tends to be easier to implement because it’s a more familiar application. It’s been battle-tested, and there are a wealth of developers out there who know how to integrate it. As with many legacy systems, there are many thought pieces, ample documentation, pre-built components, and sample applications available via a Google search (or just ask ChatGPT).

Another benefit of keyword search is its interpretability – the ability for a user to understand why a certain result matched the query. You can easily see the terms you have searched for in your results. Although there is an algorithm performing the scoring ranking, a search developer can quickly discern why a certain result appeared before another and make tweaks to impact the algorithm. Conversely, the logic behind semantic search results is more of a “black box” variety. It’s not always readily apparent why a particular result was returned. This has a significant impact on overall user experience; when users understand why they’re getting a search result, they trust the system and feel more positively towards it.

The biggest drawback of keyword search is that it lacks the ability to determine the proper context of your searches. Instead of seeing your search terms as concepts or things, it sees them simply as strings of characters. Take for instance the following query:

“What do eagles eat?”

Keyword search processes and searches for each term individually. It has no concept that you are asking a question or that “what” and “do” are unimportant. Further, there are many different concepts known as “Eagles”: the bird-of-prey, the 70’s rock group, the Philadelphia football team, and the Boston College sports teams. While a person can surmise that you’re interested in the bird, keyword search is simply looking for any mention of the letter string: “e-a-g-l-e.”

Semantic Search (Fernando Aguilar)

Semantic search has gained popularity in recent years due to its ability to understand the intent and meaning behind the user’s query, resulting in more relevant and personalized results. However, not all use cases benefit from it. Understanding the advantages, limitations, and the trade-offs between semantic and keyword search can help you choose the best approach for your organization’s specific needs.

Pros:

  • Semantic search makes search results more comprehensive and inclusive by identifying and matching term synonyms and variations.
  • Vector search provides more relevant results by considering a query’s context, allowing it to differentiate between “Paris,” the location, and “Paris,” a person’s name. It also understands the relationship between its terms, such as part-of-speech (POS) tagging, and identifying different terms as verbs, adjectives, adverbs, or nouns.
  • It enables the user to express their intent more accurately by allowing them to make queries using natural language phrases, synonyms, or variations of terms and misspellings, leading to a more user-friendly search experience.

Cons:

  • Calculating similarity metrics to retrieve search results is computationally intensive. Optimization algorithms are generally needed to speed up the process. However, faster search times come at the cost of decreased accuracy.
  • The search results can be less relevant if users are accustomed to searching using one or two-term queries instead of using search phrases. Therefore, it is essential to analyze current search patterns before implementing vector search.
  • Pre-trained language models need to be fine-tuned to learn and understand the relationships between words in the context of your business domain. Fine-tuning a language model will improve the accuracy of the search results, but training is usually time-consuming and resource intensive.

How do the use cases for each type of search differ?

Keyword Search (Chris Marino)

In general, any search use case is a good case for keyword search. It has been around for many years and, when configured correctly, can provide solid results at a reasonable cost. However, there are a few use cases that are particularly well-suited for keyword search: academic and legal search, primarily by librarians. It’s been my experience that these types of searchers have very exact, complex queries. Characteristics of these queries might include:

  • Exact phrase matching
  • Multi-field searches (“show me documents with X in Field 1, Y in Field 2, Z in Field 3 …”)
  • Heavy boolean searches (“show me this OR these AND those but NOT that”)

In these instances, the user needs to ensure and validate that each result matches their exact query. They are not looking for suggestions. Precision (“show me exactly what I asked for”) is more important than recall (“show me things I may be interested in but didn’t specifically request”).

Semantic Search (Fernando Aguilar)

The primary use case differentiator will be determined by how search users format their queries. Semantic search will prove best for users that submit search phrases where context, word relationships, and term variations are present versus searching for a couple of exact terms. Hence, beyond a search query, chatbots, virtual assistants, or customer service applications are great examples where users may be conversationally asking questions.

What are the cool features found in keyword search vs semantic search?

Keyword Search (Chris Marino)

There are a number of features that keyword search provides to improve a searcher’s overall experience. Some of the main ones include facets, phrase-searching, and snippets.

Facets

Facets are filters that let you refine your search results to only view items that are of particular interest to you based on a common characteristic. Think of the left-hand side of an Amazon search results page. They are based on the metadata associated with your documents, so the richer your metadata, the better options you can provide to your users. In an enterprise setting, common facets are geography-based ones (State, Country), enterprise-based ones (Department, Business Unit), and time-based ones (Published Date, Modified Date – whose values can even contain relative values like “Today”, “”, “Last 7 days”, “This Year”).

Phrase searching

Phrase searching allows you to find exact phrases in your documents, normally by including the phrase within quotation marks (“”). A search for “tuition reimbursement” will only return documents that match this exact phrase, and not documents that only mention “tuition” or “reimbursement” independent from one another.

Snippets

Snippets are small sections from your document which include your search terms and are displayed on the search results page. They show the search terms in the context of the overall document, e.g., the main sentence that contains the terms. This helps by providing a visual cue to help the searcher understand why this particular document appears. Normally, the search results page displays the title of the document, but often your search term does not appear in the title. By displaying the snippet, with the search term highlighted, the user feels validated that the search has returned relevant information. We refer to this as “information scent.”

Semantic Search (Fernando Aguilar)

Currently, semantic search is one of the most promising techniques for improving search and organizing information. While semantic methods have already proven effective in a variety of fields, such as computer vision and natural language processing, there are several cool features that make semantic search an exciting area to watch for enterprise search. Some examples include:

  • Concurrent Multilingual Capabilities: Vector search can leverage multilingual language models to retrieve content regardless of the language of the content or the query itself.
  • Text-to-Multimodal Search: Natural language queries can retrieve un-tagged video, image, or audio content, depending on the model used to create the vectors.
  • Content Similarity Search: Semantic search can also take content as query input, so applications can retrieve similar content to the one the user is currently viewing.

Conclusion

If perfecting the relevancy of your search results isn’t directly tied to your organization’s revenue or mission achievement, keyword search provides an efficient, proven, and effective method for implementing search in your application. On the other hand, semantic search will be a better solution when the clients are using natural language to describe what they are looking for, when the content to be retrieved is not all text-based, or when an API (not a person) is consuming your search.

Check out some of our other thought leadership pieces on search:

5 Steps to Enhance Search with a Knowledge Graph

Dashboards – The Changing Face of Search

And if you are embarking on your own search project and need proven expertise to help guide you to success, contact us!

The post Expert Analysis: Keyword Search vs Semantic Search – Part One appeared first on Enterprise Knowledge.

]]>
The Value of Data Catalogs for Data Scientists https://enterprise-knowledge.com/the-value-of-data-catalogs-for-data-scientists/ Thu, 30 Jun 2022 14:08:31 +0000 https://enterprise-knowledge.com/?p=15667 Introduction After the Harvard Business Review called Data Scientist the sexiest job of the 21st century in 2012, much attention went into the interdisciplinary field of data science. Students and professionals were curious to know more about what data scientists … Continue reading

The post The Value of Data Catalogs for Data Scientists appeared first on Enterprise Knowledge.

]]>
Introduction

After the Harvard Business Review called Data Scientist the sexiest job of the 21st century in 2012, much attention went into the interdisciplinary field of data science. Students and professionals were curious to know more about what data scientists do, while businesses and organizations across industries wanted to understand how data scientists could bring them value.

In 2016, CrowdFlower, now Appen, published their Data Scientist report to respond to this newfound interest. This report aimed to survey professional data scientists with different years of experience and fields of expertise to find, among other things, what their everyday tasks were. The most important takeaway from this report is that it supports the famous 80/20 rule of data science. This rule states that data scientists spend around 80% of their time sourcing and cleaning data. And, only 20% of their time is left to perform analysis and develop machine learning models, which according to the same CrowdFlower survey, is the task that data scientists enjoy the most. The pie chart below shows that 1 out of every 5 data scientists spends most time collecting data, while 3 out of every 5 spend most of their time cleaning and organizing it.

A donut chart representing what data scientists spend most of their time doing.

Anaconda's 2020 State of Data Science report summary of how much time data scientists spend doing what. More recently, Anaconda’s 2020 State of Data Science Report shows that the time data scientists spent collecting, cleaning, and organizing data improved. It now takes up to 50% of their time. From the bar chart on the right, we can notice that most of the improvement is due to a dramatic decrease in the time spent cleaning data, from 60% in 2016 to 26%. However, collecting data remained static at 19%. We can also notice the introduction of time spent on data visualizations. This addition speaks to the growing need to communicate the value of the data scientist’s work to non-technical executives and stakeholders. And therefore, it is not surprising that the amount of time dedicated to developing those visualizations is a third of the time spent generating that value through model selection, model training and scoring, and deploying models.

In my experience, Anaconda’s report remains true to this date. When starting a data science project, finding the relevant data to fit the client’s use case is time-consuming. It often involves not only querying databases but also interviewing data consumers and producers that may point to data silos only known to a small group or even bring out discrepancies in understanding among teams regarding the data. Bridging the gap in understanding data among data personas is the most time-consuming task and one that I have witnessed data catalogs excel at performing.

To keep this trend and reverse the 80/20 rule, businesses and organizations must adopt tools that facilitate the tasks throughout the data science processes, especially in data sourcing and cleaning. Implementing an enterprise data catalog would be an ideal solution with an active role throughout the data science process. By doing so, data scientists will have more time to spend on high-value-generating tasks, increasing the return on investment.

Enterprise Data Catalogs

Data catalogs are a metadata management system for the organization’s data resources. In the context of this blog, they help data scientists and analysts find the data they need and provide information to evaluate its suitability for the intended use. Some capabilities enabled by a data catalog are:

  • Increased search speed utilizing a comprehensive index of all included data
  • Improved visibility with custom views of your data organized by defined criteria
  • Contextual insights from analytics dashboards and statistical metrics
  • Documentation of cross-system relationships between data at an enterprise level

Because of these capabilities, data catalogs prove to be relevant throughout the data science process. To demonstrate this, let’s review its relevance through each step in the OSEMN framework.

Value of Data Catalogs to the OSEMN Framework

The acronym OSEMN stands for Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data. It is a convenient framework to analyze the value of data catalogs because each step translates to a specific task in a typical data science project. Mason and Wiggins introduced this five-step OSEMN framework in their article “A Taxonomy of Data Science” in 2010, and it has been widely adopted by data scientists since.

Obtain

This step involves searching for and sourcing relevant data for the project. That is easy enough to do if you know what specific datasets to look for and whom to ask for access to the data. However, in practice, this is rarely the case. In my experience, the projects that generate the most significant value for the organization require integrating data across systems, teams, departments, and geographies. Furthermore, teams leading analytics and data science efforts recognize that the ROI of the project is highly dependent on the quality, integrity, and relevance of the sourced data. They, therefore, have been spending about a fifth of their time sourcing and verifying that they have high-quality and complete data for their projects. Data catalogs can help reduce this time through advanced search, enhanced discoverability, and data trustworthiness.

  • Advanced Search: Enterprise-wide faceted search provides knowledge instead of simple results by displaying the data’s contextual information, such as the data assets owner, steward, approved uses, content, and quality indicators. Most teams won’t have access to all of the enterprise datasets. However, these metadata profiles help data scientists save time by using this information to find what data is available to them quickly, assess their fitness for their use case, and whom to ask for access to the data they need.
  • Enhanced Discoverability: Although this is the first step in the OSEMN framework, this step comes after understanding the business problem. This understanding gives greater insight into the entities involved, such as customers, orders, transactions, organizations, and metrics. Hence, users can tag datasets according to the entities present in the data, and the data catalog can then auto-tag new content as it gets ingested. This feature allows new data to be discoverable almost immediately, resulting in additional and more recent data available for the project.
  • Data Trustworthiness: Searching for data across systems and departments can be time-consuming and often does not yield great results. Occasionally, you might stumble upon data that seems fit for your use case, but can you trust it? Because of data catalogs, data scientists can save time by not having to do detective work tracking down the data’s origins to assess its reliability. Data catalogs allow you to trace the data’s lineage and display quality metrics taking out the guesswork of sourcing your data.

Scrub

Data scientists would curate a clean data set for their project in this step. Some tasks include merging data from different sources into a single data set, standardizing column formatting, and imputing missing values. As examined in the introduction, the time spent cleaning and organizing data has sharply decreased. I believe the advent of user-friendly ETL solutions has played a significant role in bringing down the time spent in this step. These solutions allow users to define pipelines and actions in graphic interfaces that handle data merging, standardization, and cleaning. While some data catalogs have such comprehensive ETL features, most will have basic ETL capabilities. The users can then expand these basic capabilities through third-party integrations. But ETL capabilities aside, data catalogs are still helpful in this step.

Many organizations reuse the same data assets for multiple initiatives. However, each team cleans and standardizes the data only to store it inside their own project folder. These data silos add clutter, waste storage, and increase duplicative work. Why not catalog the clean and standardized data? This way, the next team that needs to use the data will save time using the already vetted and cleaned data set.

Explore

Data scientists usually perform exploratory data analysis (EDA) in this step. EDA entails examining the data to understand its underlying structure and looking for patterns and trends. Some of the queries developed in this step provide descriptive statistics, visualizations, and correlation reports, and some may even result in new features. Data catalogs can support federated queries so that data scientists can perform their EDA from a single self-service store. This way, they save time by not having to query multiple systems at different access points and figuring out how to aggregate them in a single repository. But the benefits do not stop there. The queries, the aggregated data set, and visualizations developed during the EDA process can also be cataloged and discoverable by other teams that might reuse the content for their initiatives. Furthermore, these cataloged assets become fundamental for future reproductions of the model.

Model

According to the CrowdFlower survey, this is the most enjoyable task for data scientists. We have been building up to this step, which many data scientists would say is “where the magic happens.” But “magic” does not necessarily have to be a black box. Data catalogs can help enhance the models’ explainability with their traceability features. Due to this, every stakeholder with access to the information will be able to see the training and test data, its origin, any documented transformations, and EDA. This information is an excellent foundation for non-technical stakeholders to understand and have enough context for the model’s results.

So far, data catalogs provide circumstantial help in this phase, primarily byproducts of the interactions between the data scientist and the data catalog in the previous steps. However, data catalogs are directly beneficial during the model selection process. As we can see from the chart on the right, as more training data become available, the results of separate models become more similar. In other words, the model selection loses relevancy when the training data available for the model to train on increases. Hence, a data catalog provides a data scientist with a self-service data discovery platform to source more data than was feasible in previous efforts. And therefore, it makes the data scientists’ task more efficient by removing the constraints on model selection caused by insufficient data. Moreover, it saves time and resources since now data scientists can train fewer models without significantly impacting the results, which is paramount, especially when developing proof-of-concept models.

iNterpret

This step is where data scientists communicate their findings and the ROI of the project to the stakeholders. In my experience, this is the most challenging step. In Anaconda’s report, data scientists responded on how effectively their teams demonstrated the impact and value of data science on business outcomes. As we notice from the results below, data science teams were more effective in communicating their impact on businesses in industries with a higher proportion of technical staff. We can also notice a wide efficiency gap across sectors, with teams in consulting and technology firms having almost twice the efficiency in conveying their projects’ impact as teams driving healthcare data science projects.

How effective are data scientist teams at demonstrating the impact of data science on business outcomes?
How effective are data scientist teams at demonstrating the impact of data science on business outcomes?

To accommodate non-technical audiences, many data scientists facilitate this demanding task using dashboards and visualizations. These visuals improve the communication of value from the teams to the stakeholders. Further, data scientists could catalog these dashboards and visualizations in the metadata management solution. In this way, data catalogs can increase the project’s visibility by storing these interpretations in the form of insights that can be discoverable by the stakeholders and a wider approved audience. Data scientists in other departments, geographies, or subsidiaries with a similar project in mind can benefit from the previous work done and build on top of that whenever possible. Therefore, reducing duplicative work.

Conclusion

Data catalogs offer many benefits throughout a data science project’s process. They provide data scientists with self-service data access and a discoverability ecosystem on which to obtain, process, aggregate, and document the data resources they need to develop a successful project. Most of the benefits are front-loaded in the first step of the OSEMN framework, obtaining data. However, we can note their relevance throughout the remaining steps.

I would like to clarify that no single data catalog solution will have all the capabilities discussed in this article embedded as a core feature. Please consider your enterprise needs and evaluate them against the features of the data catalog solution you consider implementing. Our team of metadata management professionals has led over 40 successful data catalog implementations with most major solution providers. Don’t hesitate to contact us so we can help you navigate the available data catalog solutions and use our expert knowledge to choose the one that best fits your organization’s needs and lead its successful implementation.

Resources and Further Reading

The post The Value of Data Catalogs for Data Scientists appeared first on Enterprise Knowledge.

]]>
Metadata Use Case: IMDB in Amazon Prime Video https://enterprise-knowledge.com/metadata-use-case-imdb-in-amazon-prime-video/ Tue, 19 May 2020 16:00:05 +0000 https://enterprise-knowledge.com/?p=11182 Have you been catching up on your favorite TV shows lately? If so, while watching a series or movie from home, it is very likely you might have asked yourself the following questions: “The narrator’s voice sounds familiar, who is … Continue reading

The post Metadata Use Case: IMDB in Amazon Prime Video appeared first on Enterprise Knowledge.

]]>
Have you been catching up on your favorite TV shows lately? If so, while watching a series or movie from home, it is very likely you might have asked yourself the following questions:

  • “The narrator’s voice sounds familiar, who is it?”
  • “What is that actor’s name? I think I might have seen him in another movie.”
  • “Isn’t she the actress from this other show I watched some years ago?”

A few years ago, these questions might have gone unanswered if neither you nor any of the people with you knew the answer, or you might have had to wait until the credits appeared. However, now, all it takes is a simple google search to find all of the answers to those questions. The information that you might find on the internet about the series could be the cast, number of seasons, number of episodes in the season, airing dates, episode summaries, episode length, and production details, among others. This relationship between the TV series and the information you found about it on the internet brings us to the concept of metadata.

an example of metadata from the movie La La Land, such as "title," "description," "director," and more.

Metadata

As you might notice from the example above, metadata is simply data about data. In this particular case, it would be the data on the internet about the videos you watched. The primary use of metadata is to provide context and information about data, as well as enhance findability and describe data, all of which are especially helpful when dealing with unstructured data.

  • Structured data: These data follow a defined framework with a set number of fields. Think of a well-formatted spreadsheet where every column contains one specific type of data. An example of this would be a table with personal information, such as name, address, telephone number, and age of multiple people.  
  • Unstructured data: This data is not able to be stored in a traditional column-row database or spreadsheet. Think of photos, videos, audio, text documents, and websites. Unstructured data is also the most common type of data, and because of its unstructured nature, its metadata is particularly useful to help us find it and make sense of it. How would you be able to find a movie without being able to search by its title, who stars in it, or what it is about?

Amazon Prime Video Meets IMDB

IMDB is a database designed to provide TV watchers and cinephiles information about millions of TV shows and films, including cast biographies and reviews. In 1998, Amazon bought IMDB to acquire its lucrative user base and give its Amazon Prime Video streaming service a marketing push. This strategic acquisition allowed Amazon to promote its video streaming service to an already targeted user base. 

An example of the X-Ray feature on amazon video, which presents facts such as "general trivia" or the case of the show you are watching. This image shows an example of general trivia displayed through the x-ray feature about the show Jack Ryan.

Amazon Prime Video and IMDB kept growing in both content and active users. The streaming service not only got its marketing push, but also integrated its user-generated data (such as user behavior and preferences) with IMDB’s database to boost its recommendation systems across platforms. So, how could these two successful products be further integrated? Fast forward about a decade, and Amazon Prime Video added a new feature called X-Ray.

Remember those questions that many people have while watching a video? Well, X-Ray takes care of that. Now, when pausing your favorite show, X-Ray will display information about what you are currently viewing. This includes cast information, filmographies, facts, trivia, character backstories, photo galleries, bonus video content, and music. By leveraging metadata from IMDB, Amazon Prime Video can add structure to unstructured video content, enabling users to answer those nagging questions.

Building More Informed Recommendation Systems

Recently, due to the COVID-19 crisis, video streaming services have been experiencing a surge in demand. As people catch up on their favorite series, streaming service firms need to keep their customers engaged by recommending related material that would keep them active.

At Enterprise Knowledge, we had the opportunity to work with a prominent client in the telecommunications industry. We improved their recommendation systems, leveraging the power of metadata on its unstructured content. The enhanced recommendation system takes the viewers’ input based on a specific scene the viewer is currently watching. The engine would ingest information pulled from the closed captioning file and internal and external databases containing information about the tv series, episode, and scene. The resulting recommendation system would not only work on general information about the tv series such as genre, recurring cast, summaries, and network, but it would also take specific details about the scene, such as sentiment inferred from the subtitles, non-recurring cast appearances, and particular music in the scene, improving the recommendations provided to the viewer.

Beyond the media or telecommunications industries, metadata has an equally crucial role in making unstructured data usable and accessible. It allows enterprise applications to link unstructured content based on assigned attributes included in the metadata. As another example, in the pharmaceutical industry, a recommendation system would take research papers, formulations, and experiment reports and link them based on related chemical compounds, illnesses, or authors. These links in the data power up recommendation systems and enterprise search engines that provide content at the users’ point of need. The resulting enterprise applications are as powerful as the quality and completeness of the metadata used to derive the results.

The benefits of including metadata as an integral part of an organization’s strategy include:

  • Content findability, reuse, and sharing: Metadata ensures that complex content is easily understood and processed by people other than the content creator. Hence, it allows anyone in the organization to find the content they need to do their jobs regardless of content type, knowledge of its existence, who owns it, or where it is located. This results in increased productivity and higher quality of work. 
  • Data Governance: Metadata can also serve as an annotation tool that denotes content ownership and temporality since some data may be deemed irrelevant after a specific timeframe. This also makes it easier to identify who is responsible for the timeliness and the quality of the content. Furthermore, it can be used to trigger workflows that ensure the content is accurate and up to date, if necessary. As a result, organizations have greater control over their content and data, ensuring the right people are finding and acting on the right information.
  • Innovation and Service: When employees spend less time asking coworkers for content, looking for information, recreating information, and waiting for answers, they have more time for innovation and customer support. This, in turn, results in greater employee and customer satisfaction, which leads to higher employee and customer retention.

Conclusion

In conclusion, metadata provides structure to unstructured content, making it machine-readable and ready to work with machine learning and artificial intelligence applications. In my specific example, Enterprise Knowledge enhanced the client’s unstructured video content using internal and external sourced data to provide a metadata rich environment. This environment gave the recommendation system access to new information on which to drive its decisions, culminating in better recommendations that keep TV watchers engaged. Similarly, we can help your organization connect your data, content, and people in ways to enhance your corporate knowledge, resulting in the benefits I discussed above.

Does your organization need assistance in leveraging metadata to enhance its unstructured content? Feel free to reach out to us for help!

The post Metadata Use Case: IMDB in Amazon Prime Video appeared first on Enterprise Knowledge.

]]>