data catalog Articles - Enterprise Knowledge https://enterprise-knowledge.com/tag/data-catalog/ Fri, 06 Jun 2025 18:26:21 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://enterprise-knowledge.com/wp-content/uploads/2022/04/EK_Icon_512x512.svg data catalog Articles - Enterprise Knowledge https://enterprise-knowledge.com/tag/data-catalog/ 32 32 EK / DataGalaxy Joint Webinar https://enterprise-knowledge.com/data-catalog-implementation-and-adoption-webinar/ Wed, 16 Oct 2024 15:21:35 +0000 https://enterprise-knowledge.com/?p=22247 Paula Land and Elliott Risch will present a webinar on the topic of AI-assisted content analysis titled Getting Control of Your Content: AI Solutions to Streamline and Optimize Your Digital Assets.  Continue reading

The post EK / DataGalaxy Joint Webinar appeared first on Enterprise Knowledge.

]]>
 

Thomas Mitrevski, Principal Consultant for Data Management at Enterprise Knowledge and Laurent Dresse, Chief Evangelist for DataGalaxy, will present a joint webinar on the topic of implementing a data catalog and garnering adoption across your organization.

Within the webinar, Mitrevski and Dresse will cover how to:

  • Deploy data catalogs for maximum impact;
  • Overcome adoption challenges and boost user engagement; and
  • Drive global data governance and foster a data-driven culture.

This interactive one-hour session will include real-world examples from Enterprise Knowledge demonstrating how to evaluate your current catalog maturity, develop actionable use cases, and identify where crucial information resides within your organization to best support your catalog use cases. 

The webinar will take place Thursday, October 31, at 11:00 a.m. EDT. The event is free, but registration is required.

Register for the webinar here!

The post EK / DataGalaxy Joint Webinar appeared first on Enterprise Knowledge.

]]>
Case Studies: Applications of Data Governance in the Enterprise https://enterprise-knowledge.com/case-studies-applications-of-data-governance-in-the-enterprise/ Tue, 19 Dec 2023 16:46:26 +0000 https://enterprise-knowledge.com/?p=19430 Thomas Mitrevski, Senior Data Management and Governance Consultant and Lulit Tesfaye, Partner and Vice President of Knowledge and Data Services presented “Case Studies: Applications of Data Governance in the Enterprise” on December 6th, 2023 at DGIQ in Washington D.C. In … Continue reading

The post Case Studies: Applications of Data Governance in the Enterprise appeared first on Enterprise Knowledge.

]]>
Thomas Mitrevski, Senior Data Management and Governance Consultant and Lulit Tesfaye, Partner and Vice President of Knowledge and Data Services presented “Case Studies: Applications of Data Governance in the Enterprise” on December 6th, 2023 at DGIQ in Washington D.C.

In this presentation, Mitrevski and Tesfaye detailed their experiences developing strategies for multiple enterprise-scale data initiatives and provided an understanding of common data governance and maturity needs. Mitrevski and Tesfaye based their talk on real-world examples and case studies and provided the audience with examples of achieving buy-in to invest in governance tools and processes, as well as the expected return on investment (ROI).

Check out the presentation below to learn: 

  • How Leading Organizations are Benchmarking Their Data Governance Maturity
  • Why End-User Training was Imperative in Seeing Scaled Governance Program Adoption
  • Which Tools and Frameworks were Critical in Getting Started with Data Governance
  • How Organizations Achieved Success with Data Governance in Under 12 Weeks
  • What Successful Data Governance Implementation Roadmaps Really Look Like

The post Case Studies: Applications of Data Governance in the Enterprise appeared first on Enterprise Knowledge.

]]>
Data Catalog Evaluation Criteria https://enterprise-knowledge.com/data-catalog-evaluation-criteria/ Mon, 09 Jan 2023 20:09:51 +0000 https://enterprise-knowledge.com/?p=16947 Data Catalogs have risen in adoption and popularity in the past several years, and it’s no coincidence as to why. The amount of data, and therefore metadata, is exploding at a rapid pace and will certainly not slow down anytime … Continue reading

The post Data Catalog Evaluation Criteria appeared first on Enterprise Knowledge.

]]>
Data Catalogs have risen in adoption and popularity in the past several years, and it’s no coincidence as to why. The amount of data, and therefore metadata, is exploding at a rapid pace and will certainly not slow down anytime soon, pushing the need for a cloud solution that creates a source of truth for data and information. It’s difficult to manage and make sense of all of it. Moreover, people are not sure what the best use of all this data is for their businesses. There are so many data catalog vendors out there, all seemingly having the same message, that they are the right choice for you, but that isn’t the case. Choosing the right data catalog for your business depends on several criteria. Before looking at vendors and selection criteria, let’s narrow down what is important for your data catalog solution to have.

Enlarged text reading "Know Your Use Cases and Users"

Before delving into what criteria and vendor you want for your data catalog, thoroughly consider the Use Cases and Users of your business, because they are the main drivers of getting the most efficient use of your data catalog solution.

Use Cases: Consider the root problem that led your business to decide they need a data catalog solution. Beyond the fact that you have siloed data sources that you want to bring together in one centralized location, what are the true needs behind this? Are you trying to enable discovery, governance, data quality, analytics and/or delivery of your data assets? While all data catalog vendors share the common goal of merging your siloed data sources, each vendor will have a tailored functionality that answers one or more of the previous questions.

Users: Who will be accessing your data catalog? Your users should align with your use cases, and knowing who they are will help you focus on the most pertinent criteria for your data catalog. Do you need a platform for data scientists and engineers to build and monitor ETL processes? Are business users using the data catalog as a go-to discovery platform for insights and answers? Some example users of your data catalog might be:

  • Casual Users: Conduct broad searches and perform data discovery.
  • Data Stewards: Make governance decisions about data within their domain.
  • Data Analysts: Analyze data sets to generate insights and trends.
  • Data Architects/Engineers: Build data transformation pipelines (ETL).
  • System Administrators: Monitor system performance, use and other metrics.
  • Mission Enablers: Transform data and information into insights within analysis and reports to support objectives.

Enlarged section header text reading "Selection Criteria"

In the previous section, I listed some potential use cases your organization may be focused on depending on the root cause of your need for a data catalog or identified users. Let’s dive deeper into the 6 different criteria that you should prioritize when evaluating your data catalog solution. 

Sub-header text: "Availability & Discovery"

To maximize the value of your data, you need to understand what you have and how it relates to other data. Increased availability leads to less time catalog users spend looking for data, therefore reducing time to insight and analysis. Discovery allows for greater creativity and innovation of your data and metadata within your infrastructure for your data professionals making your business more efficient. For example, a client I am supporting to implement a data catalog solution needs their casual end users to be able to search for keywords and documents from separate databases and see all related results in one place to reduce time spent searching through multiple databases for the same information.

Sub-header: "Interoperability"

Interoperability pertains to the data catalog’s ability to integrate with your siloed information platforms and aggregate them into one centralized location. Data catalog vendors do not serve every database, data warehouse or data lake on the market. Rather, they will often target one or a few particular business software suites. Integration compatibility across your current environment is necessary to maximize your user experience as well as just making the data catalog usable. In addition to considering system interoperability, evaluate the data interoperability of the catalog. I recommend using a data catalog that will store and relate your data together using graphs and Semantic Web standards. Graphs and the Semantic Web standards help transform unstructured and semi-structured data at scale into meaningful and human readable relationships. Before choosing your catalog, assess the ease of configuration and linking your data catalog to your current environment. An example for checking for interoperability of your data catalog might be that if your current environment spans across multiple data storage providers such as AWS, Google or Microsoft, it’s important that your data catalog can aggregate information from all sources that are mission critical.

Sub-header: "Governance"

Businesses wrap their data in complicated security processes and rules, typically enforced by a specialized data governance team. These security processes and rules are enforced with a top-down approach and slow down your work. The modern and rising data framework highlights the need for governance to be a bottom-up approach to reduce bottlenecks of discovery and analysis. Choose the data catalog that provides governance features that prioritize catalog setup, data quality, data access and end-to-end data lineage. A few key governance features to consider are data ownership, curation, workflow management, and policy/usability controls. These governance features streamline and consolidate efforts to provide proper data access and management for users with an easy to use interface that spans across all data within the catalog. The right data catalog solution for your business will contain and specialize in the governance features needed by your user personas, such as system administrators to control data intake for users based on role, responsibility and sensitivity. For more information regarding metadata governance, check out my colleague’s post on the Best Practices for Successful Metadata Governance.

Sub-header: "Analytics & Reporting '

Analytics and reporting pertains to the ability to develop, automate and deliver analytical summaries and reports about your data. Internally or through integration, your data catalog needs to expand beyond being a centralized repository for your assets and provide analytical insights about how your data is being consumed and what business outcomes it is helping to drive. Some insights that are of interest to many organizations are which datasets are most popular, which users are consuming particular datasets, and the overall quality of the data contained within your data catalog. The most sought after insight I see with client implementations surrounds data usage by user types (analyzing which users consume particular data sets to get a better understanding of the data that has the most business impact).

Sub-header: "Metadata Management"

Metadata often outlasts the lifecycle of the data itself after it is deprecated, replaced, or deleted. Some of the key components of metadata management are availability, quality, lineage, and licensing.

  • Availability: Metadata needs to be stored where it can be accessed, indexed, and discovered in a timely manner.
  • Quality: Metadata needs to have consistency in its quality so that the consumers of the data know it can be trusted.
  • Historical Lineage: Metadata needs to be kept over time to be able to track data curation and deprecation.
  • Proper Licensing: Metadata needs to contain proper licensing information to ensure proper use by the appropriate users.

Depending on your use cases and personas, some of the key components above will take priority over others. Ensure that your data catalog contains, collects and analyzes the metadata your business needs. During the data catalog implementation, one feature I notice that clients usually need from their data catalog is data lineage. If historical lineage of your data is a dealbreaker, this will help narrow down your data catalog search effort.

Sub-header: "Enterprise Scale"

Enterprise scale is the capability for widespread use across multiple organizations, domains, and projects. Your data catalog will need to scale vertically with the amount of data that is ingested, as well as horizontally to continually serve new business ventures within your roadmap. Evaluate how you foresee your data catalog to grow in the coming years. Vertical scaling will reflect a need to continually add more data to the catalog, whereas horizontal scaling will reflect a need to spread the reach of your data catalog to more users.

Visual diagram comparing vertical vs. horizontal scaling

Conclusion

Now that you have an idea of the criteria that are most important when selecting your data catalog vendor, it’s time to explore further into your options. Take advantage of demos offered by data catalog vendors to get a feel for which catalogs have the right fit for your use cases and users. Carefully consider the pros and cons of each vendor’s platform and how their platform can meet the goals of your business catalog. If a data catalog is the right fit for your business and you’re still not sure as to which is the right for you, reach out to us at Enterprise Knowledge and we can help you evaluate your use cases and recommend the right data catalog solution for you!

 

The post Data Catalog Evaluation Criteria appeared first on Enterprise Knowledge.

]]>
Three Pillars of Successful Data Catalog Adoption https://enterprise-knowledge.com/three-pillars-of-successful-data-catalog-adoption/ Thu, 11 Aug 2022 14:08:42 +0000 https://enterprise-knowledge.com/?p=16053 Data catalogs function as a central library of information about content and data across an enterprise, making them a useful metadata management tool. They can aid  organizations that struggle with poorly documented metadata, duplicated data work, and wasted time due … Continue reading

The post Three Pillars of Successful Data Catalog Adoption appeared first on Enterprise Knowledge.

]]>
Data catalogs function as a central library of information about content and data across an enterprise, making them a useful metadata management tool. They can aid  organizations that struggle with poorly documented metadata, duplicated data work, and wasted time due to the inability to find proper resources. As my colleague Joe Hilger further elaborates in his post on The Top 5 KM Technologies, data catalogs can benefit companies seeking to manage siloed content and improve resource findability. The key to unlocking the knowledge management and data insight advertised by catalog providers requires a careful catalog implementation strategy.

The top 10 data catalog providers in 2022 according to The Forrester Wave:

– Atlan

– data.world

– Collibra

– Informatica

– Google

– Microsoft

– Alation

– Talend

– Oracle

– Cloudera

I have led the training and adoption process for data catalog implementations for my full tenure at Enterprise Knowledge (EK). During this time, I have consulted with both our client catalog program managers and catalog provider implementation teams to determine what did or didn’t work to drive catalog adoption at their companies. I have reviewed catalog user research studies across widely differing organizations to learn how various implementation approaches affect the data teams they intend to support. 

My key finding is that successful adoption of a data catalog requires both a user-driven program design and integrating the tool into your team’s day-to-day tasks. In this white paper, I consolidate my experience and findings into three strategic pillars essential to create the necessary catalog environment:

1. Cater to your company’s culture.

2. Make it easy (and enticing) to use.

3. Measure how it’s going.

As each company is unique in both its data ecosystem and goals, there is not a singular approach or definition of “success” for this framework. Each use case requires context-based solutions. I present these pillars as a guide to help you brainstorm an implementation strategy specific to your organization. If you would prefer expert assistance, EK’s team of data specialists is available to help you design an enterprise catalog program tailored to your unique team and data strategy. 

1. Cater to your company’s personas

A culture-driven design means defining an initial use case that satisfies the requirements of the organization as a whole, including both your executive stakeholders and your data users.

Top-down approach 

Obtain support from stakeholders by aligning catalog program goals with the broader data strategy at your company.

As a first step, clarify why your organization is moving to implement a catalog. What are youIcon that references key stakeholders trying to accomplish by adding this tool to the data ecosystem? Once you’ve ascertained your organization’s why, then move on to the how. What level of investment are you able to put in? Which catalog providers fit in that scope? Secure executive and stakeholder sponsorship by demonstrating how the addition of a data catalog will fit with the broader data vision. Working to align these two strategies will help you to develop your catalog program high-level goals.

What does success mean for your program overall? With your high-level goals in mind, what specific use cases should you focus on first to step forward towards those goals? Is your priority increased efficiency or increased findability? You understand that by talking to the teams who will use the data catalog for their regular operations. What do your users need to be successful?

Bottom-up approach

Consult with your data team and prospective catalog users to determine what solutions they need to reach the company’s broader goals.

Icon referencing prospective catalog users.Collaborate with your users to define a range of catalog use cases. To do this, you must clearly define who your company’s catalog users will be. What are their needs, current workflows, and vital tools? Do not assume; ask them. 

Different data personas will have different needs. For example, catering only to data consumers and neglecting the data producers will not build a lasting catalog. Conversely, a catalog with only detailed technical metadata will exclude business decision makers and less technical users. Strive to create a data ecosystem that supports all of your personas. When everyone on the team feels heard, teams trust each other and work together. Therefore, it is vital to explore and clearly define the catalog personas specific to your company and data strategy before you design the catalog. Model the catalog metadata fields around the personas you develop – What problems are your users trying to solve? What information do they need to be able to solve them? 

Resources for persona and use case development:

Personas to Products: Writing Persona Driven Epics and User Stories

How to Build a User-Centric Product: A Quick Guide to UX Design

The Value of User Stories

The goal of combining top-down and bottom-up approaches is to build a culture of community and shared purpose around the catalog. Empathy between user groups leads to workflows that unite, rather than clash, in support of the group effort. To achieve this, the catalog must be approachable, accurate, and integrate into the team’s established workflows. 

2. Make it easy (and enticing) to use

Data catalogs become more valuable as more people use them. How can you recruit more users?

Meet your team where they are

Don’t let catalog adoption be an added stress to your users. Design a program and select a tool that fits their current workflow. Simplify the onboarding process with customized training offerings. 

Technology ecosystem – Integrate with current workflows, embedding commonly used tooling where possible. Actively engaged users make it easier to keep the catalog updated. A catalog that Computer icon to reference technology ecosystem.is too burdensome or convoluted to use will collect dust and depreciate. Then, when users log in and find inaccurate data due to neglect, they will lose their trust  in the entire catalog and adoption will fail. Avoiding this requires smart tool selection. Determine and understand the tooling that is vital to your users and confirm it is able to integrate with the catalog provider you select. The goal of this segment is to configure a catalog that will become part of the team’s daily operations for data work. Having constant access and collaboration from your teams helps to ensure accuracy and completeness of metadata by surfacing issues sooner.

Lightbulb icon to reference education.Education – Aim to provide self-serve documentation so users can learn at their own pace as their unique needs arise. EK suggests creating customized training materials for the tool using company specific resources, use cases, and your catalog instance UI. This helps your users envision how the catalog fits with their workflow and how they can use it to successfully complete their unique tasks. It can be helpful to designate a few catalog SMEs within your data teams, train them to be power users, and then set them up to help onboard additional users.

What’s in it for me? 

Develop a marketing strategy to broadcast catalog capabilities and internal successes.  

At first, the catalog may seem like added work for your users. Some might think, “Oh great, management is adding another ‘solution’ to our already busy process.” Why should your team make the time to interact with it? What is the reward? The technical benefits of a properly implemented catalog should speak for themselves, so broadcast these! 

Clipboard iconEK’s experience has found that you don’t need to recruit everyone all at once. Aim to first establish something beneficial for an initial core group of users and develop some success stories to share. Then, market to the broader audience. Let the benefits speak for themselves and entice people to seek out the catalog rather than trying to force it.

For example, did a data manager curate metadata for x amount of resources, supporting the y team to save z units of time using the catalog? Create a case study to share this success with your organization. A well-crafted case study serves two purposes – first, it recognizes the team members who have added the catalog to their workflow. Second, it increases awareness about the success your data team can unlock using the catalog. Ideally, you want to create an atmosphere where new users are drawn to the catalog by witnessing the success of their peers.

Resources about the value of data catalogs from EK thought leaders:

The Value of Data Catalogs for Data Scientists

Semantic Data Portal/Data Catalog

Managing Disparate Learning Content

– Metadata Catalogs featured as part of The Top 5 KM Technologies

Knowledge Management is More Than SharePoint

The best way to entice your data team is through testimonials from users similar to them supported by actual performance data. Strategize your internal marketing plan before implementation begins. Plan to do an initial marketing push at regular intervals to recap the growing success of the catalog and showcase the increasing number of teams and users engaged with it. After the catalog has been established, consider sending updates whenever there have been substantial wins or when teams develop new use cases. 

Some data catalog products facilitate sharing catalog usage and successes directly in the user-interface. When choosing a catalog, look for tools that include the ability for users to collaborate within the app and view the activity of other users. Being able to login and see what other team members are working on not only invites discussion but also encourages new users to contribute. 

3. Measure how it’s going

What is working as expected and what is not? Do you need to change course? How can you demonstrate value and progress to stakeholders?

Key Performance Indicators (KPIs)

Use quantifiable results to demonstrate ROI and to monitor catalog usage. 

Compass iconHave a KPI monitoring plan in place before tool selection. Will your catalog of choice enable you to measure what matters to you? To determine what metrics to measure, reflect on your use case. What is your desired outcome? What are the success criteria to support it? 

Example Success Indicators Provided Insight
Successful searches (searches that result in a click through) Are your users finding what they need in the catalog? How much time is it taking them to find?
New user sign in/activity Are new users enrolling? Did they browse for one day and leave?

It is important to know not only what to measure, but when to measure it. For example, we have  found that effective catalog use doesn’t necessarily mean users will go to the catalog every day. Catalog use may peak during the discovery phase of a project and then steadily decline. When reviewing a decline in user stats, is this decline because your users could not find the resources they needed and abandoned the catalog? Or did they find valuable resources and now are deep in analysis, which won’t require daily catalog use. How can you determine the reason? Survey your users!

User feedback

Successful catalog adoption hinges entirely on your users. Seek out feedback to understand how the catalog is (or isn’t) working for them.

In the case of a data catalog, better content quality enables greater product functionalityMessage bubbles icon. While usage monitoring and KPIs will help inform you how users interact with the catalog, it is also essential to frequently engage with your users. Direct user feedback can help you improve the platform’s usability and highlight value to stakeholders. Methods for gathering feedback include focus groups, surveys, and direct interviews. Demonstrating that you value and act on gathered feedback will build their trust in the catalog program. When your users trust the catalog, they will rely on it as part of their default workflow

As you progress from implementation to the next phase, use feedback to learn from your users whether the next iteration should focus on refactoring what currently exists, deepening the current use case, or exploring new territory. 

Conclusion

A user-driven catalog approach is adaptive to changes in data needs and flexible when scaling for both more users and use cases. Centering your users when designing your catalog program provides the most value to your team members who rely on it. When your teams are successful, they will push the broader data strategy forward for the entire organization.

EK’s team of metadata management and data modeling specialists have the experience needed to help you explore and adapt these pillars to your unique organization. Contact us to learn more about how Enterprise Knowledge can help drive your data catalog adoption to success.

The post Three Pillars of Successful Data Catalog Adoption appeared first on Enterprise Knowledge.

]]>
The Value of Data Catalogs for Data Scientists https://enterprise-knowledge.com/the-value-of-data-catalogs-for-data-scientists/ Thu, 30 Jun 2022 14:08:31 +0000 https://enterprise-knowledge.com/?p=15667 Introduction After the Harvard Business Review called Data Scientist the sexiest job of the 21st century in 2012, much attention went into the interdisciplinary field of data science. Students and professionals were curious to know more about what data scientists … Continue reading

The post The Value of Data Catalogs for Data Scientists appeared first on Enterprise Knowledge.

]]>
Introduction

After the Harvard Business Review called Data Scientist the sexiest job of the 21st century in 2012, much attention went into the interdisciplinary field of data science. Students and professionals were curious to know more about what data scientists do, while businesses and organizations across industries wanted to understand how data scientists could bring them value.

In 2016, CrowdFlower, now Appen, published their Data Scientist report to respond to this newfound interest. This report aimed to survey professional data scientists with different years of experience and fields of expertise to find, among other things, what their everyday tasks were. The most important takeaway from this report is that it supports the famous 80/20 rule of data science. This rule states that data scientists spend around 80% of their time sourcing and cleaning data. And, only 20% of their time is left to perform analysis and develop machine learning models, which according to the same CrowdFlower survey, is the task that data scientists enjoy the most. The pie chart below shows that 1 out of every 5 data scientists spends most time collecting data, while 3 out of every 5 spend most of their time cleaning and organizing it.

A donut chart representing what data scientists spend most of their time doing.

Anaconda's 2020 State of Data Science report summary of how much time data scientists spend doing what. More recently, Anaconda’s 2020 State of Data Science Report shows that the time data scientists spent collecting, cleaning, and organizing data improved. It now takes up to 50% of their time. From the bar chart on the right, we can notice that most of the improvement is due to a dramatic decrease in the time spent cleaning data, from 60% in 2016 to 26%. However, collecting data remained static at 19%. We can also notice the introduction of time spent on data visualizations. This addition speaks to the growing need to communicate the value of the data scientist’s work to non-technical executives and stakeholders. And therefore, it is not surprising that the amount of time dedicated to developing those visualizations is a third of the time spent generating that value through model selection, model training and scoring, and deploying models.

In my experience, Anaconda’s report remains true to this date. When starting a data science project, finding the relevant data to fit the client’s use case is time-consuming. It often involves not only querying databases but also interviewing data consumers and producers that may point to data silos only known to a small group or even bring out discrepancies in understanding among teams regarding the data. Bridging the gap in understanding data among data personas is the most time-consuming task and one that I have witnessed data catalogs excel at performing.

To keep this trend and reverse the 80/20 rule, businesses and organizations must adopt tools that facilitate the tasks throughout the data science processes, especially in data sourcing and cleaning. Implementing an enterprise data catalog would be an ideal solution with an active role throughout the data science process. By doing so, data scientists will have more time to spend on high-value-generating tasks, increasing the return on investment.

Enterprise Data Catalogs

Data catalogs are a metadata management system for the organization’s data resources. In the context of this blog, they help data scientists and analysts find the data they need and provide information to evaluate its suitability for the intended use. Some capabilities enabled by a data catalog are:

  • Increased search speed utilizing a comprehensive index of all included data
  • Improved visibility with custom views of your data organized by defined criteria
  • Contextual insights from analytics dashboards and statistical metrics
  • Documentation of cross-system relationships between data at an enterprise level

Because of these capabilities, data catalogs prove to be relevant throughout the data science process. To demonstrate this, let’s review its relevance through each step in the OSEMN framework.

Value of Data Catalogs to the OSEMN Framework

The acronym OSEMN stands for Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data. It is a convenient framework to analyze the value of data catalogs because each step translates to a specific task in a typical data science project. Mason and Wiggins introduced this five-step OSEMN framework in their article “A Taxonomy of Data Science” in 2010, and it has been widely adopted by data scientists since.

Obtain

This step involves searching for and sourcing relevant data for the project. That is easy enough to do if you know what specific datasets to look for and whom to ask for access to the data. However, in practice, this is rarely the case. In my experience, the projects that generate the most significant value for the organization require integrating data across systems, teams, departments, and geographies. Furthermore, teams leading analytics and data science efforts recognize that the ROI of the project is highly dependent on the quality, integrity, and relevance of the sourced data. They, therefore, have been spending about a fifth of their time sourcing and verifying that they have high-quality and complete data for their projects. Data catalogs can help reduce this time through advanced search, enhanced discoverability, and data trustworthiness.

  • Advanced Search: Enterprise-wide faceted search provides knowledge instead of simple results by displaying the data’s contextual information, such as the data assets owner, steward, approved uses, content, and quality indicators. Most teams won’t have access to all of the enterprise datasets. However, these metadata profiles help data scientists save time by using this information to find what data is available to them quickly, assess their fitness for their use case, and whom to ask for access to the data they need.
  • Enhanced Discoverability: Although this is the first step in the OSEMN framework, this step comes after understanding the business problem. This understanding gives greater insight into the entities involved, such as customers, orders, transactions, organizations, and metrics. Hence, users can tag datasets according to the entities present in the data, and the data catalog can then auto-tag new content as it gets ingested. This feature allows new data to be discoverable almost immediately, resulting in additional and more recent data available for the project.
  • Data Trustworthiness: Searching for data across systems and departments can be time-consuming and often does not yield great results. Occasionally, you might stumble upon data that seems fit for your use case, but can you trust it? Because of data catalogs, data scientists can save time by not having to do detective work tracking down the data’s origins to assess its reliability. Data catalogs allow you to trace the data’s lineage and display quality metrics taking out the guesswork of sourcing your data.

Scrub

Data scientists would curate a clean data set for their project in this step. Some tasks include merging data from different sources into a single data set, standardizing column formatting, and imputing missing values. As examined in the introduction, the time spent cleaning and organizing data has sharply decreased. I believe the advent of user-friendly ETL solutions has played a significant role in bringing down the time spent in this step. These solutions allow users to define pipelines and actions in graphic interfaces that handle data merging, standardization, and cleaning. While some data catalogs have such comprehensive ETL features, most will have basic ETL capabilities. The users can then expand these basic capabilities through third-party integrations. But ETL capabilities aside, data catalogs are still helpful in this step.

Many organizations reuse the same data assets for multiple initiatives. However, each team cleans and standardizes the data only to store it inside their own project folder. These data silos add clutter, waste storage, and increase duplicative work. Why not catalog the clean and standardized data? This way, the next team that needs to use the data will save time using the already vetted and cleaned data set.

Explore

Data scientists usually perform exploratory data analysis (EDA) in this step. EDA entails examining the data to understand its underlying structure and looking for patterns and trends. Some of the queries developed in this step provide descriptive statistics, visualizations, and correlation reports, and some may even result in new features. Data catalogs can support federated queries so that data scientists can perform their EDA from a single self-service store. This way, they save time by not having to query multiple systems at different access points and figuring out how to aggregate them in a single repository. But the benefits do not stop there. The queries, the aggregated data set, and visualizations developed during the EDA process can also be cataloged and discoverable by other teams that might reuse the content for their initiatives. Furthermore, these cataloged assets become fundamental for future reproductions of the model.

Model

According to the CrowdFlower survey, this is the most enjoyable task for data scientists. We have been building up to this step, which many data scientists would say is “where the magic happens.” But “magic” does not necessarily have to be a black box. Data catalogs can help enhance the models’ explainability with their traceability features. Due to this, every stakeholder with access to the information will be able to see the training and test data, its origin, any documented transformations, and EDA. This information is an excellent foundation for non-technical stakeholders to understand and have enough context for the model’s results.

So far, data catalogs provide circumstantial help in this phase, primarily byproducts of the interactions between the data scientist and the data catalog in the previous steps. However, data catalogs are directly beneficial during the model selection process. As we can see from the chart on the right, as more training data become available, the results of separate models become more similar. In other words, the model selection loses relevancy when the training data available for the model to train on increases. Hence, a data catalog provides a data scientist with a self-service data discovery platform to source more data than was feasible in previous efforts. And therefore, it makes the data scientists’ task more efficient by removing the constraints on model selection caused by insufficient data. Moreover, it saves time and resources since now data scientists can train fewer models without significantly impacting the results, which is paramount, especially when developing proof-of-concept models.

iNterpret

This step is where data scientists communicate their findings and the ROI of the project to the stakeholders. In my experience, this is the most challenging step. In Anaconda’s report, data scientists responded on how effectively their teams demonstrated the impact and value of data science on business outcomes. As we notice from the results below, data science teams were more effective in communicating their impact on businesses in industries with a higher proportion of technical staff. We can also notice a wide efficiency gap across sectors, with teams in consulting and technology firms having almost twice the efficiency in conveying their projects’ impact as teams driving healthcare data science projects.

How effective are data scientist teams at demonstrating the impact of data science on business outcomes?
How effective are data scientist teams at demonstrating the impact of data science on business outcomes?

To accommodate non-technical audiences, many data scientists facilitate this demanding task using dashboards and visualizations. These visuals improve the communication of value from the teams to the stakeholders. Further, data scientists could catalog these dashboards and visualizations in the metadata management solution. In this way, data catalogs can increase the project’s visibility by storing these interpretations in the form of insights that can be discoverable by the stakeholders and a wider approved audience. Data scientists in other departments, geographies, or subsidiaries with a similar project in mind can benefit from the previous work done and build on top of that whenever possible. Therefore, reducing duplicative work.

Conclusion

Data catalogs offer many benefits throughout a data science project’s process. They provide data scientists with self-service data access and a discoverability ecosystem on which to obtain, process, aggregate, and document the data resources they need to develop a successful project. Most of the benefits are front-loaded in the first step of the OSEMN framework, obtaining data. However, we can note their relevance throughout the remaining steps.

I would like to clarify that no single data catalog solution will have all the capabilities discussed in this article embedded as a core feature. Please consider your enterprise needs and evaluate them against the features of the data catalog solution you consider implementing. Our team of metadata management professionals has led over 40 successful data catalog implementations with most major solution providers. Don’t hesitate to contact us so we can help you navigate the available data catalog solutions and use our expert knowledge to choose the one that best fits your organization’s needs and lead its successful implementation.

Resources and Further Reading

The post The Value of Data Catalogs for Data Scientists appeared first on Enterprise Knowledge.

]]>
What is the Roadmap to Enterprise AI? https://enterprise-knowledge.com/enterprise-ai-in-5-steps/ Wed, 18 Dec 2019 14:00:57 +0000 https://enterprise-knowledge.com/?p=10153 Artificial Intelligence technologies allow organizations to streamline processes, optimize logistics, drive engagement, and enhance predictability as the organizations themselves become more agile, experimental, and adaptable. To demystify the process of incorporating AI capabilities into your own enterprise, we broke it … Continue reading

The post What is the Roadmap to Enterprise AI? appeared first on Enterprise Knowledge.

]]>
Artificial Intelligence technologies allow organizations to streamline processes, optimize logistics, drive engagement, and enhance predictability as the organizations themselves become more agile, experimental, and adaptable. To demystify the process of incorporating AI capabilities into your own enterprise, we broke it down into five key steps in the infographic below.

An infographic about implementing AI (artificial intelligence) capabilities into your enterprise.

If you are exploring ways your own enterprise can benefit from implementing AI capabilities, we can help! EK has deep experience in designing and implementing solutions that optimizes the way you use your knowledge, data, and information, and can produce actionable and personalized recommendations for you. Please feel free to contact us for more information.

The post What is the Roadmap to Enterprise AI? appeared first on Enterprise Knowledge.

]]>