Data management tools for our times

Methodologies are evolving, as are the tools, meaning a shorter time to value and insights, finds Paul Hearns

Longform

Image: Stockfresh

12 February 2020

We hear everyday about how data is the new [insert your favourite term] – for example Cloudera’s ‘data is the new bacon’. Like most of these analogies, it is not very helpful, but it is a bit of fun.

However, what the application of these various monikers indicates is the importance that data has assumed in today’s enterprise. Every aspect of modern business is becoming data-driven. That drive requires data, a lot of it, in accessible form, cleaned, sorted and made ready for analysis.

And yet, many organisations are struggling. They know how to gather data, but according to Gartner, only 6% of enterprise data is analysed to provide a competitive driver for the business. An Australian study by the analyst in 2018 concluded the 87% of organisations have a low business intelligence and analytics maturity.

Isaac Sacolick, contributing editor, CIO, wrote that enterprises need better tools to learn and collaborate around data sources. One method for doing this, he presents, is data catalogues.

Data catalogues have been around for some time and have become more strategic today, as organisations scale Big Data platforms, operate in hybrid clouds, invest in data science and machine learning programmes, and sponsor data-driven organisational behaviours.

The first concept to understand about data catalogues is that they are tools for the entire organisation to learn and collaborate around data sources. They are important to organisations trying to be more data-driven, ones with data scientists experimenting with machine learning, and others embedding analytics in customer-facing applications.

Database engineers, software developers, and other technologists take on responsibilities to integrate data catalogues with the primary enterprise data sources. They also use and contribute to the data catalogue, especially when databases are created or updated.

In that respect, data catalogues that interface with the majority of an enterprise’s data assets are a single source of truth. They help answer what data exists, how to find the best data sources, how to protect data, and who has expertise. The data catalogue includes tools to discover data sources, capture metadata about those sources, search them, and provide some metadata management capabilities.

Many data catalogues go beyond the notion of a structured directory. Data catalogues often include relationships between data sources, entities, and objects. Most catalogues track different classes of metadata, especially on confidentiality, privacy, and security. They capture and share information on how different people, departments, and applications utilise data sources. Most data catalogues also include tools to define data dictionaries; some bundle in tools to profile data, cleanse data, and perform other data stewardship functions. Specialised data catalogues also enable or interface with master data management and data lineage capabilities.

Data catalogue products and services

The market is full of data catalogue tools and platforms. Some products grew out of other infrastructure and enterprise data management capabilities. Others represent a new generation of capabilities and focus on ease of use, collaboration, and machine learning differentiators. Naturally, choice will depend on scale, user experience, data science strategy, data architecture, and other organisation requirements.

Here is a sample of data catalogue products:
• Azure Data Catalog and AWS Glue are data cataloguing services built into public cloud platforms.
• Many data integration platforms have data cataloguing capabilities, including Informatica Enterprise Data Catalog, Talend Data Catalog, SAP Data Hub, and IBM Infosphere Information Governance Catalog.
• Some data catalogues are designed for big data platforms and hybrid clouds, such as Cloudera Data Platform and InfoWorks DataFoundry, which supports data operations and orchestration.
• There are stand-alone platforms with machine learning capabilities, including Unifi Data Catalog, Alation Data Catalog, Collibra Catalog, Waterline Data, and IBM Watson Knowledge Catalog.
• Master data management tools such as Stibo Systems and Reltio and customer data platforms such as Arm Treasure Data can also function as data catalogues.

ML driving insights and experimentation

Data catalogues that automate data discovery, enable searching the repository, and provide collaboration tools are the basics. More advanced data catalogues include capabilities in machine learning, natural language processing, and low-code implementations.

Machine learning capabilities take on several forms depending on the platform. For example, Unifi has a built-in recommendation engine that reviews how people are using, joining, and labelling primary and derived data sets. It captures utilisation metrics and uses machine learning to make recommendations when other end-users query for similar data sets and patterns. Unifi also uses machine learning algorithms to profile data, identify sensitive personally identifiable information, and tag data sources.

Collibra is using machine learning to help data stewards classify data. Automatic Data Classification analyses new data sets and matches to 40 out-of-the-box classifications, such as addresses, financial information, and product identifiers.

Waterline Data has patented fingerprinting technology that automates the discovery, classification, and management of enterprise data. One of their focus areas is identifying and tagging sensitive data; they claim to reduce the time needed for tagging by 80%.

Different platforms have different strategies and technical capabilities around data processing. Some only function at a data catalogue and metadata level, whereas others have extended data prep, integration, cleansing, and other data operational capabilities.

InfoWorks DataFoundry is an enterprise data operations and orchestration system that has direct integration with machine learning algorithms. It has a low-code, visual programming interface enabling end-users to connect data with machine learning algorithms such as k-means clustering and random forest classification.

We are, argues Sacolick, in the early stages of active platforms such as data catalogues that provide governance, operational capabilities, and discovery tools for enterprises with growing data assets. As organisations realise more business value from data and analytics, there will be a greater need to scale and manage data practices. Machine learning capabilities will likely be one area where different data catalogue platforms compete.

As important as these methodologies are, data preparation and hygiene are equally deserving of attention.

To reap the benefits of data analytics, you first have to get data preparation right, writes Thor Olavsrud, senior writer, CIO. For many organisations, this is a significant bottleneck, with up to 70% of their time focused on data preparation tasks, according to recent research from Gartner.

“Finding, accessing, cleaning, transforming, and sharing the data, with the right people and in a timely manner, continues to be one of the most time-consuming roadblocks in data management and analytics,” said Ehtisham Zaidi, senior director analyst, data & analytics, Gartner, lead author of Gartner’s Market Guide for Data Preparation Tools.

For organisations seeking to transform their business with analytics, the chief problem is less about mastering AI and more about mastering the data pipeline, said Jonathan Martin, chief marketing officer of Hitachi Vantara.

“The data preparation piece is the piece that is most challenging,” he says. “How do I identify where all this data is? Am I able to build a portfolio? Am I able to engineer the pipelines to connect all those data sources together in an automated and managed and governed way to allow us to get that data to the right place, the right person, the right machine in the right time frame?”

What to look for in a data preparation tool

As organisations evaluate modern data preparation tools, they should look for key capabilities:
• Data ingestion and profiling. Look for a visual environment that enables users to interactively ingest, search, sample, and prepare data assets.
• Data cataloguing and basic metadata management. Tools should allow you to create and search metadata.
• Data modelling and transformation. Tools should support data mashup and blending, data cleansing, filtering, and user-defined calculations, groups, and hierarchies.
• Data security. Tools should include security features such as data masking, platform authentication, and security filtering at the user/group/role level.
• Basic data quality and governance support. Data preparation tools should integrate with tools supporting data governance/stewardship and capabilities for data quality, user permissions, and data lineage.
• Data enrichment. Tools should support basic data enrichment capabilities, including entity extraction and capturing of attributes from the integrated data.
• User collaboration and operationalisation. The tools should facilitate the sharing of queries and datasets, including publishing, sharing, and promoting models with governance features such as dataset user ratings or official watermarking.

Differentiating capabilities to look for

• Data source access/connectivity. Tools should feature APIs and standards-based connectivity, including native access to cloud application and data sources, such as popular database PaaS and cloud data warehouses, on-premises data sources, relational and unstructured data, and non-relational databases.
• Machine learning. Tools should support the use of machine learning AI to improve or even automate the data preparation process.
• Hybrid and multi-cloud deployment options. Data preparation tools need to support deployment in the cloud, on-premises, or in a hybrid integration platform setting.
• Domain- or vertical-specific offerings or templates. Tools should provide packaged templates or offerings for domain- or vertical-specific data and models that can accelerate the time to data preparation.

Ultimately, Zaidi says one of the first things you must consider is whether your organisation will go with a standalone data preparation tool or with a vendor that embeds data preparation into its broader analytics/BI, data science, or data integration tools. Consider standalone tools if you have a general-purpose use case that depends on integration of data for a range of analytics/BI and data science tools. On the other hand, if you need data preparation only within the context of a particular platform or ecosystem, it may make more sense to go with the embedded data preparation capability of those tools.

Market overview

Gartner breaks data preparation tools vendors into four categories, each of which is in flux as data preparation capabilities are being embedded across all data management and analytics tools.

Standalone data preparation tools: Vendors in this space focus on enabling tighter integration with downstream processes, such as API access and support for multiple analytics/BI, data science, and data integration tools. Tools in this space include offerings from vendors such as Altair, Datameer, Lore IO, Modak Analytics, Paxata and Trifacta.

Data integration tools: Vendors in this category have historically focused on data integration and management. This includes offerings from vendors such as Cambridge Semantics, Denodo, Infogix, Informatica, SAP, SAS, Talend, and TMMData.

Modern analytics and BI platforms

These vendors focus on data preparation as part of an end-to-end analytics workflow. Because data preparation is critical to modern analytics and BI, all vendors in the space are embedding data preparation capabilities, Zaidi says. Vendors in this category include Alteryx, Tableau, Cambridge Semantics, Infogix, Microsoft, MicroStrategy, Oracle, Qlik, SAP, SAS, TIBCO Software, and TMMData.

Data science and machine learning platforms: Gartner says these vendors provide data preparation capabilities as part of an end-to-end data science and ML process. Representative vendors include Alteryx, Cambridge Semantics, Dataiku, IBM, Infogix, Rapid Insight, SAP, and SAS.

In addition to the above four broad categories, Gartner sees new categories emerging with data preparation capabilities, including the following platforms and representative vendors:

• Data management/data lake enablement platforms: Informatica, Talend, Unifi, and Zaloni
• Data engineering platforms: Infoworks
• Data quality tools: Experian
• Data integration specialists: Alooma, Nexla, StreamSets, and Striim

Six key data preparation tools

Alteryx Designer

This standalone data preparation tool is also a part of the Alteryx Analytics and Data Science platform, meaning it is also embedded as a capability within a broader modern analytics and BI platform, and as a capability within a broader data science and machine learning platform.

Cambridge Semantics Anzo

Anzo is Cambridge Semantics’ end-to-end data discovery and integration platform, and so crosses all four of Gartner’s categories. Anzo applies a semantic, graph-based data fabric layer over existing data infrastructure to map enterprise data, expose connections between datasets, enable visual exploration and discovery, and blend multiple datasets.

Datameer Enterprise

Datameer Enterprise is a data preparation and data engineering platform squarely in Gartner’s standalone category. It focuses on bringing together raw, disparate data sources to create a single data store using a wizard-led integration process. Datameer offers a spreadsheet-like interface for point-and-click blending and visual exploration capabilities.

Infogix Data3Sixty Analyze

Infogix’s Data3Sixty Analyze is a web-based solution born from Infogix’s acquisition of Lavastorm. Like Datameer, it crosses all four of Gartner’s categories. Data3Sixty uses roles to define users. Designers can create and edit data flows, explorers can only execute data flows, while schedulers can create and modify schedules for automated processing.

Talend Data Preparation

Talend offers three data preparation tools: Talend Data Preparation (an open source desktop version), Talend Data Preparation Cloud (a commercial version offered as part of the Talend Cloud platform), and another version of Talend Data Preparation (a commercial version that is part of the on-premises Talend Data Fabric offering). Talend Data Preparation is a standalone tool, while Talend Cloud and Talend Data Fabric are examples of data preparation integrated as a capability within a broader data integration/data management tool. Talend uses machine learning algorithms for standardisation, cleansing, pattern recognition, and reconciliation.

Trifacta Wrangler

Trifacta Wrangle is a standalone data preparation platform that comes in various editions for supporting cloud and on-premises computing environments. It offers embedded ML capabilities for recommending data with which to connect, inferring data structure and schema, recommending joins, defining user access, and automating visualisations for exploration/data quality.

Read More: data management longform