NLP and AI boost the automated data warehouse

As digital business continues to accelerate, companies are automating elements of their data warehouses to accelerate their data-to-insights cycles using AI and machine learning. Augmented analysis plays a role, as do traditional tools such as ETL (extract, transform, and load). Collectively, the landscape of increasingly intelligent data management tools helps make data more accessible and usable.

The influence of augmented analytics

Augmented Analytics is state-of-the-art when it comes to data analysis. Instead of typing queries in SQL, users can simply use natural language.

Another distinguishing factor of augmented analytics platforms has extended past analytics to include data preparation and even some data warehouse capabilities. According to Mark Beyer, research vice president and analyst at Gartner, the role of augmented analytics is to uncover data usage patterns that determine who accesses what data, how often, in what combinations, and at what rate. acceleration or deceleration of their overall commitment.

“Augmented analytics can only learn from patterns and previous activity. It can add data analysis at the level of profiling content by individual assets and inferring that similar data across different datasets might be the same data,” Beyer said. “Any inference model should be trained to recognize long-term patterns, which requires both time and many use cases interfacing with the same data to show how variable the patterns are and what conditional scenarios cause the different variations.”

Augmented analytics platform provider Qlik offers a suite of data management tools bundled into a single SKU. Qlik Replicate, a universal data replication and ingestion tool, integrates with Qlik Compose, a data lake and automation tool, to enable and automate batch and real-time data feeds from source systems to data warehouses and lakes.

Qlik Enterprise Manager centrally manages data replication and pipeline automation across the enterprise, providing a single point of control for designing, running, and monitoring replication and composition jobs.

Data structures and resulting metadata are shared with Qlik Catalog so users can provide data directly from Catalog to the Qlik Sense augmented analytics platform or similar platforms like Power BI and Tableau.

“Qlik enables batch and continuous migration of data across many data sources and targets, both on-premises and in the cloud,” said Anand Rao, Director of Product Marketing at Qlik. “[It] supports use cases ranging from cloud migrations to platform modernizations and tightly integrates with all major cloud providers.”

Augmented analytics platform provider Sisense offers a full suite of data management features, including ingestion, manual and AI preparation, modeling, governance, and cataloging. Each of these capabilities can be replaced with advanced services that can be more specialized in a designated area.

ML-based data prep is easily the biggest trend we’re seeing in the space. The amount of time people spend traversing tables to perform such simple tasks as deduplications is staggering and can be automated.

Ryan SegarSenior Vice President of Field Engineering, Sisense

“One of the most unique things about Sisense is that we designed [it] as a true microservices solution, [so] each workflow can be supplemented or replaced entirely,” said Ryan Segar, senior vice president of field engineering at Sisense.

For example, with ETL, customers can use Stitch, Fivetran, CData, or Matillion. For data warehouses or data lakes, they can use Redshift, Snowflake, SingleStore, Databricks, or BigQuery. For governance and cataloging, they can use Collibra, Alation, BigID, Alteryx, Trifacta, and others.

“ML-based data prep is by far the biggest trend we’re seeing in the space,” Segar said. “The amount of time people spend traversing tables to perform such simple tasks as deduplications is staggering and can be automated.”

Neuro-linguistic programming comes to the fore

Neuro-linguistic programming (NLP) has been added to data analysis platforms so that less technical users, such as “citizen data scientists”, can access and analyze data.

“NLP understands user intent and analyzes search strings to identify key attributes of an analytics query. [It] then leverages AI to generate the best insights for the user, which can be refined and added to dashboards for further explanation,” Rao said. “Similarly, queries to an automated data warehouse can benefit from NLP, allowing business analysts to request data and analysis. calculations without sophisticated SQL queries.”

Rao defines data warehouse automation as creating and importing data models, performing custom data type mappings in different data stores, following validation and data quality rules and the creation of data warehouses or derived data stores. An NLP-driven query generator tool can initially be used to update individual tasks in a workflow and eventually replace less complex tasks downstream.

A data warehouse versus a data store.

“The NLP-driven tool must be able to generate typical OLAPs [online analytical processing] queries to create data stores with requested datasets,” Rao said.

While the semantic layer made self-service data accessible, the growing amount and types of data revealed its fatal flaw, according to Segar.

“Humans are still the ones creating and maintaining it. NLP has been spurred by advances in data cataloging, and ML is changing the game so systems can retrain on the business words used in global and local cases” , Segar said. “When implemented correctly, we can automate the most difficult task of data management: recognizing the uniqueness of each user rather than training them to think differently.”

If the language is parsed for consistent use over many occurrences and has a subject, object, predicate (SOP) construct, then it can be parsed for code entries, according to Gartner’s Beyer. For example, computer code always has the same SOP construct:

  • Topics are derived from a department, business unit, or functional requirement such as “patient admission”.
  • The objects are the desired attributes or memory arrays to fill in, such as “patient admission date and time”.
  • Predicates are verb phrases in the language, such as “patient admitted”.

“The NLP can thus be coded as a program module to capture the admission of the patient as a subject. The objects are patient ID, hospital, ward ID (if applicable) and date and the time the patient is registered as present. The predicate is to capture data inputs,” Beyer said. “On the back-end, an augmented system could know how that data is used based on use cases. previous ones. It could also learn typical error types and create checks to filter them out using data quality or query plan rules as a second predicate.”

The bottom line is that data warehouses continue to automate over time through AI and machine learning. Augmented analytics is part of the fabric that helps generate value from data by enabling more people to glean important contextual insights.


Source link

Comments are closed.