Data Quality Metrics using Talend Data Quality Management

Overview

Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and information about the data. It helps business users in making more informed decisions with the quality data.

In this blog, let us discuss Data Quality Statistics (DQS) using Talend Data Quality Management (DQM).

Pre-requisites

Download and install Talend data quality tool from the following link:
https://www.talend.com/products/data-quality/

Data Description

Loan applicant dataset, with basic applicant details such as applicant ID, gender, age, marital status, and so on, is used as the source data.

Sample Data Source in MySQL

sample-data-source-in-mysql

Use Case

Perform column and table level quality statistics on the input data source.

Synopsis

Connect data source with Talend DQM
Create analysis and data quality statistics
- Simple statistics
- Pattern matching statistics
- Text statistics
- Pattern frequency statistics
Apply static rules
Perform Correlation Analysis
Identify data duplicates using Match Analysis

Connecting Data Source with Talend DQM

To connect Talend DQM with the database, perform the following:

Open Talend Open Studio for Data Quality.
In the left panel, click Metadata –> DB connections –> Create DB Connection to create a database connection to import the source data from the database for collecting statistics.

Provide the required credentials to create metadata for MySQL DB connection as shown in the below diagram:

Creating Analysis and Data Quality Statistics

On successfully connecting Talend with MySQL, perform analysis on the following levels:

Column
Table

To collect the data quality statistics with the applicant dataset on column level, perform the following:

Create analysis as shown in the below diagram:

Select columns for performing the data quality statistics as shown in the below diagram:

Select quality indicator for the selected columns to run analysis and view the analysis results.

The quality statistics based on the above-selected indicators in Talend DQM are:

Simple Statistics
Pattern Matching Statistics
Text Statistics
Pattern Frequency Statistics

Simple Statistics

The simple statistics on the applicant ID column, with Row Count, Null Count, Distinct Count, Unique Count, Duplicate Count, and Blank Count, is shown in the below diagram:

simple-statistics

Pattern Matching Statistics

The pattern matching statistics is used to analyze the format of several types of data such as date in different formats, phone number patterns in different countries, zip codes, and so on. It provides both matching and non-matching patterns.

Matched and unmatched patterns of the phone numbers with the countries are shown in the below diagram:

pattern-matching-statistics

Matched and unmatched patterns of the applicant last name starting with uppercase are shown in the below diagram:

pattern-matching-statistics1

Matched and unmatched patterns of the US state codes in the applicant data are shown in the below diagram:

pattern-matching-statistics2

Matched and unmatched patterns of the date matching with the date of birth of the applicant are shown in the below diagram:

pattern-matching-statistics3

Text Statistics

The text statistics is used to check the data with default length such as phone number (for example, 10 digits in India). The text statistics on the applicants’ phone number is shown in the below diagram:

text-statistics

Pattern Frequency Statistics

The pattern frequency statistics is used to check the pattern formats of the data source. The patterns of the phone numbers are shown in the below diagram:

pattern-frequency-statistics

Applying Static Rules

Business Rule statistics, called as table level statistics, is used to apply static rules and predefined business rules in the table columns.

Few static rules created are:

Approved loan amount should not be greater than requested loan amount.
Age column record is valid based on date of birth column in table.
Gender column should have valid data like Male or Female.

The business rule analysis performed using the above static rules is shown in the below diagram:

applying-static-rules

Performing Correlation Analysis

The correlation analysis is used to explore the relationships and correlations in the data. It is used to highlight weak relationships between the data to find potential incorrect relationships. The correlation analysis between the cities and the states is shown in the below diagram:

Identifying Data Duplicates Using Match Analysis

The match analysis is used to assess the number of duplicates in the data. It estimates the number of groups of similar data on a table-set or a column-set basis. The match analysis, with column sets such as state and gender, is shown in the below diagrams:

match-analysis

match-analysis1

References

Data Profiling – Concepts and Principles

Data Quality Metrics using Talend Data Quality Management

Overview

Pre-requisites

Data Description

Use Case

Connecting Data Source with Talend DQM

Creating Analysis and Data Quality Statistics

Simple Statistics

Pattern Matching Statistics

Text Statistics

Pattern Frequency Statistics

Applying Static Rules

Performing Correlation Analysis

Identifying Data Duplicates Using Match Analysis

References

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...