Kylo – Automatic Data Profiling and Search-based Data Discovery

Overview

Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count & sum), occurrence of null values, occurrence of uniqueness, occurrence of missing values, occurrence of duplicates, occurrence of top values, and occurrence of valid & invalid values.

Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and to make data discovery simple.

In this blog, let us discuss automatic data profiling and search-based data discovery in Kylo.

Pre-requisites

To know about Kylo deployment requiring knowledge on different components/technologies, refer our previous blog on Kylo Setup for Data Lake Management.

To learn more about Kylo self-service data ingest, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Data Profiling

Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found in Feed Details page.

Feed Details

The feed ingestion using Kafka is shown in the below diagram:

Informative summaries about each field from the ingested data can be viewed under View option in Profile page.

String (user field in the sample dataset) and numeric data type (amount field in the sample dataset) profiling details are shown in the below diagrams:

Profiling Statistics

Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for string field. The numeric field statistics for the amount field is shown in the below diagram:

The basic statistics for the string field (i.e. user field) is shown in the below diagram:

Standardization Rules

Predefined standardization rules are used to manipulate data into conventional or canonical formats (dates, stripping special characters) or data protection (masking credit cards, PII, and so on). Few standardization rules applied on the ingested data are as follows:

Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the user, business, and address fields as per the configuration is shown in the below diagram:

Profiling Window

Kylo’s profiling window provides additional tabs such as valid and invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as Range Validator Rule violation, not considered as timestamp, and so on.

The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000”, which is the number of messages Kafka producer will attempt to batch before sending it to the consumer. To know more on batch size, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka is shown in the below diagram:

Search-based Data Discovery

Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows the business analysts to decide on the required fields to be searchable and to enable index option for those fields while creating feed. The indexed “user” and “business” fields searchable from Kylo Global Search is shown in the below diagram:

Index Feed

Predefined “Index Feed” queries the index-enabled field data from persisted Hive table and indexes the feed data into Elasticsearch. The “Index Feed” is automatically triggered as a part of “Data Ingest” template. The index feed job status is highlighted in the below diagram:

If the index feed fails, search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for user and business fields failed due to the field name “user” as shown in the below diagram:

To resolve this, the “user” field name is modified as “customer_name” during feed creation.

Search Queries

The search query to return the matched documents from the Elasticsearch is:

customer_name: “Bradley Martinez”

The search query, Lucence search query, to search data and metadata is:

business: “JP Morgan Chase & Co”

Feed Lineage

Lineage is automatically maintained at “feed-level” by Kylo framework, sinks identified by the template designer, and by any sources when registering the template.

Conclusion

In this blog, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed few issues faced during Index Feed and its solutions too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capability to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.

Kylo – Automatic Data Profiling and Search-based Data Discovery

Overview

Pre-requisites

Data Profiling

Search-based Data Discovery

Feed Lineage

Conclusion

References

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...