
Overview
Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count & sum), occurrence of null values, occurrence of uniqueness, occurrence of missing values, occurrence of duplicates, occurrence of top values, and occurrence of valid & invalid values.
Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and to make data discovery simple.
In this blog, let us discuss automatic data profiling and search-based data discovery in Kylo.
Pre-requisites
To know about Kylo deployment requiring knowledge on different components/technologies, refer our previous blog on Kylo Setup for Data Lake Management.
To learn more about Kylo self-service data ingest, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).
Data Profiling
Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found in Feed Details page.
Feed Details
The feed ingestion using Kafka is shown in the below diagram:
Informative summaries about each field from the ingested data can be viewed under View option in Profile page.
String (user field in the sample dataset) and numeric data type (amount field in the sample dataset) profiling details are shown in the below diagrams:
Profiling Statistics
Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for string field. The numeric field statistics for the amount field is shown in the below diagram:
The basic statistics for the string field (i.e. user field) is shown in the below diagram:
Standardization Rules
Predefined standardization rules are used to manipulate data into conventional or canonical formats (dates, stripping special characters) or data protection (masking credit cards, PII, and so on). Few standardization rules applied on the ingested data are as follows:
Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the user, business, and address fields as per the configuration is shown in the below diagram:
Profiling Window
Kylo’s profiling window provides additional tabs such as valid and invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as Range Validator Rule violation, not considered as timestamp, and so on.
The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000”, which is the number of messages Kafka producer will attempt to batch before sending it to the consumer. To know more on batch size, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).
Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka is shown in the below diagram:
Search-based Data Discovery
Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows the business analysts to decide on the required fields to be searchable and to enable index option for those fields while creating feed. The indexed “user” and “business” fields searchable from Kylo Global Search is shown in the below diagram:
Index Feed
Predefined “Index Feed” queries the index-enabled field data from persisted Hive table and indexes the feed data into Elasticsearch. The “Index Feed” is automatically triggered as a part of “Data Ingest” template. The index feed job status is highlighted in the below diagram:
If the index feed fails, search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for user and business fields failed due to the field name “user” as shown in the below diagram:
To resolve this, the “user” field name is modified as “customer_name” during feed creation.
Search Queries
The search query to return the matched documents from the Elasticsearch is:
customer_name: “Bradley Martinez”
The search query, Lucence search query, to search data and metadata is:
business: “JP Morgan Chase & Co”
Feed Lineage
Lineage is automatically maintained at “feed-level” by Kylo framework, sinks identified by the template designer, and by any sources when registering the template.
Conclusion
In this blog, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed few issues faced during Index Feed and its solutions too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capability to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.