Quantcast
Channel: Flamingo
Viewing all 98 articles
Browse latest View live

Apache Drill vs Amazon Athena – A Comparison on Data Partitioning

$
0
0

Overview

Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for accessing, exploring, and reporting huge volume of data. Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data.

In this blog, let us compare data partitioning in Apache Drill and AWS Athena and the distinct features of both.

Dataset Description

A sample dataset, containing census data of a particular country in the USA, is used in this use case. For sample dataset, consider Reference section.

Partitioning Data

In this section, let us discuss data partitioning based on male and female fertility rate in a predefined age group in Apache Drill and Athena.

Partitioning Data in Apache Drill

To perform data partition in Drill, perform the following:

  • Change data storage format to Parquet using the following command:
ALTER SESSION SET `store.format`='parquet';
select
  • Create table and partition data using the following command:
CREATE TABLE dfs.`csvOut`.AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION(country_code, country_name, 
`year`, fertility_rate_15_19, fertility_rate_20_24, fertility_rate_25_29, fertility_rate_30_34, fertility_rate_35_39, 
fertility_rate_40_44, fertility_rate_45_49, total_fertility_rate, gross_reproduction_rate, sex_ratio_at_birth,gender) 
PARTITION BY (gender)
AS
SELECT columns[0] as country_code,columns[1] as country_name,columns[2] as `year`,columns[3] as 
fertility_rate_15_19,columns[4] as fertility_rate_20_24,columns[5] as fertility_rate_25_29,columns[6] as 
fertility_rate_30_34,columns[7] as fertility_rate_35_39,columns[8] as fertility_rate_40_44,columns[9] as 
fertility_rate_45_49,columns[10] as total_fertility_rate,columns[11] as gross_reproduction_rate,columns[12] as 
sex_ratio_at_birth,columns[13] as gender FROM 
dfs.`/user/tsldp/drillathena/age_specific_fertility_rates_gender.csv`;
The table created is as shown below:

select

The time taken to create a table is as shown below:

select

You can check the data loaded into the database using the following command:

select * from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` ;

The time taken to select the required data in a table is as shown below:

select

  • Get total count of male and female fertility data using the following command:
select count(*),gender from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` group by gender;
The count of males and females in a country is shown below:

select

The file size after partitioning data using Apache drill is as shown below:

select

Partitioning Data in Athena

Athena uses Hive data partitioning and provides improved query performance by reducing the amount of data scanned.

In Athena, data partitioning can be done in two separate ways as follows:

  • With already partitioned data stored on Amazon S3 and accessed on Athena.
  • With unpartitioned data.

In both methods, specify the partitioned column in create statement.

To perform data partition in Athena, perform the following:

  • Create table using the below query:
create external table sampledb.age_fertility_rates_gender_parq_part(
country_code string,
country_name string,
year string,
fertility_rate_15_19 decimal(10,5),
fertility_rate_20_24 decimal(10,5),
fertility_rate_25_29 decimal(10,5),
fertility_rate_30_34 decimal(10,5),
fertility_rate_35_39 decimal(10,5),
fertility_rate_40_44 decimal(10,5),
fertility_rate_45_49 decimal(10,5),
total_fertility_rate decimal(10,5),
gross_reproduction_rate decimal(10,5),
sex_ratio_at_birth decimal(10,5))
PARTITIONED BY (gender string)
stored as parquet
LOCATION 's3://cps3bucket/data_gender_parquet/';
select
  •  Add partitions to the catalog by using the below command:
lMSCK REPAIR TABLE age_fertility.age_fertility_rates_gender_parq_part;
select
  • Check partitioned data using the below query:
select * from age_fertility.age_fertility_rates_gender_parq_part;
select

Data Partition Comparison between Apache Drill and Amazon Athena

The time taken to perform create partition and select partition is as follows:

select

Distinct Features of Drill and Athena

select

Conclusion

In Apache Drill, data partitioning concepts can be applied directly. In Athena, we need to convert the files into Parquet format using EMR to perform data partitioning. A separate storage is not required in Athena as you can query the data directly from Amazon S3.

References


Amazon Athena & Tableau – Serverless Interactive Query Service and Business Intelligence (BI)

$
0
0

Overview

Amazon Athena, a serverless query service in Amazon Simple Storage Service (S3) and a pay per service, is used to easily analyze data using standard SQL in S3. It has a very high query performance even for huge datasets and complex queries.

Athena can process both structured and semi-structured data in different file formats such as CSV, JSON, Parquet, and ORC. It can be used to generate reports in connection with BI tools. It uses HIVE for creating database and tables. As it has metastore, it can be used with Hadoop, Spark, and Presto.

Pre-requisites

  • Sign into AWS console.
  • Create your own S3 bucket in S3 to upload data as shown below:

select

select

  • Create database and tables in Athena to query the data.

select

Dataset Description

A sample dataset, containing bank transaction data of a customer, is used in this use case. For sample dataset, please consider Reference section.

Columns

  • Step – Maps a unit of time in the real world. In this use case, 1 step represents 1 hour of time.
  • Type – Diverse types of payments (CASH_IN, CASH_OUT, DEBIT, PAYMENT, TRANSFER).
  • Amount – Amount of transaction in local currency.
  • nameOrig – Customer who started the transaction.
  • oldbalanceOrg – Initial balance before the transaction.
  • newbalanceOrig – New balance after the transaction.
  • NameDest – Customer who is the recipient of the transaction.
  • OldbalanceDest – Initial balance recipient before the transaction.
  • NewbalanceDest – New balance recipient after the transaction.
  • IsFraud – Transactions made by the fraudulent agents inside the simulation.
  • IsFlaggedFraud – An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

Note: There is no information for customers name starting with M (Merchants).

Use Cases

Data security plays a significant role in almost all sectors especially in financial institutions such as banks, credit unions, and so on. As mobile transactions in the financial industry have increased during recent days, it is highly challenging to secure the data in the mobile platform. In this use case, let us discuss on finding and managing malicious behavior during mobile transactions.

In this blog, let us discuss the below use cases:

  • Connect Athena with JDBC SQL Workbench Driver
  • Connect Athena with BI Tools (Tableau 10.3)

Connecting Athena with JDBC SQL Workbench Driver

You can query data from AWS console by connecting JDBC SQL workbench with Athena.

Pre-requisites:

To connect Athena with JDBC driver using SQL workbench, perform the following:

  • In SQL workbench, choose File –> Manage drivers.
  • Perform configuration as shown in the below diagram:

select

  • Add a new driver connection by configuring user name and password as your AWS access key and secret key, respectively.
  • Configure your s3_staging_dir in extended properties to save your executed queries in your S3 bucket.

select

  • Create a table with SQL workbench.

select

The table structure is as follows:

select

The data visualization in table format is as follows:

select

The data is queried as follows:

select

Connecting Athena with BI Tools (Tableau 10.3)

To connect Athena with Tableau, perform the following:

select

On successfully creating connections, Athena databases will be listed out as shown in the below diagram:

select

Data Visualization

Few data visualizations are as follows:

Total Amount Based on Type and IsFraud Flag

select

New Balance and Old Balance Based on Type

select

Total Amount Based on Type and Fraudulent Activities

select

Minimum New Balance and Minimum Old Balance Based on Type

select

Count of Steps for Fraudulent Types of Transactions

select

Percentage of Steps for Fraudulent Type of Activities

select

Conclusion

Athena very fastly queries data directly from Amazon S3 without any ETL. As it is a pay per service, 1 TB charges $5. As it possesses performance improving techniques such as partitioning, compressing, and so on, it supports accessing huge datasets.

References

Self Service Analytics using Dremio

$
0
0

Overview

Dremio, a self-service data platform, helps data analysts and data scientists to determine, organize, accelerate, and share any data at any time irrespective of volume, velocity, location, or structure. Dremio allows business users to access data from a variety of sources and prevents them from relying on developers.

In this blog, let us discuss about data transformation and data analysis using Dremio and data visualization using Tableau.

Pre-requisites

Download and install Dremio from the following link:
https://www.dremio.com/download/

Data Description

Online retail data with different product types, product prices, and quantities sold from December, 2010 to December, 2011 is used as a data source.

Sample Data Source

sample_data_source1

Synopsis

  • Connect different data sources with Dremio
  • Perform data transformation
  • Create virtual datasets in Dremio
  • Connect virtual datasets with BI tools
  • Visualize results in Tableau

Connecting Different Data Sources with Dremio

Different types of data sources available for performing data transformation activities are shown in the below screenshot:

connecting_different_data_sources_with_dremio

To connect Amazon S3 data sources with Dremio, perform the following:

  • In Data Source Types page, select Amazon S3 data source.
  • Connect to Amazon S3 location as shown in the below screenshot:

connecting_different_data_sources_with_dremio11

  • Connect to MySQL connection and provide required credentials as shown in the below screenshot:

connecting_different_data_sources_with_dremio22

  • Connect to Network Attached Storage (NAS) as shown in the below screenshot:

connecting_different_data_sources_with_dremio33

Performing Data Transformation

To transform data, perform the following:

  • Use UNION function to merge data from 3 different data sources such as S3, MySQL, & NAS and load data as virtual dataset as shown in the below screenshot:

performing_data_transformation

As price values are based on single quantity, total price needs to be calculated based on quantity.

  • Add “Total_Price” as a new field.
  • Calculate total price based on number of quantity as shown in the below diagram:

performing_data_transformation1

  • Perform aggregation with stock quantity and stock price based on the products in the source data as shown in the below diagram:

performing_data_transformation2

  • Round off the total price values to 2 decimal digits as shown in the below diagram:

performing_data_transformation3

Creating Virtual Datasets in Dremio

On successfully transforming data, create virtual datasets (view) on Dremio spaces to store the data based on source.

The virtual dataset for purchases done by each customer is as shown below:

creating_virtual_datasets_in_dremio

The virtual dataset for most quantity sold based on the product is as shown in the below diagram:

creating_virtual_datasets_in_dremio1

Connecting Virtual Datasets with BI Tools

To connect the virtual datasets with BI tools, export virtual dataset in .tds format to be used with BI tools such as Tableau, Qlik Sense, and Power BI as shown in the below diagram:

connecting_virtual_datasets_with_bi_tools

 

connecting_virtual_datasets_with_bi_tools1

Visualizing Results in Tableau

On clicking .tds file in Tableau, you will be redirected to Tableau for visualizing the data.

Most Purchases by Customers

most_purchases_by_customers

Maximum Number of Products Sold

maximum_number_of_products_sold

References

Dremio: https://www.dremio.com/

Data Quality Checks with StreamSets using Drift Rules

$
0
0

Overview

In the world of big data, data drift has emerged as a critical technical challenge for data scientists and engineers in unleashing the power of data. It delays businesses from gaining real-time actionable business insights and making more informed business decisions.

StreamSets is not only used for big data ingestion but also for analyzing real-time streaming data. It is used to identify null or bad data in source data and filter out the bad data from the source data in order to get precise results. It also helps the businesses in making quick and accurate decisions.

In this blog, let us discuss about checking quality of data using Data rules and Data Drift rules in StreamSets.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Create a dataflow pipeline to check quality of source data and load the data into HDFS using StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K.

Sample data

{“ambient_temperature”:”16.70″,”datetime”:”Wed Aug 30 18:42:45 IST 2017″,”humidity”:”76.4517″,”lat”:36.17,”lng”:-119.7462,”photo_sensor”:”1003.3″,”radiation_level”:”201″,”sensor_id”:”c6698873b4f14b995c9e66ad0d8f29e3″,”sensor_name”:”California”,”sensor_uuid”:”probe-2a2515fc”,”timestamp”:1504098765}

Synopsis

  • Read data from local file system
  • Configure data drift rules and alerts
  • Convert data types
  • Configure data rules and alerts
  • Derive fields
  • Load data into HDFS
  • Get Alerts During Data Quality Checks
  • Visualize data in motion

Reading Data from Local File System

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure “Directory” origin to read files from a directory.
  • Set Batch Size (recs) as “1” to read records one by one to easily analyze data and get accurate results.
  • Set “Data Format” as JSON.
  • Select “JSON content” as Multiple JSON objects.

reading_data_from_local_file_system

Configuring Data Drift Rules and Alerts

To configure data drift rules and alerts, perform the following:

  • Gather details about data drift as and when data passes between two stages.
  • Provide meters and alerts.
  • Create data drift rules to indicate data structure changes.
  • Click “Add” to add the conditions in the links between the stages.

Few conditions applied are:

    • Alerts when field name varies between two subsequent JSON records.
      Function: drift:names(, )
      For example: ${drift:names(‘/’, false)}
    • Alerts when number of fields vary between two subsequent JSON records.
      Function: drift:size(, )
      For example: ${drift:size(‘/’, false)}
    • Alerts when data type of specified field changes and specified field is missing (i.e.) Double-String, String-Integer.
      Function: drift:type(, )
      For example: ${drift:type(‘/photo_sensor’, false)}
    • Alerts when order of fields varies between two subsequent JSON records.
      Function: drift:order(, )
      For example: ${drift:order(‘/’, false)}
    • Alerts when String is Empty.
      For example: ${record:value(‘/photo_sensor’)==”"}
  • Click “Activate” to activate all the rules.

configuring_data_drift_rules_and_alerts

 

select

Converting Data Types

To analyze data and apply data rules, convert data with String data type into Decimal or Integer type.
For example: Convert String data type of “humidity” data (“humidity”:”76.4517″) in the source data into Double type (“humidity”:76.4517).

converting_data_types

Configuring Data Rules and Alerts

To configure data rules and alerts, perform the following:

  • Click “Add” to add the conditions in data rules and data drift rules in the links between stages.
  • Apply data rules for attributes.
    For example: ${record:value(‘/humidity’) < 66.2353 and record:value(‘/humidity’)>92.4165}

configuring_data_rules_and_alerts

 

configuring_data_rules_and_alerts2

 

select

Deriving Fields

To derive a new field using “Expression Evaluator” processor, add the below language expression in Field Expression:

if ambient_temperature < 20 and humidity > 90:
 return 'Anomaly'
 else if ambient_temperature > 20 and ambient_temperature < 30 and humidity > 80 and humidity < 90:
 return 'Suspicious'
 else:
 return 'Normal'

For example, if derived field is “/prediction”, the expression is:

${record:value('/ambient_temperature') < 20 and record:value('/humidity') > 90? "Anomaly": (record:value('/ambient_temperature') > 20 and record:value('/ambient_temperature') < 30 and record:value('/humidity') > 80 and record:value('/humidity') < 90? "Suspicious": "Normal")}

deriving_fields

Use “Stream Selector” processor to split records with different conditions,

${record:value('/prediction')=="Suspicious"} and ${record:value('/prediction')=="Anomaly"}

deriving_fields1

Loading Data into HDFS

To load data into HDFS, perform the following:

  • Configure “Hadoop FS” destination processor.
  • Select data format as “JSON”.

Note: Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf) contains core-site.xml and hdfs-site.xml files. sdc-resources directory will be created while installing StreamSets.

loading_data_into_hdfs

Getting Alerts During Data Quality Checks

Alerts while Data in Motion

alerts_while_data_in_motion

Alert Summary on Detecting Data Anamolies

alert_summary on_detecting_data_anamolies

Visualizing Data in Motion

Record Summary Statistics

record_summary_statistics

Record Count In/Out Statistics

select

References

Handle Class Imbalance Data with R

$
0
0

Overview

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalance data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalance dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset. In this blog, let us discuss tackling imbalanced classification problems using R.

Data Description

A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to References section.

Columns

  • Time – Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
  • V1-V28 – Principal component variables obtained with PCA.
  • Amount – Transaction amount.
  • Class – Dependent (or) response variable with value as 1 in case of fraud and 0 in case of good.

select

Synopsis

  • Performing exploratory data analysis
    • Checking imbalance data
    • Checking number of transactions by hour
    • Checking mean using PCA variables
  • Partitioning data
  • Building model on training set
  • Applying sampling methods to balance dataset

Performing Exploratory Data Analysis

Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.

Checking Imbalance Data

To find the imbalance in the dependent variable, perform the following:

  • Group the data based on Class value using dplyr package containing “group by function”.

select

  • Use ggplot to show the percentage of class category.

select

Checking Number of Transactions by Hour

To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.

select

select

The above graph shows the transactions of 2 days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.

Checking Mean using PCA Variables

To find data anomalies, take mean of variables from V1 to V28 and check the variation.

The blue points with much variations are shown in the below plot:

select

Partitioning Data

In predictive modeling, data needs to be partitioned for training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.

select

Building Model on Training Set

To build a model on the training set, perform the following:

  • Apply logic classifier on the training set.
  • Predict the test set.
  • Check the predicted output on the imbalance data.

Using Confusion Matrix, the test result shows 99.9% accuracy due to much of class 1 records. So, let us neglect this accuracy. Using ROC curve, the test result shows 78% accuracy that is very low.

select

Applying Sampling Methods to Balance Dataset

Different sampling methods are used to balance the given data, apply model on the balanced data, and check the number of good and fraud transactions in the training set.

select

There are 227K good and 394 fraud transactions.

In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.

The different types of sampling methods are:

Oversampling

This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.

select

Undersampling

This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.

select

Both Sampling

This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.

ROSE Sampling

ROSE sampling method generates data synthetically and provides a better estimate of original data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset.

For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset.

The count of each class records after applying sampling techniques is shown below:

select

Logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.

select

Conclusion

In this blog, highest data accuracy is obtained using SMOTE method. As there is no much variation in these sampling methods, these methods when combined with a more robust algorithm such as random forest and boosting can provide exceptionally high data accuracy.

When dealing with the imbalanced dataset, experiment the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.

These sampling methods can be implemented in the same way in Python too. For Python code, check the below References section.

References

API Response Tracking with StreamSets, Elasticsearch, and Kibana

$
0
0

Overview

RESTful API JSON response data can be used to view various aspects such as pipeline configuration or monitoring information of the StreamSets Data Collector. This API response information can be used with Data Collector REST API and can be used to provide Data Collector details to a REST-based monitoring system.

In this blog, let us discuss on capturing all alerts produced by StreamSets pipelines using RESTful API, loading alerts in Elasticsearch, and visualizing alerts in Kibana.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Create a dataflow pipeline to capture response of RESTful API using StreamSets and to load it in Elasticsearch.

Synopsis

  • View RESTful API response data
  • Capture RESTful API response
  • Load API response in Elasticsearch
  • Visualize pipeline alerts in Kibana

Viewing RESTful API Response Data

To view RESTful API response data, perform the following:

  • Log in to StreamSets.
  • On the top right corner, click Help icon.
  • Click RESTful API.
    Different categories such as ACL, definitions, manager, preview, store, and system can be viewed.

select

  • Click manager to view API required to get alerts triggered for all the pipelines.
  • Click try it out! to get the request URL.

select

  • Check the response in UI using the below URL:
    http://<sdc_host>:/rest/v1/pipelines/alerts

select

Capturing RESTful API Response

To capture RESTful API response, perform the following:

  • Configure HTTP Client Processor by setting Resource URL as “http://<sdc_host>:/rest/v1/pipelines/alerts”, Mode as “Polling”, and Polling Interval.

select

  • Capture RESTful API response using the HTTP client processor.
  • In Pagination tab, set Pagination Mode as “Link HTTP header” and Result Field Path as “/”.

select

Loading API Response in Elasticsearch

To load API Response in Elasticsearch, perform the following:

  • Configure “Elasticsearch” processor.
  • Set Cluster HTTP URI.
  • Use the below template for Elasticsearch:
{
 "template" : "streamsets*",
 "mappings": {
 "uri": {
 "properties": {
  "gauge": {
             "properties": { 
                      "value":{ 
                               "properties": { 
                                        "timestamp": { 
                                        "type":"date", 
                                                 "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-
dd'T'HH:mm:ss.SSS'Z'||yyyy-MM-dd||yyyy-MM-dd HH:mm:ss||mmm dd, yyyy HH:mm:ss 
a||epoch_millis" 
        } 
       } 
      } 
     } 
    } 
   } 
  } 
 } 
}
select

Visualizing Pipeline Alerts in Kibana

The alerts produced by all the pipelines can be viewed in Kibana without using StreamSets.

Number of Alerts vs Label as Attribute

select select

Number of Alerts vs Timestamp

select

Conclusion

StreamSets provides different RESTful APIs to get metrics, status, alerts, and so on. These APIs can be used with different visualization tools to visualize data and to monitor the pipelines externally.

References

Import and Ingest Data into HDFS using Kafka in StreamSets

$
0
0

Overview

StreamSets provides state-of-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven User Interface (UI) helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.

Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis. In this blog, let us discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) & flat files and writing the data into HDFS using Kafka in StreamSets.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Import and ingest data from different data sources into HDFS using Kafka in StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K with 3.5K duplicate records.

Sample data

{"ambient_temperature":"16.70","datetime":"Wed Aug 30 18:42:45 IST 
2017","humidity":"76.4517","lat":36.17,"lng":-
119.7462,"photo_sensor":"1003.3","radiation_level":"201","sensor_id":"c6698873b4f14b995c9e66ad0d8f29e3","
sensor_name":"California","sensor_uuid":"probe-2a2515fc","timestamp":1504098765}

Synopsis

  • Read data from local file system and produce data to Kafka
  • Read data from Amazon S3 and produce data to Kafka
  • Consume streaming data produced by Kafka
  • Remove duplicate records
  • Persist data into HDFS
  • View data loading statistics

Reading Data from Local File System and Producing Data to Kafka

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure File Directory origin to read files from a directory.
  • Set Data Format as JSON and JSON content as Multiple JSON objects.
  • Use Kafka Producer processor to produce data into Kafka.
    Note: If there are no Kafka processors, install Apache Kafka package and restart SDC.
  • Produce the data under topic sensor_data.

reading-data-from-local-file-system

reading-data-from-local-file-system1

Reading Data from Amazon S3 and Producing Data to Kafka

To read data from Amazon S3 and produce data into Kafka, perform the following:

  • Create another pipeline.
  • Use Amazon S3 origin processor to read data from S3.
    Note: If there are no Amazon S3 processors, install Amazon Web Services 1.11.123 package available under Package Manager.
  • Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
  • Set the data format as JSON.
  • Produce data under the same Kafka topic – sensor_data.

reading-data-from-amazon-s3

reading-data-from-amazon-s3-1

Consuming Streaming Data Produced by Kafka

To consume streaming data produced by Kafka, perform the following:

  • Create a new pipeline.
  • Use Kafka Consumer origin to consume Kafka produced data.
  • Configure processor by providing the following details:
    • Broker URI
    • ZooKeeper URI
    • Topic – set the topic name as sensor_data (same data produced in previous sections 1 & 2)
  • Set the data format as JSON.

consuming-streaming-data-produced-by-kafka

Removing Duplicate Records

To remove duplicate records using Record Deduplicator processor, perform the following:

  • Under Deduplication tab, provide the following fields to compare and find duplicates:
    • Max. Records to Compare
    • Time to Compare
    • Compare
    • Fields to Compare
      For example, find duplicates based on sensor_id and sensor_uuid.
  • Move the duplicate records to Trash.
  • Store the unique records in HDFS.

removing-duplicate-records

Persisting Data into HDFS

To load data into HDFS, perform the following:

  • Configure Hadoop FS destination processor from stage library HDP 2.6.
  • Select data format as JSON.
    Note: core-site.xml and hdfs-site.xml files are placed in Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf). While installing StreamSets, sdc-resources directory will be created.

persisting-data-into-hdfs

Viewing Data Loading Statistics

Data loading statistics, after removing duplicates from different sources, is as follows:

viewing-data-loading-statistics

viewing-data-loading-statistics1

References

Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!)

$
0
0

Overview

Kylo, a feature-rich data lake platform, is built on Apache Hadoop and Apache Spark. Kylo provides a business-friendly data lake solution and enables self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization, and data discovery. Its intuitive user interface allows IT professionals to access the data lake (without having to code).

Though there are many tools to ingest batch data and/or streaming or real-time data, Kylo supports both data. It provides a plug-in architecture with a variety of extensions. Apache NiFi templates provide incredible flexibility for batch and streaming use cases.

In this blog post, let us discuss ingesting data from Apache Kafka, performing data cleansing and validation at real-time, and persisting the data into Apache Hive table.

Pre-requisites

  • Install Kafka.
  • Deploy Kylo, where the deployment requires knowledge on different components/technologies such as:
    • AngularJS for Kylo UI
    • Apache Spark for data wrangling, data profiling, data validation, data cleansing, and schema detection
    • JBoss ModeShape and MySQL for Kylo Metadata Server
    • Apache NiFi for pipeline orchestration
    • Apache ActiveMQ for interprocess communication
    • Elasticsearch for search-based data discovery
    • All Hadoop technologies but most preferably HDFS, YARN, and Hive

To know more about basics and installation of Kylo in AWS EC2 instance, refer our previous blog on Kylo Setup for Data Lake Management.

Data Description

User transaction dataset with 68K rows, generated by Treselle team, is used as the source file. The input dataset has time, uuid, user, business, address, amount, and disputed columns.

Sample dataset

select

Examples of invalid and missing values in the dataset:

select

Use Case

  • Publish user transaction dataset into Kafka.
  • Ingest data from Kafka using Kylo data ingestion template and standardize & validate data.

Synopsis

  • Customize data ingest pipeline template
  • Define categories for feeds
  • Define feeds with source and destination
  • Cleanse and validate data
  • Schedule feeds
  • Monitor feeds

Self-Service Data Ingest, Data Cleansing, and Data Validation

Kylo utilizes Spark to provide a pre-defined pipeline template, which implements multiple best practices around data ingestion. By default, it comes up with file system and databases. It helps business users in simplifying configuration of ingest data from new sources such as JMS, Kafka, HDFS, HBase, FTP, SFTP, REST, HTTP, TCP, IMAP, AMQP, POP3, MQTT, WebSocket, Flume, Elasticsearch and Solr, Microsoft Azure Event Hub, Microsoft Exchange using Exchange Web Services (EWS), Couchbase, MongoDB, Amazon S3, SQS, DynamoDB, and Splunk.

Apache NiFi, a scheduler and orchestration engine, provides an integrated framework for designing new types of pipelines with 250+ processors (data connectors and transforms).

The pre-defined data ingest template is modified by adding Kafka, S3, HDFS, and FTP as shown in the below screenshot:

select

Get, Consume, and Fetch named processors are used to ingest the data. The Get and Consume versions of Kafka processors in NiFi is as follows:

GetKafka 1.3.0: Fetches messages from the earlier version of Apache Kafka (specifically 0.8.x versions). The complementary NiFi processor used to send messages is PutKafka.

ConsumeKafka_0_10 1.3.0: Consumes messages from the newer version of Apache Kafka specifically built against the Kafka 0.10.x Consumer API.

Based on need, a custom processor or other custom extension for NiFi can be written & packaged as an NAR file and deployed into NiFi.

Customizing Data Ingest Pipeline Template

On updating and saving the data ingest template in NiFi, the same template can be customized in Kylo UI. The customization steps involve:

  • Customizing feed destination table
  • Adding input properties
  • Adding additional properties
  • Performing access control
  • Registering the template

select

select

Defining Categories for Feeds

All the feeds created in Kylo should be categorized. The process group in NiFi is launched to execute the feeds. “Transaction raw data” category is created to categorize the feeds.

select

Defining Feeds with Source and Destination

Kylo UI is self-explanatory to create and schedule the feeds. To define feeds, perform the following:

  • Choose data ingest template.
  • Provide feed name, category, and description.

select

  • Choose input Data Source to ingest data.
  • Customize the configuration parameter related to that source.
    For example, “transactionRawTopic” in Kafka and batch size “10000”.

select

  • Define output feed table using either of the following methods:
    • Manually define the table columns and its data type.
    • Upload sample file and update the data type as per the data in the column.
  • Preview the data under Feed Details section in the top right corner.

select

  • Define partitioning output table by choosing Source Field and Partition Formula.
    For example, “time” as source field and “year” as partition formula to partition the data.

select

Cleansing and Validating Data

Feed creation wizard UI allows end-users to configure cleansing and standardization functions to manipulate data into conventional or canonical formats (for example, simple data type conversion such as dates, stripping special characters) or data protection (for example, masking credit cards, PII, and so on).

It allows users to define field-level validation to protect data against quality issues and provides schema validation automatically. It provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per needs. It provides predefined rules for standardization and validation of different data types.

select

To clean and validate data, perform the following:

  • Apply different pre-defined standardization rules for time, user, address, and amount columns as shown below:

select

select

  • Apply standardization and validation for different columns as shown in the below screenshot:

select

  • Define data ingestion merge strategy in the output table.
  • Choose “Dedupe and merge” to ignore duplicated batch data and insert it into the desired output table.

select

  • Use Target Format section to define data storage and compression options.
    Supported Storage Formats: ORC, Parquet, Avro, TextFile, and RCFile
    Compression Options: Snappy and Zlib

select

Scheduling Feeds

To schedule the feeds using cron or timer based mechanism, enable “Enable Feed immediately” option to enable the feeds immediately without waiting for cron job or timer criteria meets.

select

Monitoring Feeds

After scheduling the feeds, the actual execution will be performed in NiFi. Feeds status can be edited and monitored. The feed details can be changed at any time and the feeds can be re-scheduled.

select

Overview of created feed job status can be seen under jobs in Operation sections. By drilling down the jobs, identify the details of each job and perform debug on feed job execution failure.

select

Job Activity section provides details such as completed, running, and so on of a specific feed recurring activity.

select

Operational Job Statistics section provides details such as success rate, flow rate per second, flow duration, and steps duration of specific job statistics.

select

select

Conclusion

In this blog, we discussed data ingestion, cleansing, and validation without any coding in Kylo data lake platform. The ingested data output from Kafka is shown in Hive table in Ambari as follows:

select

In our next blog – Kylo: Data Profiling and Search-based Data Discovery, let us discuss data profiling and search-based data discovery.

select

References


Predict Lending Club Loan Default Using Seahorse and SparkR

$
0
0

Overview

Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited as the data is processed on a single machine, where the movement of data from the development environment to production environment is time-consuming and requires extensive re-engineering.

To overcome this problem, Spark provides a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use by the data scientists and data engineers. It is simple, scalable, and easy to integrate with other tools.

Seahorse, a scalable data analytics workbench, allows the data scientists to visually build Spark applications. It allows the data scientists to perform data preparation, data transformation, data modeling, data training, data analysis, and data visualization collaboratively. Seahorse has built-in operations to allow the data scientists to customize parameter values.

In this blog, let us discuss predicting loan default of Lending Club. Lending Club is the world’s largest online marketplace to connect borrowers and investors.

Pre-requisites

  • VirtualBox (version 5.0.10)
  • Vagrant (version 1.8.1)
  • Google Chrome (60.0.3112.113)

Data Description

Loan data of Lending Club, from 2007-2011, with 40K records is used the source file. Each loan has more than 100 characteristics of loan and borrower.

select

Use Case

  • Analyze loan data of Lending Club.
  • Predict loan default in Lending Club dataset by building data model using Logistic Regression.

Loan status falls under two categories such as Charged Off (default loan) and Fully Paid (desirable loan). Lending Club defines Charged Off loans as loans that are non-collectable and the lender has no hope of recovering money.

Synopsis

  • Read Data from Source
  • Prepare Data
  • Train and Evaluate Data Model
  • Visualize Data

Workflow Operations

In Seahorse, all the machine learning processes are made as operations. R Transformation operations are used to clean and prepare the data. The operations used for Lending Club loan data analysis are as follows:

  • Input / Output – Read DataFrame
  • Action – Fit, Transform, Evaluate
  • Set Operation – Split
  • Filtering – Filter Columns, Handle Missing Values
  • Transformation – SQL Transformation, R Transformation
  • Feature Conversion – String Indexer, One Hot Encoder, Assemble Vector
  • Machine Learning – Logistic Regression from Classification, Binary Classification Evaluator from Evaluation

select

Reading Data from Source

Seahorse supports three different file formats such as CSV, Parquet, and JSON from different types of data sources such as HDFS, Database, Local, and Google Spreadsheets. Read DataFrame operation is used to read the files from the data sources and upload it into Seahorse library.

select

Preparing Data

To prepare the data for analysis, perform the following:

  • Remove irrelevant data (loan ID, URL, and so on), poorly documented data (average current balance), and less important features (payment plan, home state) from the source data.
  • Use Filter Columns operation to select 17 key features from the dataset as shown in the below diagram:

select

  • Use R Transformation operation to write any custom function in R.
  • Convert string columns into numeric columns by removing special characters and duplicate data.
    For Example, convert int_rate and revol_util columns into numeric by removing special characters (%).
transform <- function(dataframe) {
# Convert into R dataframe using collect function
dataframe <- collect(dataframe)
# Remove special character(%) from the features
dataframe$int_rate <- as.numeric(gsub("%","",dataframe$int_rate))
dataframe$revol_util <- as.numeric(gsub("%","",dataframe$revol_util))
# Convert string to numeric by removing same word(months)
dataframe$term <- as.numeric(gsub("months","",dataframe$term))
# Reduce factor level for some features column.
dataframe$home_ownership[dataframe$home_ownership=="NONE"] <- "OTHER"
# verified and source verified both are giving same meaning so we have convert as single state
dataframe$verification_status[dataframe$verification_status=="Source Verified"] <- "Verified"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Charged Off"] <- "Charged Off"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Fully Paid"] <- "Fully Paid"
return(dataframe)
}
  • Derive new features from the date columns by applying feature engineering.
    For example, derive issue_month and issue_year from issue_d feature and for earliest_cr_line feature.
transform <- function(dataframe) {
dataframe <- collect(dataframe)
# Add default value for day in date_time columns
dataframe$issue_d <- as.Date(paste("01-",dataframe$issue_d,sep=""),"%d-%b-%Y")
dataframe$earliest_cr_line <- as.Date(paste("01-",dataframe$earliest_cr_line,sep=""),"%d-%b-%Y")
# Get year from the date_time column
dataframe$issue_year <- as.numeric(format(dataframe$issue_d,"%Y"))
dataframe$cr_line_year <- as.numeric(format(dataframe$earliest_cr_line,"%Y"))
# Get month from the date_time
dataframe$issue_month <- as.numeric(format(dataframe$issue_d,"%m"))
dataframe$cr_line_month <- as.numeric(format(dataframe$earliest_cr_line,"%m"))
dataframe$issue_d <- NULL
dataframe$earliest_cr_line <- NULL
return(dataframe)
}
The derived features are shown in the below diagram:

select

After Preprocessing

After preprocessing, perform the following:

  • Use Handle Missing Values operation to find the rows with missing values and to handle them with the selected strategy such as remove row, remove column, custom value, and mode.
    For example, provide custom values for NAs and empty string.
  • Select numeric and string columns from the DataFrame and select remove row as strategy as shown in the below diagram:

select

  • Use String Indexer to map the categorical features into numbers.
  • Choose the columns from the DataFrame using name, index, or type.
  • Select string type columns from the DataFrame and apply string indexer to those columns.
    For example, after the String Indexer execution, Fully Paid will become 0 and Charge Off will become 1 in loan_status column.
  • Use One Hot Encoder operation to convert categorical values into numbers in a fixed range of values.
    A vector will be produced in each column corresponding to one possible value of the feature.

select

  • Use Assemble Vector operation to group all relevant columns together and to form a column with a single vector of all the features.
    For example, the loan_status column is prediction variable and all other columns are features.
  • Use excluding mode to select all the columns other than the prediction variable.

select

Training and Evaluating Data Model

To split the dataset into training set and validation set using Split operation based on split ratio, perform the following:

  • Use 0.7 as a split ratio to split 70 percentage of data in the training set and 30 percentage of data in the validation set.

select

  • Use Logistic Regression and Fit operations to perform model training.
  • Use Fit operation to fit on estimator so as to produce Transformer.
  • In Fit Operation, select features columns and prediction variable.
  • Select maximum iterations and threshold value for the model.
    Fit operation provides prediction variable with predicted values and confidence score in raw prediction and probability columns.

select

  • Use Evaluate action with Binary Classification Evaluator to find the performance of the model.
  • Find AUC, F-Score, and Recall values from the Binary Classification Evaluator and select AUC as a metric for the model.

select

  • Use custom functions (R or Python Transformation) to find the confusion matrix of the model and derive the metrics for that model.
  • Use SQL Transformation to write custom Spark SQL query and to get correctly predicted values and wrongly predicted values from the DataFrame.

select

select

Visualizing Data

DataFrame Report

In DataFrame Report, every column has some plots based on the datatype.

select

Int_rate Column Visualization

For Continuous features, the bar chart is used for data visualization as shown in the below diagram:

select

Grade Column Visualization

For Discrete features, the pie chart is used for data visualization as shown in the below diagram:

select

To create a custom plot like the combination of two column values, use custom operations such as R, Python, SQL Transformation or Python or R Notebook.

References

Data Quality Metrics using Talend Data Quality Management

$
0
0

Overview

Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and information about the data. It helps business users in making more informed decisions with the quality data.

In this blog, let us discuss Data Quality Statistics (DQS) using Talend Data Quality Management (DQM).

Pre-requisites

Download and install Talend data quality tool from the following link:
https://www.talend.com/products/data-quality/

Data Description

Loan applicant dataset, with basic applicant details such as applicant ID, gender, age, marital status, and so on, is used as the source data.

Sample Data Source in MySQL

sample-data-source-in-mysql

Use Case

Perform column and table level quality statistics on the input data source.

Synopsis

  • Connect data source with Talend DQM
  • Create analysis and data quality statistics
    • Simple statistics
    • Pattern matching statistics
    • Text statistics
    • Pattern frequency statistics
  • Apply static rules
  • Perform Correlation Analysis
  • Identify data duplicates using Match Analysis

Connecting Data Source with Talend DQM

To connect Talend DQM with the database, perform the following:

  • Open Talend Open Studio for Data Quality.
  • In the left panel, click Metadata –> DB connections –> Create DB Connection to create a database connection to import the source data from the database for collecting statistics.

select

  • Provide the required credentials to create metadata for MySQL DB connection as shown in the below diagram:

select

Creating Analysis and Data Quality Statistics

On successfully connecting Talend with MySQL, perform analysis on the following levels:

  • Column
  • Table

To collect the data quality statistics with the applicant dataset on column level, perform the following:

  • Create analysis as shown in the below diagram:

select

  • Select columns for performing the data quality statistics as shown in the below diagram:

select

  • Select quality indicator for the selected columns to run analysis and view the analysis results.

select

The quality statistics based on the above-selected indicators in Talend DQM are:

  • Simple Statistics
  • Pattern Matching Statistics
  • Text Statistics
  • Pattern Frequency Statistics

Simple Statistics

The simple statistics on the applicant ID column, with Row Count, Null Count, Distinct Count, Unique Count, Duplicate Count, and Blank Count, is shown in the below diagram:

simple-statistics

Pattern Matching Statistics

The pattern matching statistics is used to analyze the format of several types of data such as date in different formats, phone number patterns in different countries, zip codes, and so on. It provides both matching and non-matching patterns.

Matched and unmatched patterns of the phone numbers with the countries are shown in the below diagram:

pattern-matching-statistics

Matched and unmatched patterns of the applicant last name starting with uppercase are shown in the below diagram:

pattern-matching-statistics1

Matched and unmatched patterns of the US state codes in the applicant data are shown in the below diagram:

pattern-matching-statistics2

Matched and unmatched patterns of the date matching with the date of birth of the applicant are shown in the below diagram:

pattern-matching-statistics3

Text Statistics

The text statistics is used to check the data with default length such as phone number (for example, 10 digits in India). The text statistics on the applicants’ phone number is shown in the below diagram:

text-statistics

Pattern Frequency Statistics

The pattern frequency statistics is used to check the pattern formats of the data source. The patterns of the phone numbers are shown in the below diagram:

pattern-frequency-statistics

Applying Static Rules

Business Rule statistics, called as table level statistics, is used to apply static rules and predefined business rules in the table columns.

Few static rules created are:

  • Approved loan amount should not be greater than requested loan amount.
  • Age column record is valid based on date of birth column in table.
  • Gender column should have valid data like Male or Female.

The business rule analysis performed using the above static rules is shown in the below diagram:

applying-static-rules

Performing Correlation Analysis

The correlation analysis is used to explore the relationships and correlations in the data. It is used to highlight weak relationships between the data to find potential incorrect relationships. The correlation analysis between the cities and the states is shown in the below diagram:

select

Identifying Data Duplicates Using Match Analysis

The match analysis is used to assess the number of duplicates in the data. It estimates the number of groups of similar data on a table-set or a column-set basis. The match analysis, with column sets such as state and gender, is shown in the below diagrams:

match-analysis

match-analysis1

References

Kylo – Automatic Data Profiling and Search-based Data Discovery

$
0
0

Overview

Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count & sum), occurrence of null values, occurrence of uniqueness, occurrence of missing values, occurrence of duplicates, occurrence of top values, and occurrence of valid & invalid values.

Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and to make data discovery simple.

In this blog, let us discuss automatic data profiling and search-based data discovery in Kylo.

Pre-requisites

To know about Kylo deployment requiring knowledge on different components/technologies, refer our previous blog on Kylo Setup for Data Lake Management.

To learn more about Kylo self-service data ingest, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Data Profiling

Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found in Feed Details page.

Feed Details

The feed ingestion using Kafka is shown in the below diagram:

select

Informative summaries about each field from the ingested data can be viewed under View option in Profile page.

String (user field in the sample dataset) and numeric data type (amount field in the sample dataset) profiling details are shown in the below diagrams:

select

select

Profiling Statistics

Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for string field. The numeric field statistics for the amount field is shown in the below diagram:

select

The basic statistics for the string field (i.e. user field) is shown in the below diagram:

select

Standardization Rules

Predefined standardization rules are used to manipulate data into conventional or canonical formats (dates, stripping special characters) or data protection (masking credit cards, PII, and so on). Few standardization rules applied on the ingested data are as follows:

select

Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the user, business, and address fields as per the configuration is shown in the below diagram:

select

Profiling Window

Kylo’s profiling window provides additional tabs such as valid and invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as Range Validator Rule violation, not considered as timestamp, and so on.

select

The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000”, which is the number of messages Kafka producer will attempt to batch before sending it to the consumer. To know more on batch size, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka is shown in the below diagram:

select

Search-based Data Discovery

Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows the business analysts to decide on the required fields to be searchable and to enable index option for those fields while creating feed. The indexed “user” and “business” fields searchable from Kylo Global Search is shown in the below diagram:

select

Index Feed

Predefined “Index Feed” queries the index-enabled field data from persisted Hive table and indexes the feed data into Elasticsearch. The “Index Feed” is automatically triggered as a part of “Data Ingest” template. The index feed job status is highlighted in the below diagram:

select

If the index feed fails, search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for user and business fields failed due to the field name “user” as shown in the below diagram:

select

To resolve this, the “user” field name is modified as “customer_name” during feed creation.

Search Queries

The search query to return the matched documents from the Elasticsearch is:

customer_name: “Bradley Martinez”

select

The search query, Lucence search query, to search data and metadata is:

business: “JP Morgan Chase & Co”

select

Feed Lineage

Lineage is automatically maintained at “feed-level” by Kylo framework, sinks identified by the template designer, and by any sources when registering the template.

select

Conclusion

In this blog, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed few issues faced during Index Feed and its solutions too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capability to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.

References

Sensor Data Quality Management using PySpark & Seaborn

$
0
0

Overview

Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. Few data quality dimensions widely used by the data practitioners are Accuracy, Completeness, Consistency, Timeliness, and Validity. Various DQM rules are configured to apply DQM to the existing data. These DQM rules are applied to clean up, repair, and standardize incoming data & identify and correct invalid data.

In this blog, let us check data for required values, validate data types, and detect integrity violation. DQM is applied to correct the data by providing default values, formatting numbers and dates, and removing missing values, null values, non-relevant values, duplicates, out of bounds, referential integrity violations, and value integrity violations.

Pre-requisites

Install the following Python packages:

  • PySpark
  • XGBoost
  • Pandas
  • Matplotlib
  • Seaborn
  • NumPy
  • sklearn

Data Description

Sensor data from the pub-nub source is used as the source file.

  • Total Record Count: 6K
  • File Types: JSON and CSV
  • # of Columns: 11
  • # of Records: 600K
  • # of Duplicate Records: 3.5K
  • # of NA Values:
    • Ambient Temperature: 3370
    • Humidity: 345
    • Sensor IDs: 12

Sample Dataset

select

Use Case

Perform data quality management on sensor data using Python API – PySpark.

Data Quality Management Process

select

Synopsis

  • Data Integrity
  • Data Profiling
  • Data Cleansing
  • Data Transformation

Data Integrity

Data integrity is the process of guaranteeing the quality of the data in the database.

  • Analyzed input sensor data with
    • 11 columns
    • 6K records
  • Validated source metadata
  • Populated relationships for an entity

Data Profiling

Data profiling is the process of discovering and analyzing enterprise metadata to discover patterns, entity relationships, data structure, and business rules. It provides statistics or informative summaries of the data to assess data issues and quality.

Few data profiling analyses include:

  • Completeness Analysis – Analyze frequency of attribute population versus blank or null values.
  • Uniqueness Analysis – Analyze and find unique or distinct values and duplicate values for a given attribute across all records.
  • Values Distribution Analysis – Analyze and find the distribution of records across different values of a given attribute.
  • Range Analysis – Analyze and find minimum, maximum, median, and average values of a given attribute.
  • Pattern Analysis – Analyze and find character patterns and pattern frequency.

Generating Profile Reports

To generate profile reports, use either Pandas profiling or PySpark data profiling using the below commands:

Pandas Profiling

import pandas as pd<br> import pandas_profiling<br> import numpy as np

#Read the source file that contains sensor data details
df= pd.read_json('E:\sensor_data.json', lines=True)

#Preprocessing on data
df = df.replace(r'\s+', np.nan, regex=True)
df['ambient_temperature']= df['ambient_temperature'].astype(float)
df['humidity'] = df['humidity'].astype(float)

#Generate profile report using pandas_profiling
report = pandas_profiling.ProfileReport(df)

#covert profile report as html file
report.to_file("E:\sensor_data.html")

PySpark Profiling

import pandas as pd<br>import spark_df_profiling<br>import numpy as np

#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

#Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
sql = SQLContext(sc)

# Loading transaction Data
sensor_data_df = sql.read.format("com.databricks.spark.csv").option("header", "true").load("E:\spireon\Data\ganga\sensor_data.csv")
report = spark_df_profiling.ProfileReport(sensor_data_df)
report.to_file("E:\spireon\Data\ganga\pyspark_sensor_data_profiling_v2.html")

The profile report provides the following details:

  • Essentials – type, unique values, missing values
  • Quantile Statistics – minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive Statistics – mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram

Profile Report Overview

select

The sample profile report for a single attribute (ambient temperature) is as follows:

Ambient Temperature – Statistics

select

Ambient Temperature – Histogram

select

Ambient Temperature – Extreme Values

select

To view the complete profile report, see Reference section.

Data Cleansing

Data cleansing is the process of identifying incomplete, incorrect, inaccurate, duplicate, or irrelevant data and modifying, replacing, or deleting the dirty data.

select

  • Analyzed the number of null (NaN) values in the dataset using the below command:
    df.isnull().sum()

The number of null values is as follows:

select

  • Deleted NaN values in String type columns using the below command:
df_v1 = df.dropna(subset=['sensor_id', 'sensor_name','sensor_uuid'], how='all')
<br>df_v1.isnull().sum()<br>
  • Imputed missing values using one of the below methods:

Method 1 – Impute package

Imputation is defined as the process of replacing the missing data with substituted values using any of the following options:

  • most_frequent: Columns of the dtype object (string) are imputed with the most frequent values in the column as mean or median cannot be found for this data type.
  • Mean: Ratio of the sum of elements to the number of elements in the list.
  • Median: Ratio of the sum of middle two numbers to two.

Note: If the missing values in the records are negligible, ignore those records.

In our use case, the most_frequent strategy is used for substituting the missing values using the below command:

imputer=Imputer(missing_values='NaN',strategy='most_frequent', axis=0)
<br>imputer=imputer.fit(df_v1.ix[:,[2,3,4,5,6]])<br>
df_v1.ix[:,[2,3,4,5,6]] =imputer.transform(df_v1.ix[:,[2,3,4,5,6]])

Method 2 – Linear Regression model

To replace the missing data with the substituted values using Linear Regression model, use the below commands:

from sklearn.linear_model import LinearRegression,LogisticRegression

# Split values into sets with known and unknown ambient_temperature values
df_v2 = df_v1[["ambient_temperature","humidity","photosensor","radiation_level"]]
knownTemperature = df_v2.loc[(df_v1.ambient_temperature.notnull())]
unknownTemperature = df_v2.loc[(df_v1.ambient_temperature.isnull())]

# All ambient_temperature values stored in a target array
Y = knownTemperature.values[:, 0]

# All the other values stored in the feature array
X = knownTemperature.values[:,1::]

# Create and fit a linear regression model
linear_regression = LinearRegression()
linear_regression.fit(X, Y)

# Use the fitted regression model to predict the missing values
predictedTemperature = linear_regression.predict(unknownTemperature.values[:, 1::])

# Assign those predicted values to the full data set
df_v1.loc[ (df_v1.ambient_temperature.isnull()), 'ambient_temperature' ] = predictedTemperature

Data Transformation

Data transformation deals with converting data from the source format into the required destination format.

select

  • Converted attributes such as ambient_temperature and humidity from object type to float type using the below command:
#Preprocessing on data transformation
<br>df = df.replace(r'\s+', np.nan, regex=True)<br>
df['ambient_temperature']= df['ambient_temperature'].astype(float)<br>
df['humidity'] = df['humidity'].astype(float)
  • Converted a non_numeric value of sensor_name into numeric data using the below command:
labelencoder_X=LabelEncoder()
<br> labelencoder_X.fit(df_v1.ix[:,6])<br>
list(labelencoder_X.classes_)<br> 
df_v1.ix[:,6] = labelencoder_X.transform(df_v1.ix[:,6])
  • Converted a non_numeric sensor name into numeric data using the below command:
labelencoder_y=LabelEncoder()<br> 
labelencoder_y.fit(df_v1.ix[:,4])<br> 
list(labelencoder_y.classes_)<br> 
df_v1.ix[:,4] = labelencoder_y.transform(df_v1.ix[:,4])
  • Converted a non_numeric value of sensor ID into numeric data using the below command:
labelencoder_z=LabelEncoder()<br> 
labelencoder_z.fit(df_v1.ix[:,5])<br> 
list(labelencoder_z.classes_)<br> 
df_v1.ix[:,5] = labelencoder_z.transform(df_v1.ix[:,5])
  • Based on the above transformation, found feature importance using built-in function using the below commands:
# plot feature importance using built-in function<br> from numpy import loadtxt<br> from xgboost import XGBClassifier<br> from xgboost import plot_importance<br> from matplotlib import pyplot

# split data into X and y
X = df_v1.ix[:,[0,1,2,3,4,5,6,7,10]]
Y = df_v1.ix[:,[10]]
plt.clf()

# fit model no training data
model = XGBClassifier()
model.fit(X, Y)

# plot feature importance
plot_importance(model)
plt.gcf().subplots_adjust(bottom=0.15)
plt.tight_layout()
plt.show()
Feature Importance Chart

select

From the above diagram, it is evident that photosensor feature has the highest importance and lat (latitude) feature has the lowest importance.

Correlation Analysis

Performed correlation analysis to explore data relationships and data correlations to highlight weak data relationships and find potential incorrect relationships. The correlation analysis between the sensor data variables is shown in the below diagram:

select

From the above diagram, it is evident that the ambient_temperature is highly correlated with the dewpoint and humidity and the latitude & longitude are negatively correlated as per the correlation analysis.

Reference

Predict Bad Loans with H2O Flow AutoML

$
0
0

Overview

Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.

H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models. Stacked Ensembles are used to produce a top-performing model–a highly predictive ensemble model in AutoML Leaderboard. In this blog, let us accurately predict bad loan data in order to help the borrowers in making financial decisions and the investors in choosing the best investment strategy.

Pre-requisites

  • Install Python 2.7 or 3.5+
  • Install H2O Flow with the following packages:
    • pip install requests
    • pip install tabulate
    • pip install scikit-learn
    • pip install colorama
    • pip install future
    • pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-3.14.0.2-py2.py3-none-any.whl
  • On successfully installing H2O, check Cluster connection using h2o.init().

Data Description

Loan data of Lending Club, from 2007-2011, with 163K rows and 15 columns is used as the source file. The Lending Club is a peer-to-peer loan platform for both the investors and borrowers.

Sample Dataset
select

Dataset Variables

  1. loan_amnt
  2. term
  3. int_rate
  4. addr_state
  5. dti
  6. revol_util
  7. delinq_2yrs
  8. emp_length
  9. annual_inc
  10. home_ownership
  11. purpose
  12. total_acc
  13. longest_credit_length
  14. verification_status
  15. Dependent variable

Use Case

  • Analyze Lending Club’s loan data.
  • Predict bad loan data in the dataset by using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.

Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans or not. For example, a loan is considered rejected if the bad loan data is 1.

Synopsis

  • Import data from source
  • View parsing data
  • View job details and dataset summary
  • Visualize labels
  • Impute data
  • Split Data
  • Run AutoML
  • View Leaderboard
  • Compute Variable Importance
  • View Output

Importing Data from Source

To import the data from the source, perform the following:

  • Open H2O Flow.
  • Click Data –> Import Files to import the source files into H2O Flow as shown in the below diagram:

select

select

After importing the files, a summary displays the results of the import.

Viewing Parsing Data

On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:

select

The parse files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of response column (bad loan) is changed from numeric to factor (Enum). After doing all changes, click Parse.

select

Viewing Job Details and Dataset Summary

After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.

select

Loan Dataset Summary

select

From the above summary, the input columns show multiple label values. Each label data can be visualized by clicking their corresponding column names.

Visualizing Labels

In this section, let us visualize data of loan amount and employee length columns.

Loan Amount Data

select

Employee Length Data

select

Imputing Data

Missing values of labels, with aggregates computed on “na.rm’d” vector, are imputed using in-place imputation.

To impute the data, perform the following:

  • Choose the attribute with missing values.
  • Click Impute as shown in the below diagram:

select

  • Specify the following details:
    • Frame
    • Column
    • Method
    • Combine Method

select

On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:

select

Splitting Data

To split the dataset into a training set (70%) and a test set (30%), perform the following:

  • Click Assist Me and Split Frame (or click Data drop-down and select Split Frame) to split the DataFrame.
    It automatically adjusts the ratio values to one. On entering unsupported values, an error will be displayed.
  • Click Create to view the split frames.

select

select

Running AutoML

To run AutoML, perform the following:

  • Select Model –> RunAutoML as shown in the below diagram:

select

  • Provide the following details as shown in the below diagram:
    • Training Frame – Select the dataset to build the model.
    • Response Column – Select the column to be used as a dependent variable. Required only for GLM, GBM, DL, DRF, Naïve Bayes (classification model).
    • Fold Column – (Optional in AutoML) Select the column with the cross-validation fold index assignment / observation.
    • Weight Column – Weights are per row observation weights and do not increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
    • Validation Frame – (optional) Select the dataset to evaluate the model accuracy.
    • Leaderboard Frame – Specify the Leaderboard frame when configuring AutoML run. If not specified, the Leaderboard frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
    • Max Models – (AutoML) Specify the maximum number of models to be built in an AutoML run.
    • Max Runtime Secs – Controls execution time of AutoML run (default time is 3600 seconds).
    • Stopping Rounds – Stops training based on a simple moving average when the stopping_metric does not improve for a specified number of training rounds. Specify 0 to disable this feature.
    • Stopping Tolerance – Specify the tolerance value to improve a model before training ceases.

select

Viewing Leaderboard

The Leaderboard displays the models with the best results first as shown in the below diagram:

select

Model

select

ROC Curve – Training Metrics

select

Computing Variable Importance

The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.
The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:

select

Viewing Output

Predicted Model of Loan Dataset

select

ROC Curve

select

Prediction Scores

select

Conclusion

In this blog, AutoML, the distributed random forest model, and the stacked ensembles are used to build and test the best model for predicting the loan default. The data is analyzed to obtain the cut-off value. The investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.

References

Crime Analysis Using H2O Autoencoders – Part 1

$
0
0

Overview

Nowadays, Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data. Machine Learning models are used to accurately predict crimes. Crime prediction not only helps in crime prevention but also enhances public safety. Autoencoder, a simple, 3-layer neural network, is used for dimensionality reduction and for extracting key features from the model.

Data Engineers spend much time in building an analytic model with proper validation metrics in order to higher the performance of the model. Data Analysts spend high time in building data pipelines as a part of Big Data Analytics. The Machine Learning models are developed in these pipelines with its own functionalities/features. On passing the models through the Analytical Pipeline, these models are easily deployed in real-time processing.

This blog is part one of a two-part series of Crime Analysis using H2O Autoencoders. In this blog, let us discuss building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles (LA).

Pre-requisites

Install the following in R:

Dataset Description

Crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. This dataset is an open data resource for governments, non-profit organizations, and NGOs.

Sample Dataset

select

Use Case

  • Predict the arrest status of the crimes happening in Los Angeles.
  • Achieve analytical pipeline.
  • Analyze the performance of Autoencoders.
  • Build deep learning and machine learning models.
  • Apply required mechanisms to increase the performance of the models.

Synopsis

  • Access data
  • Prepare data
    • Clean data
    • Preprocess data
  • Perform Exploratory Data Analysis (EDA)
  • Build Machine Learning model
    • Initialize H2O cluster
    • Impute data
    • Train model
  • Validate model
  • Execute model
    • Pre-trained supervised model

Accessing Data

The crime dataset is obtained from https://dev.socrata.com/ and imported into the database. The Socrata APIs provide rich query functionality through a query language called “Socrata Query Language” or “SoQL”.

The data structure is as follows:

select

Preparing Data

In this section, let us discuss data preparation for building a model.

Cleansing Data

Data cleansing is performed to find NA values in the dataset. These NA values should be either removed or imputed with some imputation techniques to get desired data.

To get the count of NA values and view the results, use the below commands:

select

Total Number of NA Values for Each Column

select

From the above diagram, it is evident that the attributes such as crm_cd_2, crm_cd_3, crm_cd_4, cross_street, premis_cd and weapon_used_cd are repeated and are to be removed. These attributes are removed from the dataset.

Preprocessing Data

Data preprocessing such as data type conversion, date conversion, month, year, & week derivation from the date field, new attributes derivation, and so on is performed on the dataset. The date attribute is converted from factor to POSIXct object. Lubridate package is used to get various fields such as the month, year, and week using this object. Chron package is used along with the time attribute to derive crime time interval (Morning, Afternoon, Midnight, and so on).

select

Performing Exploratory Data Analysis

The EDA is performed on the crime dataset to make better and useful EDA.

Top 20 Crimes in Los Angeles

select

Crime Timings

select

Month with Highest Crimes

select

Area with Highest Crime Percentage

select

Top 10 Descent Groups Getting Affected

select

Top 10 Frequently Used Weapons for Crime

select

Safest Living Places in Los Angeles

select

Building Machine Learning Model

In this section, let us discuss building the best Machine Learning model for our dataset using Machine Learning algorithms.

Initializing H2O Cluster

Before imputing the data, initiate a H2O cluster running with port 12345 using init(). This cluster is accessed using http://localhost:12345/flow/index.html#.

select

Imputing Data

In H2O, data imputation is performed using h2o.impute() to fill the NA values using default methods such as mean, median, and mode. The method is chosen based on the data type of each column. For example, factor or categorical columns are imputed using mode method.

select

The dependent variable is grouped based on the status codes of the crimes occurred. The crimes arrest status codes are grouped into Not Arrested and Arrested.

select

Training Model

The dataset is split into Train, Test, and Validation frames based on certain ratios specified using h2o.splitframe. Each frame is assigned to a separate variable using h2o.assign().

select

To train the model, perform the following:

  • Take the data pertaining to the year 2016 as the training set.
  • Take the data pertaining to the year 2017 as the test set.
  • Apply Deep Learning to the model.
  • Perform Unsupervised classification to predict the arrest status of the crimes.
  • Make the autoencoder model to learn the patterns of the input data irrespective of the given class labels.
  • Make the model to learn the status behavior based on the features.

Function Used to Apply Deep Learning to Our Data: h2o.deeplearning
@param x – features for our model
@param training_frame – dataset to the model that needs to be applied.
@param model_id – string represents our model to save and load.
@param seed – for resproducability.
@param hidden – number of hidden layers.
@param epochs – number of iterations our dataset must go through.
@param activation – a string representing the activation to be used.
@params stopping_rounds, stopping_metric, export_weights_and_biases – used for cross validation purposes.
@param autoencoder – logic representing whether autoencoders should be applied or not

select

select

The above diagram shows the summary of our Autoencoders model and its performance for our training set.

A classification problem is encountered as Gaussian distribution is applied to our model instead of a Binomial classification.

As the above results are not satisfactory, the dimensionality of our model is reduced to get better results. The features of one of the hidden layers are extracted and the results are plotted to classify the arrest status using deep features functions in H2O package.

select

From the above results, the arrest status of the crimes happened cannot be exactly obtained.

select

So, dimensionality reduction with our autoencoder model alone is not sufficient to identify the arrest status in this dataset. The dimensionality representation of one of our hidden layers is used as features for Model Training. Supervised Classification is applied to the extracted features and the results are tested.

select

Validating Model

To validate the performance of our model, the cross-validation parameters used while building the model is used to plot the ROC curves and get the AUC value on our validation frames. A detailed overview of our model is obtained using summary() function.

select

select

Executing Model

To predict the arrest status of the crimes, perform the following:

    • Apply the deep features to the dataset.
    • Use our model to predict the arrest status.

select

    • Plot the ROC curve with AUC values based on Sensitivity and Specificity.

select

    • Group the results based on the predicted and actual values with the total number of classes and its frequencies.
    • Decide the performance of our model on the arrest status of the crimes.

select

From the above diagram, the predicted number of Not Arrested cases is 28 and the predicted number of Arrested cases is 150. As the numbers seem to be less, this model will cause a slight problem in maintaining the historical records when used in real-time.

Pre-trained Supervised Model

The autoencoder model is used as a pre-training input for a supervised model and its weights are used for model fitting. The same training and validation sets are used for the supervised model. A parameter called pretrained_autoencoder is added in our model along with the autoencoder model name.

select

This pre-trained model is used to predict the results of our new data and to find the probability of classes for our new data.

select

The results are grouped based on the actual and predicted values and the performance of our model is decided based on the arrest status of the crimes.

select

From the above results, it is evident that there are only minor changes in the results from our previous results with the dimensionality representation. Let us plot the ROC curves and AUC values to compare both the results.

select

select

Conclusion

In this blog, we discussed creating the analytical pipeline for the Los Angeles crime dataset, applying the Autoencoders to the dataset, performing both Unsupervised and Supervised Classifications, extracting the dimensionality representation of our model, and applying the Supervised model.

In our next blog on Crime Analysis Using H2O Autoencoders – Part 2, let us discuss deploying the model by converting it into POJO/MOJO objects with the help of H2O functions.

References

Streaming Analytics using Kafka SQL

$
0
0

Overview

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish the trip data from source to other destinations for further analysis.

In this blog, let us discuss enriching the Citi Bike trip data and finding the number of trips on the particular day to/from the particular station.

Pre-requisites

Install the following:

  • Scala
  • Apache Kafka
  • KSQL
  • JDK

Data Description

Trip dataset of Citi Bike March 2017 is used as the source data. It contains basic details such as trip duration, ride start time, ride end time, station ID, station name, station latitude, and station longitude.

select

Station dataset of Citi Bike is used for enriching trip details for further analysis after data consumption. It contains basic details such as availableBikes, availableDocks, statusValue, and totalDocks.

select

Use Case

  • Enrich Citi Bike trip data in real time using join and aggregation concepts.
  • Find the number of trips on the day to/from the particular station.
  • View trip details with station details & aggregate trip count of each station.

Synopsis

  • Produce station details
  • Join stream data and table data
  • Group data
  • Produce trip details
  • View output
    • View trip details with station details
    • View aggregate trip count of each station

Producing Station Details

To produce the station details using Scala, perform the following:

  • Create trip-details and station-details topics in Kafka using the below commands:
./bin/kafka-topics --create --zookeeper localhost:2181 --topic station-details --replication-factor 1 --partitions 1
./bin/kafka-topics --create --zookeeper localhost:2181 --topic trip-details --replication-factor 1 --partitions 1
select

select

    • Iterate the station list to produce JSON file using the below commands:

select

  • Produce the station data into the station-details topic via the below Scala command:
java -cp kafka_producer_consumer.jar com.treselle.kafka.core.Producer station-details localhost:9092 station_data
select
  • Iterate and produce the station details list in JSON format.
  • Check the produced and consumed station details using the below command:
./bin/kafka-console-consumer –bootstrap-server localhost:9092 –topic station- details –from-beginning

Joining Stream Data and Table Data

To join the stream and table data, perform the following:

  • In KSQL console, create a table for the station details to join it with the trip details while producing the stream using the below commands:
CREATE TABLE
station_details_table
(
id BIGINT,
stationName VARCHAR,
availableDocks BIGINT,
totalDocks BIGINT,
latitude DOUBLE,
longitude DOUBLE,
statusValue VARCHAR,
statusKey BIGINT,
availableBikes BIGINT,
stAddress1 VARCHAR,
stAddress2 VARCHAR,
city VARCHAR,
postalCode VARCHAR,
location VARCHAR,
altitude VARCHAR,
testStation BOOLEAN,
lastCommunicationTime VARCHAR,
landMark VARCHAR
)
WITH
(
kafka_topic='station-details',
value_format='JSON'
)
select
  • In KSQL Console, create a stream for the trip details to enrich the data with the start station details and to find the trip count of each station for the day using the below commands:
CREATE STREAM
trip_details_stream
(
tripduration BIGINT,
starttime VARCHAR,
stoptime VARCHAR,
start_station_id BIGINT,
start_station_name VARCHAR,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE,
end_station_id BIGINT,
end_station_name VARCHAR,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE,
bikeid INT,
usertype VARCHAR,
birth_year VARCHAR,
gender VARCHAR
)
WITH
(
kafka_topic='trip-details',
value_format='DELIMITED'
);
select
  • Join the stream with the station details table to get fields such as availableBikes, totalDocks, and availableDocks using the station ID as the key.
  • Extract the select statement start time in the date format as the timestamp to get only the day from the start time to find the started trip count details in the day using the below commands:
CREATE STREAM
citibike_trip_start_station_details WITH
(
value_format='JSON'
) AS
SELECT
a.tripduration,
a.starttime,
STRINGTOTIMESTAMP(a.starttime, 'yyyy-MM-dd HH:mm:ss') AS startime_timestamp,
a.start_station_id,
a.start_station_name,
a.start_station_latitude,
a.start_station_longitude,
a.bikeid,
a.usertype,
a.birth_year,
a.gender,
b.availableDocks AS start_station_availableDocks,
b.totalDocks AS start_station_totalDocks,
b.availableBikes AS start_station_availableBikes,
b.statusValue AS start_station_service_value
FROM
trip_details_stream a
LEFT JOIN
station_details_table b
ON
a.start_station_id=b.id
select
  • Add the end station details with the trip details in another topic similar to the start station.
  • Extract end time field as a long timestamp using the below commands:
CREATE STREAM
citibike_trip_end_station_details WITH
(
value_format='JSON'
) AS
SELECT
a.tripduration,
a.stoptime,
STRINGTOTIMESTAMP(a.stoptime, 'yyyy-MM-dd HH:mm:ss') AS stoptime_timestamp,
a.end_station_id,
a.end_station_name,
a.end_station_latitude,
a.end_station_longitude,
a.bikeid,
a.usertype,
a.birth_year,
a.gender,
b.availableDocks AS end_station_availableDocks,
b.totalDocks AS end_station_totalDocks,
b.availableBikes AS end_station_availableBikes,
b.statusValue AS end_station_service_value
FROM
trip_details_stream a
LEFT JOIN
station_details_table b
ON
a.end_station_id=b.id;
select
  • Join the streamed trip details with the station details table as KSQL does not allow joining of two streams or two tables.

Grouping Data

To group data based on the station details and the date, perform the following:

  • Format date as YYYY-MM-DD from the long timestamp to group by date in the start trip details using the below commands:
CREATE STREAM
citibike_trip_start_station_details_with_date AS
SELECT
TIMESTAMPTOSTRING(startime_timestamp, 'yyyy-MM-dd') AS DATE,
starttime,
start_station_id,
start_station_name
FROM
citibike_trip_start_station_details;
select
  • Format date as YYYY-MM-DD from the long timestamp to group by date in the end trip details using the below commands:
CREATE STREAM
citibike_trip_end_station_details_with_date AS
SELECT
TIMESTAMPTOSTRING(stoptime_timestamp, 'yyyy-MM-dd') AS DATE,
stoptime,
end_station_id,
end_station_name
FROM
citibike_trip_end_station_details;
select
  • Create a table by grouping the data based on the date and the stations for finding the started trip counts and the ended trip counts of each station for the day using the below commands:
CREATE TABLE
start_trip_count_by_stations AS
SELECT
DATE,
start_station_id,
start_station_name,
COUNT(*) AS trip_count
FROM
citibike_trip_start_station_details_with_date
GROUP BY
DATE,
start_station_name,
start_station_id;
select
CREATE TABLE
end_trip_count_by_stations AS
SELECT
DATE,
end_station_id,
end_station_name,
COUNT(*) AS trip_count
FROM
citibike_trip_end_station_details_with_date
GROUP BY
DATE,
end_station_name,
end_station_id;
select
  • List the topics to check whether the topics are created for persistent queries or not.

select

Producing Trip Details

To produce the trip details into the topic trip-details using Scala, use the below commands:

./bin/kafka-console-consumer –bootstrap-server localhost:9092 –topic station-details –from-beginning

select

From the above console output, it is evident that a total of 727664 messages are produced for data enrichment at the stream.

Viewing Output

Viewing Trip Details with Station Details

To view the trip details with the station details, perform the following:

  • Consume the message using the topic CITIBIKE_TRIP_START_STATION_DETAILS to view the extra fields added to trip details from the station details table and to extract the long timestamp field from the start and end times using the below commands:
/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic CITIBIKE_TRIP_START_STATION_DETAILS --from-beginning
select
  • Consume the message using the topic CITIBIKE_TRIP_END_STATION_DETAILS using the below commands:
./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic CITIBIKE_TRIP_END_STATION_DETAILS --from-beginning
select

From the above console output, it is evident that the fields of the station details are added to the trip while producing the trip details.

Viewing Aggregate Trip Count of Each Station

To view the aggregate trip count of each station based on the date, perform the following:

  • Consume the message via the console to check the trip counts obtained on the stream using the below commands:
./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic START_TRIP_COUNT_BY_STATIONS --from-beginning
select

From the above console output, it is evident that the trip counts are updated and added to the topic for each day when producing the message. So, this data can be filtered to the latest trip count in consumer for further analysis.

  • Obtain the end trip count details based on the stations using the below commands:
./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic END_TRIP_COUNT_BY_STATIONS --from-beginning
select

Conclusion

In this blog, we discussed adding extra fields from the station details table, extracting date in the YYYY-MM-DD format, and grouping the details based on the station ID & the day for getting the start and end trip count details of the station.

References


Crime Analysis Using H2O Autoencoders – Part 2

$
0
0

Overview

This is the second part of a two-part series of Crime Analysis using H2O Autoencoders. In our previous blog on Crime Analysis Using H2O Autoencoders – Part 1, we discussed building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles (LA). Our Machine Learning model can be deployed as a jar file using POJO and MOJO objects. H2O generated POJO and MOJO models can be easily embeddable into Java environment based on the autogenerated h2o-genmodel.jar file.

In this blog, let us discuss deploying the H2O Autoencoders model into a real-time production environment by converting it into POJO objects using H2O functions. As the Autoencoders does not support MOJO models, the POJO model is used in this blog.

Dataset Description

Crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. For more description, refer our previous blog on Crime Analysis Using H2O Autoencoders – Part 1.

Sample Deployment Model

select

Use Case

Deploy the H2O Autoencoders model into the production environment.

Synopsis

  • Generate JAR File for H2O Autoencoder Model
  • Run model
  • Deploy model into production environment
  • Implement machine learning model (Java Spring)
    • Set up model execution project
    • Set up model deployment project
  • Perform overall production deployment

Generating JAR File for H2O Autoencoder Model

The Autoencoders model created from our previous analysis is as follows:

select

To generate the JAR file, perform the following:

    • Download the Autoencoders model using h2o.download_pojo() function in H2O package.
    • Execute the below syntax to create a Java file along with the JAR file:

select

  • Download the Java file along with the JAR file using a Java Decompiler as shown in the below diagram:

select

Note: If the downloaded dependency JAR file does not contain logic to implement the autoencoder model, an UnsupportedOperationException error will be thrown similar to the one shown in the below diagram:

select

The error can be viewed in the PredictCsv.java file as shown in the below diagram:

select

Similarly, you can view other models such as BinomialModelPrediction, MultinomialModelPrediction, and so on.

To overcome this exception error, perform the following:

select

  • View the new jar file downloaded from the external site containing logic for the Autoencoders as shown in the below diagram:

select

Running Model

You need a Java file from POJO object, an input file, and a h2o-genmodel.jar file with its dependencies to run the model.

To run the model, perform the following:

  • Use test_input.csv as an input file and output.csv as an output file.
  • Run the model with all the dependencies using the below commands:
javac -cp h2o-genmodel.jar -J-Xmx2g crime_model_auto.java
java -cp .;* hex.genmodel.tools.PredictCsv --header --model crime_model_auto --input test_input.csv --output output.csv
Note: As the Autoencoders return reconstruction MSE error values for all columns for each class, the arrest status of the crimes cannot be predicted.
    • Download the already trained Supervised Classification model as the POJO object using the pre-trained autoencoder model to predict the values.
    • Create a separate folder named “pre-trained” for this process.
    • Append all the JAR files into this folder.
    • Copy and paste the dependency JAR files and inputs into this folder.
    • Compile and run the Java file using the below commands:

select

  • Obtain the output of our prediction model. The output looks similar to the one shown below:

select

From the above results, it is evident that our model works fine as a standalone Java file. Let us convert this model into a JAR file and move it into the production environment along with h2o-genmodel.jar and input files.

Deploying Model into Production Environment

To deploy the model into the production environment, perform the following:

  • Convert the model into the JAR file with all the class files using the below command:
jar cf crime_model.jar *.class
select
  • Place the above setup on any server and run the JAR file using the below command:
java -cp .;* hex.genmodel.tools.PredictCsv --header --model crime_pretrained --input test_input.csv --output output.csv

Implementing Machine Learning Model (Java Spring)

To implement the POJO model in the Java Environment using Spring Framework, set up a simple Spring WebService project and pass the input as JSON payload through POST call.

Setting Up Model Execution Project

To set up a model execution project, perform the following:

  • Parse an input CSV file and convert it into required Java collection objects.
  • Convert the collection objects into JSON string to pass it as a JSON payload in the POST call.
  • Create a function to make the JSON string as a valid request for our API call and to make all necessary connection objects within it.

Project Setup

select

Few class files in the project setup are:

  • CrimeModelExecution.java – Makes all the required function calls and converts the input file string into a valid JSON string. It is the core file for our project.
  • CSVParser.java – Parses a CSV file and converts it into required Java collections.
  • URLExecution.java – Contains functions to make the JSON string as the valid request for our API call. It makes all necessary connection objects within it.
  • StringUtil.java – All Util functions are made in this class.

Setting Up Model Deployment Project

To set up model deployment project, perform the following:

  • Convert the execution project into the JAR file with all its dependencies.
  • Initiate a server to run all APIs containing necessary logic to apply prediction on the dataset.
  • Setup the project in a server environment and pass the required input files as parameters.

The project setup is as follows:

select

Few class files in the project setup are:

  • CrimeController.java – Contains all APIs required to apply Model Prediction for the datasets and to pass the input as JSON payload through POST call and as the File format in POST call.
  • UtilHelper.java – Performs basic string datatype conversions.

The project is implemented based on dependencies present in the h2o-genmodel.jar (PredictCSV.java) file. So, add this JAR to our classpath during implementation.

Performing Overall Production Deployment

The overall production deployment involves analyzing the input, implementing a model using R scripts, downloading the model into required Java Objects, and implementing these objects in the production environment.

The flow of moving the Machine Learning models into the production environment is as follows:

select

To deploy the model, perform the following:

  • Upload all the codes in a specified location.
  • Create separate batch files (in Windows environment) for implementing R Script.
  • Make the project execution JAR.
  • Deploy the model in the production environment as shown in the below diagram:

select

Conclusion

In this blog, we discussed setting up a simple Spring Webservice project in Java environment and deploying the Machine Learning model in the real-time production environment using the command prompt and the POJO model. In our use case, the setup was performed on Windows. But, the same can be followed in any real-time server setup. The h2o-genmodel.jar file contains all the dependencies and default functionalities required to build the model using Java.

To know about building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles, consider our previous blog on Crime Analysis Using H2O Autoencoders – Part 1.

References

Ingest IoT Sensor Data into S3 with Raspberry Pi3 & StreamSets Data Collector Edge

$
0
0

Overview

Due to increasing amount of data produced from outside of source systems, enterprises are facing difficulties in reading, collecting, and ingesting data into a desired, central database system. An edge pipeline runs on an edge device with limited resources, receives data from another pipeline or reads the data from the device, and controls the device based on the data.

StreamSets Data Collector (SDC) Edge, an ultra-lightweight agent, is used to create end-to-end data flow pipelines in StreamSets Data Collector and to run the pipelines to read and export data in and out of the systems. In this blog, StreamSets Data Collector Edge is used to read data of air pressure BMP180 sensor from IoT Device (Raspberry Pi3) and StreamSets Data Collector is used to load the data into Amazon Simple Storage Service (S3) via MQTT.

Pre-requisites

  • Install StreamSets
  • Raspberry Pi3
  • BMP180 Sensor
  • Amazon S3 Storage

Use Case

  • Read air pressure BMP180 sensor data with IoT Device (Raspberry Pi3) and send to MQTT
  • Use SDC to load the data into Amazon S3 via MQTT

Synopsis

  • Connect BMP180 temperature/pressure sensor with Raspberry Pi3
  • Create edge sending pipeline
  • Create data collector receiving pipeline

Flow Diagram

select

Connecting BMP180 Temperature/Pressure Sensor with Raspberry Pi3

I2C bus, a communication protocol, is used by Raspberry Pi3 to communicate with other embedded IoT devices such as temperature sensors, displays, accelerometers, and so on. The I2C bus has two wires called SCL and SDA, where the SCL is a clock line to synchronize all data transfers over the I2C bus and the SDA is a data line. The devices are connected to the I2C bus via the SCL & SDA lines.

To enable I2C drivers on Raspberry Pi3, perform the following:

  • Run sudo raspi-config.
  • Choose Interfacing Options from the menu as shown in the below diagram:

select

  • Choose I2C as shown in the below diagram:

select

Note: If I2C is not available in the Interfacing Options, check Advanced Options for I2C availability.

  • Click Yes to enable the I2C driver.
  • Click Yes again to load the driver by default.
  • Add i2c-dev to /etc/modules using the below commands:
pi@raspberrypi:~$ sudo nano /etc/modules
i2c-bcm2708
i2c-dev
  • Install i2c-tools using the below command:
pi@raspberrypi:~$ sudo apt-get install python-smbus i2c-tools
  • Reboot the Raspberry Pi3 by running back at the command line using the below command:
sudo reboot
  • Ensure that the I2C modules are loaded and made active using the below command:
pi@raspberrypi:~$ lsmod | grep i2c
  • Connect the Raspberry Pi3 with the BMP180 temperature/pressure sensor as shown in the below diagram:

select

  • Ensure that the hardware and software are working fine with i2cdetect using the below command:
pi@raspberrypi:~$ sudo i2cdetect -y 1
select

Building Edge Sending Pipeline

To build an edge sending pipeline for reading the sensor data, perform the following:

  • Create an SDC Edge Sending pipeline on StreamSets Data Collector.
  • Read the data directly from the device (using I2C Address) using “Sensor Reader” component.
  • Set the I2C address as “0×77”.
  • Use an Expression Evaluator to convert temperature from Celsius to Fahrenheit.
  • Publish data to MQTT topic as “bmp_sensor/data”.
  • Download and move the SDC Edge pipeline’s executable format (Linux) to device side, where the pipeline runs on the device side (Raspberry Pi3).
  • Start SDC Edge from the SDC Edge home directory on the edge device using the following command:
bin/edge –start=<pipeline_id>
For example:
bin/edge --start=sendingpipeline137e204d-1970-48a3-b449-d28e68e5220e
select

Building Data Collector Receiving Pipeline

To build a data collector receiving pipeline for storing the received data in Amazon S3, perform the following:

  • Create a receiving pipeline on the StreamSets Data Collector.
  • Use MQTT subscriber component to consume data from MQTT topic (bmp_sensor/data).
  • Use Amazon S3 destination component to load the data into Amazon S3.
  • Run the receiving pipeline in the StreamSets Data Collector.

select

The real-time air pressure data collected and stored is shown in the below diagram:

select

Conclusion

In this blog, we discussed reading the air pressure BMP180 sensor data from Raspberry Pi3 using StreamSets Data Collector Edge and loading the collected data into Amazon S3 via MQTT using StreamSets Data Collector.

Custom Partitioning and Analysis using Kafka SQL Windowing

$
0
0

Overview

Apache Kafka uses round-robin fashion to produce messages to multiple partitions. Custom partition technique is used to produce a particular type of message in the defined partition and to make the produced message to be consumed by a particular consumer. This technique allows us to take control over the produced messages. Windowing allows event-time driven analysis and data grouping based on time limits. The three different types of windowing are Tumbling, Session, and Hopping.

In this blog, we will discuss processing Citibike trip data in the following ways:

  • Partitioning trip data based on user type using the Custom partitioned technique.
  • Analyzing trip details at stream using Kafka SQL Windowing.

Pre-requisites

Install the following:

  • Scala
  • Java
  • Kafka
  • Confluent
  • KSQL

Data Description

Trip dataset of Citi Bike March 2017 is used as the source data. It contains basic details such as trip duration, start time, stop time, station name, station ID, station latitude, and station longitude.

Sample Dataset

select

Use Case

  • Process Citibike trip data to two different brokers by partitioning the messages according to user types (Subscriber or Customer).
  • Use Kafka SQL Windowing concepts to analyze the following details:
    • Number of trips started at particular time limits using Tumbling Window.
    • Number of trips started using advanced time intervals using Hopping Window.
    • Number of trips started with session intervals using Session Window.

Synopsis

  • Set up Kafka cluster
  • Produce and consume trip details using custom partitioning
  • Create trip data stream
  • Perform streaming analytics using Window Tumbling
  • Perform streaming analytics using Window Session
  • Perform streaming analytics using Window Hopping

Setting Up Kafka Cluster

To setup the cluster on the same server by changing the ports of the brokers in the cluster, perform the following steps:

  • Run ZooKeeper on default port 2181.
    The ZooKeeper data will be stored by default in /tmp/data.
  • Change the default path (/tmp/data) to another path with enough space for non-disrupted producing and consuming.
  • Edit the ZooKeeper configurations in zookeeper.properties file available in the confluent base path etc/kafka/zookeeper.properties as shown in the below diagram:

select

  • Start the ZooKeeper using the following command:
./bin/zookeeper-server-start etc/kafka/zookeeper.properties
You can view the below ZooKeeper startup screen:

select

  • Start 1st broker in the cluster by running default Kafka broker in port 9092 and setting broker ID as 0.
    The default log path is /tmp/kafka-logs.
  • Edit the default log path (/tmp/kafka-logs) for starting the 1st broker in the server.properties file available confluent base path.
    vi etc/kafka/server.properties.

select

  • Start the broker using the following command:
./bin/kafka-server-start etc/kafka/server.properties
 

You can view the 1st broker startup with broker ID 0 and port 9092:

select

  • Start 2nd broker in the cluster by copying server.properties as server1.properties under etc/kafka/ for configuring 2nd broker in cluster.
  • Edit server1.properties.
    vi etc/kafka/server1.properties.

select

  • Start the broker using the following command:
./bin/kafka-server-start etc/kafka/server1.properties
You can view the 2nd broker startup with broker ID 1 and port 9093:

select

  • List the brokers available in the cluster using the following command:
./bin/zookeeper-shell localhost:2181 ls /brokers/ids
You can view the brokers available in the cluster as shown in the below diagram:

select

In the above case, two brokers are started on the same node. If the brokers are available in different nodes, parallel message processing can be made faster and memory issue can be resolved when a large number of messages are produced by sharing the messages in different nodes memory.

Producing and Consuming Trip Details Using Custom Partitioning

To produce and consume trip details using custom partitioning, perform the following steps:

  • Create topic trip-data with two partitions using the following command:
./bin/kafka-topics --create --zookeeper localhost:2181 --topic trip-data --replication-factor 1 --partitions 1
select
  • Describe the topic to view the leaders of partitions created.

You can see broker 0 responsible for partition 0 and broker 1 responsible for partition 1 for message transfer as shown in the below diagram:

select

• Use custom partitioner technique to produce messages.
• Create CustomPartitioner class by overriding partitioner interface using the below commands:

override def partition(topic : String, key : Any, keyBytes : Array[Byte],value : Any, valueBytes : Array[Byte], cluster : Cluster) : Int = {
var partition = 0
val keyInt = Integer.parseInt(key.asInstanceOf[String])
val tripData = value.asInstanceOf[String]
//Gets the UserType from the message produced
val userType = tripData.split(",")(12)
//Assigns the partitions to the messages based on the user types
if("Subscriber".equalsIgnoreCase(userType)) {
partition = 0;
} else if ("Customer".equalsIgnoreCase(userType)){
partition = 1;
}
println("Partition for message "+value+" is "+partition)
partition
}

You can view the Subscriber user type messages produced into partition 0 and Customer user type messages turned to partition 1.

  • Define the CustomPartitioner class in producer properties as shown below:
//Splits messages to particular partitions
props.put("partitioner.class", "com.treselle.core.CustomPartitioner");
  • Define the partitions to the topic in the consumer by assigning different partitions to the consumers as shown below:
val topicPartition = new TopicPartition(TOPIC,partition)
consumer.assign(Collections.singletonList(topicPartition))
  • Pass the partition as input in arguments in the consumer when running multiple consumers with each consumer listening to different partitions.
  • Start multiple consumers with different partitions.
  • Start Consumer1 using the below command:
java -cp custom_partitioner.jar com.treselle.core.ConsumerBasedOnPartition trip-data localhost:9092 0
  • Start Consumer2 using the below command:
java -cp custom_partitioner.jar com.treselle.core.ConsumerBasedOnPartition trip-data localhost:9092 1
  • Produce the trip details by defining the custom partitioner using the below command:
java –cp custom_partitioner.jar com.treselle.core. CustomPartionedProducer trip-data localhost:9092
You can view the consumer 1 consuming only Subscriber messages from Partition 0 and consumer 2 consuming only Customer messages from partition 1.

Consumer1

select

Consumer2

select

  • Check the memory of the brokers after consuming all the messages in both consumers.

The memory shared between the brokers and the memory of the brokers’ logs can be viewed in the below diagram:

select

Here, the Customer messages are consumed by broker localhosy:9092 and Subscriber messages are consumed by the broker localhost:9093. As the Customer messages are less, only less memory is occupied in kafka-logs (localhost:9092).

Creating Trip Data Stream

In KSQL, there is no option to consume the messages based on the partitions. The messages are consumed from all the partitions in the given topic for stream or table creation.

To create trip data stream, perform the following steps:

  • Separate the Subscriber and Customer data using conditions for Window processing.
  • Create trip_data_stream with columns in trip data produced using the following command:
CREATE STREAM
trip_data_stream
(
tripduration BIGINT,
starttime VARCHAR,
stoptime VARCHAR,
start_station_id BIGINT,
start_station_name VARCHAR,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE,
end_station_id BIGINT,
end_station_name VARCHAR,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE,
bikeid INT,usertype VARCHAR,
birth_year VARCHAR,
gender VARCHAR
)
WITH
(
kafka_topic='trip-data',
value_format='DELIMITED'
);
  • Extract Unix TIMESTAMP for Windowing using the start times of trips.
  • Set the extracted start time Unix TIMESTAMP as property of stream for Windowing using the start times of trips instead of the message produced time.
  • Create the stream with extracted Unix TIMESTAMP and the subscriber messages for finding the trip details of the subscribers using the below command:
CREATE STREAM
subscribers_trip_data_stream
WITH
(
TIMESTAMP='startime_timestamp',
PARTITIONS=2
) AS
select
STRINGTOTIMESTAMP(starttime, 'yyyy-MM-dd HH:mm:ss') AS startime_timestamp,
tripduration,
starttime,
usertype
FROM TRIP_DATA_STREAM
where usertype='Subscriber';

Performing Streaming Analytics Using Window Tumbling

Window tumbling groups the data in the given interval into non-overlapping, fixed-size Windows. It is used in anomaly detection of the stream on a certain time interval. For example, consider tumbling with a time interval of 5 minutes.

select

To find the number of trips started by subscribers at the interval of 5 minutes, execute the following command:

SELECT
COUNT(*),
starttime
FROM subscribers_trip_data_stream
WINDOW TUMBLING (SIZE 5 MINUTE)
GROUP BY usertype;

select

From the above result, it is evident that 19 trips have been started at the end of the 4th minute, 25 trips have been started at the end of the 9th minute, and 26 strips have been started at the end of the 14th minute. Thus, the started trips are counted at each given interval of time.

Performing Streaming Analytics Using Window Session

In Window session, data is grouped in a particular session. For example, when a session 1 minute is set and if data is not available in the interval of 1 minute, then a new session is started for grouping the data. For example, consider a session of 1 minute working as stated in the following diagram:

select

To group start the trip details of the subscribers in the particular session, set the session interval as 20 seconds using the below command:

SELECT
count(*),
starttime
FROM subscribers_trip_data_stream
WINDOW SESSION (20 SECOND)
GROUP BY usertype;

select

From the above diagram, it is evident that the data grouping is made in the particular interval session. When the data is not available in the interval 20 second, then a new session is started for grouping the data.

For example, consider the time interval between 00:01:09 and 00:01:57. At an interval between 00:01:09 and 00:01:33, you can view no time difference of 20 second or more than that. So, trip counts are incremented. At an interval between 00:01:33 and 00:01:57, you can view an inactivity gap of more than 20 second. So, a new session is started from 57th second.

Performing Streaming Analytics Using Window Hopping

In Window hopping, data are grouped in a given time interval into overlapping Windows by advancing to the given interval of time. For example, consider interval 5 minute with an advanced interval of 1-minute working as shown in the below diagram:

select

To group start the trip details in the interval of 5 minutes advanced by 1 minute, execute the following command for hopping Window analysis:

SELECT
count(*),
starttime
FROM subscribers_trip_data_stream
WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE)
GROUP BY usertype;

select

From the above diagram, it is evident that 5 entries for each record are consumed in the interval of 5 minutes’ size and advanced by 1 minute. Entry size varies based on the interval size and advanced interval given.

In the above example, consider 00:02:12 time record scenario to check the working of the hopping with 5 minutes and advanced 1-minute size given. 00:02:12 scenario has five entries with trip counts 7,7,7,6,1. In 2 minutes, only two advances of 1 minute are made for first three entries. 00:00:00 to 00:02:12 time interval has 7 trips started. 4th entry made an advance of 1 minute. 00:01:00 to 00:02:12 time interval has 6 trips and 5th entry made another advance of 1-minute. So, trip considered from 00:02:00 to 00:02:12 has only 1 trip.

Conclusion

In this blog, we discussed custom partitioning technique to partition the trip details using user type in two different partitions. We also discussed Kafka Windowing concepts such as Window Tumbling, Window Session, and Window Hopping and its working using trips start timings to understand the difference between the types of windowing.

References

Customer Churn – Logistic Regression with R

$
0
0

Overview

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. In this blog post, we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

Learning/Prediction Steps

churn_lr_model_diagram

Data Description

Telecom dataset has the details for 7000+ unique customers, where details of each customer are represented in a unique row and below is the structure of the dataset: chrun_lr_dataframe Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning. chrun_lr_head

Data Preprocessing

    • Data cleansing and preparation will be done in this step. Transforming continuous variable into meaningful factor variable will improve the model performance and help understand the insights of the data. For example, in this dataset, the tenure interval variable is converted to factor variable with range in months. Thus, understanding the type of customers with tenure value to perform churn decision.
    • As part of data cleansing, the missing values are identified using the missing map plot. The telecom dataset has minimal number of missing value record and is dropped out from analysis.

churn_lr_na chrun_lr_missing_plot

    • Custom logic is implemented to create derived categorical variable from the tenure variable and continuous variables. As it will not affect the prediction value, customer id and tenure values are dropped from further process.

chrun_lr_custom_logic

    • New categorical feature is created as mentioned above.

churn_lr_feature_head

    • Few categorical variables have duplicate reference values and it refers to the same level. For example, “MultipleLine” feature has possible values as “Yes, No, No Phone Service”. Since “No” and “No Phone Service” have the same meaning, these records are replaced with unique reference.

churn_lr_categorical_var

Partitioning the Data & Logistic Regression

    • In the predictive modeling, the data need to be partitioned into train and test sets. 70% of the data will be partitioned for training purpose and 30% of the data will be partitioned for testing purpose.
    • In this dataset, 4K+ customer records are used for training purpose and 2K+ records are used for testing purpose.
    • Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can be used to predict chrun that are available in R or Python or Spark ML.
    • Multiple models can be executed on top of the telecom dataset to compare their performance and error rate to choose the best model. In this blog post, we have used Logistic Regression Model with R using glm package. Future blogs will focus on other models and combination of models.

churn_lr_train_test

Model Summary

From the model summary, the response churn variable is affected by tenure interval, contract period, paper billing, senior citizen, and multiple line variables. The importance of the variable will be identified by the legend of the correlated coefficients (*** – high importance, * – medium importance, and dot – next level of importance). Rerunning the model with these dependent variables will impact the model performance and accuracy. churn_lr_model

Prediction Accuracy

    • Models built using train datasets are tested through the test dataset. Accuracy and error rate are used to understand how these models are behaving for the test dataset. The selection of the best model is determined by using these measures.
    • Confusion Matrix/ Misclassification Table: It is a table used to describe the performance of the classification model on a test data. It is used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.

chrun_lr_cf_basics

    • The various measures derived from the confusion matrix are:

churn_lr_cf_derive

    • With the choice of logistic regression, it is evident that the accuracy for this model is evaluated as 80% and error rate as 20%. The accuracy of the model can be improved with other classification models such as decision tree, and random forest with parameter tuning.

chrun_lr_cf_results

References

The post Customer Churn – Logistic Regression with R appeared first on treselle.com.

Embrace Relationships with Neo4J, R & Java

$
0
0

Introduction

Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular graph database that can be used to make recommendations, get social, find paths, uncover fraud, manage networks, and so on. A graph database can store any kind of data using a Nodes (graph data records), Relationships (connect nodes), and Properties (named data values).

A graph database can be used for connected data which is otherwise not possible with either relational or other NOSQL databases as they lack relationships and multiple depth traversals. Graph Databases Embrace Relationships as they naturally form Paths. Querying or traversing the graph involves following Paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient.

Use Case

This use case is based on modified version of StackOverflow dataset that shows network of programming languages, questions that refers to these programming languages, users who asked and answered these questions, and how these nodes are connected with relationships to find deeper insights in Neo4J Graph Database which is otherwise not possible with common relation database or other NoSQL databases.

What we want to do:

  • Prerequisites
  • Download StackOverflow Dataset
  • Data Manipulation with R
  • Create Nodes & Relationships file with Java
  • Create GraphDB with BatchImporter
  • Visualize Graph with Neo4J

Solution

Prerequisites

  • Download and Install Neo4j: We will be using Neo4j 2.x version and installing it on Windows is very easy. Follow the instructions on at the below link to download and install.

Note: Neo4j 2.x requires JDK 1.7 and above.

http://www.neo4j.org/download/windows

  • Download and Install RStudio: We will be using R to perform some data manipulation on the StackOverflow dataset which is available in RData format and this includes filtering, altering, dropping columns, and others. This is done to show the power of R with respect to data manipulation and the same can be done in other programming languages as well. Download the open source edition of Rstudio from the below link.

http://www.rstudio.com/products/rstudio/#Desk

Download StackOverflow Dataset

  • Download Dataset: This use case is based on modified version of StackOverflow dataset which is rather old and available in both CSV and RData format. Follow the below links to download the dataset. The first link contains the details about various fields and the second link is to download RData

http://www.ics.uci.edu/~duboisc/StackOverflow

http://www.ics.uci.edu/~duboisc/StackOverflow/answers.Rdata

  • Understanding Dataset:

We will be mostly interested in the following fields which will be used to create nodes and relationships in Neo4j.

qid: Unique question id
i: User id of questioner
qs: Score of the question
tags: a comma-separated list of the tags associated with the question that refers to programming languages
qvc: Number of views of this question
aid: Unique answer id
j: User id of answer
as: Score of the answer

 

Data Manipulation with R

We will reshape the dataset to fit to our needs and appreciate the power of data manipulation with R. The actual RData contains around 250 K rows but this use case will perform the following manipulation to keep it interesting and small.

  • Open RStudio and Set Working Directory: Open RStudio and set the working directory to where the RData file was downloaded.
  • Load and Perform Data Manipulation:




Note: Ignore the warning message

Create Nodes and Relationship file with Java

We will write a Java program that takes the finadata.csv generated from the above R program and create multiple node files and a single relationship file that contains relations between the nodes. Our nodes and relationship structure is as follows:

Nodes: question_nodes, answer_nodes, user_nodes, lang_nodes
Relationships: The following are the relationships

 

  • Details about Java Program: This Java program is self explanatory and simply creates nodes and relationship files in CSV format as needed by the Neo4j Batch Importer program. Few things about the Java program to keep in mind
    • The format of Nodes file is as follows:

       

       

    • The format of Relationship file is as follows:

       

       

    • lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated

       

       

    • finaldata.csv is renamed to sodata.csv (optional)
    • The dataset doesn’t come with names of questioners and answerers. So, we have downloaded some fictional names and associated them with the userid. This will make more sense when we view them in Neo4j graphical interface. A fictional name file for around 1500 names were created from http://homepage.net/name_generator/ and stored as “random_names.txt”.

       

       

  • Java Program to Create Nodes & Relationships:

Note:The below program has dependency only on OpenCSV library that can be downloaded from http://sourceforge.net/projects/opencsv/

    • Output of the Program: 

Run the above program from command line or within eclipse to create question_nodes.csv, answer_nodes.csv, user_nodes.csv, and rels.csv. Click here to download nodes and relationship zip file to quickly run it thru BatchImporter to create Graph DB.

Create GraphDB with Batch Importer

  • Download and Set up Batch Importer: Batch Importer program is a separate library that will create Graphdb data file which is needed by Neo4j. The input to the Batch Importer is configured in the batch.properties file that indicates what files to use as Nodes and Relationships. More details about the Batch Importer can be found in the readme at https://github.com/jexp/batch-import/tree/20

Download Link: https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Note: Unzip to the location where the nodes and relationship files are created by the Java program.

      • Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.
      • Execute Batch Importer: Execute the batch importer program with import.bat within the Batch Importer directory and pass batch.properties and name of the graph db file to create

Visualize Graph with Neo4j

  • Copy graph.db file: Create a new directory “data” under the root of Neo4J installation directory and copy graph.db to data directory. This is optional but recommended to keep the graph.db in the same location as Neo4j.
  • Start Neo4j: Execute “neo4j-community” file under bin directory of Neo4j to start Neo4j. You will be prompted to choose the location of the graph.db file.
  • Visualize Graphs:
    • Launch Neo4j Web Console: http://localhost:7474/browser/
  • Navigate to Graphs: Click on the bubbles on the left top and choose “*”
  • Customize Graph Attributes: Double click on “Java” node and choose “name” as the caption.
  • Explore Graphs: The below exploration shows the following:

Tracing the orange line indicates how the user Trevor answered (aid_853052) a Java question also asked a PHP question (qid_865476). Tracing the red line indicates the user Audrey answered two Java questions (aid_853030 and aid_892379). It’s lot of fun to work with Graph Database as the traversals are limitless. BTW, user names are fictional and not real users

 

Conclusion

  • Neo4j is one of the best graph databases around and comes with powerful Cypher Query Language that enables us to traverse the nodes via the relationships and using nodes properties as well. We will be covering CQL in our next blog post based on this graph data.
  • R is very handy in performing many data manipulation techniques to quickly cleanse, transform, and alter the data to our needs.
  • Neo4j also comes with Rest API to add nodes and relationships dynamically on the existing graph DB.

References

The post Embrace Relationships with Neo4J, R & Java appeared first on treselle.com.

Viewing all 98 articles
Browse latest View live