Apache Drill vs Amazon Athena – A Comparison on Data Partitioning

September 1, 2017, 6:25 am

≫ Next: Amazon Athena & Tableau – Serverless Interactive Query Service and Business Intelligence (BI)

≪ Previous: Data Analysis Using Apache Hive and Apache Pig

Overview

Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for accessing, exploring, and reporting huge volume of data. Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data.

In this blog, let us compare data partitioning in Apache Drill and AWS Athena and the distinct features of both.

Dataset Description

A sample dataset, containing census data of a particular country in the USA, is used in this use case. For sample dataset, consider Reference section.

Partitioning Data

In this section, let us discuss data partitioning based on male and female fertility rate in a predefined age group in Apache Drill and Athena.

Partitioning Data in Apache Drill

To perform data partition in Drill, perform the following:

Change data storage format to Parquet using the following command:

ALTER SESSION SET `store.format`='parquet';

Create table and partition data using the following command:

CREATE TABLE dfs.`csvOut`.AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION(country_code, country_name, 
`year`, fertility_rate_15_19, fertility_rate_20_24, fertility_rate_25_29, fertility_rate_30_34, fertility_rate_35_39, 
fertility_rate_40_44, fertility_rate_45_49, total_fertility_rate, gross_reproduction_rate, sex_ratio_at_birth,gender) 
PARTITION BY (gender)
AS
SELECT columns[0] as country_code,columns[1] as country_name,columns[2] as `year`,columns[3] as 
fertility_rate_15_19,columns[4] as fertility_rate_20_24,columns[5] as fertility_rate_25_29,columns[6] as 
fertility_rate_30_34,columns[7] as fertility_rate_35_39,columns[8] as fertility_rate_40_44,columns[9] as 
fertility_rate_45_49,columns[10] as total_fertility_rate,columns[11] as gross_reproduction_rate,columns[12] as 
sex_ratio_at_birth,columns[13] as gender FROM 
dfs.`/user/tsldp/drillathena/age_specific_fertility_rates_gender.csv`;

The table created is as shown below:

The time taken to create a table is as shown below:

You can check the data loaded into the database using the following command:

select * from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` ;

The time taken to select the required data in a table is as shown below:

Get total count of male and female fertility data using the following command:

select count(*),gender from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` group by gender;

The count of males and females in a country is shown below:

The file size after partitioning data using Apache drill is as shown below:

Partitioning Data in Athena

Athena uses Hive data partitioning and provides improved query performance by reducing the amount of data scanned.

In Athena, data partitioning can be done in two separate ways as follows:

With already partitioned data stored on Amazon S3 and accessed on Athena.
With unpartitioned data.

In both methods, specify the partitioned column in create statement.

To perform data partition in Athena, perform the following:

Create table using the below query:

create external table sampledb.age_fertility_rates_gender_parq_part(
country_code string,
country_name string,
year string,
fertility_rate_15_19 decimal(10,5),
fertility_rate_20_24 decimal(10,5),
fertility_rate_25_29 decimal(10,5),
fertility_rate_30_34 decimal(10,5),
fertility_rate_35_39 decimal(10,5),
fertility_rate_40_44 decimal(10,5),
fertility_rate_45_49 decimal(10,5),
total_fertility_rate decimal(10,5),
gross_reproduction_rate decimal(10,5),
sex_ratio_at_birth decimal(10,5))
PARTITIONED BY (gender string)
stored as parquet
LOCATION 's3://cps3bucket/data_gender_parquet/';

Add partitions to the catalog by using the below command:

lMSCK REPAIR TABLE age_fertility.age_fertility_rates_gender_parq_part;

Check partitioned data using the below query:

select * from age_fertility.age_fertility_rates_gender_parq_part;

Data Partition Comparison between Apache Drill and Amazon Athena

The time taken to perform create partition and select partition is as follows:

Distinct Features of Drill and Athena

Conclusion

In Apache Drill, data partitioning concepts can be applied directly. In Athena, we need to convert the files into Parquet format using EMR to perform data partitioning. A separate storage is not required in Athena as you can query the data directly from Amazon S3.

References

Amazon Athena:
https://aws.amazon.com/athena/
Apache Drill:
https://drill.apache.org/
Sample Dataset in GitHub:
https://github.com/treselle-systems/apache_drill_vs_amazon_athena

↧

Amazon Athena & Tableau – Serverless Interactive Query Service and Business Intelligence (BI)

September 1, 2017, 6:32 am

≫ Next: Self Service Analytics using Dremio

≪ Previous: Apache Drill vs Amazon Athena – A Comparison on Data Partitioning

Overview

Amazon Athena, a serverless query service in Amazon Simple Storage Service (S3) and a pay per service, is used to easily analyze data using standard SQL in S3. It has a very high query performance even for huge datasets and complex queries.

Athena can process both structured and semi-structured data in different file formats such as CSV, JSON, Parquet, and ORC. It can be used to generate reports in connection with BI tools. It uses HIVE for creating database and tables. As it has metastore, it can be used with Hadoop, Spark, and Presto.

Pre-requisites

Sign into AWS console.
Create your own S3 bucket in S3 to upload data as shown below:

Create database and tables in Athena to query the data.

Dataset Description

A sample dataset, containing bank transaction data of a customer, is used in this use case. For sample dataset, please consider Reference section.

Columns

Step – Maps a unit of time in the real world. In this use case, 1 step represents 1 hour of time.
Type – Diverse types of payments (CASH_IN, CASH_OUT, DEBIT, PAYMENT, TRANSFER).
Amount – Amount of transaction in local currency.
nameOrig – Customer who started the transaction.
oldbalanceOrg – Initial balance before the transaction.
newbalanceOrig – New balance after the transaction.
NameDest – Customer who is the recipient of the transaction.
OldbalanceDest – Initial balance recipient before the transaction.
NewbalanceDest – New balance recipient after the transaction.
IsFraud – Transactions made by the fraudulent agents inside the simulation.
IsFlaggedFraud – An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

Note: There is no information for customers name starting with M (Merchants).

Use Cases

Data security plays a significant role in almost all sectors especially in financial institutions such as banks, credit unions, and so on. As mobile transactions in the financial industry have increased during recent days, it is highly challenging to secure the data in the mobile platform. In this use case, let us discuss on finding and managing malicious behavior during mobile transactions.

In this blog, let us discuss the below use cases:

Connect Athena with JDBC SQL Workbench Driver
Connect Athena with BI Tools (Tableau 10.3)

Connecting Athena with JDBC SQL Workbench Driver

You can query data from AWS console by connecting JDBC SQL workbench with Athena.

Pre-requisites:

Download Athena jar from the following path:
https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.1.0.jar
Ensure the system is configured with Java 1.8 version.
Download SQL workbench if needed from the following path:
http://www.sql-workbench.net/downloads.html

To connect Athena with JDBC driver using SQL workbench, perform the following:

In SQL workbench, choose File –> Manage drivers.
Perform configuration as shown in the below diagram:

Add a new driver connection by configuring user name and password as your AWS access key and secret key, respectively.
Configure your s3_staging_dir in extended properties to save your executed queries in your S3 bucket.

Create a table with SQL workbench.

The table structure is as follows:

The data visualization in table format is as follows:

The data is queried as follows:

Connecting Athena with BI Tools (Tableau 10.3)

To connect Athena with Tableau, perform the following:

Download Athena connector from the following path:
https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.1.0.jar
Place the connector in the following location:
C:\Program Files\Tableau\Drivers directory.
In Tableau, configure the following:
- Server – atena.region.com
- Port – 443
- S3 Staging Directory – s3://yourbucketname/query_save_folder
- Access Key ID
- Secret Access Key

On successfully creating connections, Athena databases will be listed out as shown in the below diagram:

Data Visualization

Few data visualizations are as follows:

Total Amount Based on Type and IsFraud Flag

New Balance and Old Balance Based on Type

Total Amount Based on Type and Fraudulent Activities

Minimum New Balance and Minimum Old Balance Based on Type

Count of Steps for Fraudulent Types of Transactions

Percentage of Steps for Fraudulent Type of Activities

Conclusion

Athena very fastly queries data directly from Amazon S3 without any ETL. As it is a pay per service, 1 TB charges $5. As it possesses performance improving techniques such as partitioning, compressing, and so on, it supports accessing huge datasets.

References

Amazon Athena Documentation:
http://docs.aws.amazon.com/athena/
Delete Replicas and Switch to Single-AZ When Stopping RDS Instances:
https://blog.skeddly.com/
Common AWS Athena and Tableau errors and what to do about them:
https://tableaulove.com/common-athena-tableau-errors/

↧

Self Service Analytics using Dremio

September 22, 2017, 4:34 am

≫ Next: Data Quality Checks with StreamSets using Drift Rules

≪ Previous: Amazon Athena & Tableau – Serverless Interactive Query Service and Business Intelligence (BI)

Overview

Dremio, a self-service data platform, helps data analysts and data scientists to determine, organize, accelerate, and share any data at any time irrespective of volume, velocity, location, or structure. Dremio allows business users to access data from a variety of sources and prevents them from relying on developers.

In this blog, let us discuss about data transformation and data analysis using Dremio and data visualization using Tableau.

Pre-requisites

Download and install Dremio from the following link:
https://www.dremio.com/download/

Data Description

Online retail data with different product types, product prices, and quantities sold from December, 2010 to December, 2011 is used as a data source.

Sample Data Source

sample_data_source1

Synopsis

Connect different data sources with Dremio
Perform data transformation
Create virtual datasets in Dremio
Connect virtual datasets with BI tools
Visualize results in Tableau

Connecting Different Data Sources with Dremio

Different types of data sources available for performing data transformation activities are shown in the below screenshot:

connecting_different_data_sources_with_dremio

To connect Amazon S3 data sources with Dremio, perform the following:

In Data Source Types page, select Amazon S3 data source.
Connect to Amazon S3 location as shown in the below screenshot:

connecting_different_data_sources_with_dremio11

Connect to MySQL connection and provide required credentials as shown in the below screenshot:

connecting_different_data_sources_with_dremio22

Connect to Network Attached Storage (NAS) as shown in the below screenshot:

connecting_different_data_sources_with_dremio33

Performing Data Transformation

To transform data, perform the following:

Use UNION function to merge data from 3 different data sources such as S3, MySQL, & NAS and load data as virtual dataset as shown in the below screenshot:

performing_data_transformation

As price values are based on single quantity, total price needs to be calculated based on quantity.

Add “Total_Price” as a new field.
Calculate total price based on number of quantity as shown in the below diagram:

performing_data_transformation1

Perform aggregation with stock quantity and stock price based on the products in the source data as shown in the below diagram:

performing_data_transformation2

Round off the total price values to 2 decimal digits as shown in the below diagram:

performing_data_transformation3

Creating Virtual Datasets in Dremio

On successfully transforming data, create virtual datasets (view) on Dremio spaces to store the data based on source.

The virtual dataset for purchases done by each customer is as shown below:

creating_virtual_datasets_in_dremio

The virtual dataset for most quantity sold based on the product is as shown in the below diagram:

creating_virtual_datasets_in_dremio1

Connecting Virtual Datasets with BI Tools

To connect the virtual datasets with BI tools, export virtual dataset in .tds format to be used with BI tools such as Tableau, Qlik Sense, and Power BI as shown in the below diagram:

connecting_virtual_datasets_with_bi_tools

connecting_virtual_datasets_with_bi_tools1

Visualizing Results in Tableau

On clicking .tds file in Tableau, you will be redirected to Tableau for visualizing the data.

Most Purchases by Customers

most_purchases_by_customers

Maximum Number of Products Sold

maximum_number_of_products_sold

References

Dremio: https://www.dremio.com/

↧

Data Quality Checks with StreamSets using Drift Rules

September 22, 2017, 4:53 am

≫ Next: Handle Class Imbalance Data with R

≪ Previous: Self Service Analytics using Dremio

Overview

In the world of big data, data drift has emerged as a critical technical challenge for data scientists and engineers in unleashing the power of data. It delays businesses from gaining real-time actionable business insights and making more informed business decisions.

StreamSets is not only used for big data ingestion but also for analyzing real-time streaming data. It is used to identify null or bad data in source data and filter out the bad data from the source data in order to get precise results. It also helps the businesses in making quick and accurate decisions.

In this blog, let us discuss about checking quality of data using Data rules and Data Drift rules in StreamSets.

Pre-requisites

Install Java 1.8
Install streamsets-datacollector-2.6.0.1

Use Case

Create a dataflow pipeline to check quality of source data and load the data into HDFS using StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K.

Sample data

{“ambient_temperature”:”16.70″,”datetime”:”Wed Aug 30 18:42:45 IST 2017″,”humidity”:”76.4517″,”lat”:36.17,”lng”:-119.7462,”photo_sensor”:”1003.3″,”radiation_level”:”201″,”sensor_id”:”c6698873b4f14b995c9e66ad0d8f29e3″,”sensor_name”:”California”,”sensor_uuid”:”probe-2a2515fc”,”timestamp”:1504098765}

Synopsis

Read data from local file system
Configure data drift rules and alerts
Convert data types
Configure data rules and alerts
Derive fields
Load data into HDFS
Get Alerts During Data Quality Checks
Visualize data in motion

Reading Data from Local File System

To read data from the local file system, perform the following:

Create a new pipeline.
Configure “Directory” origin to read files from a directory.
Set Batch Size (recs) as “1” to read records one by one to easily analyze data and get accurate results.
Set “Data Format” as JSON.
Select “JSON content” as Multiple JSON objects.

reading_data_from_local_file_system

Configuring Data Drift Rules and Alerts

To configure data drift rules and alerts, perform the following:

Gather details about data drift as and when data passes between two stages.
Provide meters and alerts.
Create data drift rules to indicate data structure changes.
Click “Add” to add the conditions in the links between the stages.

Few conditions applied are:

Alerts when field name varies between two subsequent JSON records.
Function: drift:names(, )
For example: ${drift:names(‘/’, false)}
Alerts when number of fields vary between two subsequent JSON records.
Function: drift:size(, )
For example: ${drift:size(‘/’, false)}
Alerts when data type of specified field changes and specified field is missing (i.e.) Double-String, String-Integer.
Function: drift:type(, )
For example: ${drift:type(‘/photo_sensor’, false)}
Alerts when order of fields varies between two subsequent JSON records.
Function: drift:order(, )
For example: ${drift:order(‘/’, false)}
Alerts when String is Empty.
For example: ${record:value(‘/photo_sensor’)==”"}

Click “Activate” to activate all the rules.

configuring_data_drift_rules_and_alerts

Converting Data Types

To analyze data and apply data rules, convert data with String data type into Decimal or Integer type.
For example: Convert String data type of “humidity” data (“humidity”:”76.4517″) in the source data into Double type (“humidity”:76.4517).

converting_data_types

Configuring Data Rules and Alerts

To configure data rules and alerts, perform the following:

Click “Add” to add the conditions in data rules and data drift rules in the links between stages.
Apply data rules for attributes.
For example: ${record:value(‘/humidity’) < 66.2353 and record:value(‘/humidity’)>92.4165}

configuring_data_rules_and_alerts

configuring_data_rules_and_alerts2

Deriving Fields

To derive a new field using “Expression Evaluator” processor, add the below language expression in Field Expression:

if ambient_temperature < 20 and humidity > 90:
 return 'Anomaly'
 else if ambient_temperature > 20 and ambient_temperature < 30 and humidity > 80 and humidity < 90:
 return 'Suspicious'
 else:
 return 'Normal'

For example, if derived field is “/prediction”, the expression is:

${record:value('/ambient_temperature') < 20 and record:value('/humidity') > 90? "Anomaly": (record:value('/ambient_temperature') > 20 and record:value('/ambient_temperature') < 30 and record:value('/humidity') > 80 and record:value('/humidity') < 90? "Suspicious": "Normal")}

deriving_fields

Use “Stream Selector” processor to split records with different conditions,

${record:value('/prediction')=="Suspicious"} and ${record:value('/prediction')=="Anomaly"}

deriving_fields1

Loading Data into HDFS

To load data into HDFS, perform the following:

Configure “Hadoop FS” destination processor.
Select data format as “JSON”.

Note: Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf) contains core-site.xml and hdfs-site.xml files. sdc-resources directory will be created while installing StreamSets.

loading_data_into_hdfs

Getting Alerts During Data Quality Checks

Alerts while Data in Motion

alerts_while_data_in_motion

Alert Summary on Detecting Data Anamolies

alert_summary on_detecting_data_anamolies

Visualizing Data in Motion

Record Summary Statistics

record_summary_statistics

Record Count In/Out Statistics

References

Sample Dataset in GitHub:
https://github.com/treselle-systems/data_quality_checks_with_streamsets
StreamSets – Rules and Alerts:
https://streamsets.com/documentation/datacollector/latest/help/index.html#Alerts/RulesAlerts_title.html#concept_pgk_brx_rr
StreamSets – Functions:
https://streamsets.com/documentation/datacollector/latest/help/index.html#Expression_Language/Functions.html

↧

Handle Class Imbalance Data with R

September 28, 2017, 4:08 am

≫ Next: API Response Tracking with StreamSets, Elasticsearch, and Kibana

≪ Previous: Data Quality Checks with StreamSets using Drift Rules

Overview

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalance data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalance dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset. In this blog, let us discuss tackling imbalanced classification problems using R.

Data Description

A credit card transaction dataset, having total transactions of 284K with 492 fraudulent transactions and 31 columns, is used as a source file. For sample dataset, refer to References section.

Columns

Time – Time (in seconds) elapsed between each transaction and the first transaction in the dataset.
V1-V28 – Principal component variables obtained with PCA.
Amount – Transaction amount.
Class – Dependent (or) response variable with value as 1 in case of fraud and 0 in case of good.

Synopsis

Performing exploratory data analysis
- Checking imbalance data
- Checking number of transactions by hour
- Checking mean using PCA variables
Partitioning data
Building model on training set
Applying sampling methods to balance dataset

Performing Exploratory Data Analysis

Exploratory data analysis is carried out using R to summarize and visualize significant characteristics of the dataset.

Checking Imbalance Data

To find the imbalance in the dependent variable, perform the following:

Group the data based on Class value using dplyr package containing “group by function”.

Use ggplot to show the percentage of class category.

Checking Number of Transactions by Hour

To check the number of transactions by day and hour, normalize the time by day and categorize them into four quarters according to the time of the day.

The above graph shows the transactions of 2 days. It states that most of the fraudulent transactions occurred between 13 to 18 hours.

Checking Mean using PCA Variables

To find data anomalies, take mean of variables from V1 to V28 and check the variation.

The blue points with much variations are shown in the below plot:

Partitioning Data

In predictive modeling, data needs to be partitioned for training set (80% of data) and testing set (20% of data). After partitioning the data, feature scaling is applied to standardize the range of independent variables.

Building Model on Training Set

To build a model on the training set, perform the following:

Apply logic classifier on the training set.
Predict the test set.
Check the predicted output on the imbalance data.

Using Confusion Matrix, the test result shows 99.9% accuracy due to much of class 1 records. So, let us neglect this accuracy. Using ROC curve, the test result shows 78% accuracy that is very low.

Applying Sampling Methods to Balance Dataset

Different sampling methods are used to balance the given data, apply model on the balanced data, and check the number of good and fraud transactions in the training set.

There are 227K good and 394 fraud transactions.

In R, Random Over Sampling Examples (ROSE) and DMwR packages are used to quickly perform sampling strategies. ROSE package is used to generate artificial data based on sampling methods and smoothed bootstrap approach. This package provides well-defined accuracy functions to quickly perform the tasks.

The different types of sampling methods are:

Oversampling

This method over instructs the algorithm to perform oversampling. As the original dataset had 227K good observations, this method is used to oversample minority class until it reaches 227K. The dataset has a total of 454K samples. This can be attained using method = “over”.

Undersampling

This method functions similar to the oversampling method and is done without replacement. In this method, good transactions are equal to fraud transactions. Hence, no significant information can be obtained from this sample. This can be attained using method = “under”.

Both Sampling

This method is a combination of both oversampling and undersampling methods. Using this method, the majority class is undersampled without replacement and the minority class is oversampled with replacement. This can be attained using method = “both”.

ROSE Sampling

ROSE sampling method generates data synthetically and provides a better estimate of original data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method is used to avoid overfitting when adding exact replicas of minority instances to the main dataset.

For example, a subset of data from the minority class is taken. New synthetic similar instances are created and added to the original dataset.

The count of each class records after applying sampling techniques is shown below:

Logistic classifier model is computed using each trained balanced data and the test data is predicted. Confusion Matrix accuracy is neglected as it is imbalanced data. roc.curve is used to capture roc metric using an inbuilt function.

Conclusion

In this blog, highest data accuracy is obtained using SMOTE method. As there is no much variation in these sampling methods, these methods when combined with a more robust algorithm such as random forest and boosting can provide exceptionally high data accuracy.

When dealing with the imbalanced dataset, experiment the dataset with all these methods to obtain the best-suited sampling method for your dataset. For better results, advanced sampling methods comprising synthetic sampling with boosting methods can be used.

These sampling methods can be implemented in the same way in Python too. For Python code, check the below References section.

References

Sample Credit Card Transaction Data:
https://www.kaggle.com/dalpozz/creditcardfraud
Associated R and Python Code in GitHub:
https://github.com/treselle-systems/handle_class_imbalance_data

↧

API Response Tracking with StreamSets, Elasticsearch, and Kibana

October 6, 2017, 4:51 am

≫ Next: Import and Ingest Data into HDFS using Kafka in StreamSets

≪ Previous: Handle Class Imbalance Data with R

Overview

RESTful API JSON response data can be used to view various aspects such as pipeline configuration or monitoring information of the StreamSets Data Collector. This API response information can be used with Data Collector REST API and can be used to provide Data Collector details to a REST-based monitoring system.

In this blog, let us discuss on capturing all alerts produced by StreamSets pipelines using RESTful API, loading alerts in Elasticsearch, and visualizing alerts in Kibana.

Pre-requisites

Install Java 1.8
Install streamsets-datacollector-2.6.0.1

Use Case

Create a dataflow pipeline to capture response of RESTful API using StreamSets and to load it in Elasticsearch.

Synopsis

View RESTful API response data
Capture RESTful API response
Load API response in Elasticsearch
Visualize pipeline alerts in Kibana

Viewing RESTful API Response Data

To view RESTful API response data, perform the following:

Log in to StreamSets.
On the top right corner, click Help icon.
Click RESTful API.
Different categories such as ACL, definitions, manager, preview, store, and system can be viewed.

Click manager to view API required to get alerts triggered for all the pipelines.
Click try it out! to get the request URL.

Check the response in UI using the below URL:
http://<sdc_host>:/rest/v1/pipelines/alerts

Capturing RESTful API Response

To capture RESTful API response, perform the following:

Configure HTTP Client Processor by setting Resource URL as “http://<sdc_host>:/rest/v1/pipelines/alerts”, Mode as “Polling”, and Polling Interval.

Capture RESTful API response using the HTTP client processor.
In Pagination tab, set Pagination Mode as “Link HTTP header” and Result Field Path as “/”.

Loading API Response in Elasticsearch

To load API Response in Elasticsearch, perform the following:

Configure “Elasticsearch” processor.
Set Cluster HTTP URI.
Use the below template for Elasticsearch:

{
 "template" : "streamsets*",
 "mappings": {
 "uri": {
 "properties": {
  "gauge": {
             "properties": { 
                      "value":{ 
                               "properties": { 
                                        "timestamp": { 
                                        "type":"date", 
                                                 "format":"yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-
dd'T'HH:mm:ss.SSS'Z'||yyyy-MM-dd||yyyy-MM-dd HH:mm:ss||mmm dd, yyyy HH:mm:ss 
a||epoch_millis" 
        } 
       } 
      } 
     } 
    } 
   } 
  } 
 } 
}

Visualizing Pipeline Alerts in Kibana

The alerts produced by all the pipelines can be viewed in Kibana without using StreamSets.

Number of Alerts vs Label as Attribute

Number of Alerts vs Timestamp

Conclusion

StreamSets provides different RESTful APIs to get metrics, status, alerts, and so on. These APIs can be used with different visualization tools to visualize data and to monitor the pipelines externally.

References

Data Quality Checks with StreamSets using Drift Rules:
http://www.treselle.com/blog/data-quality-checks-with-streamsets-using-drift-rules/
StreamSets Data Collector – Administration:
https://streamsets.com/documentation/datacollector/latest/help/#Administration/Administration_title.html
StreamSets Data Collector – Elasticsearch:
https://streamsets.com/documentation/datacollector/latest/help/#Destinations/Elasticsearch.html#concept_u5t_vpv_4r

↧

Import and Ingest Data into HDFS using Kafka in StreamSets

October 13, 2017, 4:14 am

≫ Next: Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!)

≪ Previous: API Response Tracking with StreamSets, Elasticsearch, and Kibana

Overview

StreamSets provides state-of-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven User Interface (UI) helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.

Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis. In this blog, let us discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) & flat files and writing the data into HDFS using Kafka in StreamSets.

Pre-requisites

Install Java 1.8
Install streamsets-datacollector-2.6.0.1

Use Case

Import and ingest data from different data sources into HDFS using Kafka in StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file.

The dataset has total record count of 600K with 3.5K duplicate records.

Sample data

{"ambient_temperature":"16.70","datetime":"Wed Aug 30 18:42:45 IST 
2017","humidity":"76.4517","lat":36.17,"lng":-
119.7462,"photo_sensor":"1003.3","radiation_level":"201","sensor_id":"c6698873b4f14b995c9e66ad0d8f29e3","
sensor_name":"California","sensor_uuid":"probe-2a2515fc","timestamp":1504098765}

Synopsis

Read data from local file system and produce data to Kafka
Read data from Amazon S3 and produce data to Kafka
Consume streaming data produced by Kafka
Remove duplicate records
Persist data into HDFS
View data loading statistics

Reading Data from Local File System and Producing Data to Kafka

To read data from the local file system, perform the following:

Create a new pipeline.
Configure File Directory origin to read files from a directory.
Set Data Format as JSON and JSON content as Multiple JSON objects.
Use Kafka Producer processor to produce data into Kafka.
Note: If there are no Kafka processors, install Apache Kafka package and restart SDC.
Produce the data under topic sensor_data.

reading-data-from-local-file-system

reading-data-from-local-file-system1

Reading Data from Amazon S3 and Producing Data to Kafka

To read data from Amazon S3 and produce data into Kafka, perform the following:

Create another pipeline.
Use Amazon S3 origin processor to read data from S3.
Note: If there are no Amazon S3 processors, install Amazon Web Services 1.11.123 package available under Package Manager.
Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
Set the data format as JSON.
Produce data under the same Kafka topic – sensor_data.

reading-data-from-amazon-s3

reading-data-from-amazon-s3-1

Consuming Streaming Data Produced by Kafka

To consume streaming data produced by Kafka, perform the following:

Create a new pipeline.
Use Kafka Consumer origin to consume Kafka produced data.
Configure processor by providing the following details:
- Broker URI
- ZooKeeper URI
- Topic – set the topic name as sensor_data (same data produced in previous sections 1 & 2)
Set the data format as JSON.

consuming-streaming-data-produced-by-kafka

Removing Duplicate Records

To remove duplicate records using Record Deduplicator processor, perform the following:

Under Deduplication tab, provide the following fields to compare and find duplicates:
- Max. Records to Compare
- Time to Compare
- Compare
- Fields to Compare
  For example, find duplicates based on sensor_id and sensor_uuid.
Move the duplicate records to Trash.
Store the unique records in HDFS.

removing-duplicate-records

Persisting Data into HDFS

To load data into HDFS, perform the following:

Configure Hadoop FS destination processor from stage library HDP 2.6.
Select data format as JSON.
Note: core-site.xml and hdfs-site.xml files are placed in Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf). While installing StreamSets, sdc-resources directory will be created.

persisting-data-into-hdfs

Viewing Data Loading Statistics

Data loading statistics, after removing duplicates from different sources, is as follows:

viewing-data-loading-statistics

viewing-data-loading-statistics1

References

Data Quality Checks with StreamSets using Drift Rules:
http://www.treselle.com/blog/data-quality-checks-with-streamsets-using-drift-rules/
Sample Dataset in GitHub:
https://github.com/treselle-systems/data_quality_checks_with_streamsets
Amazon S3:
https://streamsets.com/documentation/datacollector/latest/help/#Origins/AmazonS3.html#concept_kvs_3hh_ht
Kafka Producer:
https://streamsets.com/documentation/datacollector/latest/help/#Destinations/KProducer.html#concept_oq2_5jl_zq
Kafka Consumer:
https://streamsets.com/documentation/datacollector/latest/help/#Origins/KConsumer.html#concept_msz_wnr_5q
Hadoop FS:
https://streamsets.com/documentation/datacollector/latest/help/#Destinations/HadoopFS-destination.html#concept_awl_4km_zq

↧

Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!)

October 22, 2017, 11:34 pm

≫ Next: Predict Lending Club Loan Default Using Seahorse and SparkR

≪ Previous: Import and Ingest Data into HDFS using Kafka in StreamSets

Overview

Kylo, a feature-rich data lake platform, is built on Apache Hadoop and Apache Spark. Kylo provides a business-friendly data lake solution and enables self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization, and data discovery. Its intuitive user interface allows IT professionals to access the data lake (without having to code).

Though there are many tools to ingest batch data and/or streaming or real-time data, Kylo supports both data. It provides a plug-in architecture with a variety of extensions. Apache NiFi templates provide incredible flexibility for batch and streaming use cases.

In this blog post, let us discuss ingesting data from Apache Kafka, performing data cleansing and validation at real-time, and persisting the data into Apache Hive table.

Pre-requisites

Install Kafka.
Deploy Kylo, where the deployment requires knowledge on different components/technologies such as:
- AngularJS for Kylo UI
- Apache Spark for data wrangling, data profiling, data validation, data cleansing, and schema detection
- JBoss ModeShape and MySQL for Kylo Metadata Server
- Apache NiFi for pipeline orchestration
- Apache ActiveMQ for interprocess communication
- Elasticsearch for search-based data discovery
- All Hadoop technologies but most preferably HDFS, YARN, and Hive

To know more about basics and installation of Kylo in AWS EC2 instance, refer our previous blog on Kylo Setup for Data Lake Management.

Data Description

User transaction dataset with 68K rows, generated by Treselle team, is used as the source file. The input dataset has time, uuid, user, business, address, amount, and disputed columns.

Sample dataset

Examples of invalid and missing values in the dataset:

Use Case

Publish user transaction dataset into Kafka.
Ingest data from Kafka using Kylo data ingestion template and standardize & validate data.

Synopsis

Customize data ingest pipeline template
Define categories for feeds
Define feeds with source and destination
Cleanse and validate data
Schedule feeds
Monitor feeds

Self-Service Data Ingest, Data Cleansing, and Data Validation

Kylo utilizes Spark to provide a pre-defined pipeline template, which implements multiple best practices around data ingestion. By default, it comes up with file system and databases. It helps business users in simplifying configuration of ingest data from new sources such as JMS, Kafka, HDFS, HBase, FTP, SFTP, REST, HTTP, TCP, IMAP, AMQP, POP3, MQTT, WebSocket, Flume, Elasticsearch and Solr, Microsoft Azure Event Hub, Microsoft Exchange using Exchange Web Services (EWS), Couchbase, MongoDB, Amazon S3, SQS, DynamoDB, and Splunk.

Apache NiFi, a scheduler and orchestration engine, provides an integrated framework for designing new types of pipelines with 250+ processors (data connectors and transforms).

The pre-defined data ingest template is modified by adding Kafka, S3, HDFS, and FTP as shown in the below screenshot:

Get, Consume, and Fetch named processors are used to ingest the data. The Get and Consume versions of Kafka processors in NiFi is as follows:

GetKafka 1.3.0: Fetches messages from the earlier version of Apache Kafka (specifically 0.8.x versions). The complementary NiFi processor used to send messages is PutKafka.

ConsumeKafka_0_10 1.3.0: Consumes messages from the newer version of Apache Kafka specifically built against the Kafka 0.10.x Consumer API.

Based on need, a custom processor or other custom extension for NiFi can be written & packaged as an NAR file and deployed into NiFi.

Customizing Data Ingest Pipeline Template

On updating and saving the data ingest template in NiFi, the same template can be customized in Kylo UI. The customization steps involve:

Customizing feed destination table
Adding input properties
Adding additional properties
Performing access control
Registering the template

Defining Categories for Feeds

All the feeds created in Kylo should be categorized. The process group in NiFi is launched to execute the feeds. “Transaction raw data” category is created to categorize the feeds.

Defining Feeds with Source and Destination

Kylo UI is self-explanatory to create and schedule the feeds. To define feeds, perform the following:

Choose data ingest template.
Provide feed name, category, and description.

Choose input Data Source to ingest data.
Customize the configuration parameter related to that source.
For example, “transactionRawTopic” in Kafka and batch size “10000”.

Define output feed table using either of the following methods:
- Manually define the table columns and its data type.
- Upload sample file and update the data type as per the data in the column.
Preview the data under Feed Details section in the top right corner.

Define partitioning output table by choosing Source Field and Partition Formula.
For example, “time” as source field and “year” as partition formula to partition the data.

Cleansing and Validating Data

Feed creation wizard UI allows end-users to configure cleansing and standardization functions to manipulate data into conventional or canonical formats (for example, simple data type conversion such as dates, stripping special characters) or data protection (for example, masking credit cards, PII, and so on).

It allows users to define field-level validation to protect data against quality issues and provides schema validation automatically. It provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per needs. It provides predefined rules for standardization and validation of different data types.

To clean and validate data, perform the following:

Apply different pre-defined standardization rules for time, user, address, and amount columns as shown below:

Apply standardization and validation for different columns as shown in the below screenshot:

Define data ingestion merge strategy in the output table.
Choose “Dedupe and merge” to ignore duplicated batch data and insert it into the desired output table.

Use Target Format section to define data storage and compression options.
Supported Storage Formats: ORC, Parquet, Avro, TextFile, and RCFile
Compression Options: Snappy and Zlib

Scheduling Feeds

To schedule the feeds using cron or timer based mechanism, enable “Enable Feed immediately” option to enable the feeds immediately without waiting for cron job or timer criteria meets.

Monitoring Feeds

After scheduling the feeds, the actual execution will be performed in NiFi. Feeds status can be edited and monitored. The feed details can be changed at any time and the feeds can be re-scheduled.

Overview of created feed job status can be seen under jobs in Operation sections. By drilling down the jobs, identify the details of each job and perform debug on feed job execution failure.

Job Activity section provides details such as completed, running, and so on of a specific feed recurring activity.

Operational Job Statistics section provides details such as success rate, flow rate per second, flow duration, and steps duration of specific job statistics.

Conclusion

In this blog, we discussed data ingestion, cleansing, and validation without any coding in Kylo data lake platform. The ingested data output from Kafka is shown in Hive table in Ambari as follows:

In our next blog – Kylo: Data Profiling and Search-based Data Discovery, let us discuss data profiling and search-based data discovery.

References

Kylo Setup for Data Lake Management:
http://www.treselle.com/blog/kylo-setup-for-data-lake-management/
Kylo Features:
https://kylo.readthedocs.io/en/v0.8.3/about/KyloFeatures.html
Kylo FAQ:
https://kylo.readthedocs.io/en/v0.8.3/about/KyloFrequentlyAskedQuestions.html
NiFi Custom Processor:
https://community.hortonworks.com/articles/4318/build-custom-nifi-processor.html
Custom Standardizer and Validator:
https://groups.google.com/forum/#!topic/kylo-community/BSPloL9sbl8

↧

Predict Lending Club Loan Default Using Seahorse and SparkR

October 23, 2017, 12:16 am

≫ Next: Data Quality Metrics using Talend Data Quality Management

≪ Previous: Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!)

Overview

Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited as the data is processed on a single machine, where the movement of data from the development environment to production environment is time-consuming and requires extensive re-engineering.

To overcome this problem, Spark provides a powerful, unified engine that is both fast (100x faster than Hadoop for large-scale data processing) and easy to use by the data scientists and data engineers. It is simple, scalable, and easy to integrate with other tools.

Seahorse, a scalable data analytics workbench, allows the data scientists to visually build Spark applications. It allows the data scientists to perform data preparation, data transformation, data modeling, data training, data analysis, and data visualization collaboratively. Seahorse has built-in operations to allow the data scientists to customize parameter values.

In this blog, let us discuss predicting loan default of Lending Club. Lending Club is the world’s largest online marketplace to connect borrowers and investors.

Pre-requisites

VirtualBox (version 5.0.10)
Vagrant (version 1.8.1)
Google Chrome (60.0.3112.113)

Data Description

Loan data of Lending Club, from 2007-2011, with 40K records is used the source file. Each loan has more than 100 characteristics of loan and borrower.

Use Case

Analyze loan data of Lending Club.
Predict loan default in Lending Club dataset by building data model using Logistic Regression.

Loan status falls under two categories such as Charged Off (default loan) and Fully Paid (desirable loan). Lending Club defines Charged Off loans as loans that are non-collectable and the lender has no hope of recovering money.

Synopsis

Read Data from Source
Prepare Data
Train and Evaluate Data Model
Visualize Data

Workflow Operations

In Seahorse, all the machine learning processes are made as operations. R Transformation operations are used to clean and prepare the data. The operations used for Lending Club loan data analysis are as follows:

Input / Output – Read DataFrame
Action – Fit, Transform, Evaluate
Set Operation – Split
Filtering – Filter Columns, Handle Missing Values
Transformation – SQL Transformation, R Transformation
Feature Conversion – String Indexer, One Hot Encoder, Assemble Vector
Machine Learning – Logistic Regression from Classification, Binary Classification Evaluator from Evaluation

Reading Data from Source

Seahorse supports three different file formats such as CSV, Parquet, and JSON from different types of data sources such as HDFS, Database, Local, and Google Spreadsheets. Read DataFrame operation is used to read the files from the data sources and upload it into Seahorse library.

Preparing Data

To prepare the data for analysis, perform the following:

Remove irrelevant data (loan ID, URL, and so on), poorly documented data (average current balance), and less important features (payment plan, home state) from the source data.
Use Filter Columns operation to select 17 key features from the dataset as shown in the below diagram:

Use R Transformation operation to write any custom function in R.
Convert string columns into numeric columns by removing special characters and duplicate data.
For Example, convert int_rate and revol_util columns into numeric by removing special characters (%).

transform <- function(dataframe) {
# Convert into R dataframe using collect function
dataframe <- collect(dataframe)
# Remove special character(%) from the features
dataframe$int_rate <- as.numeric(gsub("%","",dataframe$int_rate))
dataframe$revol_util <- as.numeric(gsub("%","",dataframe$revol_util))
# Convert string to numeric by removing same word(months)
dataframe$term <- as.numeric(gsub("months","",dataframe$term))
# Reduce factor level for some features column.
dataframe$home_ownership[dataframe$home_ownership=="NONE"] <- "OTHER"
# verified and source verified both are giving same meaning so we have convert as single state
dataframe$verification_status[dataframe$verification_status=="Source Verified"] <- "Verified"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Charged Off"] <- "Charged Off"
dataframe$loan_status[dataframe$loan_status=="Does not meet the credit policy. Status:Fully Paid"] <- "Fully Paid"
return(dataframe)
}

Derive new features from the date columns by applying feature engineering.
For example, derive issue_month and issue_year from issue_d feature and for earliest_cr_line feature.

transform <- function(dataframe) {
dataframe <- collect(dataframe)
# Add default value for day in date_time columns
dataframe$issue_d <- as.Date(paste("01-",dataframe$issue_d,sep=""),"%d-%b-%Y")
dataframe$earliest_cr_line <- as.Date(paste("01-",dataframe$earliest_cr_line,sep=""),"%d-%b-%Y")
# Get year from the date_time column
dataframe$issue_year <- as.numeric(format(dataframe$issue_d,"%Y"))
dataframe$cr_line_year <- as.numeric(format(dataframe$earliest_cr_line,"%Y"))
# Get month from the date_time
dataframe$issue_month <- as.numeric(format(dataframe$issue_d,"%m"))
dataframe$cr_line_month <- as.numeric(format(dataframe$earliest_cr_line,"%m"))
dataframe$issue_d <- NULL
dataframe$earliest_cr_line <- NULL
return(dataframe)
}

The derived features are shown in the below diagram:

After Preprocessing

After preprocessing, perform the following:

Use Handle Missing Values operation to find the rows with missing values and to handle them with the selected strategy such as remove row, remove column, custom value, and mode.
For example, provide custom values for NAs and empty string.
Select numeric and string columns from the DataFrame and select remove row as strategy as shown in the below diagram:

Use String Indexer to map the categorical features into numbers.
Choose the columns from the DataFrame using name, index, or type.
Select string type columns from the DataFrame and apply string indexer to those columns.
For example, after the String Indexer execution, Fully Paid will become 0 and Charge Off will become 1 in loan_status column.
Use One Hot Encoder operation to convert categorical values into numbers in a fixed range of values.
A vector will be produced in each column corresponding to one possible value of the feature.

Use Assemble Vector operation to group all relevant columns together and to form a column with a single vector of all the features.
For example, the loan_status column is prediction variable and all other columns are features.
Use excluding mode to select all the columns other than the prediction variable.

Training and Evaluating Data Model

To split the dataset into training set and validation set using Split operation based on split ratio, perform the following:

Use 0.7 as a split ratio to split 70 percentage of data in the training set and 30 percentage of data in the validation set.

Use Logistic Regression and Fit operations to perform model training.
Use Fit operation to fit on estimator so as to produce Transformer.
In Fit Operation, select features columns and prediction variable.
Select maximum iterations and threshold value for the model.
Fit operation provides prediction variable with predicted values and confidence score in raw prediction and probability columns.

Use Evaluate action with Binary Classification Evaluator to find the performance of the model.
Find AUC, F-Score, and Recall values from the Binary Classification Evaluator and select AUC as a metric for the model.

Use custom functions (R or Python Transformation) to find the confusion matrix of the model and derive the metrics for that model.
Use SQL Transformation to write custom Spark SQL query and to get correctly predicted values and wrongly predicted values from the DataFrame.

Visualizing Data

DataFrame Report

In DataFrame Report, every column has some plots based on the datatype.

Int_rate Column Visualization

For Continuous features, the bar chart is used for data visualization as shown in the below diagram:

Grade Column Visualization

For Discrete features, the pie chart is used for data visualization as shown in the below diagram:

To create a custom plot like the combination of two column values, use custom operations such as R, Python, SQL Transformation or Python or R Notebook.

References

↧

Data Quality Metrics using Talend Data Quality Management

October 30, 2017, 12:16 am

≫ Next: Kylo – Automatic Data Profiling and Search-based Data Discovery

≪ Previous: Predict Lending Club Loan Default Using Seahorse and SparkR

Overview

Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and information about the data. It helps business users in making more informed decisions with the quality data.

In this blog, let us discuss Data Quality Statistics (DQS) using Talend Data Quality Management (DQM).

Pre-requisites

Download and install Talend data quality tool from the following link:
https://www.talend.com/products/data-quality/

Data Description

Loan applicant dataset, with basic applicant details such as applicant ID, gender, age, marital status, and so on, is used as the source data.

Sample Data Source in MySQL

sample-data-source-in-mysql

Use Case

Perform column and table level quality statistics on the input data source.

Synopsis

Connect data source with Talend DQM
Create analysis and data quality statistics
- Simple statistics
- Pattern matching statistics
- Text statistics
- Pattern frequency statistics
Apply static rules
Perform Correlation Analysis
Identify data duplicates using Match Analysis

Connecting Data Source with Talend DQM

To connect Talend DQM with the database, perform the following:

Open Talend Open Studio for Data Quality.
In the left panel, click Metadata –> DB connections –> Create DB Connection to create a database connection to import the source data from the database for collecting statistics.

Provide the required credentials to create metadata for MySQL DB connection as shown in the below diagram:

Creating Analysis and Data Quality Statistics

On successfully connecting Talend with MySQL, perform analysis on the following levels:

Column
Table

To collect the data quality statistics with the applicant dataset on column level, perform the following:

Create analysis as shown in the below diagram:

Select columns for performing the data quality statistics as shown in the below diagram:

Select quality indicator for the selected columns to run analysis and view the analysis results.

The quality statistics based on the above-selected indicators in Talend DQM are:

Simple Statistics
Pattern Matching Statistics
Text Statistics
Pattern Frequency Statistics

Simple Statistics

The simple statistics on the applicant ID column, with Row Count, Null Count, Distinct Count, Unique Count, Duplicate Count, and Blank Count, is shown in the below diagram:

simple-statistics

Pattern Matching Statistics

The pattern matching statistics is used to analyze the format of several types of data such as date in different formats, phone number patterns in different countries, zip codes, and so on. It provides both matching and non-matching patterns.

Matched and unmatched patterns of the phone numbers with the countries are shown in the below diagram:

pattern-matching-statistics

Matched and unmatched patterns of the applicant last name starting with uppercase are shown in the below diagram:

pattern-matching-statistics1

Matched and unmatched patterns of the US state codes in the applicant data are shown in the below diagram:

pattern-matching-statistics2

Matched and unmatched patterns of the date matching with the date of birth of the applicant are shown in the below diagram:

pattern-matching-statistics3

Text Statistics

The text statistics is used to check the data with default length such as phone number (for example, 10 digits in India). The text statistics on the applicants’ phone number is shown in the below diagram:

text-statistics

Pattern Frequency Statistics

The pattern frequency statistics is used to check the pattern formats of the data source. The patterns of the phone numbers are shown in the below diagram:

pattern-frequency-statistics

Applying Static Rules

Business Rule statistics, called as table level statistics, is used to apply static rules and predefined business rules in the table columns.

Few static rules created are:

Approved loan amount should not be greater than requested loan amount.
Age column record is valid based on date of birth column in table.
Gender column should have valid data like Male or Female.

The business rule analysis performed using the above static rules is shown in the below diagram:

applying-static-rules

Performing Correlation Analysis

The correlation analysis is used to explore the relationships and correlations in the data. It is used to highlight weak relationships between the data to find potential incorrect relationships. The correlation analysis between the cities and the states is shown in the below diagram:

Identifying Data Duplicates Using Match Analysis

The match analysis is used to assess the number of duplicates in the data. It estimates the number of groups of similar data on a table-set or a column-set basis. The match analysis, with column sets such as state and gender, is shown in the below diagrams:

match-analysis

match-analysis1

References

Data Profiling – Concepts and Principles

↧

Kylo – Automatic Data Profiling and Search-based Data Discovery

October 30, 2017, 12:33 am

≫ Next: Sensor Data Quality Management using PySpark & Seaborn

≪ Previous: Data Quality Metrics using Talend Data Quality Management

Overview

Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count & sum), occurrence of null values, occurrence of uniqueness, occurrence of missing values, occurrence of duplicates, occurrence of top values, and occurrence of valid & invalid values.

Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and to make data discovery simple.

In this blog, let us discuss automatic data profiling and search-based data discovery in Kylo.

Pre-requisites

To know about Kylo deployment requiring knowledge on different components/technologies, refer our previous blog on Kylo Setup for Data Lake Management.

To learn more about Kylo self-service data ingest, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Data Profiling

Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found in Feed Details page.

Feed Details

The feed ingestion using Kafka is shown in the below diagram:

Informative summaries about each field from the ingested data can be viewed under View option in Profile page.

String (user field in the sample dataset) and numeric data type (amount field in the sample dataset) profiling details are shown in the below diagrams:

Profiling Statistics

Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for string field. The numeric field statistics for the amount field is shown in the below diagram:

The basic statistics for the string field (i.e. user field) is shown in the below diagram:

Standardization Rules

Predefined standardization rules are used to manipulate data into conventional or canonical formats (dates, stripping special characters) or data protection (masking credit cards, PII, and so on). Few standardization rules applied on the ingested data are as follows:

Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the user, business, and address fields as per the configuration is shown in the below diagram:

Profiling Window

Kylo’s profiling window provides additional tabs such as valid and invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as Range Validator Rule violation, not considered as timestamp, and so on.

The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000”, which is the number of messages Kafka producer will attempt to batch before sending it to the consumer. To know more on batch size, refer our previous blog on Kylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding Required!).

Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka is shown in the below diagram:

Search-based Data Discovery

Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows the business analysts to decide on the required fields to be searchable and to enable index option for those fields while creating feed. The indexed “user” and “business” fields searchable from Kylo Global Search is shown in the below diagram:

Index Feed

Predefined “Index Feed” queries the index-enabled field data from persisted Hive table and indexes the feed data into Elasticsearch. The “Index Feed” is automatically triggered as a part of “Data Ingest” template. The index feed job status is highlighted in the below diagram:

If the index feed fails, search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for user and business fields failed due to the field name “user” as shown in the below diagram:

To resolve this, the “user” field name is modified as “customer_name” during feed creation.

Search Queries

The search query to return the matched documents from the Elasticsearch is:

customer_name: “Bradley Martinez”

The search query, Lucence search query, to search data and metadata is:

business: “JP Morgan Chase & Co”

Feed Lineage

Lineage is automatically maintained at “feed-level” by Kylo framework, sinks identified by the template designer, and by any sources when registering the template.

Conclusion

In this blog, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed few issues faced during Index Feed and its solutions too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capability to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.

References

↧

Sensor Data Quality Management using PySpark & Seaborn

November 14, 2017, 5:31 am

≫ Next: Predict Bad Loans with H2O Flow AutoML

≪ Previous: Kylo – Automatic Data Profiling and Search-based Data Discovery

Overview

Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. Few data quality dimensions widely used by the data practitioners are Accuracy, Completeness, Consistency, Timeliness, and Validity. Various DQM rules are configured to apply DQM to the existing data. These DQM rules are applied to clean up, repair, and standardize incoming data & identify and correct invalid data.

In this blog, let us check data for required values, validate data types, and detect integrity violation. DQM is applied to correct the data by providing default values, formatting numbers and dates, and removing missing values, null values, non-relevant values, duplicates, out of bounds, referential integrity violations, and value integrity violations.

Pre-requisites

Install the following Python packages:

PySpark
XGBoost
Pandas
Matplotlib
Seaborn
NumPy
sklearn

Data Description

Sensor data from the pub-nub source is used as the source file.

Total Record Count: 6K
File Types: JSON and CSV
# of Columns: 11
# of Records: 600K
# of Duplicate Records: 3.5K
# of NA Values:
- Ambient Temperature: 3370
- Humidity: 345
- Sensor IDs: 12

Sample Dataset

Use Case

Perform data quality management on sensor data using Python API – PySpark.

Data Quality Management Process

Synopsis

Data Integrity
Data Profiling
Data Cleansing
Data Transformation

Data Integrity

Data integrity is the process of guaranteeing the quality of the data in the database.

Analyzed input sensor data with
- 11 columns
- 6K records
Validated source metadata
Populated relationships for an entity

Data Profiling

Data profiling is the process of discovering and analyzing enterprise metadata to discover patterns, entity relationships, data structure, and business rules. It provides statistics or informative summaries of the data to assess data issues and quality.

Few data profiling analyses include:

Completeness Analysis – Analyze frequency of attribute population versus blank or null values.
Uniqueness Analysis – Analyze and find unique or distinct values and duplicate values for a given attribute across all records.
Values Distribution Analysis – Analyze and find the distribution of records across different values of a given attribute.
Range Analysis – Analyze and find minimum, maximum, median, and average values of a given attribute.
Pattern Analysis – Analyze and find character patterns and pattern frequency.

Generating Profile Reports

To generate profile reports, use either Pandas profiling or PySpark data profiling using the below commands:

Pandas Profiling

import pandas as pd<br> import pandas_profiling<br> import numpy as np

#Read the source file that contains sensor data details
df= pd.read_json('E:\sensor_data.json', lines=True)

#Preprocessing on data
df = df.replace(r'\s+', np.nan, regex=True)
df['ambient_temperature']= df['ambient_temperature'].astype(float)
df['humidity'] = df['humidity'].astype(float)

#Generate profile report using pandas_profiling
report = pandas_profiling.ProfileReport(df)

#covert profile report as html file
report.to_file("E:\sensor_data.html")

PySpark Profiling

import pandas as pd<br>import spark_df_profiling<br>import numpy as np

#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

#Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
sql = SQLContext(sc)

# Loading transaction Data
sensor_data_df = sql.read.format("com.databricks.spark.csv").option("header", "true").load("E:\spireon\Data\ganga\sensor_data.csv")
report = spark_df_profiling.ProfileReport(sensor_data_df)
report.to_file("E:\spireon\Data\ganga\pyspark_sensor_data_profiling_v2.html")

The profile report provides the following details:

Essentials – type, unique values, missing values
Quantile Statistics – minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive Statistics – mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram

Profile Report Overview

The sample profile report for a single attribute (ambient temperature) is as follows:

Ambient Temperature – Statistics

Ambient Temperature – Histogram

Ambient Temperature – Extreme Values

To view the complete profile report, see Reference section.

Data Cleansing

Data cleansing is the process of identifying incomplete, incorrect, inaccurate, duplicate, or irrelevant data and modifying, replacing, or deleting the dirty data.

Analyzed the number of null (NaN) values in the dataset using the below command:
df.isnull().sum()

The number of null values is as follows:

Deleted NaN values in String type columns using the below command:

df_v1 = df.dropna(subset=['sensor_id', 'sensor_name','sensor_uuid'], how='all')
<br>df_v1.isnull().sum()<br>

Imputed missing values using one of the below methods:

Method 1 – Impute package

Imputation is defined as the process of replacing the missing data with substituted values using any of the following options:

most_frequent: Columns of the dtype object (string) are imputed with the most frequent values in the column as mean or median cannot be found for this data type.
Mean: Ratio of the sum of elements to the number of elements in the list.
Median: Ratio of the sum of middle two numbers to two.

Note: If the missing values in the records are negligible, ignore those records.

In our use case, the most_frequent strategy is used for substituting the missing values using the below command:

imputer=Imputer(missing_values='NaN',strategy='most_frequent', axis=0)
<br>imputer=imputer.fit(df_v1.ix[:,[2,3,4,5,6]])<br>
df_v1.ix[:,[2,3,4,5,6]] =imputer.transform(df_v1.ix[:,[2,3,4,5,6]])

Method 2 – Linear Regression model

To replace the missing data with the substituted values using Linear Regression model, use the below commands:

from sklearn.linear_model import LinearRegression,LogisticRegression

# Split values into sets with known and unknown ambient_temperature values
df_v2 = df_v1[["ambient_temperature","humidity","photosensor","radiation_level"]]
knownTemperature = df_v2.loc[(df_v1.ambient_temperature.notnull())]
unknownTemperature = df_v2.loc[(df_v1.ambient_temperature.isnull())]

# All ambient_temperature values stored in a target array
Y = knownTemperature.values[:, 0]

# All the other values stored in the feature array
X = knownTemperature.values[:,1::]

# Create and fit a linear regression model
linear_regression = LinearRegression()
linear_regression.fit(X, Y)

# Use the fitted regression model to predict the missing values
predictedTemperature = linear_regression.predict(unknownTemperature.values[:, 1::])

# Assign those predicted values to the full data set
df_v1.loc[ (df_v1.ambient_temperature.isnull()), 'ambient_temperature' ] = predictedTemperature

Data Transformation

Data transformation deals with converting data from the source format into the required destination format.

Converted attributes such as ambient_temperature and humidity from object type to float type using the below command:

#Preprocessing on data transformation
<br>df = df.replace(r'\s+', np.nan, regex=True)<br>
df['ambient_temperature']= df['ambient_temperature'].astype(float)<br>
df['humidity'] = df['humidity'].astype(float)

Converted a non_numeric value of sensor_name into numeric data using the below command:

labelencoder_X=LabelEncoder()
<br> labelencoder_X.fit(df_v1.ix[:,6])<br>
list(labelencoder_X.classes_)<br> 
df_v1.ix[:,6] = labelencoder_X.transform(df_v1.ix[:,6])

Converted a non_numeric sensor name into numeric data using the below command:

labelencoder_y=LabelEncoder()<br> 
labelencoder_y.fit(df_v1.ix[:,4])<br> 
list(labelencoder_y.classes_)<br> 
df_v1.ix[:,4] = labelencoder_y.transform(df_v1.ix[:,4])

Converted a non_numeric value of sensor ID into numeric data using the below command:

labelencoder_z=LabelEncoder()<br> 
labelencoder_z.fit(df_v1.ix[:,5])<br> 
list(labelencoder_z.classes_)<br> 
df_v1.ix[:,5] = labelencoder_z.transform(df_v1.ix[:,5])

Based on the above transformation, found feature importance using built-in function using the below commands:

# plot feature importance using built-in function<br> from numpy import loadtxt<br> from xgboost import XGBClassifier<br> from xgboost import plot_importance<br> from matplotlib import pyplot

# split data into X and y
X = df_v1.ix[:,[0,1,2,3,4,5,6,7,10]]
Y = df_v1.ix[:,[10]]
plt.clf()

# fit model no training data
model = XGBClassifier()
model.fit(X, Y)

# plot feature importance
plot_importance(model)
plt.gcf().subplots_adjust(bottom=0.15)
plt.tight_layout()
plt.show()

Feature Importance Chart

From the above diagram, it is evident that photosensor feature has the highest importance and lat (latitude) feature has the lowest importance.

Correlation Analysis

Performed correlation analysis to explore data relationships and data correlations to highlight weak data relationships and find potential incorrect relationships. The correlation analysis between the sensor data variables is shown in the below diagram:

From the above diagram, it is evident that the ambient_temperature is highly correlated with the dewpoint and humidity and the latitude & longitude are negatively correlated as per the correlation analysis.

Reference

↧

Predict Bad Loans with H2O Flow AutoML

November 15, 2017, 2:41 am

≫ Next: Crime Analysis Using H2O Autoencoders – Part 1

≪ Previous: Sensor Data Quality Management using PySpark & Seaborn

Overview

Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately predict the probability of loan default in making the best financial decisions by both investors and borrowers. H2O Flow, a web-based interactive computational environment, is used for combining text, code execution, and rich media into a document.

H2O’s AutoML, an easy-to-use interface for advanced users, automates the machine learning workflow including training a large set of models. Stacked Ensembles are used to produce a top-performing model–a highly predictive ensemble model in AutoML Leaderboard. In this blog, let us accurately predict bad loan data in order to help the borrowers in making financial decisions and the investors in choosing the best investment strategy.

Pre-requisites

Install Python 2.7 or 3.5+
Install H2O Flow with the following packages:
- pip install requests
- pip install tabulate
- pip install scikit-learn
- pip install colorama
- pip install future
- pip install http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/Python/h2o-3.14.0.2-py2.py3-none-any.whl
On successfully installing H2O, check Cluster connection using h2o.init().

Data Description

Loan data of Lending Club, from 2007-2011, with 163K rows and 15 columns is used as the source file. The Lending Club is a peer-to-peer loan platform for both the investors and borrowers.

Sample Dataset

Dataset Variables

loan_amnt
term
int_rate
addr_state
dti
revol_util
delinq_2yrs
emp_length
annual_inc
home_ownership
purpose
total_acc
longest_credit_length
verification_status
Dependent variable

Use Case

Analyze Lending Club’s loan data.
Predict bad loan data in the dataset by using the distributed random forest model and the stacked ensembles in AutoML based on the borrower loan amount approval or rejection.

Based on the percentage of the bad loan data, the investors can very easily decide whether to finance the borrower for new loans or not. For example, a loan is considered rejected if the bad loan data is 1.

Synopsis

Import data from source
View parsing data
View job details and dataset summary
Visualize labels
Impute data
Split Data
Run AutoML
View Leaderboard
Compute Variable Importance
View Output

Importing Data from Source

To import the data from the source, perform the following:

Open H2O Flow.
Click Data –> Import Files to import the source files into H2O Flow as shown in the below diagram:

After importing the files, a summary displays the results of the import.

Viewing Parsing Data

On successfully importing these files, click Parse these files to parse the files and to view the details of the source data as shown in the below diagram:

The parse files contain column names and data types of all features. The data types will be assigned by default and can be changed if required. For example, in our use case, the data type of response column (bad loan) is changed from numeric to factor (Enum). After doing all changes, click Parse.

Viewing Job Details and Dataset Summary

After clicking the parse files, you can view the job details. Click View to view the summary of the DataFrame.

Loan Dataset Summary

From the above summary, the input columns show multiple label values. Each label data can be visualized by clicking their corresponding column names.

Visualizing Labels

In this section, let us visualize data of loan amount and employee length columns.

Loan Amount Data

Employee Length Data

Imputing Data

Missing values of labels, with aggregates computed on “na.rm’d” vector, are imputed using in-place imputation.

To impute the data, perform the following:

Choose the attribute with missing values.
Click Impute as shown in the below diagram:

Specify the following details:
- Frame
- Column
- Method
- Combine Method

On successfully imputing the column with the median values, the summary of the column will be displayed as shown in the below diagram:

Splitting Data

To split the dataset into a training set (70%) and a test set (30%), perform the following:

Click Assist Me and Split Frame (or click Data drop-down and select Split Frame) to split the DataFrame.
It automatically adjusts the ratio values to one. On entering unsupported values, an error will be displayed.
Click Create to view the split frames.

Running AutoML

To run AutoML, perform the following:

Select Model –> RunAutoML as shown in the below diagram:

Provide the following details as shown in the below diagram:
- Training Frame – Select the dataset to build the model.
- Response Column – Select the column to be used as a dependent variable. Required only for GLM, GBM, DL, DRF, Naïve Bayes (classification model).
- Fold Column – (Optional in AutoML) Select the column with the cross-validation fold index assignment / observation.
- Weight Column – Weights are per row observation weights and do not increase data size. During data training, rows with higher weights matter more due to the larger loss function pre-factor.
- Validation Frame – (optional) Select the dataset to evaluate the model accuracy.
- Leaderboard Frame – Specify the Leaderboard frame when configuring AutoML run. If not specified, the Leaderboard frame will be created from the Training Frame. The output models with best results will be displayed on the Leaderboard.
- Max Models – (AutoML) Specify the maximum number of models to be built in an AutoML run.
- Max Runtime Secs – Controls execution time of AutoML run (default time is 3600 seconds).
- Stopping Rounds – Stops training based on a simple moving average when the stopping_metric does not improve for a specified number of training rounds. Specify 0 to disable this feature.
- Stopping Tolerance – Specify the tolerance value to improve a model before training ceases.

Viewing Leaderboard

The Leaderboard displays the models with the best results first as shown in the below diagram:

Model

ROC Curve – Training Metrics

Computing Variable Importance

The statistical significance of all variables affecting the model is computed depending on the algorithm and is listed in the order of most to least importance.
The percentage importance of all variables is scaled to 100. The scaled importance value of the variables is shown in the below diagram:

Viewing Output

Predicted Model of Loan Dataset

ROC Curve

Prediction Scores

Conclusion

In this blog, AutoML, the distributed random forest model, and the stacked ensembles are used to build and test the best model for predicting the loan default. The data is analyzed to obtain the cut-off value. The investors use this cut-off value to decide the best type of investment strategy for loan investment and to determine the applicants getting loans.

References

↧

Crime Analysis Using H2O Autoencoders – Part 1

December 8, 2017, 4:51 am

≫ Next: Streaming Analytics using Kafka SQL

≪ Previous: Predict Bad Loans with H2O Flow AutoML

Overview

Nowadays, Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data. Machine Learning models are used to accurately predict crimes. Crime prediction not only helps in crime prevention but also enhances public safety. Autoencoder, a simple, 3-layer neural network, is used for dimensionality reduction and for extracting key features from the model.

Data Engineers spend much time in building an analytic model with proper validation metrics in order to higher the performance of the model. Data Analysts spend high time in building data pipelines as a part of Big Data Analytics. The Machine Learning models are developed in these pipelines with its own functionalities/features. On passing the models through the Analytical Pipeline, these models are easily deployed in real-time processing.

This blog is part one of a two-part series of Crime Analysis using H2O Autoencoders. In this blog, let us discuss building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles (LA).

Pre-requisites

Install the following in R:

H2O from the below repository:
https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/R
Command to install – install.packages(“h2o”, type=”source”, repos=”https://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/2/R“
dplyr
ggplot2
lubridate
chron
pROC

Dataset Description

Crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. This dataset is an open data resource for governments, non-profit organizations, and NGOs.

Sample Dataset

Use Case

Predict the arrest status of the crimes happening in Los Angeles.
Achieve analytical pipeline.
Analyze the performance of Autoencoders.
Build deep learning and machine learning models.
Apply required mechanisms to increase the performance of the models.

Synopsis

Access data
Prepare data
- Clean data
- Preprocess data
Perform Exploratory Data Analysis (EDA)
Build Machine Learning model
- Initialize H2O cluster
- Impute data
- Train model
Validate model
Execute model
- Pre-trained supervised model

Accessing Data

The crime dataset is obtained from https://dev.socrata.com/ and imported into the database. The Socrata APIs provide rich query functionality through a query language called “Socrata Query Language” or “SoQL”.

The data structure is as follows:

Preparing Data

In this section, let us discuss data preparation for building a model.

Cleansing Data

Data cleansing is performed to find NA values in the dataset. These NA values should be either removed or imputed with some imputation techniques to get desired data.

To get the count of NA values and view the results, use the below commands:

Total Number of NA Values for Each Column

From the above diagram, it is evident that the attributes such as crm_cd_2, crm_cd_3, crm_cd_4, cross_street, premis_cd and weapon_used_cd are repeated and are to be removed. These attributes are removed from the dataset.

Preprocessing Data

Data preprocessing such as data type conversion, date conversion, month, year, & week derivation from the date field, new attributes derivation, and so on is performed on the dataset. The date attribute is converted from factor to POSIXct object. Lubridate package is used to get various fields such as the month, year, and week using this object. Chron package is used along with the time attribute to derive crime time interval (Morning, Afternoon, Midnight, and so on).

Performing Exploratory Data Analysis

The EDA is performed on the crime dataset to make better and useful EDA.

Top 20 Crimes in Los Angeles

Crime Timings

Month with Highest Crimes

Area with Highest Crime Percentage

Top 10 Descent Groups Getting Affected

Top 10 Frequently Used Weapons for Crime

Safest Living Places in Los Angeles

Building Machine Learning Model

In this section, let us discuss building the best Machine Learning model for our dataset using Machine Learning algorithms.

Initializing H2O Cluster

Before imputing the data, initiate a H2O cluster running with port 12345 using init(). This cluster is accessed using http://localhost:12345/flow/index.html#.

Imputing Data

In H2O, data imputation is performed using h2o.impute() to fill the NA values using default methods such as mean, median, and mode. The method is chosen based on the data type of each column. For example, factor or categorical columns are imputed using mode method.

The dependent variable is grouped based on the status codes of the crimes occurred. The crimes arrest status codes are grouped into Not Arrested and Arrested.

Training Model

The dataset is split into Train, Test, and Validation frames based on certain ratios specified using h2o.splitframe. Each frame is assigned to a separate variable using h2o.assign().

To train the model, perform the following:

Take the data pertaining to the year 2016 as the training set.
Take the data pertaining to the year 2017 as the test set.
Apply Deep Learning to the model.
Perform Unsupervised classification to predict the arrest status of the crimes.
Make the autoencoder model to learn the patterns of the input data irrespective of the given class labels.
Make the model to learn the status behavior based on the features.

Function Used to Apply Deep Learning to Our Data: h2o.deeplearning
@param x – features for our model
@param training_frame – dataset to the model that needs to be applied.
@param model_id – string represents our model to save and load.
@param seed – for resproducability.
@param hidden – number of hidden layers.
@param epochs – number of iterations our dataset must go through.
@param activation – a string representing the activation to be used.
@params stopping_rounds, stopping_metric, export_weights_and_biases – used for cross validation purposes.
@param autoencoder – logic representing whether autoencoders should be applied or not

The above diagram shows the summary of our Autoencoders model and its performance for our training set.

A classification problem is encountered as Gaussian distribution is applied to our model instead of a Binomial classification.

As the above results are not satisfactory, the dimensionality of our model is reduced to get better results. The features of one of the hidden layers are extracted and the results are plotted to classify the arrest status using deep features functions in H2O package.

From the above results, the arrest status of the crimes happened cannot be exactly obtained.

So, dimensionality reduction with our autoencoder model alone is not sufficient to identify the arrest status in this dataset. The dimensionality representation of one of our hidden layers is used as features for Model Training. Supervised Classification is applied to the extracted features and the results are tested.

Validating Model

To validate the performance of our model, the cross-validation parameters used while building the model is used to plot the ROC curves and get the AUC value on our validation frames. A detailed overview of our model is obtained using summary() function.

Executing Model

To predict the arrest status of the crimes, perform the following:

Apply the deep features to the dataset.
Use our model to predict the arrest status.

Plot the ROC curve with AUC values based on Sensitivity and Specificity.

Group the results based on the predicted and actual values with the total number of classes and its frequencies.
Decide the performance of our model on the arrest status of the crimes.

From the above diagram, the predicted number of Not Arrested cases is 28 and the predicted number of Arrested cases is 150. As the numbers seem to be less, this model will cause a slight problem in maintaining the historical records when used in real-time.

Pre-trained Supervised Model

The autoencoder model is used as a pre-training input for a supervised model and its weights are used for model fitting. The same training and validation sets are used for the supervised model. A parameter called pretrained_autoencoder is added in our model along with the autoencoder model name.

This pre-trained model is used to predict the results of our new data and to find the probability of classes for our new data.

The results are grouped based on the actual and predicted values and the performance of our model is decided based on the arrest status of the crimes.

From the above results, it is evident that there are only minor changes in the results from our previous results with the dimensionality representation. Let us plot the ROC curves and AUC values to compare both the results.

Conclusion

In this blog, we discussed creating the analytical pipeline for the Los Angeles crime dataset, applying the Autoencoders to the dataset, performing both Unsupervised and Supervised Classifications, extracting the dimensionality representation of our model, and applying the Supervised model.

In our next blog on Crime Analysis Using H2O Autoencoders – Part 2, let us discuss deploying the model by converting it into POJO/MOJO objects with the help of H2O functions.

References

↧

Streaming Analytics using Kafka SQL

December 8, 2017, 4:57 am

≫ Next: Crime Analysis Using H2O Autoencoders – Part 2

≪ Previous: Crime Analysis Using H2O Autoencoders – Part 1

Overview

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish the trip data from source to other destinations for further analysis.

In this blog, let us discuss enriching the Citi Bike trip data and finding the number of trips on the particular day to/from the particular station.

Pre-requisites

Install the following:

Scala
Apache Kafka
KSQL
JDK

Data Description

Trip dataset of Citi Bike March 2017 is used as the source data. It contains basic details such as trip duration, ride start time, ride end time, station ID, station name, station latitude, and station longitude.

Station dataset of Citi Bike is used for enriching trip details for further analysis after data consumption. It contains basic details such as availableBikes, availableDocks, statusValue, and totalDocks.

Use Case

Enrich Citi Bike trip data in real time using join and aggregation concepts.
Find the number of trips on the day to/from the particular station.
View trip details with station details & aggregate trip count of each station.

Synopsis

Produce station details
Join stream data and table data
Group data
Produce trip details
View output
- View trip details with station details
- View aggregate trip count of each station

Producing Station Details

To produce the station details using Scala, perform the following:

Create trip-details and station-details topics in Kafka using the below commands:

./bin/kafka-topics --create --zookeeper localhost:2181 --topic station-details --replication-factor 1 --partitions 1
./bin/kafka-topics --create --zookeeper localhost:2181 --topic trip-details --replication-factor 1 --partitions 1

Read station data from the URL – https://feeds.citibikenyc.com/stations/stations.json using the below commands:

Iterate the station list to produce JSON file using the below commands:

Produce the station data into the station-details topic via the below Scala command:

java -cp kafka_producer_consumer.jar com.treselle.kafka.core.Producer station-details localhost:9092 station_data

Iterate and produce the station details list in JSON format.
Check the produced and consumed station details using the below command:

./bin/kafka-console-consumer –bootstrap-server localhost:9092 –topic station- details –from-beginning

Joining Stream Data and Table Data

To join the stream and table data, perform the following:

In KSQL console, create a table for the station details to join it with the trip details while producing the stream using the below commands:

CREATE TABLE
station_details_table
(
id BIGINT,
stationName VARCHAR,
availableDocks BIGINT,
totalDocks BIGINT,
latitude DOUBLE,
longitude DOUBLE,
statusValue VARCHAR,
statusKey BIGINT,
availableBikes BIGINT,
stAddress1 VARCHAR,
stAddress2 VARCHAR,
city VARCHAR,
postalCode VARCHAR,
location VARCHAR,
altitude VARCHAR,
testStation BOOLEAN,
lastCommunicationTime VARCHAR,
landMark VARCHAR
)
WITH
(
kafka_topic='station-details',
value_format='JSON'
)

In KSQL Console, create a stream for the trip details to enrich the data with the start station details and to find the trip count of each station for the day using the below commands:

CREATE STREAM
trip_details_stream
(
tripduration BIGINT,
starttime VARCHAR,
stoptime VARCHAR,
start_station_id BIGINT,
start_station_name VARCHAR,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE,
end_station_id BIGINT,
end_station_name VARCHAR,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE,
bikeid INT,
usertype VARCHAR,
birth_year VARCHAR,
gender VARCHAR
)
WITH
(
kafka_topic='trip-details',
value_format='DELIMITED'
);

Join the stream with the station details table to get fields such as availableBikes, totalDocks, and availableDocks using the station ID as the key.
Extract the select statement start time in the date format as the timestamp to get only the day from the start time to find the started trip count details in the day using the below commands:

CREATE STREAM
citibike_trip_start_station_details WITH
(
value_format='JSON'
) AS
SELECT
a.tripduration,
a.starttime,
STRINGTOTIMESTAMP(a.starttime, 'yyyy-MM-dd HH:mm:ss') AS startime_timestamp,
a.start_station_id,
a.start_station_name,
a.start_station_latitude,
a.start_station_longitude,
a.bikeid,
a.usertype,
a.birth_year,
a.gender,
b.availableDocks AS start_station_availableDocks,
b.totalDocks AS start_station_totalDocks,
b.availableBikes AS start_station_availableBikes,
b.statusValue AS start_station_service_value
FROM
trip_details_stream a
LEFT JOIN
station_details_table b
ON
a.start_station_id=b.id

Add the end station details with the trip details in another topic similar to the start station.
Extract end time field as a long timestamp using the below commands:

CREATE STREAM
citibike_trip_end_station_details WITH
(
value_format='JSON'
) AS
SELECT
a.tripduration,
a.stoptime,
STRINGTOTIMESTAMP(a.stoptime, 'yyyy-MM-dd HH:mm:ss') AS stoptime_timestamp,
a.end_station_id,
a.end_station_name,
a.end_station_latitude,
a.end_station_longitude,
a.bikeid,
a.usertype,
a.birth_year,
a.gender,
b.availableDocks AS end_station_availableDocks,
b.totalDocks AS end_station_totalDocks,
b.availableBikes AS end_station_availableBikes,
b.statusValue AS end_station_service_value
FROM
trip_details_stream a
LEFT JOIN
station_details_table b
ON
a.end_station_id=b.id;

Join the streamed trip details with the station details table as KSQL does not allow joining of two streams or two tables.

Grouping Data

To group data based on the station details and the date, perform the following:

Format date as YYYY-MM-DD from the long timestamp to group by date in the start trip details using the below commands:

CREATE STREAM
citibike_trip_start_station_details_with_date AS
SELECT
TIMESTAMPTOSTRING(startime_timestamp, 'yyyy-MM-dd') AS DATE,
starttime,
start_station_id,
start_station_name
FROM
citibike_trip_start_station_details;

Format date as YYYY-MM-DD from the long timestamp to group by date in the end trip details using the below commands:

CREATE STREAM
citibike_trip_end_station_details_with_date AS
SELECT
TIMESTAMPTOSTRING(stoptime_timestamp, 'yyyy-MM-dd') AS DATE,
stoptime,
end_station_id,
end_station_name
FROM
citibike_trip_end_station_details;

Create a table by grouping the data based on the date and the stations for finding the started trip counts and the ended trip counts of each station for the day using the below commands:

CREATE TABLE
start_trip_count_by_stations AS
SELECT
DATE,
start_station_id,
start_station_name,
COUNT(*) AS trip_count
FROM
citibike_trip_start_station_details_with_date
GROUP BY
DATE,
start_station_name,
start_station_id;

CREATE TABLE
end_trip_count_by_stations AS
SELECT
DATE,
end_station_id,
end_station_name,
COUNT(*) AS trip_count
FROM
citibike_trip_end_station_details_with_date
GROUP BY
DATE,
end_station_name,
end_station_id;

List the topics to check whether the topics are created for persistent queries or not.

Producing Trip Details

To produce the trip details into the topic trip-details using Scala, use the below commands:

./bin/kafka-console-consumer –bootstrap-server localhost:9092 –topic station-details –from-beginning

From the above console output, it is evident that a total of 727664 messages are produced for data enrichment at the stream.

Viewing Output

Viewing Trip Details with Station Details

To view the trip details with the station details, perform the following:

Consume the message using the topic CITIBIKE_TRIP_START_STATION_DETAILS to view the extra fields added to trip details from the station details table and to extract the long timestamp field from the start and end times using the below commands:

/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic CITIBIKE_TRIP_START_STATION_DETAILS --from-beginning

Consume the message using the topic CITIBIKE_TRIP_END_STATION_DETAILS using the below commands:

./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic CITIBIKE_TRIP_END_STATION_DETAILS --from-beginning

From the above console output, it is evident that the fields of the station details are added to the trip while producing the trip details.

Viewing Aggregate Trip Count of Each Station

To view the aggregate trip count of each station based on the date, perform the following:

Consume the message via the console to check the trip counts obtained on the stream using the below commands:

./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic START_TRIP_COUNT_BY_STATIONS --from-beginning

From the above console output, it is evident that the trip counts are updated and added to the topic for each day when producing the message. So, this data can be filtered to the latest trip count in consumer for further analysis.

Obtain the end trip count details based on the stations using the below commands:

./bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic END_TRIP_COUNT_BY_STATIONS --from-beginning

Conclusion

In this blog, we discussed adding extra fields from the station details table, extracting date in the YYYY-MM-DD format, and grouping the details based on the station ID & the day for getting the start and end trip count details of the station.

References

↧

Crime Analysis Using H2O Autoencoders – Part 2

December 8, 2017, 5:10 am

≫ Next: Ingest IoT Sensor Data into S3 with Raspberry Pi3 & StreamSets Data Collector Edge

≪ Previous: Streaming Analytics using Kafka SQL

Overview

This is the second part of a two-part series of Crime Analysis using H2O Autoencoders. In our previous blog on Crime Analysis Using H2O Autoencoders – Part 1, we discussed building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles (LA). Our Machine Learning model can be deployed as a jar file using POJO and MOJO objects. H2O generated POJO and MOJO models can be easily embeddable into Java environment based on the autogenerated h2o-genmodel.jar file.

In this blog, let us discuss deploying the H2O Autoencoders model into a real-time production environment by converting it into POJO objects using H2O functions. As the Autoencoders does not support MOJO models, the POJO model is used in this blog.

Dataset Description

Crime dataset of Los Angeles, from 2016-2017, with 224K records and 27 attributes is used as the source file. For more description, refer our previous blog on Crime Analysis Using H2O Autoencoders – Part 1.

Sample Deployment Model

Use Case

Deploy the H2O Autoencoders model into the production environment.

Synopsis

Generate JAR File for H2O Autoencoder Model
Run model
Deploy model into production environment
Implement machine learning model (Java Spring)
- Set up model execution project
- Set up model deployment project
Perform overall production deployment

Generating JAR File for H2O Autoencoder Model

The Autoencoders model created from our previous analysis is as follows:

To generate the JAR file, perform the following:

Download the Autoencoders model using h2o.download_pojo() function in H2O package.
Execute the below syntax to create a Java file along with the JAR file:

Download the Java file along with the JAR file using a Java Decompiler as shown in the below diagram:

Note: If the downloaded dependency JAR file does not contain logic to implement the autoencoder model, an UnsupportedOperationException error will be thrown similar to the one shown in the below diagram:

The error can be viewed in the PredictCsv.java file as shown in the below diagram:

Similarly, you can view other models such as BinomialModelPrediction, MultinomialModelPrediction, and so on.

To overcome this exception error, perform the following:

Download the latest version of h2o-genmodel.jar file with all its dependencies from the link: https://jar-download.com/explore-java-source-code.php?a=h2o-genmodel&g=ai.h2o&v=3.8.2.5&downloadable=1
Place all the dependency files along with the input file in the same folder in which the JAR file is placed as shown in the below diagram:

View the new jar file downloaded from the external site containing logic for the Autoencoders as shown in the below diagram:

Running Model

You need a Java file from POJO object, an input file, and a h2o-genmodel.jar file with its dependencies to run the model.

To run the model, perform the following:

Use test_input.csv as an input file and output.csv as an output file.
Run the model with all the dependencies using the below commands:

javac -cp h2o-genmodel.jar -J-Xmx2g crime_model_auto.java
java -cp .;* hex.genmodel.tools.PredictCsv --header --model crime_model_auto --input test_input.csv --output output.csv

Note: As the Autoencoders return reconstruction MSE error values for all columns for each class, the arrest status of the crimes cannot be predicted.

Download the already trained Supervised Classification model as the POJO object using the pre-trained autoencoder model to predict the values.
Create a separate folder named “pre-trained” for this process.
Append all the JAR files into this folder.
Copy and paste the dependency JAR files and inputs into this folder.
Compile and run the Java file using the below commands:

Obtain the output of our prediction model. The output looks similar to the one shown below:

From the above results, it is evident that our model works fine as a standalone Java file. Let us convert this model into a JAR file and move it into the production environment along with h2o-genmodel.jar and input files.

Deploying Model into Production Environment

To deploy the model into the production environment, perform the following:

Convert the model into the JAR file with all the class files using the below command:

jar cf crime_model.jar *.class

Place the above setup on any server and run the JAR file using the below command:

java -cp .;* hex.genmodel.tools.PredictCsv --header --model crime_pretrained --input test_input.csv --output output.csv

Implementing Machine Learning Model (Java Spring)

To implement the POJO model in the Java Environment using Spring Framework, set up a simple Spring WebService project and pass the input as JSON payload through POST call.

Setting Up Model Execution Project

To set up a model execution project, perform the following:

Parse an input CSV file and convert it into required Java collection objects.
Convert the collection objects into JSON string to pass it as a JSON payload in the POST call.
Create a function to make the JSON string as a valid request for our API call and to make all necessary connection objects within it.

Project Setup

Few class files in the project setup are:

CrimeModelExecution.java – Makes all the required function calls and converts the input file string into a valid JSON string. It is the core file for our project.
CSVParser.java – Parses a CSV file and converts it into required Java collections.
URLExecution.java – Contains functions to make the JSON string as the valid request for our API call. It makes all necessary connection objects within it.
StringUtil.java – All Util functions are made in this class.

Setting Up Model Deployment Project

To set up model deployment project, perform the following:

Convert the execution project into the JAR file with all its dependencies.
Initiate a server to run all APIs containing necessary logic to apply prediction on the dataset.
Setup the project in a server environment and pass the required input files as parameters.

The project setup is as follows:

Few class files in the project setup are:

CrimeController.java – Contains all APIs required to apply Model Prediction for the datasets and to pass the input as JSON payload through POST call and as the File format in POST call.
UtilHelper.java – Performs basic string datatype conversions.

The project is implemented based on dependencies present in the h2o-genmodel.jar (PredictCSV.java) file. So, add this JAR to our classpath during implementation.

Performing Overall Production Deployment

The overall production deployment involves analyzing the input, implementing a model using R scripts, downloading the model into required Java Objects, and implementing these objects in the production environment.

The flow of moving the Machine Learning models into the production environment is as follows:

To deploy the model, perform the following:

Upload all the codes in a specified location.
Create separate batch files (in Windows environment) for implementing R Script.
Make the project execution JAR.
Deploy the model in the production environment as shown in the below diagram:

Conclusion

In this blog, we discussed setting up a simple Spring Webservice project in Java environment and deploying the Machine Learning model in the real-time production environment using the command prompt and the POJO model. In our use case, the setup was performed on Windows. But, the same can be followed in any real-time server setup. The h2o-genmodel.jar file contains all the dependencies and default functionalities required to build the model using Java.

To know about building the analytical pipeline and applying Deep Learning to predict the arrest status of the crimes happening in Los Angeles, consider our previous blog on Crime Analysis Using H2O Autoencoders – Part 1.

References

↧

Ingest IoT Sensor Data into S3 with Raspberry Pi3 & StreamSets Data Collector Edge

December 26, 2017, 5:04 am

≫ Next: Custom Partitioning and Analysis using Kafka SQL Windowing

≪ Previous: Crime Analysis Using H2O Autoencoders – Part 2

Overview

Due to increasing amount of data produced from outside of source systems, enterprises are facing difficulties in reading, collecting, and ingesting data into a desired, central database system. An edge pipeline runs on an edge device with limited resources, receives data from another pipeline or reads the data from the device, and controls the device based on the data.

StreamSets Data Collector (SDC) Edge, an ultra-lightweight agent, is used to create end-to-end data flow pipelines in StreamSets Data Collector and to run the pipelines to read and export data in and out of the systems. In this blog, StreamSets Data Collector Edge is used to read data of air pressure BMP180 sensor from IoT Device (Raspberry Pi3) and StreamSets Data Collector is used to load the data into Amazon Simple Storage Service (S3) via MQTT.

Pre-requisites

Install StreamSets
Raspberry Pi3
BMP180 Sensor
Amazon S3 Storage

Use Case

Read air pressure BMP180 sensor data with IoT Device (Raspberry Pi3) and send to MQTT
Use SDC to load the data into Amazon S3 via MQTT

Synopsis

Connect BMP180 temperature/pressure sensor with Raspberry Pi3
Create edge sending pipeline
Create data collector receiving pipeline

Flow Diagram

Connecting BMP180 Temperature/Pressure Sensor with Raspberry Pi3

I2C bus, a communication protocol, is used by Raspberry Pi3 to communicate with other embedded IoT devices such as temperature sensors, displays, accelerometers, and so on. The I2C bus has two wires called SCL and SDA, where the SCL is a clock line to synchronize all data transfers over the I2C bus and the SDA is a data line. The devices are connected to the I2C bus via the SCL & SDA lines.

To enable I2C drivers on Raspberry Pi3, perform the following:

Run sudo raspi-config.
Choose Interfacing Options from the menu as shown in the below diagram:

Choose I2C as shown in the below diagram:

Note: If I2C is not available in the Interfacing Options, check Advanced Options for I2C availability.

Click Yes to enable the I2C driver.
Click Yes again to load the driver by default.
Add i2c-dev to /etc/modules using the below commands:

pi@raspberrypi:~$ sudo nano /etc/modules
i2c-bcm2708
i2c-dev

Install i2c-tools using the below command:

pi@raspberrypi:~$ sudo apt-get install python-smbus i2c-tools

Reboot the Raspberry Pi3 by running back at the command line using the below command:

sudo reboot

Ensure that the I2C modules are loaded and made active using the below command:

pi@raspberrypi:~$ lsmod | grep i2c

Connect the Raspberry Pi3 with the BMP180 temperature/pressure sensor as shown in the below diagram:

Ensure that the hardware and software are working fine with i2cdetect using the below command:

pi@raspberrypi:~$ sudo i2cdetect -y 1

Building Edge Sending Pipeline

To build an edge sending pipeline for reading the sensor data, perform the following:

Create an SDC Edge Sending pipeline on StreamSets Data Collector.
Read the data directly from the device (using I2C Address) using “Sensor Reader” component.
Set the I2C address as “0×77”.
Use an Expression Evaluator to convert temperature from Celsius to Fahrenheit.
Publish data to MQTT topic as “bmp_sensor/data”.
Download and move the SDC Edge pipeline’s executable format (Linux) to device side, where the pipeline runs on the device side (Raspberry Pi3).
Start SDC Edge from the SDC Edge home directory on the edge device using the following command:

bin/edge –start=<pipeline_id>

For example:

bin/edge --start=sendingpipeline137e204d-1970-48a3-b449-d28e68e5220e

Building Data Collector Receiving Pipeline

To build a data collector receiving pipeline for storing the received data in Amazon S3, perform the following:

Create a receiving pipeline on the StreamSets Data Collector.
Use MQTT subscriber component to consume data from MQTT topic (bmp_sensor/data).
Use Amazon S3 destination component to load the data into Amazon S3.
Run the receiving pipeline in the StreamSets Data Collector.

The real-time air pressure data collected and stored is shown in the below diagram:

Conclusion

In this blog, we discussed reading the air pressure BMP180 sensor data from Raspberry Pi3 using StreamSets Data Collector Edge and loading the collected data into Amazon S3 via MQTT using StreamSets Data Collector.

↧

Custom Partitioning and Analysis using Kafka SQL Windowing

January 29, 2018, 1:51 am

≫ Next: Customer Churn – Logistic Regression with R

≪ Previous: Ingest IoT Sensor Data into S3 with Raspberry Pi3 & StreamSets Data Collector Edge

Overview

Apache Kafka uses round-robin fashion to produce messages to multiple partitions. Custom partition technique is used to produce a particular type of message in the defined partition and to make the produced message to be consumed by a particular consumer. This technique allows us to take control over the produced messages. Windowing allows event-time driven analysis and data grouping based on time limits. The three different types of windowing are Tumbling, Session, and Hopping.

In this blog, we will discuss processing Citibike trip data in the following ways:

Partitioning trip data based on user type using the Custom partitioned technique.
Analyzing trip details at stream using Kafka SQL Windowing.

Pre-requisites

Install the following:

Scala
Java
Kafka
Confluent
KSQL

Data Description

Trip dataset of Citi Bike March 2017 is used as the source data. It contains basic details such as trip duration, start time, stop time, station name, station ID, station latitude, and station longitude.

Sample Dataset

Use Case

Process Citibike trip data to two different brokers by partitioning the messages according to user types (Subscriber or Customer).
Use Kafka SQL Windowing concepts to analyze the following details:
- Number of trips started at particular time limits using Tumbling Window.
- Number of trips started using advanced time intervals using Hopping Window.
- Number of trips started with session intervals using Session Window.

Synopsis

Set up Kafka cluster
Produce and consume trip details using custom partitioning
Create trip data stream
Perform streaming analytics using Window Tumbling
Perform streaming analytics using Window Session
Perform streaming analytics using Window Hopping

Setting Up Kafka Cluster

To setup the cluster on the same server by changing the ports of the brokers in the cluster, perform the following steps:

Run ZooKeeper on default port 2181.
The ZooKeeper data will be stored by default in /tmp/data.

Change the default path (/tmp/data) to another path with enough space for non-disrupted producing and consuming.
Edit the ZooKeeper configurations in zookeeper.properties file available in the confluent base path etc/kafka/zookeeper.properties as shown in the below diagram:

Start the ZooKeeper using the following command:

./bin/zookeeper-server-start etc/kafka/zookeeper.properties

You can view the below ZooKeeper startup screen:

Start 1st broker in the cluster by running default Kafka broker in port 9092 and setting broker ID as 0.
The default log path is /tmp/kafka-logs.

Edit the default log path (/tmp/kafka-logs) for starting the 1st broker in the server.properties file available confluent base path.
vi etc/kafka/server.properties.

Start the broker using the following command:

./bin/kafka-server-start etc/kafka/server.properties

You can view the 1st broker startup with broker ID 0 and port 9092:

Start 2nd broker in the cluster by copying server.properties as server1.properties under etc/kafka/ for configuring 2nd broker in cluster.
Edit server1.properties.
vi etc/kafka/server1.properties.

Start the broker using the following command:

./bin/kafka-server-start etc/kafka/server1.properties

You can view the 2nd broker startup with broker ID 1 and port 9093:

List the brokers available in the cluster using the following command:

./bin/zookeeper-shell localhost:2181 ls /brokers/ids

You can view the brokers available in the cluster as shown in the below diagram:

In the above case, two brokers are started on the same node. If the brokers are available in different nodes, parallel message processing can be made faster and memory issue can be resolved when a large number of messages are produced by sharing the messages in different nodes memory.

Producing and Consuming Trip Details Using Custom Partitioning

To produce and consume trip details using custom partitioning, perform the following steps:

Create topic trip-data with two partitions using the following command:

./bin/kafka-topics --create --zookeeper localhost:2181 --topic trip-data --replication-factor 1 --partitions 1

Describe the topic to view the leaders of partitions created.

You can see broker 0 responsible for partition 0 and broker 1 responsible for partition 1 for message transfer as shown in the below diagram:

• Use custom partitioner technique to produce messages.
• Create CustomPartitioner class by overriding partitioner interface using the below commands:

override def partition(topic : String, key : Any, keyBytes : Array[Byte],value : Any, valueBytes : Array[Byte], cluster : Cluster) : Int = {
var partition = 0
val keyInt = Integer.parseInt(key.asInstanceOf[String])
val tripData = value.asInstanceOf[String]
//Gets the UserType from the message produced
val userType = tripData.split(",")(12)
//Assigns the partitions to the messages based on the user types
if("Subscriber".equalsIgnoreCase(userType)) {
partition = 0;
} else if ("Customer".equalsIgnoreCase(userType)){
partition = 1;
}
println("Partition for message "+value+" is "+partition)
partition
}

You can view the Subscriber user type messages produced into partition 0 and Customer user type messages turned to partition 1.

Define the CustomPartitioner class in producer properties as shown below:

//Splits messages to particular partitions
props.put("partitioner.class", "com.treselle.core.CustomPartitioner");

Define the partitions to the topic in the consumer by assigning different partitions to the consumers as shown below:

val topicPartition = new TopicPartition(TOPIC,partition)
consumer.assign(Collections.singletonList(topicPartition))

Pass the partition as input in arguments in the consumer when running multiple consumers with each consumer listening to different partitions.
Start multiple consumers with different partitions.
Start Consumer1 using the below command:

java -cp custom_partitioner.jar com.treselle.core.ConsumerBasedOnPartition trip-data localhost:9092 0

Start Consumer2 using the below command:

java -cp custom_partitioner.jar com.treselle.core.ConsumerBasedOnPartition trip-data localhost:9092 1

Produce the trip details by defining the custom partitioner using the below command:

java –cp custom_partitioner.jar com.treselle.core. CustomPartionedProducer trip-data localhost:9092

You can view the consumer 1 consuming only Subscriber messages from Partition 0 and consumer 2 consuming only Customer messages from partition 1.

Consumer1

Consumer2

Check the memory of the brokers after consuming all the messages in both consumers.

The memory shared between the brokers and the memory of the brokers’ logs can be viewed in the below diagram:

Here, the Customer messages are consumed by broker localhosy:9092 and Subscriber messages are consumed by the broker localhost:9093. As the Customer messages are less, only less memory is occupied in kafka-logs (localhost:9092).

Creating Trip Data Stream

In KSQL, there is no option to consume the messages based on the partitions. The messages are consumed from all the partitions in the given topic for stream or table creation.

To create trip data stream, perform the following steps:

Separate the Subscriber and Customer data using conditions for Window processing.
Create trip_data_stream with columns in trip data produced using the following command:

CREATE STREAM
trip_data_stream
(
tripduration BIGINT,
starttime VARCHAR,
stoptime VARCHAR,
start_station_id BIGINT,
start_station_name VARCHAR,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE,
end_station_id BIGINT,
end_station_name VARCHAR,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE,
bikeid INT,usertype VARCHAR,
birth_year VARCHAR,
gender VARCHAR
)
WITH
(
kafka_topic='trip-data',
value_format='DELIMITED'
);

Extract Unix TIMESTAMP for Windowing using the start times of trips.
Set the extracted start time Unix TIMESTAMP as property of stream for Windowing using the start times of trips instead of the message produced time.
Create the stream with extracted Unix TIMESTAMP and the subscriber messages for finding the trip details of the subscribers using the below command:

CREATE STREAM
subscribers_trip_data_stream
WITH
(
TIMESTAMP='startime_timestamp',
PARTITIONS=2
) AS
select
STRINGTOTIMESTAMP(starttime, 'yyyy-MM-dd HH:mm:ss') AS startime_timestamp,
tripduration,
starttime,
usertype
FROM TRIP_DATA_STREAM
where usertype='Subscriber';

Performing Streaming Analytics Using Window Tumbling

Window tumbling groups the data in the given interval into non-overlapping, fixed-size Windows. It is used in anomaly detection of the stream on a certain time interval. For example, consider tumbling with a time interval of 5 minutes.

To find the number of trips started by subscribers at the interval of 5 minutes, execute the following command:

SELECT
COUNT(*),
starttime
FROM subscribers_trip_data_stream
WINDOW TUMBLING (SIZE 5 MINUTE)
GROUP BY usertype;

From the above result, it is evident that 19 trips have been started at the end of the 4th minute, 25 trips have been started at the end of the 9th minute, and 26 strips have been started at the end of the 14th minute. Thus, the started trips are counted at each given interval of time.

Performing Streaming Analytics Using Window Session

In Window session, data is grouped in a particular session. For example, when a session 1 minute is set and if data is not available in the interval of 1 minute, then a new session is started for grouping the data. For example, consider a session of 1 minute working as stated in the following diagram:

To group start the trip details of the subscribers in the particular session, set the session interval as 20 seconds using the below command:

SELECT
count(*),
starttime
FROM subscribers_trip_data_stream
WINDOW SESSION (20 SECOND)
GROUP BY usertype;

From the above diagram, it is evident that the data grouping is made in the particular interval session. When the data is not available in the interval 20 second, then a new session is started for grouping the data.

For example, consider the time interval between 00:01:09 and 00:01:57. At an interval between 00:01:09 and 00:01:33, you can view no time difference of 20 second or more than that. So, trip counts are incremented. At an interval between 00:01:33 and 00:01:57, you can view an inactivity gap of more than 20 second. So, a new session is started from 57th second.

Performing Streaming Analytics Using Window Hopping

In Window hopping, data are grouped in a given time interval into overlapping Windows by advancing to the given interval of time. For example, consider interval 5 minute with an advanced interval of 1-minute working as shown in the below diagram:

To group start the trip details in the interval of 5 minutes advanced by 1 minute, execute the following command for hopping Window analysis:

SELECT
count(*),
starttime
FROM subscribers_trip_data_stream
WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE)
GROUP BY usertype;

From the above diagram, it is evident that 5 entries for each record are consumed in the interval of 5 minutes’ size and advanced by 1 minute. Entry size varies based on the interval size and advanced interval given.

In the above example, consider 00:02:12 time record scenario to check the working of the hopping with 5 minutes and advanced 1-minute size given. 00:02:12 scenario has five entries with trip counts 7,7,7,6,1. In 2 minutes, only two advances of 1 minute are made for first three entries. 00:00:00 to 00:02:12 time interval has 7 trips started. 4th entry made an advance of 1 minute. 00:01:00 to 00:02:12 time interval has 6 trips and 5th entry made another advance of 1-minute. So, trip considered from 00:02:00 to 00:02:12 has only 1 trip.

Conclusion

In this blog, we discussed custom partitioning technique to partition the trip details using user type in two different partitions. We also discussed Kafka Windowing concepts such as Window Tumbling, Window Session, and Window Hopping and its working using trips start timings to understand the difference between the types of windowing.

References

↧

Customer Churn – Logistic Regression with R

February 21, 2021, 12:37 am

≫ Next: Embrace Relationships with Neo4J, R & Java

≪ Previous: Custom Partitioning and Analysis using Kafka SQL Windowing

Overview

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. In this blog post, we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

Learning/Prediction Steps

churn_lr_model_diagram

Data Description

Telecom dataset has the details for 7000+ unique customers, where details of each customer are represented in a unique row and below is the structure of the dataset: chrun_lr_dataframe Input Variables: These variables are called as predictors or independent variables.

Customer Demographics (Gender and Senior citizenship)
Billing Information (Monthly and Annual charges, Payment method)
Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning. chrun_lr_head

Data Preprocessing

- Data cleansing and preparation will be done in this step. Transforming continuous variable into meaningful factor variable will improve the model performance and help understand the insights of the data. For example, in this dataset, the tenure interval variable is converted to factor variable with range in months. Thus, understanding the type of customers with tenure value to perform churn decision.
- As part of data cleansing, the missing values are identified using the missing map plot. The telecom dataset has minimal number of missing value record and is dropped out from analysis.

churn_lr_na chrun_lr_missing_plot

- Custom logic is implemented to create derived categorical variable from the tenure variable and continuous variables. As it will not affect the prediction value, customer id and tenure values are dropped from further process.

chrun_lr_custom_logic

- New categorical feature is created as mentioned above.

churn_lr_feature_head

- Few categorical variables have duplicate reference values and it refers to the same level. For example, “MultipleLine” feature has possible values as “Yes, No, No Phone Service”. Since “No” and “No Phone Service” have the same meaning, these records are replaced with unique reference.

churn_lr_categorical_var

Partitioning the Data & Logistic Regression

- In the predictive modeling, the data need to be partitioned into train and test sets. 70% of the data will be partitioned for training purpose and 30% of the data will be partitioned for testing purpose.
- In this dataset, 4K+ customer records are used for training purpose and 2K+ records are used for testing purpose.
- Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can be used to predict chrun that are available in R or Python or Spark ML.
- Multiple models can be executed on top of the telecom dataset to compare their performance and error rate to choose the best model. In this blog post, we have used Logistic Regression Model with R using glm package. Future blogs will focus on other models and combination of models.

churn_lr_train_test

Model Summary

From the model summary, the response churn variable is affected by tenure interval, contract period, paper billing, senior citizen, and multiple line variables. The importance of the variable will be identified by the legend of the correlated coefficients (*** – high importance, * – medium importance, and dot – next level of importance). Rerunning the model with these dependent variables will impact the model performance and accuracy. churn_lr_model

Prediction Accuracy

- Models built using train datasets are tested through the test dataset. Accuracy and error rate are used to understand how these models are behaving for the test dataset. The selection of the best model is determined by using these measures.
- Confusion Matrix/ Misclassification Table: It is a table used to describe the performance of the classification model on a test data. It is used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.

chrun_lr_cf_basics

- The various measures derived from the confusion matrix are:

churn_lr_cf_derive

- With the choice of logistic regression, it is evident that the accuracy for this model is evaluated as 80% and error rate as 20%. The accuracy of the model can be improved with other classification models such as decision tree, and random forest with parameter tuning.

chrun_lr_cf_results

References

The telecom dataset and the associated R code can be downloaded from GitHub.
GitHub Location: https://github.com/treselle-systems/customer_churn_analysis

The post Customer Churn – Logistic Regression with R appeared first on treselle.com.

↧

Embrace Relationships with Neo4J, R & Java

February 21, 2021, 12:40 am

≫ Next: Apache Spark Performance Tuning – Degree of Parallelism

≪ Previous: Customer Churn – Logistic Regression with R

Introduction

Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular graph database that can be used to make recommendations, get social, find paths, uncover fraud, manage networks, and so on. A graph database can store any kind of data using a Nodes (graph data records), Relationships (connect nodes), and Properties (named data values).

A graph database can be used for connected data which is otherwise not possible with either relational or other NOSQL databases as they lack relationships and multiple depth traversals. Graph Databases Embrace Relationships as they naturally form Paths. Querying or traversing the graph involves following Paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient.

Use Case

This use case is based on modified version of StackOverflow dataset that shows network of programming languages, questions that refers to these programming languages, users who asked and answered these questions, and how these nodes are connected with relationships to find deeper insights in Neo4J Graph Database which is otherwise not possible with common relation database or other NoSQL databases.

What we want to do:

Prerequisites
Download StackOverflow Dataset
Data Manipulation with R
Create Nodes & Relationships file with Java
Create GraphDB with BatchImporter
Visualize Graph with Neo4J

Solution

Prerequisites

Download and Install Neo4j: We will be using Neo4j 2.x version and installing it on Windows is very easy. Follow the instructions on at the below link to download and install.

Note: Neo4j 2.x requires JDK 1.7 and above.

http://www.neo4j.org/download/windows

Download and Install RStudio: We will be using R to perform some data manipulation on the StackOverflow dataset which is available in RData format and this includes filtering, altering, dropping columns, and others. This is done to show the power of R with respect to data manipulation and the same can be done in other programming languages as well. Download the open source edition of Rstudio from the below link.

http://www.rstudio.com/products/rstudio/#Desk

Download StackOverflow Dataset

Download Dataset: This use case is based on modified version of StackOverflow dataset which is rather old and available in both CSV and RData format. Follow the below links to download the dataset. The first link contains the details about various fields and the second link is to download RData

http://www.ics.uci.edu/~duboisc/StackOverflow

http://www.ics.uci.edu/~duboisc/StackOverflow/answers.Rdata

Understanding Dataset:

We will be mostly interested in the following fields which will be used to create nodes and relationships in Neo4j.

qid:	Unique question id
i:	User id of questioner
qs:	Score of the question
tags:	a comma-separated list of the tags associated with the question that refers to programming languages
qvc:	Number of views of this question
aid:	Unique answer id
j:	User id of answer
as:	Score of the answer

Data Manipulation with R

We will reshape the dataset to fit to our needs and appreciate the power of data manipulation with R. The actual RData contains around 250 K rows but this use case will perform the following manipulation to keep it interesting and small.

Open RStudio and Set Working Directory: Open RStudio and set the working directory to where the RData file was downloaded.

Load and Perform Data Manipulation:

//Load answers.Rdata that was downloaded 

load(“answers.Rdata”)

//The data is available in “data” object and a quick can be done with head

head(data)

//Load answers.Rdata that was downloaded

load(“answers.Rdata”)

//The data is available in “data” object and a quick can be done with head

head(data)

//Load stringr library to perform some String manipulation

require(stringr)

//Create a new column Match and assign True/False based on whether the tags contain only specific language.
//For this use case, we are interested only in subset of programming languages.

data$Match <- str_detect(string = data$tags, pattern = “(java|mysql|linux|python|django|php|jquery)”)

//Create a new column length that contains number of words in tags column by using splitting.
//sapply function will perform the function str_split recursively for each row

data$length <- sapply(str_split(data$tags, “,”), length)

//The data object now contains 2 new columns: Match and length. Match column will have TRUE if the tags column contains
//one of the programming language patterns that we are interested in. The length column will have number of words delimited
//by comma

head(data)

//Load stringr library to perform some String manipulation

require(stringr)

//Create a new column Match and assign True/False based on whether the tags contain only specific language.

//For this use case, we are interested only in subset of programming languages.

//Create a new column length that contains number of words in tags column by using splitting.

//sapply function will perform the function str_split recursively for each row

data$length <– sapply(str_split(data$tags, “,”), length)

//The data object now contains 2 new columns: Match and length. Match column will have TRUE if the tags column contains

//one of the programming language patterns that we are interested in. The length column will have number of words delimited

//by comma

head(data)

//Find number of rows in the data object
nrow(data) //This will show 263540 rows

//Subset the data object where Match=True, length=1, question and answer score are greater than zero
//Store the result in a newdata object

newdata <- subset(data, (Match == “TRUE” & length == 1 & qs > 0 & as > 0))

//the row count is significantly went down to 1668
nrow(newdata)

//The top 5 row sample shows that the tags column has only one programming language associated
head(newdata)

//Find number of rows in the data object

nrow(data) //This will show 263540 rows

//Subset the data object where Match=True, length=1, question and answer score are greater than zero

//Store the result in a newdata object

newdata <– subset(data, (Match == “TRUE” & length == 1 & qs > 0 & as > 0))

//the row count is significantly went down to 1668

nrow(newdata)

//The top 5 row sample shows that the tags column has only one programming language associated

head(newdata)

//Create a drop column list(qt, at, Match, and length) and drop from the newdata object that are not needed anymore

drops <- c(“qt”, “at”, “Match”, “length”)

//The new data frame finaldata object doesn’t contain the drops column list
finaldata <- newdata[, !(names(newdata) %in% drops)]
head(finaldata)

//Create a drop column list(qt, at, Match, and length) and drop from the newdata object that are not needed anymore

drops <– c(“qt”, “at”, “Match”, “length”)

//The new data frame finaldata object doesn’t contain the drops column list

finaldata <– newdata[, !(names(newdata) %in% drops)]

head(finaldata)

//Order the finaldata object by question id
finaldata <- finaldata[order(finaldata$qid),]

//Write the finaldata object to a CSV file that will be used to create nodes and relationships
write.csv(finaldata, “finaldata.csv”,sep=”,”,row.names=FALSE)

//Order the finaldata object by question id

finaldata <– finaldata[order(finaldata$qid),]

//Write the finaldata object to a CSV file that will be used to create nodes and relationships

write.csv(finaldata, “finaldata.csv”,sep=“,”,row.names=FALSE)

Note: Ignore the warning message

Create Nodes and Relationship file with Java

We will write a Java program that takes the finadata.csv generated from the above R program and create multiple node files and a single relationship file that contains relations between the nodes. Our nodes and relationship structure is as follows:

Nodes: question_nodes, answer_nodes, user_nodes, lang_nodes
Relationships: The following are the relationships

//One question refers to one programming language
Question REFERS Language

//One question can have multiple answers
Question HAS_ANSWER Answer

//One question asked by one user
Question ASKED_BY User

//One answer answered by one user
Answer ANSWERED_BY User

//One question refers to one programming language

Question REFERS Language

//One question can have multiple answers

Question HAS_ANSWER Answer

//One question asked by one user

Question ASKED_BY User

//One answer answered by one user

Answer ANSWERED_BY User

Details about Java Program: This Java program is self explanatory and simply creates nodes and relationship files in CSV format as needed by the Neo4j Batch Importer program. Few things about the Java program to keep in mind

The format of Nodes file is as follows:

//id is the actual id, string is the datatype of the id, and users indicate the name of the index that we want to create in Neo4J. This file should contain somename:datatype:index_name and may contain more attributes of the nodes with tab delimited. This is the format that Neo4J Batch Importer expects

Id:string:users      attribute1      attribute2
qid_123456         4 (views)      10 (score)

//id is the actual id, string is the datatype of the id, and users indicate the name of the index that we want to create in Neo4J. This file should contain somename:datatype:index_name and may contain more attributes of the nodes with tab delimited. This is the format that Neo4J Batch Importer expects

Id:string:users attribute1 attribute2

qid_123456 4 (views) 10 (score)

The format of Relationship file is as follows:

//ids of the nodes and type of the relationship between them. So, the question qid_797771 is ASKED_BY user uid_94691

id:string:users    id:string:users      type
qid_797771         uid_94691             ASKED_BY
qid_887301         javascript            REFERS
qid_607386         aid_608425            HAS_ANSWER
qid_809735         uid_88631             ASKED_BY
qid_954376         uid_117795            ASKED_BY

//ids of the nodes and type of the relationship between them. So, the question qid_797771 is ASKED_BY user uid_94691

id:string:users id:string:users type

qid_797771 uid_94691 ASKED_BY

qid_887301 javascript REFERS

qid_607386 aid_608425 HAS_ANSWER

qid_809735 uid_88631 ASKED_BY

qid_954376 uid_117795 ASKED_BY

lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated

//lang_nodes.csv

id:string:users name
java            Java
mysql           MySQL
linux           Linux
python          Python
django          Django
php             PHP
jquery          JQuery
javascript      Javascript
cakephp         CakePHP

//lang_nodes.csv

id:string:users name

java Java

mysql MySQL

linux Linux

python Python

django Django

php PHP

jquery JQuery

javascript Javascript

cakephp CakePHP

- finaldata.csv is renamed to sodata.csv (optional)
- The dataset doesn’t come with names of questioners and answerers. So, we have downloaded some fictional names and associated them with the userid. This will make more sense when we view them in Neo4j graphical interface. A fictional name file for around 1500 names were created from http://homepage.net/name_generator/ and stored as “random_names.txt”.
  
  Sample of random_names.txt:
  
  Edward MacDonald Nicholas Arnold Faith Lambert Peter White Trevor Campbell
  
  1
  
  2
  
  3
  
  4
  
  5
  
  Edward MacDonald
  
  Nicholas Arnold
  
  Faith Lambert
  
  Peter White
  
  Trevor Campbell

Java Program to Create Nodes & Relationships:

Note:The below program has dependency only on OpenCSV library that can be downloaded from http://sourceforge.net/projects/opencsv/

package com.treselle.soagrapher;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import au.com.bytecode.opencsv.CSVReader;

public class NodeRelationCreator {

    private static final String QUESTION_NODE_FILE = “question_nodes.csv”;
    private static final String USER_NODE_FILE = “user_nodes.csv”;
    private static final String ANSWER_NODE_FILE = “answer_nodes.csv”;
    private static final String RELATIONS_FILE = “rels.csv”;
    private static final String INPUT_FILE = “sodata.csv”;
    private static final String RANDOM_NAME_FILE = “random_names.txt”;

    //stores question id as the key and views, score as map values
    private static Map<String, Map<String, String>> questions = new HashMap<String, Map<String, String>>();
    //stores unique userids of both questioner and answerer
    private static Set<String> users = new HashSet<String>();
    //stores random names from the file
    private static List<String> randomNames = new ArrayList<String>();
    //stores answerid as key and score as the map values
    private static Map<String, Map<String, String>> answers = new HashMap<String, Map<String, String>>();
    //stores various relations between nodes. The key is two nodes delimited by :: and the value is relation type
    private static Map<String, String> relsMap = new HashMap<String, String>();

    private void readFromCSV() throws Exception{
        //Read the CSV with tab delimited and skip first row
        CSVReader csvReader = new CSVReader(new FileReader(INPUT_FILE),’,’,’\”‘,1);
        String[] rows = null;

        String lang = null;
        String questionId = null;
        String question_user = null;
        String question_score = null;
        String question_views = null;
        String answerId = null;
        String answer_user = null;
        String answer_score = null;
        Map<String, String> questionAttrs = null;
        Map<String, String> answerAttrs = null;

        while((rows = csvReader.readNext()) != null) {

            questionAttrs = new HashMap<String, String>();
            answerAttrs = new HashMap<String, String>();

            questionId = rows[0];
            question_user = rows[1];
            question_score = rows[2];
            lang = rows[3];
            question_views = rows[4];
            answerId = rows[6];
            answer_user = rows[7];
            answer_score = rows[8];

            questionAttrs.put(“views”,question_views);
            questionAttrs.put(“score”,question_score);
            questions.put(“qid_”+questionId, questionAttrs);

            answerAttrs.put(“score”, answer_score);
            answers.put(“aid_”+answerId, answerAttrs);

            users.add(“uid_”+question_user);
            users.add(“uid_”+answer_user);

            relsMap.put(“qid_”+questionId+”::”+”aid_”+answerId, “HAS_ANSWER”);
            relsMap.put(“qid_”+questionId+”::”+”uid_”+question_user, “ASKED_BY”);
            relsMap.put(“aid_”+answerId+”::”+”uid_”+answer_user, “ANSWERED_BY”);
            relsMap.put(“qid_”+questionId+”::”+lang, “REFERS”);
        }

        this.writeQuestionNodesFile();
        this.writeAwnsersNodesFile();
        this.writeUsersNodesFile();
        this.writeRelationsFile();
        csvReader.close();
    }

    private void writeQuestionNodesFile(){
        try{
            FileWriter fos = new FileWriter(QUESTION_NODE_FILE);
            PrintWriter dos = new PrintWriter(fos);
            dos.println(“id:string:users\tname\tviews\tscore”);

            for (Entry<String, Map<String, String>> entry : questions.entrySet()){
                dos.print(entry.getKey());
                Map<String, String> valueMap = entry.getValue();
                dos.print(“\t”+entry.getKey());
                dos.print(“\t”+valueMap.get(“views”));
                dos.print(“\t”+valueMap.get(“score”));
                dos.println();
            }

            dos.close();
            fos.close();

        }catch (IOException e) {
            System.err.println(“Error writeQuestionNodesFile File”);
        }
    }

    private void writeAwnsersNodesFile(){
        try{
            FileWriter fos = new FileWriter(ANSWER_NODE_FILE);
            PrintWriter dos = new PrintWriter(fos);
            dos.println(“id:string:users\tname\tscore”);

            for (Entry<String, Map<String, String>> entry : answers.entrySet()){
                dos.print(entry.getKey());
                Map<String, String> valueMap = entry.getValue();
                dos.print(“\t”+entry.getKey());
                dos.print(“\t”+valueMap.get(“score”));
                dos.println();
            }

            dos.close();
            fos.close();

        }catch (IOException e) {
            System.err.println(“Error writeAwnsersNodesFile File”);
        }
    }

    private void writeUsersNodesFile(){
        try{
            FileWriter fos = new FileWriter(USER_NODE_FILE);
            PrintWriter dos = new PrintWriter(fos);
            dos.println(“id:string:users\tname”);
            int count = 0;

            for(String user : users){
                dos.print(user);
                dos.print(“\t”+randomNames.get(count));
                dos.println();
                count++;
            }

            dos.close();
            fos.close();

        }catch (IOException e) {
            System.err.println(“Error writeUsersNodesFile File”);
        }
    }

    private void writeRelationsFile(){
        try{
            FileWriter fos = new FileWriter(RELATIONS_FILE);
            PrintWriter dos = new PrintWriter(fos);

            dos.println(“id:string:users\tid:string:users\ttype”);

            for (Map.Entry<String, String> entry : relsMap.entrySet()){

                String splitKeys[] = entry.getKey().split(“::”);
                dos.print(splitKeys[0]+”\t”);
                dos.print(splitKeys[1]+”\t”);
                dos.println(entry.getValue());
            }

            dos.close();
            fos.close();

        }catch (IOException e) {
            System.err.println(“Error writeRelationsFile File”);
        }
    }

    private void readRandomNames(){
        try{
            BufferedReader in = new BufferedReader(new FileReader(RANDOM_NAME_FILE));
            String line = “”;

            while ((line = in.readLine()) != null) {
                randomNames.add(line);
            }

            in.close();
        }catch (IOException e) {
            System.err.println(“Error readRandomNames File”);
        }
    }

    public static void main(String[] args){
        try{
            long start = System.currentTimeMillis();
            NodeRelationCreator nodeRelationCreator = new NodeRelationCreator();
            nodeRelationCreator.readRandomNames();
            nodeRelationCreator.readFromCSV();
            long end = System.currentTimeMillis();

            System.out.println(“Done Processing in “+(end – start)+ ” ms”);
        }
        catch(Exception e){
            System.out.println(“Exception in main is “+e.getMessage());
            e.printStackTrace();
        }
    }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

package com.treselle.soagrapher;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.io.PrintWriter;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.HashSet;

import java.util.List;

import java.util.Map;

import java.util.Map.Entry;

import java.util.Set;

import au.com.bytecode.opencsv.CSVReader;

public class NodeRelationCreator {

private static final String QUESTION_NODE_FILE = “question_nodes.csv”;

private static final String USER_NODE_FILE = “user_nodes.csv”;

private static final String ANSWER_NODE_FILE = “answer_nodes.csv”;

private static final String RELATIONS_FILE = “rels.csv”;

private static final String INPUT_FILE = “sodata.csv”;

private static final String RANDOM_NAME_FILE = “random_names.txt”;

//stores question id as the key and views, score as map values

private static Map<String, Map<String, String>> questions = new HashMap<String, Map<String, String>>();

//stores unique userids of both questioner and answerer

private static Set<String> users = new HashSet<String>();

//stores random names from the file

private static List<String> randomNames = new ArrayList<String>();

//stores answerid as key and score as the map values

private static Map<String, Map<String, String>> answers = new HashMap<String, Map<String, String>>();

//stores various relations between nodes. The key is two nodes delimited by :: and the value is relation type

private static Map<String, String> relsMap = new HashMap<String, String>();

private void readFromCSV() throws Exception{

//Read the CSV with tab delimited and skip first row

CSVReader csvReader = new CSVReader(new FileReader(INPUT_FILE),‘,’,‘\”‘,1);

String[] rows = null;

String lang = null;

String questionId = null;

String question_user = null;

String question_score = null;

String question_views = null;

String answerId = null;

String answer_user = null;

String answer_score = null;

Map<String, String> questionAttrs = null;

Map<String, String> answerAttrs = null;

while((rows = csvReader.readNext()) != null) {

questionAttrs = new HashMap<String, String>();

answerAttrs = new HashMap<String, String>();

questionId = rows[0];

question_user = rows[1];

question_score = rows[2];

lang = rows[3];

question_views = rows[4];

answerId = rows[6];

answer_user = rows[7];

answer_score = rows[8];

questionAttrs.put(“views”,question_views);

questionAttrs.put(“score”,question_score);

questions.put(“qid_”+questionId, questionAttrs);

answerAttrs.put(“score”, answer_score);

answers.put(“aid_”+answerId, answerAttrs);

users.add(“uid_”+question_user);

users.add(“uid_”+answer_user);

relsMap.put(“qid_”+questionId+“::”+“aid_”+answerId, “HAS_ANSWER”);

relsMap.put(“qid_”+questionId+“::”+“uid_”+question_user, “ASKED_BY”);

relsMap.put(“aid_”+answerId+“::”+“uid_”+answer_user, “ANSWERED_BY”);

relsMap.put(“qid_”+questionId+“::”+lang, “REFERS”);

}

this.writeQuestionNodesFile();

this.writeAwnsersNodesFile();

this.writeUsersNodesFile();

this.writeRelationsFile();

csvReader.close();

}

private void writeQuestionNodesFile(){

try{

FileWriter fos = new FileWriter(QUESTION_NODE_FILE);

PrintWriter dos = new PrintWriter(fos);

dos.println(“id:string:users\tname\tviews\tscore”);

for (Entry<String, Map<String, String>> entry : questions.entrySet()){

dos.print(entry.getKey());

Map<String, String> valueMap = entry.getValue();

dos.print(“\t”+entry.getKey());

dos.print(“\t”+valueMap.get(“views”));

dos.print(“\t”+valueMap.get(“score”));

dos.println();

}

dos.close();

fos.close();

}catch (IOException e) {

System.err.println(“Error writeQuestionNodesFile File”);

}

private void writeAwnsersNodesFile(){

try{

FileWriter fos = new FileWriter(ANSWER_NODE_FILE);

PrintWriter dos = new PrintWriter(fos);

dos.println(“id:string:users\tname\tscore”);

for (Entry<String, Map<String, String>> entry : answers.entrySet()){

dos.print(entry.getKey());

Map<String, String> valueMap = entry.getValue();

dos.print(“\t”+entry.getKey());

dos.print(“\t”+valueMap.get(“score”));

dos.println();

}

dos.close();

fos.close();

}catch (IOException e) {

System.err.println(“Error writeAwnsersNodesFile File”);

}

private void writeUsersNodesFile(){

try{

FileWriter fos = new FileWriter(USER_NODE_FILE);

PrintWriter dos = new PrintWriter(fos);

dos.println(“id:string:users\tname”);

int count = 0;

for(String user : users){

dos.print(user);

dos.print(“\t”+randomNames.get(count));

dos.println();

count++;

}

dos.close();

fos.close();

}catch (IOException e) {

System.err.println(“Error writeUsersNodesFile File”);

}

private void writeRelationsFile(){

try{

FileWriter fos = new FileWriter(RELATIONS_FILE);

PrintWriter dos = new PrintWriter(fos);

dos.println(“id:string:users\tid:string:users\ttype”);

for (Map.Entry<String, String> entry : relsMap.entrySet()){

String splitKeys[] = entry.getKey().split(“::”);

dos.print(splitKeys[0]+“\t”);

dos.print(splitKeys[1]+“\t”);

dos.println(entry.getValue());

}

dos.close();

fos.close();

}catch (IOException e) {

System.err.println(“Error writeRelationsFile File”);

}

private void readRandomNames(){

try{

BufferedReader in = new BufferedReader(new FileReader(RANDOM_NAME_FILE));

String line = “”;

while ((line = in.readLine()) != null) {

randomNames.add(line);

}

in.close();

}catch (IOException e) {

System.err.println(“Error readRandomNames File”);

}

public static void main(String[] args){

try{

long start = System.currentTimeMillis();

NodeRelationCreator nodeRelationCreator = new NodeRelationCreator();

nodeRelationCreator.readRandomNames();

nodeRelationCreator.readFromCSV();

long end = System.currentTimeMillis();

System.out.println(“Done Processing in “+(end – start)+ ” ms”);

}

catch(Exception e){

System.out.println(“Exception in main is “+e.getMessage());

e.printStackTrace();

}

- Output of the Program:

Run the above program from command line or within eclipse to create question_nodes.csv, answer_nodes.csv, user_nodes.csv, and rels.csv. Click here to download nodes and relationship zip file to quickly run it thru BatchImporter to create Graph DB.

Create GraphDB with Batch Importer

Download and Set up Batch Importer: Batch Importer program is a separate library that will create Graphdb data file which is needed by Neo4j. The input to the Batch Importer is configured in the batch.properties file that indicates what files to use as Nodes and Relationships. More details about the Batch Importer can be found in the readme at https://github.com/jexp/batch-import/tree/20

Download Link: https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Note: Unzip to the location where the nodes and relationship files are created by the Java program.

Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.

dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M

batch_import.node_index.users=exact
batch_import.nodes_files=lang_nodes.csv,question_nodes.csv,answer_nodes.csv,user_nodes.csv
batch_import.rels_files=rels.csv

dump_configuration=false

cache_type=none

use_memory_mapped_buffers=true

neostore.propertystore.db.index.keys.mapped_memory=5M

neostore.propertystore.db.index.mapped_memory=5M

neostore.nodestore.db.mapped_memory=200M

neostore.relationshipstore.db.mapped_memory=500M

neostore.propertystore.db.mapped_memory=200M

neostore.propertystore.db.strings.mapped_memory=200M

batch_import.node_index.users=exact

batch_import.nodes_files=lang_nodes.csv,question_nodes.csv,answer_nodes.csv,user_nodes.csv

batch_import.rels_files=rels.csv

Execute Batch Importer: Execute the batch importer program with import.bat within the Batch Importer directory and pass batch.properties and name of the graph db file to create

//This command will create graph.db data file in the same location as your nodes and relationship file batch_importer_20\import.bat batch.properties graph.db

1

2

3

//This command will create graph.db data file in the same location as your nodes and relationship file

batch_importer_20\import.bat batch.properties graph.db

Using Existing Configuration File
Importing 9 Nodes took 0 seconds
Importing 676 Nodes took 0 seconds
Importing 1653 Nodes took 0 seconds
Importing 1491 Nodes took 0 seconds
Importing 4656 Relationships skipped (2) took 0 seconds

Total import time: 2 seconds

Using Existing Configuration File

Importing 9 Nodes took 0 seconds

Importing 676 Nodes took 0 seconds

Importing 1653 Nodes took 0 seconds

Importing 1491 Nodes took 0 seconds

Importing 4656 Relationships skipped (2) took 0 seconds

Total import time: 2 seconds

Visualize Graph with Neo4j

Copy graph.db file: Create a new directory “data” under the root of Neo4J installation directory and copy graph.db to data directory. This is optional but recommended to keep the graph.db in the same location as Neo4j.

Start Neo4j: Execute “neo4j-community” file under bin directory of Neo4j to start Neo4j. You will be prompted to choose the location of the graph.db file.
Visualize Graphs:
- Launch Neo4j Web Console: http://localhost:7474/browser/

Navigate to Graphs: Click on the bubbles on the left top and choose “*”
Customize Graph Attributes: Double click on “Java” node and choose “name” as the caption.
Explore Graphs: The below exploration shows the following:

Tracing the orange line indicates how the user Trevor answered (aid_853052) a Java question also asked a PHP question (qid_865476). Tracing the red line indicates the user Audrey answered two Java questions (aid_853030 and aid_892379). It’s lot of fun to work with Graph Database as the traversals are limitless. BTW, user names are fictional and not real users

Conclusion

Neo4j is one of the best graph databases around and comes with powerful Cypher Query Language that enables us to traverse the nodes via the relationships and using nodes properties as well. We will be covering CQL in our next blog post based on this graph data.
R is very handy in performing many data manipulation techniques to quickly cleanse, transform, and alter the data to our needs.
Neo4j also comes with Rest API to add nodes and relationships dynamically on the existing graph DB.

References

Neo4J: http://www.neo4j.org/
Neo4J Use Cases: http://www.neo4j.org/learn/use_cases
R: http://www.r-project.org/
Neo4J Batch Importer: https://github.com/jexp/batch-import/tree/20
Files: Click here to download nodes and relationship zip file

The post Embrace Relationships with Neo4J, R & Java appeared first on treselle.com.

↧