Quantcast
Channel: Flamingo
Viewing all 98 articles
Browse latest View live

Sales Data Analysis using Dataiku DSS

$
0
0

Overview

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

It is easy and more user-friendly to explore the data and perform data cleansing. It supports datasets such as Filesystem, FTP, HTTP, SSH, SFTP, Cloud (S3), PostgreSQL, MySQL, Hadoop (HDFS), Oracle, MS SQL Server, Analytic SQL (Vertica, Greenplum, Redshift, Teradata, and Exadata), and NoSQL (MongoDB, Cassandra, and Elasticsearch).

In this blog, let us discuss about data cleansing, data transformation, and data visualization of sales data of a financial company using Dataiku DSS.

Pre-requisites

Download and install Dataiku DSS Version 4.0.4 on Ubuntu from the following link:
https://www.dataiku.com/dss/trynow/linux/

Importing Dataset

To import a dataset into Dataiku DSS, perform the following:

  • Open Dataiku DSS.
  • Create a new Project.
  • Click Add New Dataset and click Add a File to upload a new dataset.
  • Choose the required Filesystem and click Preview to view the added file.
    The dataset looks similar to the one below:

select

The storage type of the data and meanings of the data will be automatically detected from the content of the columns, where the “meaning” is of rich semantic type. For example, DSS automatically detects the meaning of the column with email IDs and sets the meaning as “E-mail address”.

Data Cleansing

Analyze meaning of each column to explore the data and perform data cleansing.

For example, the E-mail address column has Valid, Invalid, and Empty data as shown in the below diagram:

select

Apply a filter to remove invalid email IDs.

For example, the Price column has both integer values and comma (,) as shown in the below diagram:

select

Apply a filter to remove the values with commas as shown in the below diagram:

select

Data Transformation

Data Preparation Recipes

This recipe has filtering and flagging rows, managing dates, sampling, and geographic processing.

To prepare data, perform the following:

  • Parse and format date columns.
  • Calculate difference between account created date and last login date for calculating dormant days.
  • Convert currency to required currency type.
    For example, INR into Dollars.
  • Filter the unwanted columns by its name.
  • Concatenate two column values with delimiters as shown in the below diagram:

select

  • Calculate GeoPoint by giving latitude and longitude as input as shown in the below diagram:

select

You can also extract latitude and longitude from the given GeoPoint.

Visual Recipes

Visual recipes are used to create new datasets by transforming existing datasets.

Filter Recipe

This recipe is used to filter invalid rows/cells, filter rows/cells on date range, numerical range, and value, filter rows/cells with formula. It has filtering and flagging rows. The records, not accessed for a long-time period, are filtered out using this recipe as shown in the below diagram:

select

Split Recipe

This recipe is used to split one dataset rows into several other datasets based on certain rules. The dataset with split and dropped state (“Ireland”) is shown in the below diagram:

select

Grouping – Aggregating Data Recipe

This recipe allows you to perform aggregations on any dataset and is equivalent to SQL “group by” statement. It offers visual tools to setup aggregations and post filters. The rows, aggregated based on products, calculated count, and distinct count of state and country, are shown in the below diagram:

select

The rows after applying a filter for state_count not be less than 100 are shown in the below diagram:

select

Joining Datasets Recipe

To join two datasets, perform the following:

  • In the “Join” section of the recipe, click “Add input” button to add one join.
  • Select 2 datasets for joining.
  • Select Join Type and choose the appropriate join type such as “Inner Join”, “Outer Join” and “Left Join” as shown in the below diagram:

select

  • Click Conditions to add conditions.
    The inner join based on Transaction_ID and Product is shown in the below diagram:

select

  • On successfully completing the join definition, go to “Selected Columns” section of the recipe and select the columns of each dataset needed.

select

The Original Price and Profit calculated using formulas are shown in the below diagram:

select

Stacking Datasets Recipe

This recipe merges several datasets into one dataset and is equivalent of a union all SQL statement.

Data Visualization

The build datasets can be visualized in the form of charts in Dashboard.

Average of Transaction ID Count by Payment Type

select

Profit by Country

select

Product Count by Country

select

Profit by Country and Product

select

Average of Profit by Year

select

Here is the flow created:

select

Conclusion

In this blog, we discussed about importing Datasets into Dataiku DSS, performing data transformation, and visualizing data in Dataiku DSS.

References


Apache Spark on YARN – Performance and Bottlenecks

$
0
0

Overview

Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query into a single function by using “whole-stage code generation” technique. Thereby, eliminating virtual function calls and leveraging CPU registers for intermediate data. This optimization is applied only to Spark high-level APIs such as DataFrame and Dataset and not to low-level RDD API.

Though Tungsten optimizes the Spark application code at runtime, Spark application performance can be improved by tuning configuration parameters, parallelism, JVM tuning, and YARN configuration tuning if the Spark application runs on YARN.

This is the first article of a four-part series about Apache Spark on YARN. In this blog, let us discuss about high-level and low-level Spark API performances. SFO Fire department call service dataset and YARN cluster manager are chosen to test as well as tune the application performance.

Our other articles of the four-part series are:

About Dataset

SFO Fire Calls-For-Service dataset includes responses of all fire units to calls. This dataset has 34 columns and 4.36 million of rows. This dataset will be updated on daily basis. For more details about this dataset, refer SFO website (the link is provided in the reference section).

SFO Fire Department Dataset in HDFS

About Apache Hadoop Cluster

2 node Apache Hadoop cluster is set up using HDP 2.6 distribution, which comes with Spark 2.1. This distribution is used for Spark application execution.

Instance details: m4.xlarge (4 cores, 16 GB RAM)

Cluster details: The summary of cluster setup is shown in the below diagram:

HDP Cluster Summary

Use Case

To understand the Spark performance and application tuning, a Spark application is created using RDD, DataFrame, Spark SQL, and Dataset APIs to answer the below questions from the SFO Fire department call service dataset.

  • How many types of calls were made to the fire department?
  • How many incidents of each call type were there?
  • How many years of fire service calls are in the data file?
  • How many service calls were logged in for the past 7 days?
  • Which neighborhood in SF generated the most calls last year?

To answer all the questions except first question, data grouping should be performed (it is data shuffle in terms of Spark).

Note: One Spark task can handle one partition (partition = data + computation logic).

Low-Level and High-Level API Implementation

In this section, let us discuss about low-level and high-level Spark API implementation to answer the above questions. For more details about API, please refer Spark website.

Resilient Distributed Dataset (RDD) Implementation (Low-Level API)

  • The RDD API, in Spark since 1.0 release, can easily and efficiently process both structured and unstructured data.
  • RDD do not take advantage of Spark’s optimizers such as Catalyst and Tungsten. Developers need to optimize each RDD based on its characteristics attributes
// NUMBER OF RECORDS IN THE FILE
val totalRecords = filteredFireServiceCallRDD.count()
println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsRDD = filteredFireServiceCallRDD.map(x => x(3))
distinctTypesOfCallsRDD.distinct().collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedRDD = distinctTypesOfCallsRDD.map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
distinctTypesOfCallsSortedRDD.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
 val fireServiceCallYearsRDD = filteredFireServiceCallRDD.map(convertToYear).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
fireServiceCallYearsRDD.take(20).foreach(println)

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallRDD = filteredFireServiceCallRDD.map(convertToDate).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).sortByKey(false)
last7DaysServiceCallRDD.take(7).foreach(println)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR? 
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsRDD = filteredFireServiceCallRDD.filter(row => (convertToYear(row) == "2016")).map(x => x(31)).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
neighborhoodDistrictCallsRDD.collect().foreach(println)

DataFrame Implementation (High-Level API)

  • Introduced as part of the Project Tungsten initiative in Spark 1.3 to improve performance and scalability of Spark.
  • Introduces the concept of a schema to describe the data and is radically different from the RDD API as it is an API for building a relational query plan that Spark’s Catalyst optimizer can execute.
  • Gains the advantage of Spark’s optimizers such as Catalyst and Tungsten.
// NUMBER OF RECORDS IN THE FILE
val totalRecords = fireServiceCallDF.count()
println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsDF = fireServiceCallDF.select("CallType").distinct()
distinctTypesOfCallsDF.collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedDF = fireServiceCallDF.select("CallType").groupBy("CallType").count().orderBy(desc("count"))
distinctTypesOfCallsSortedDF.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsDF = fireServiceCallDF.select("CallYear").groupBy("CallYear").count().orderBy(desc("count"))
fireServiceCallYearsDF.show()

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallDF = fireServiceCallDF.select("CallDateTS").groupBy("CallDateTS").count().orderBy(desc("CallDateTS"))
last7DaysServiceCallDF.show(7)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsDF = fireServiceCallDF.filter("CallYear == 2016").select("NeighborhooodsDistrict").groupBy("NeighborhooodsDistrict").count().orderBy(desc("count"))
neighborhoodDistrictCallsDF.collect().foreach(println)

Spark SQL Implementation (High-Level API)

  • Spark SQL lets you query the data using SQL, both inside a Spark program and from external tools that are connected to Spark SQL through standard database connectors (JDBC/ ODBC) such as Business Intelligence tools like Tableau.
  • It provides a DataFrame abstraction in Python, Java, and Scala to simplify working with structured datasets. DataFrames are similar to tables in a relational database.
  • Spark SQL gains the advantage of Spark’s optimizers such as Catalyst and Tungsten as its abstraction is DataFrame.
// NUMBER OF RECORDS IN THE FILE
val totalRecords = spark.sql("SELECT COUNT(*) from fireServiceCallsView")
println(s"Number of records in the data file")
totalRecords.show()

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY DIFFERENT TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsDF = spark.sql("SELECT DISTINCT CallType from fireServiceCallsView")
distinctTypesOfCallsDF.collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedDF = spark.sql("SELECT CallType, COUNT(CallType) as count from fireServiceCallsView GROUP BY CallType ORDER BY count desc")
distinctTypesOfCallsSortedDF.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsDF = spark.sql("SELECT CallYear, COUNT(CallYear) as count from fireServiceCallsView GROUP BY CallYear ORDER BY count desc")
fireServiceCallYearsDF.show()

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallDF = spark.sql("SELECT CallDateTS, COUNT(CallDateTS) as count from fireServiceCallsView GROUP BY CallDateTS ORDER BY CallDateTS desc")
last7DaysServiceCallDF.show(7)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsDF = spark.sql("SELECT NeighborhooodsDistrict, COUNT(NeighborhooodsDistrict) as count from 
fireServiceCallsView WHERE CallYear == 2016 GROUP BY NeighborhooodsDistrict ORDER BY count desc")
neighborhoodDistrictCallsDF.collect().foreach(println)

Dataset Implementation (High-Level API)

  • The Dataset API, released as an API preview in Spark 1.6, provides the best of both RDD and DataFrame.
  • Datasets acquire two discrete APIs characteristics such as strongly typed and untyped.
  • Datasets and DataFrame use very advanced Spark built-in encoders. The encoders provide on-demand access to individual attributes without de-serializing an entire object and generate byte code to interact with off-heap data.
  • Dataset API gains the advantage of Spark’s optimizers such as Catalyst and Tungsten.
// NUMBER OF RECORDS IN THE FILE
val totalRecords = fireServiceCallDS.count()
println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsDS = fireServiceCallDS.select(col("CallType"))
distinctTypesOfCallsDS.distinct().collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedDS = fireServiceCallDS.select(col("CallType")).groupBy(col("CallType")).agg(count(col("CallType")).alias("count")).orderBy(desc("count"))
distinctTypesOfCallsSortedDS.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsDS = fireServiceCallDS.select(col("CallYear")).groupBy(col("CallYear")).agg(count(col("CallYear")).alias("count")).orderBy(desc("count"))
fireServiceCallYearsDS.show()

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallDS = fireServiceCallDS.select(col("CallDateTS")).groupBy(col("CallDateTS")).agg(count(col("CallDateTS")).alias("count")).orderBy(desc("CallDateTS"))
last7DaysServiceCallDS.show(7)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsDS = fireServiceCallDS.filter(fireServiceCall => fireServiceCall.CallYear == 2016).select(col("NeighborhooodsDistrict")).groupBy(col("NeighborhooodsDistrict")).agg(count(col("NeighborhooodsDistrict")).alias("count")).orderBy(desc("count"))
neighborhoodDistrictCallsDS.collect().foreach(println)

Running Spark on YARN

There are two deployment modes such as cluster and client modes for launching Spark applications on YARN.

  • In cluster mode, the Spark driver runs inside an application master process managed by YARN on the cluster. The client goes away after initiating the application.
  • In client mode, the application master only requests resources from YARN and the Spark driver runs in client process.

Resource (executors, cores, and memory) planning is an essential part when running Spark application as Standalone, and on YARN and Apache Mesos. Especially in YARN, “memory overhead” is a vital configuration while planning for Spark application resource.

Default Spark Configuration for YARN

Plenty of properties can be configured while submitting Spark application on YARN. As part of resource planning, the following are important:

select

Note: In Cluster mode: Spark driver runs inside a YARN Application Master (AM), which will be launched as per the resource allocated for driver with memory overhead. In Client mode: Spark driver runs in the client process and YARN Application Master (AM) resource should be allocated. In both modes, executor resource should be planned and allocated.

Submitting Spark Application in YARN

Pre-requisites: Let us assemble all the Spark applications as a Jar using Scala Build Tool (SBT). We have launched the Spark application in YARN in cluster mode with default Spark configuration.

./bin/spark-submit --master yarn --deploy-mode cluster --class com.treselle.fscalls.analysis.FireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --master yarn --deploy-mode cluster --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --master yarn --deploy-mode cluster --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDFSQL /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --master yarn --deploy-mode cluster --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDS /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Monitoring Driver and Executor Resource

On successfully submitting the Spark application, the below message will be displayed in the console. The message states the amount of memory allocated for Application Master (AM). It is 1408 MB including 384 MB memory overhead i.e. driver default configuration.

spark.driver.memory + spark.yarn.driver.memoryOverhead = 1024 MB (1 GB) + 384 MB = 1408 MB.

Driver AM Resource

Executor memory and core can be monitored in both Resource Manager UI and Spark UI. Executor tab in the Spark UI displays the number of executors and resources allocated to the executor. Driver core in the below diagram is ‘0’ though default driver core is 1. This 1 core is used by YARN application master. Storage memory under executor is shown based on memory used / total available memory for storage of data like RDD partitions cached in memory.

Fire Service Analysis DF Executor Stats

Understanding Spark Internals

Spark constructs Direct Acyclic Graph (DAG) using DAGScheduler based on transformation and action used in the application. Jobs, Stages, and tasks are the internal part of Spark execution. To understand about Spark DAG and its internal, refer our blog on Text Normalization with Spark – Part 2.

RDD implementation for Spark application Jobs is shown in the below diagram. Jobs view in Spark UI provides the high-level overview of Spark application statistics such as number of jobs, overall and individual job duration, number of stages, and total number of tasks.

Fire Service Analysis RDD Jobs Stats

RDD, DataFrame, Spark SQL, and Dataset implementation of the Spark Application Jobs statistics are as follows:

select

Note: The above statistics are based on the default Spark configuration for different Spark API implementation in our use case scenario and no tuning has been applied. The performance bottlenecks are identified using Stages view in Spark UI.

RDD implementation of Stages View

Fire Service Analysis RDD Stages Stats

DataFrame Implementation of Stages View

Fire Service Analysis DF Stages Stats

Spark SQL Implementation of Stages View

Fire Service Analysis DF SQL Stages Stats

Dataset Implementation of Stages View

Fire Service Analysis DS Stages Stats

Low-Level and High-Level API Outputs

The results for five questions in this use case with different Spark API implementations are same. But, the duration taken by different implementations are varied.

  • High-level API implementation of the application was completed and the results were provided in 1.8 and 1.9 minutes.
  • Low-level RDD API implementation of the application was completed in 22 minutes. Even with Kyro serialized way, the implementation of the application was completed in 21 minutes.

The reason for the time difference is caused due to Spark optimizers such as Catalyst and Tungsten when the Spark application was written using High-level API and not Low-level API.

Fire Service Call Output

Note: The results of these implementations and source codes has been uploaded into GitHub. Please look into the Reference section for the GitHub location and dataset link.

Identifying Performance Bottlenecks

To do performance tuning, identify the bottlenecks in the application. The following bottlenecks were identified during Spark application implementation of RDD, DataFrame, Spark SQL, and Dataset API:

Resource Planning (Executors, core and memory)

Balanced number of executors, core, and memory will significantly improve the performance without any code changes in the Spark application while running on YARN.

Degree of Parallelism – Partition Tuning (Avoid Small Partition Problems)

On considering the Stages view of both high-level and low-level APIs, bunch of tasks (200) were found at few stages. Dig deep into stages and look into those 200 tasks in the Event Timeline, tasks computation time will be very low when compared to scheduler delay. The thumb rule for partition size while running in YARN is ~ 128 MB.

Parallelism Bottelneck

Straggler Tasks (Long Running Tasks)

The straggler tasks can be identified in the Stages view and takes long time to complete. In this use case, the following are the straggler tasks that took longer time.

RDD Implementation Straggler Task

RDD Straggler Task

DataFrame Implementation Straggler Task

DF Straggler Task

Conclusion

In this blog, we have discussed about running Spark application on YARN with default configuration by implementing in high-level and low- level APIs. All the implementations were completed within the default resources allocated to the application for this use case. But, this may not be the case for all the use cases. Resource planning helps us to decide the balanced executors, cores, and memory planning.

The application written in high-level APIs are completed in less time when compared to low-level APIs. So, programming using high-level API is recommended using Spark. Bottlenecks were identified during both high-level and low-level API implementation.The above bottlenecks and performance tuning to eliminate those bottlenecks will be covered in our upcoming blog posts listed below:

After performance tuning and fixing bottleneck, the final time taken to complete the application in both high-level and low-level APIs are as shown in the below diagram:

Straggler Fix Output

High-level API implementation of the application was completed and the results were provided in 1.8 and 1.9 minutes. After performance tuning, the time was reduced to ~41 seconds. Low-level RDD API implementation of the application was completed in 22 minutes and even with Kyro serialized way the application was completed in 21 minutes. After performance tuning, the time was reduced to ~3 minutes.

References

Apache Spark on YARN – Resource Planning

$
0
0

Overview

This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.

To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. In this blog, let us discuss about the resource planning for the same use case used in our previous blog and about improving the performance of the Spark application used in that use case.

Our other articles of the four-part series are:

Spark Resource Planning Principles

The general principles to be followed while deciding resource allocation for Spark application is as follows:

  • Most granular (smallest sized executors) level of resource allocation reduces application performance due to the inability to use the power of running multiple tasks inside single executor. To perform computation faster, multiple tasks within the executor share the cached data.
  • Least granular (biggest sized executors) level of resource allocation influences application performance due to over usage of the resources and not considering memory overhead of OS and other daemons.
  • The balanced resources (executors, cores, and memory) with memory overhead improves the performance of the Spark application especially when running Spark application on YARN.

Understanding Use Case Performance

The Spark application is executed in YARN cluster mode. The resource allocation for the use case Spark application is illustrated in the below table:

select

The observation from Spark UI are as follows:

  • High-level API implementation of the application was completed and the results were provided in 1.8 and 1.9 minutes.
  • Low-level RDD API implementation of the application was completed in 22 minutes and even with Kyro serialized way the application was completed in 21 minutes.

Fire Service Call Output

Let us understand the YARN resources before performing Spark application resource tuning.

Understanding YARN Resource

A cluster is set up and the YARN resource availability from YARN configuration is illustrated in the below table:

select

The maximum memory and vcores available per node are 8 GB and 3 Cores. Totally, we have 16 GB and 6 Cores as shown in the below diagram:

YARN Cluster Metrics

If the underlying instance has more memory and core, the above configuration can be increased. Let us stick with the above configuration of YARN and tune the resources. If the resources allocated to the Spark application exceeds these limits, then the application will be terminated with error messages.

Executor Memory Exceeds Cluster Memory Error

Executor Core Exceeds Cluster Vcore Error

Hope, you understood the use case performance and available resources in YARN.

Spark on YARN – Resource Planning

Let us find out the reasonable resources to execute the Spark application in YARN.

Memory available per node 8 GB
Core available per node 3

 

To find out the number of executors, cores, and memory and its works for our use case with notable performance improvement, perform the following steps:

Step 1: Allocate 1 GB memory and 1 core for driver per node. Driver can be launched at any one of the nodes at run time. If the output of the action returns more data (for example, more than 1 GB), then driver memory must be adjusted.

Memory available per node 7 GB
Core available per node 2

 

Step 2: Assign 1 GB memory and 1 core for OS & Hadoop Daemons overhead per instance. Let us look at the below instance to launch the cluster:

Instance details: m4.xlarge (4 cores, 16 GB RAM)

1 core and 8 GB RAM are freed up for other resources and YARN is configured with 8 GB RAM and 3 cores per node. The freed-up resource will be used on OS and Hadoop Daemons overhead. Memory available and core available per node remains unchanged after Step 2.

Step 3: Find out number of cores per executor. As 2 cores per node are available, decide the number of cores as 2 per executor.

Note: If you have more cores per instance, (for example, 16 – 1(overhead) = 15), then stick with number of cores per executor as 5 while running in YARN with HDFS due to high HDFS throughput.

Step 4: Find out number of executors and memory per executors.

Number of cores per executor: 2 

Total cores = Number of nodes * Number of cores per node (after taking overhead) => 2 * 2 = 4

Total Executors: 2

Total executors = Total cores / Number of nodes => 4 / 2 = 2

Number of executors per node = Total executors / Number of nodes => 2/2 = 1. Each node will have one executor.

Memory per executor: 7 GB (This must be adjusted as per the application payload)

                Memory per node / Number of executors per node => 7 / 1 => 7 GB

 

This calculation works well with our use case except for the memory per executor as input dataset size is 1.5 GB and using 6 GB per executor to process 1.5 GB is like over using the memory.

Executor memory with 2 GB is applied and increased up to 7 GB per executor to execute the Spark application. 2 GB memory per executor is decided as there are no additional performance improvements while increasing executor memory from 2 GB to 7 GB.

The decided resource allocation derived from the above steps for the use case Spark applications is illustrated in the below table:

select

Note: Different organization has different workloads and the above steps may not work well for all the cases. You can obtain an idea about calculating executors, cores, and memory.

Running Spark on YARN with Tuned Resource

DataFrame Implementation of Spark Application

DataFrame implementation of Spark application is executed in most granular, lease granular, and balanced resource (which we have calculated) levels.

./bin/spark-submit --name FireServiceCallAnalysisDataFrameTest2 --master yarn --deploy-mode cluster   --executor-memory 1g --executor-cores 1  --num-executors 7 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

FireService Call Analysis DataFrame Test2 Executors Stats

./bin/spark-submit --name FireServiceCallAnalysisDataFrameTest1 --master yarn --deploy-mode cluster   --executor-memory 7g --executor-cores 2  --num-executors 1 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Fire Service Call Analysis DataFrame Test1 Executors Stats

./bin/spark-submit --name FireServiceCallAnalysisDataFrameTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Fire Service Call Analysis DataFrame Test Executor Stats

The balanced resource allocation provides notable performance improvement from 1.8 minute to 1.3 minutes.

FireServiceCallAnalysisSPTuneOutput

RDD Implementation of Spark Application

RDD implementation of Spark application is executed in most granular, lease granular, and balanced resource (which we have calculated) levels.

./bin/spark-submit --name FireServiceCallAnalysisRDDTest2 --master yarn --deploy-mode cluster  --executor-memory 1g --executor-cores 1  --num-executors 7 --class com.treselle.fscalls.analysis.FireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --name FireServiceCallAnalysisRDDTest1 --master yarn --deploy-mode cluster   --executor-memory 7g --executor-cores 2  --num-executors 1 --class com.treselle.fscalls.analysis.FireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --name FireServiceCallAnalysisRDDTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --class com.treselle.fscalls.analysis.FireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

The RDD implementation on balanced resource allocation is 2 times faster than default Spark configuration execution. Spark default configuration produced results in 22 minutes. But, after resource tuning, the result is produced in 11 minutes.

FireServiceCallAnalysisRDDSPTuneOutput

Spark Applications with Default Configuration

Fire Service Call Output

Spark Application After Resource Tuning

FireServiceCallAnalysisRDDSPTuneOutput

FireServiceCallAnalysisSPTuneOutput

Conclusion

In this blog, we have discussed about the Spark resource planning principles and understood the use case performance and YARN resource configuration before doing resource tuning for Spark application.

We followed certain steps to calculate resources (executors, cores, and memory) for Spark application. The results are as follows:

  • Significant performance improvement in the DataFrame implementation of Spark application from 1.8 minutes to 1.3 minutes.
  • RDD implementation of Spark application is 2 times faster from 22 minutes to 11 minutes.

We covered one of the bottlenecks discussed in our previous blog Apache Spark on YARN – Performance and Bottlenecks. In the following blog posts, we will cover other two bottlenecks and performance tuning to eliminate those:

After performance tuning and fixing of bottleneck, the final duration to complete the application in both high-level and low-level APIs are:

Straggler Fix Output

References

Apache Spark Performance Tuning – Degree of Parallelism

$
0
0

Overview

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple cores on a desktop. A partition, aka split, is a logical chunk of a distributed data set. Apache Spark builds Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on number of partitions.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. The Resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application is discussed in our previous blog on Apache Spark on YARN – Resource Planning.

In this blog post, let us discuss about the partition problem and tuning the partitions of the use case Spark application.

Our other articles of the four-part series are:

Spark Partition Principles

The general principles to be followed when tuning partition for Spark application is as follows:

  • Too few partitions – Cannot utilize all cores available in the cluster.
  • Too many partitions – Excessive overhead in managing many small tasks.
  • Reasonable partitions – Helps us to utilize the cores available in the cluster and avoids excessive overhead in managing small tasks.

Understanding Use Case Performance

The performance duration (without any performance tuning) based on different API implementations of the use case Spark application running on YARN is shown in the below diagram:

SparkApplicationWithDefaultConfigurationPerformance

The performance duration (after performance tuning) after tuning the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application is shown in the below diagram:

SparkApplicationAfterResourceTuningPerformance

For tuning of the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application, refer our previous blog on Apache Spark on YARN – Resource Planning.

Let us understand the Spark data partitions of the use case application and decide on increasing or decreasing the partition using Spark configuration properties.

Understanding Spark Data Partitions

The two configuration properties in Spark to tune the number of partitions at run time are as follows:

select

Default parallelism and shuffle partition problems in both RDD and DataFrame API based application implementation is shown in the below diagram:

FireServiceCallAnalysisDataFrameTest1StagesStats

count () action stage using default parallelism (12 partitions) is shown in the below diagram:

SparkDefaultParallelism12_200Stats

From the Summary Metrics for Input Size/Records section, the Max partition size is ~128 MB.

On considering the Event Timeline to understand those 200 shuffled partition tasks, there are tasks with more scheduler delay and less computation time. It indicates 200 tasks are not necessary here and can be tuned to decrease the shuffle partition to reduce scheduler burden.

HighNumberOfTasksProblem

The Stages view in Spark UI indicates that most of the tasks are simply launched and terminated without any computation as shown in the below diagram:

NumberOfTasksProblem

Spark Partition Tuning

Let us first decide the number of partitions based on the input dataset size. The Thumb Rule to decide the partition size while working with HDFS is 128 MB. As our input dataset size is about 1.5 GB (1500 MB) and going with 128 MB per partition, the number of partitions will be:

Total input dataset size / partition size => 1500 / 128 = 11.71 = ~12 partitions

This is equal to the Spark default parallelism (spark.default.parallelism) value. The metrics based on default parallelism is shown in the above section.

Now, let us perform a test by reducing the partition size and increasing number of partitions.

Consider partition size as 64 MB

Number of partitions = Total input dataset size / partition size => 1500 / 64 = 23.43 = ~23 partitions

DataFrame API implementation is executed using the below partition configurations:

select

The RDD API implementation is executed using the below partition configurations:

select

Note: spark.sql.shuffle.partitions property is not applicable for RDD API based implementation.

Running Spark on YARN with Partition Tuning

./bin/spark-submit --name FireServiceCallAnalysisDataFramePartitionTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv
Note: Update the values of spark.default.parallelism and spark.sql.shuffle.partitions property as testing has to be performed with different number of partitions.

The Stages view based on spark.default.parallelism=23 and spark.sql.shuffle.partitions=23 is shown in the below diagram:

Partition23_23_StagesStats

Consider Tasks: Succeeded/Total column in the above diagram. Both default and shuffle partitions are applied and number of tasks are 23.

count () action stage using default parallelism (23 partitions) is shown in the below screenshot:

SparkParallelism23_23_Stats

On considering Summary Metrics for Input Size/Records section, the max partition size is ~66 MB.

On looking into the shuffle stage tasks, the scheduler has launched 23 tasks and most of the times are occupied by shuffle (Read/Write). There are no tasks without computation.

Partition23_23_ShuflleEventStats

Partition23_23_TasksStats

The output obtained after executing Spark application with different number of partitions is shown in the below diagram:

PartitionTuningOutput

Conclusion

In this blog, we discussed about partition principles and understood about the use case performance, deciding the number of partitions, and partition tuning using Spark configuration properties.

The Resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application is discussed in our previous blog on Apache Spark on YARN – Resource Planning. To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. But, the performance of spark application remains unchanged.

In our upcoming blog, let us discuss about the final bottleneck of the use case on “Apache Spark Performance Tuning – Straggler Tasks”.

The final performance achieved after resource tuning, partition tuning, and straggler tasks problem fixing is shown in the below diagram:

Straggler Fix Output

References

Apache Spark Performance Tuning – Straggler Tasks

$
0
0

Overview

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks.

The Resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application is discussed in our previous blog on Apache Spark on YARN – Resource Planning.

To know about partition tuning in the use case Spark application, refer our previous blog on Apache Spark Performance Tuning – Degree of Parallelism.

In this blog, let us discuss about shuffle and straggler tasks problem so as to improve the performance of the use case application.

Our other articles of the four-part series are:

Spark Shuffle Principles

Two primary techniques such as “shuffle less” and “shuffle better” to avoid performance problems associated with shuffles are as follows:

  • Shuffle Less Often – To minimize number of shuffles in a computation requiring several transformations, preserve partitioning across narrow transformations to avoid reshuffling data.
  • Shuffle Better – Computation cannot be completed without a shuffle sometimes. All wide transformations and all shuffles are not equally expensive or prone to failure.

Operations on the key/value pairs can cause:

  • Out-of-memory errors in the driver
  • Out-of-memory errors on the executor nodes
  • Shuffle failures
  • Straggler tasks or partitions, especially slow to compute

The memory errors in the driver is mainly caused by actions. The last three performance issues (such as out of memory on the executors, shuffles, and straggler tasks) are almost caused by shuffles associated with the wide transformations.

Understanding Use Case Application Shuffle

The number of partitions tuned based on the input dataset size is explained in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. The DataFrame API implementation of application submitted with the following configuration is shown in the below screenshot:

./bin/spark-submit --name FireServiceCallAnalysisDataFramePartitionTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Partition23_23_ShuflleUnderstanding

On considering Shuffle Read and Write columns, the shuffled data are in Bytes and Kilo Bytes (KB) across all the stages as per the shuffle principles “Shuffle are less” in our use case application.

The input of ~849 MB is carried over in all the shuffle stages.

The “Executors” tab in the Spark UI provides the summary of input, shuffles read, and write as shown in the below diagram:

ExecutorSummary23_23Partition

The overall input size is 5.9 GB including original input of 1.5 GB and entire shuffle input of ~849 MB.

Detecting Stragglers Tasks in Use Case

“Stragglers” are tasks within a stage that take much longer to execute than other tasks.

The total time taken for DataFrame API implementation is 1.3 minutes.

On considering the Stages wise durations, Stage 0 and 2 consumed 10 s and 46 s, respectively. Totally, 56 seconds (~ 1 minute).

StragglerDeduction23_23Partition

Internally, Spark does the following:

  • Spark optimizers such as Catalyst and Tungsten optimize the code at run time
  • Spark high-level DataFrame and DataSet API encoder reduce the input size by encoding the data

By reducing input size and by filtering the data from input datasets in both low-level and high-level API implementation, the performance can be improved.

Low-Level and High-Level API Implementation

Our input dataset has 34 columns. 3 columns were used for computation to answer the use case scenario questions.

The below updated RDD and DataFrame API implementation code provides performance improvement by selecting only needed data for this use case scenario:

val filteredFireServiceCallRDD = filteredFireServiceCallWithoutHeaderRDD.map(x => Array(x(3), x(4), x(31)))

The above line is added at the beginning of the RDD API implementation to select 3 columns and remove 31 columns from the RDD to reduce the input size in all the shuffle stages.

The below code also does the same thing in DataFrame API implementation:

// FILTERING NEEDED COLUMN FOR USE CASE SCENARIO’S
val fireServiceCallDF = fireServiceCallYearAddedDF.select("CallType", "NeighborhooodsDistrict", "CallDateTS", "CallYear")

The code block of both RDD and DataFrame API implementations is given below:

// FILTER THE HEADER ROW AND SPLIT THE COLUMNS IN THE DATA FILE (EXCLUDE COMMA WITH IN DOUBLE QUOTES)
val filteredFireServiceCallWithoutHeaderRDD = fireServiceCallRawRDD.filter(row => row != header).map(x => x.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))

val filteredFireServiceCallRDD = filteredFireServiceCallWithoutHeaderRDD.map(x => Array(x(3), x(4), x(31)))

// CACHE/PERSIST THE RDD
filteredFireServiceCallRDD.setName("FireServiceCallsRDD").persist().take(10)

// NUMBER OF RECORDS IN THE FILE
val totalRecords = filteredFireServiceCallRDD.count()
    println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsRDD = filteredFireServiceCallRDD.map(x => x(0))
distinctTypesOfCallsRDD.distinct().collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedRDD = distinctTypesOfCallsRDD.map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
distinctTypesOfCallsSortedRDD.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsRDD = filteredFireServiceCallRDD.map(convertToYear).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
fireServiceCallYearsRDD.take(20).foreach(println)

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallRDD = filteredFireServiceCallRDD.map(convertToDate).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).sortByKey(false)
last7DaysServiceCallRDD.take(7).foreach(println)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR? 
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsRDD = filteredFireServiceCallRDD.filter(row => (convertToYear(row) == "2016")).map(x => x(2)).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
neighborhoodDistrictCallsRDD.collect().foreach(println)

// FILTERING NEEDED COLUMN FOR USE CASE SCENARIO’S
val fireServiceCallDF = fireServiceCallYearAddedDF.select("CallType", "NeighborhooodsDistrict", "CallDateTS", "CallYear")

// RE ARRANGE NUMBER OF PARTITION
fireServiceCallDF.cache().take(10)

// PRINT SCHEMA 
fireServiceCallDF.printSchema()

// LOOK INTO TOP 20 ROWS IN THE DATA FILE
fireServiceCallDF.show()

// NUMBER OF RECORDS IN THE FILE
val totalRecords = fireServiceCallDF.count()
println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsDF = fireServiceCallDF.select("CallType").distinct()
distinctTypesOfCallsDF.collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedDF = fireServiceCallDF.select("CallType").groupBy("CallType").count().orderBy(desc("count"))
distinctTypesOfCallsSortedDF.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsDF = fireServiceCallDF.select("CallYear").groupBy("CallYear").count().orderBy(desc("count"))
fireServiceCallYearsDF.show()

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallDF = fireServiceCallDF.select("CallDateTS").groupBy("CallDateTS").count().orderBy(desc("CallDateTS"))
last7DaysServiceCallDF.show(7)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsDF = fireServiceCallDF.filter("CallYear == 2016").select("NeighborhooodsDistrict").groupBy("NeighborhooodsDistrict").count().orderBy(desc("count"))
neighborhoodDistrictCallsDF.collect().foreach(println)

Submitting Spark Application in YARN

The Spark submit command with partition tuning, used to execute the RDD and DataFrame API implementation in YARN, is as follows:

./bin/spark-submit --name FireServiceCallAnalysisRDDStragglerFixTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.SFOFireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysisPF.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --name FireServiceCallAnalysisDataFrameStragglerFixTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.SFOFireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysisPF.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

DataFrame API implementation of application input, shuffles read, and writes is monitored in stages view. The below diagram shows that the input size of shuffle stages is ~17 MB currently and ~849 MB previously. The Shuffle read and write do not have multiple changes.

DataFrameStraggerFixStages

The “Executors” tab in the Spark UI provides the summary of input, shuffles read, and write as shown in the below diagram:

DataFrameStragglerFixExecutorsStats

The summary shows that the input size is 1.5 GB currently and 5.9 GB previously.

The time duration after reducing input size in RDD and DataFrame API implementation is shown in the below diagram:

Straggler Fix Output

Understanding Use Case Performance

The performance duration (without any performance tuning) based on different API implementation of the use case Spark application running on YARN is shown in the below diagram:

SparkApplicationWithDefaultConfigurationPerformance

For more details, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks.

We tuned the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application. The below diagram is based on the performance improvements after tuning the resources:

SparkApplicationAfterResourceTuningPerformance

For more details, refer our previous blog on Apache Spark on YARN – Resource Planning.

We tuned the default parallelism and shuffle partitions of both RDD and DataFrame implementation in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. We did not achieve performance improvement. But, reduced the scheduler overhead.

Finally, after identifying the straggler tasks and reducing the input size, we got 2 x performance improvements in DataFrame implementation and 4 x improvements in RDD implementation.

StragglerPerformanceBenchmark

Conclusion

In this blog, we discussed about Shuffle principles and understood use case application shuffle, straggler task detection in the application, and input size reduction to improve the performance of different API implementations of the Spark application.

We achieved 2 x performance improvements in DataFrame implementation and 4 x improvements in RDD implementation from the result of resource and partition running.

References

Protractor with Cucumber

$
0
0

Overview

Protractor, an end-to-end testing framework, supports Jasmine and is specifically built for AngularJS application. It is highly flexible with different Behavior-Driven Development (BDD) frameworks such as Cucumber. It allows writing specifications based on the behavior of the application. The greatest feature of Protractor is that it waits until a page gets loaded and limits the number of waits and sleeps used in testing suite.

Cucumber, a BDD framework, is used for web applications for performing acceptance tests. It provides a higher-level view of testing process of the suite. In this blog, let us discuss about configuring Protractor with Cucumber framework and about passing the input parameters by using the property file.

Pre-requisites

  • Install Java Development Kit
  • Install Node.js (Latest version)
  • Install Protractor using the below command:
npm install -g protractor
  • Check the version of Protractor using the below command:
protractor –version
webdriver-manager update
  • Install Cucumber using the below command:
npm install -g cucumber
npm install --save-dev protractor-cucumber-framework

Use Case

We will discuss about passing the input parameter by using property file in Protractor with Cucumber framework.

Setting Up Configuration in Protractor with Cucumber

Cucumber Configuration

To configure Cucumber, use the below command:

cucumberOpts: {
 require: 'features/step_definitions/*.js',
 tags: false,
 format: ['pretty'],
 profile: false,
 'no-source': true,
 }

Running Script in Default Browser

To run script in a default browser, use the below command:

capabilities: {
 'browserName': 'chrome'
 },

Running Script in Other Browsers

To run script in other browsers, use the below command:

capabilities:[ {
 'browserName': 'Firefox'
 },{
 'browserName': 'safari'
 }
 ]

Viewing Cucumber.conf.js File

The above Cucumber configuration can be viewed in a single JS file as shown below:

select

Setting Up Feature File in Protractor with Cucumber

The sample feature file used in this use case is as follows:

  • Feature: Angular Application Testing
  • Scenario: Protractor and Cucumber Test
  • Given Open the Application Login page
  • When Enter the valid credentials
  • Then Get the stock price of the given Ticker

Configuring Property File

To configure the property file, perform the following:

  • Install properties reader using the below command:
npm install properties-reader
  • Create a prop.properties file as shown in the below diagram:

select

  • Configure property file with first_test.JS using the below command:
var propertiesReader = require('properties-reader');
var inputproperties = propertiesReader(/path/properties_file/prop.properties');
  • Pass the below sample parameter as input:
var website = inputproperties.get('Website');

Executing Test Script in Protractor with Cucumber

To execute the test script in Protractor, perform the following:

  • Before running the test, start the Selenium server using the below command:
webdriver-manager start
select
  • Open other command prompt and run your test using the below command:
protractor cucumber.conf.js
select

Viewing Test Result

Open results.json file to view the test result in JSON format as shown in the below diagram:

select

Conclusion

In this blog, we discussed about installing and configuring Protractor with Cucumber framework. We learnt about the basic structure of feature file and about passing the input parameters by using the property file in Protractor with Cucumber framework.

References

Protractor-Cucumber: https://www.npmjs.com/package/protractor-cucumber

Distributed Load Testing using Apache JMeter

$
0
0

Overview

Distributed load testing is a process of simulating very high work load of enormous number of users using multiple systems. As a single system cannot generate large number of threads (users), multiple systems are used for load testing. It helps to distribute the tests and the load.

Apache JMeter, an open source testing tool, is used for load testing, performance testing, and functional testing. In JMeter, Master-Slave configuration is used for achieving load testing. Distributed load testing is bit tricky and can provide inaccurate results if not properly configured.

In this blog, let us discuss about setting up distributed testing with JMeter.

Pre-requisites

  • Download and install Apache JMeter from the following link:
    http://jmeter.apache.org/download_jmeter.cgi
  • Ensure that all the test machines are on the same subnet
  • Ensure that the same version of Apache JMeter is installed on all the machines
  • Ensure that the same version of Java is installed on all the machines
  • Disable firewall or designate it with a proxy supporting RMI protocol
  • Ensure correct system configurations such as RAM, processor, and so on

Use Case

A single Apache JMeter master instance is used to control multiple remote JMeter slave instances and to generate large volume of load on the test application.

The distributed test environment is as follows:

select

Performing Distributed Load Testing

To do distributed load testing, perform the following:

Starting JMeter Server in Master and Slave systems

To start the jmeter-server.bat in both master and slave systems, perform the following:

  • Click JMeter home directory –> Bin folder.
  • Run the batch file – jmeter-server.bat (for Windows) or jmeter-server (for Linux) as shown in the below diagram:

select

Note: If you are unable to run test form in Remote machine and get the below error, check whether jmeter-server.bat file is running on remote system:

select

Setting IP Addresses for Slave Systems

To set the IP addresses for slave systems, perform the following:

  • From the master system, open the properties file – jmeter.properties
  • Remove the current IP for remote_host entry
    For example, remove the IP address – 127.0.0.1
  • Specify the IP addresses of all the Remote systems separated by commas
    For example, 192.168.0.1,192.168.0.2 as shown below:

select

Starting Slave Systems Remotely

To remote start all the slave systems in JMeter, perform the following:

  • Open JMeter in the Master machine (on which properties files are edited)
  • Open your test script and Remote Start all the slave systems as shown below:

select

Creating Test Plan in JMeter

To create test plan in JMeter, perform the following:

  • Create a JMeter Thread Group and mention the number of threads, loop count, and ramp-up period as shown in the below diagram:

select

  • In the Thread Group, add JMeter config element as HTTP request defaults and provide source URL and port number as shown in the below diagram:

select

  • Add HTTP Request in the test plan thread group and mention the tested URL followed by the specific path as shown in the below diagram:

select

  • Add the Duration Assertion to validate each response received within a given period.
  • Add the Response Assertion to verify different segments of the response such as text (response body), document (doc, PDF), response code (200, 404), response message (description of code), and response headers.

select

  • Add the listener to check the test plan results for all the formats.

Viewing Results

Table View

The results can be viewed in table format as shown in the below diagram:

select

Response Time Graph View

The results can be viewed in the form of chart as shown in the below diagram:

select

Conclusion

In this blog, we discussed about setting up distributed testing with JMeter, creating test plan, and viewing results in JMeter.

References

Data Normalization and Filtration Using Drools

$
0
0

Overview

Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL file.

In this blog, let us discuss about normalizing and filtering data using Drools.

Pre-requisite

Download and install the following:

Use Case

Oil Well Drilling datasets from two different states–Arkansas (AR) and Oklahoma (OK) of USA are taken as the input data for processing based on API numbers.

  • Filter invalid drill types data
  • Remove null date values
  • Normalize API numbers to have correct digits
  • Format dates to the required format
  • Remove duplicate well information by taking only maximum modified date value

Data Description

The Oil Well Drilling datasets contain raw information about wells & its formation details, drill types, and production dates. Arkansas dataset has 6040 records and Oklahoma dataset has 2559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well & invalid well information with modified dates. These raw data from source is transformed to MS SQL for further filtering and normalization.

Arkansas Dataset

Null values for date_of_1st_prod

select

Invalid values for initial_production

select

Incorrect Digits in Well API Numbers

select

Oklahoma Dataset

Duplicate well data

select

Invalid date values in test date

select

  • MS SQL is used to transform the input data into transient data.
  • Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.
LOGGER.info("Creating MSSQL connection ................");
 Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
 conn = DriverManager.getConnection(url,userName,passWord);

KIE File System

KIE File System is used to load DRL files and to reduce the dependency from KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.

All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksessions for which particular rules are loaded.

public void createKnowledgeSession(String rulePath) {
 //Gets the factory class for KIE.
 KieServices ks = KieServices.Factory.get();
 KieFileSystem kfs = ks.newKieFileSystem();
 KieRepository kr = ks.getRepository();
 File rulePathFile = new File(rulePath);
 File ruleFiles[] = null;
 if(rulePathFile.isDirectory()) {
 ruleFiles = rulePathFile.listFiles();
 }
 for(File drlFile : ruleFiles) {
 LOGGER.info("File path is "+drlFile.getAbsolutePath());
 kfs.write(ResourceFactory.newFileResource(drlFile));
 }
 KieBuilder kb = ks.newKieBuilder(kfs);
 kb.buildAll();
 KieContainer kContainer = ks.newKieContainer(kr.getDefaultReleaseId());
 this.kSession = kContainer.newKieSession();
 }

Applying Rules

Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.

Arkansas Dataset – Rules Applied

  1. Removing Invalid Values
  2. Setting Initial Production and Unifying Date Format

Oklahoma Dataset – Rules applied

  1. Applying Date Filter
  2. Getting Max Modify Date Values
  3. Filtering Max Modify Date Values

Applying Rules in Arkansas Dataset

Rule 1: Removing Invalid Values

This rule is applied to remove invalid production date and initial production values using retract key words.

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate,$initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

Arkansas data filtered after removing invalid production date and initial production values is as follows:

select

Rule 2: Setting Initial Production and Unifying Date Format

This rule is applied to format gas_vol values (initial production value is considered as gas_vol) and to format the production date to prescribed date type from string by giving the date format.
Well API numbers are normalized with zeros for formatting the API numbers to fourteen digits.

rule "set initial production to gas_vol and first_prod_date to correct format" salience 1
 when 
$arkanas : Arkanas( $initial_production : initialProcuction,$first_prod_date : firstProdDate,$api_number : apiNumber )
eval( CommonUtil.isValidDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy") )
 then
 try {
$arkanas.setDateTime(CommonUtil.getDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy"));
 $arkanas.setGasVol(Float.parseFloat($initial_production));
 $arkanas.setFilteredData(true);
 $arkanas.setApiNumber($api_number+"0000");
 } catch (Exception e) {
 e.printStackTrace();
 }
end

API numbers are normalized after applying the normalizing rule:

select

First production dates are converted into the prescribed format in DRL file and mapped with date_time column as shown in the below diagram:

select

Applying Rules in Oklahoma Dataset

Rule 1: Applying Date Filter

This rule is applied to filter the values based on their test date for last 7 years and to filter horizontal drill type.

rule "data between given range"
 when
 $oklahoma : Oklahoma( $api_number : apiNumber,$test_date : testDate,$modify_date : 
 modifyDate,$drill_type : drillType)
eval( !(CommonUtil.isDateWithinRange(CommonUtil.get
Date("2010-01-01","yyyy-MM-dd"),new 
Date(),$test_date) && StringUtil.isValidString($drill_type) && ( 
$drill_type.toUpperCase().startsWith("HORIZONTAL") || 
$drill_type.toUpperCase().equalsIgnoreCase("H") ) ) )
 then
 retract($oklahoma);
 end

Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time as mentioned in the data description section:

select

Rule 2: Getting Max Modify Date Values

This rule is applied to get the max modify date values by grouping API number in temp storage object called MaxValue by using accumulate.

Accumulate gets the max modify date by grouping the data by well API numbers.

rule "selecting ok data with max modify date"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber()) and not MaxValue($api_number == apinumber)
accumulate(ok : Oklahoma(getApiNumber()==$api_number),$maxDateValue : 
max(ok.getModifyDate().getTime()))
 then
 insert(new MaxValue($api_number,$maxDateValue));
 end

Rule 3: Filtering Max Modify Date Values

This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.

rule "Filters max modify date values"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber(),$max_date : getModifyDate().getTime()) 
and not UniqVal($api_number == apinumber,$max_date == max_modify_date)
$maxValue : MaxValue($api_number==apinumber && $max_date==max_modify_date)
 then
 insert(new UniqVal($api_number,$max_date));
 $oklahoma.setFilteredData(true);
 end

After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:

select

Salience

Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.

For example, in Arkansas dataset, rules for “removing invalid initial production and first production values” had to be executed first to format date and gas_vol.

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate, $initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

Conclusion

Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file. Thus, easily changing business logic without redeployments.

In this blog, the test date was used for getting last 7 years data. Business analyst can change this date range in future without changing code or performing redeployment.

References


Building a RESTful API Using LoopBack

$
0
0

Overview

LoopBack, an easy to learn and understand open-source Node.js framework, allows you to create end-to-end REST APIs with less code compared to Express and other frameworks. It allows you to create your basic routes on adding a model into the application.

Data can be accessed from multiple databases such as MySQL, MongoDB, Oracle, MS SQL, PostgreSQL, and so on. In this blog post, let us discuss about building a RESTful API using LoopBack and accessing data from MongoDB.

Installing LoopBack

LoopBack can be installed in either ways such as using API Connect or StrongLoop. To install using StrongLoop, use the below command:

$ sudo npm install -g strongloop

Creating Sample Project

In this section, let us discuss about creating a sample project.

Creating Application

To create an application, use the below command:

$ slc loopback

Accept the default selection api-server to create the application.

Connecting to Database

To connect to the database such as MySQL/MongoDB, use the below command:

$ npm install loopback-connector-mongodb–save

Creating Model

The models are connected to databases via data sources providing create, retrieve, update, and delete (CRUD) functions.

select

Other backend services such as REST APIs, SOAP web services, and storage services, and so on are generalized as data sources.

To create a model, use the below command:

$ slc loopback:model

Running Application

To run the application, use the below command:

$ node server.js
 $ slc run

Creating Static HTML Page

To create a static HTML page, perform the following:

  • Comment the default root configuration from the following file:
server/boot/root.js
--- disable the code ---
 module.exports = function(server) { // Install a `/` route that returns server status
 var router = server.loopback.Router();
 router.get('/', server.loopback.status());
 server.use(router);
 };
  • Add root configuration from the below file path:
server/middleware.json
...
 "files": {
 "loopback#static": {
 "params": "$!../client"
 }
 },
 ...
The above lines define static middleware making the application serve files in the client directory as static content. The $! characters indicate that the path is relative to the location of middleware.json. The page looks similar to the one below:

select

MongoDB Collection

A collection called studentmodel is created in MongoDB as shown in the below diagram:

select

Creating Custom API

To create custom API to access data from MongoDB [Extend API], perform the following:

  • Use the below root configuration file:
common/models/studentmodel.js
  • Select data from MongoDB based on the API request using the below command:
Studentmodel.getName = function(shopId, cb) {
 Studentmodel.findById( shopId, function (err, instance) {
 var response = "Name of coffee shop is " + instance.name;
 cb(null, response);
 console.log(response);
 });
 }
Studentmodel.remoteMethod (
 'getName',
 {
 http: {path: '/getname', verb: 'get'},
 accepts: {arg: 'id', type: 'string', http: { source: 'query' } },
 returns: {arg: 'name', type: 'string'}
 }
 );
The screen looks similar to the one shown below:

select

  • Insert data into MongoDB based on the API.
api : /api/studentmodels/addstudent?name=cc&category=fresh
 /** To insert data into mongodb from api **/
Studentmodel.addStudent = function(stuname,stucateg, cb) {
 var newstu = {"name": stuname, "category": stucateg};
 Studentmodel.create( newstu, function (err) {
 var response = "Successfully inserted";
 cb(null, response);
 console.log(response);
 });
 }
Studentmodel.remoteMethod (
 'addStudent',
 {
 http: {path: '/addstudent', verb: 'get'},
 accepts: [{arg: 'name', type: 'string', http: { source: 'query' } },{arg: 'category', type: 'string',
http: { source: 'query' } }],
 returns: {arg: 'response', type: 'string'}
 }
 );
The screen looks similar to the one shown below:

select

The database table is as follows:

Before Inserting Data into Database

select

After Inserting Data into Database

select

  • Destroy model instance with the specified ID.
api: /api/studentmodels/removestudent?id=591edf53a593f55ec705595c
/** Destroy model instance with the specified ID **/
 Studentmodel.removeStudent = function(stuid, cb) {
 Studentmodel.destroyById( stuid, function (err) {
 var response = "Successfully removed";
 cb(null, response);
 console.log(response);
 });
 }
Studentmodel.remoteMethod (
 'removeStudent',
 {
 http: {path: '/removestudent', verb: 'get'},
 accepts: {arg: 'id', type: 'string', http: { source: 'query' } },
 returns: {arg: 'response', type: 'string'}
 }
 );
The screen looks similar to the one shown below:

select

The database table is as follows:

Before Deleting Data from Database

select

After Deleting Data from Database

select

  • Update model instance with specified ID.
    Replace attributes for the model instance whose ID is the first input argument, persist it into the data source, and perform validation before replacing.
api: /api/studentmodels/updateStudent?id=591eddaba593f55ec705595b&name=ccccc&categ=manag
/** Update instance of model with the specified ID **/
 Studentmodel.updateStudent = function(stuid, stuname,stucateg,, cb) {
var newstu = {"name": stuname, "category": stucateg};
 Studentmodel.replaceById( stuid,newstu, function (err) {
 var response = "Successfully removed";
 cb(null, response);
 console.log(response);
 });
 }
Studentmodel.remoteMethod (
 'updateStudent',
 {
 http: {path: '/updatestudent', verb: 'get'},
 accepts: [{arg: 'id', type: 'string', http: { source: 'query' } }, {arg: 'name', type: 'string', http: {
source: 'query' } },{arg: 'category', type: 'string', http: { source: 'query' } }],
 returns: {arg: 'response', type: 'string'}
 }
 );
The screen looks similar to the one shown below:

select

The database table is as follows:

Before Updating Data into Database

select

After Updating Data into Database

select

References

Pivoting and Unpivoting Multiple Columns in MS SQL Server

$
0
0

Overview

MS SQL Server, a Relational Database Management System (RDBMS), is used for storing and retrieving data. Data integrity, data consistency, and data anomalies play primary role when storing data into database. Data is provided in different formats to create different visualizations for analysis. For this purpose, you need to pivot (rows to columns) and unpivot (columns to rows) your data.

A PIVOT relational operator is used to convert values of multiple rows into values of multiple columns. An UNPIVOT relational operator is used to convert values of multiple columns into values of multiple rows. In this blog, let us discuss about converting values of rows into columns (PIVOT) and values of columns into rows (UNPIVOT) in MS SQL Server.

Pre-requisite

  • Install MS SQL SERVER 2012
  • Create Database MovieLens and table objects based on data modeling and loaded sample data

Use Case

In this use case, let us convert row data into column data using custom logic and temp table, and populate aggregated data in the temp table.

Dataset Description

A sample dataset, containing information about movies and its user ratings, is used in this use case. For sample dataset, please look into Reference section.

Data modeling for the sample dataset is as follows:

select

Syntax for Pivot Clause

The syntax for pivot clause is as follows:

SELECT first_column AS <first_column_alias>,
 [pivot_value1], [pivot_value2], ... [pivot_value_n]
 FROM
 (<source_table>) AS <source_table_alias>
 PIVOT
 (
 aggregate_function(<aggregate_column>)
 FOR <pivot_column> IN ([pivot_value1], [pivot_value2], ... [pivot_value_n])
 ) AS <pivot_table_alias>;

Parameters or Arguments

The parameters or arguments used are as follows:

  • first_column – Column or expression displayed as first column in the pivot table.
  • first_column_alias – Column heading for the first column in the pivot table.
  • pivot_value1, pivot_value2, … pivot_value_n – List of values to pivot.
  • source_table – SELECT statement providing source data for the pivot table.
  • source_table_alias – Alias for source_table.
  • aggregate_function – Represents aggregate functions such as SUM, COUNT, MIN, MAX, or AVG.
  • aggregate_column – Column or expression used with the aggregate_function.
  • pivot_column – Column containing the pivot values.
  • pivot_table_alias – Alias for the pivot table.

Converting Single Row into Multiple Columns Using Pivot Operator

A PIVOT operator is used to transpose rows into columns.

To convert single row into multiple columns, perform the following:

  • Fetch data from database using the below query:
/* Getting table data */
WITH cte_result AS( 
 SELECT
 m.movieid ,m.title ,ROUND(r.rating,0) AS [rating],
 CAST(ROUND(r.rating,0) AS VARCHAR(5))+'_rating' AS [Star]
 FROM [movielens].[dbo].[rating] r
 JOIN [movielens].[dbo].[movie] m ON m.movieid=r.movieid )
SELECT * FROM (
 SELECT
 movieid AS [MovieId],
 title AS [Movie Name],
 CAST(COUNT(*) AS FLOAT) AS [noofuser],
 CAST(SUM(Rating) AS FLOAT) AS [sumofrating],
 CAST(AVG(Rating) AS FLOAT) AS [avgofrating],
 CASE WHEN star IS NULL THEN 't_rating' ELSE star END [RatingGrade]
 FROM cte_result WHERE MovieId <= 2 GROUP BY ROLLUP(movieid,title,star) )ratingfilter
 WHERE [Movie Name] IS NOT NULL ;
  • Get aggregated data using Pivot and convert single row into multiple columns using the below query:
/* Getting aggregated data using Pivot and converting rows to columns */
WITH cte_result AS(
SELECT 
 m.movieid ,m.title ,ROUND(r.rating,0) AS [rating], 
 CAST(ROUND(r.rating,0) AS VARCHAR(5))+'_rating' AS [Star]
FROM [movielens].[dbo].[rating] r 
JOIN [movielens].[dbo].[movie] m ON m.movieid=r.movieid )
SELECT 
[MovieId],
[Movie Name],
[1_rating],
[2_rating],
[3_rating],
[4_rating],
[5_rating],
[t_rating] FROM
(SELECT 
 movieid AS [MovieId] ,
 title AS [Movie Name],
 CAST(COUNT(*) AS FLOAT) AS [noofuser],
 CASE WHEN star IS NULL THEN 't_rating' ELSE star END [RatingGrade]
FROM cte_result GROUP BY ROLLUP(movieid,title,star))ratingfilter
PIVOT (SUM([noofuser]) FOR [RatingGrade] IN ([1_rating],[2_rating],[3_rating],[4_rating],[5_rating],[t_rating]))a 
WHERE [Movie Name] IS NOT NULL ORDER BY movieid
The single row transposed into multiple columns is shown in the below diagram:

select

The transposed ratings of the movies are graphically represented using MS Excel as follows:

select

Converting Multiple Rows into Multiple Columns Using Pivot Operator

The Pivot operator can also be used to convert multiple rows into multiple columns.

To convert multiple rows into multiple columns, perform the following:

  • Fetch data from database using the below query:
/* Getting table data */
WITH cte_result AS(
SELECT 
 m.movieid,
 m.title,
 ROUND(r.rating,0) AS rating,
 u.gender
FROM [movielens].[dbo].[rating] r 
JOIN [movielens].[dbo].[movie] m ON m.movieid=r.movieid
JOIN [movielens].[dbo].[user] u ON u.userid=r.userid
WHERE r.movieid < = 5 )
SELECT movieid,title,CAST(SUM(rating) AS FLOAT) AS rating,CAST(COUNT(*) AS FLOAT) AS nofuser,CAST(AVG(rating) AS FLOAT) avgr,gender FROM cte_result&nbsp;
GROUP BY movieid,title,gender
ORDER BY movieid,title,gender
  • Select rows for conversion into columns as shown in the below diagram:

select

Multiple rows can be converted into multiple columns by applying both UNPIVOT and PIVOT operators to the result.

  • Use UNPIVOT operator to fetch values from rating, nofuser, and avgr columns and to convert them into one column with multiple rows using the below query:
/* Getting aggregated data using Unpivot and converting column to row */
WITH cte_result AS(
SELECT 
 m.movieid,
 m.title,
 ROUND(r.rating,0) AS rating,
 u.gender
FROM [movielens].[dbo].[rating] r 
JOIN [movielens].[dbo].[movie] m ON m.movieid=r.movieid
JOIN [movielens].[dbo].[user] u ON u.userid=r.userid
WHERE r.movieid < = 5 )
SELECT movieid,title,gender+'_'+col AS col,value FROM (
SELECT movieid,title,CAST(SUM(rating) AS FLOAT) AS rating,CAST(COUNT(*) AS FLOAT) AS nofuser,CAST(AVG(rating) AS FLOAT) avgr,gender FROM cte_result GROUP BY movieid,title,gender) rt
unpivot ( value FOR col in (rating,nofuser,avgr))unpiv
ORDER BY movieid
Multiple columns converted into single column are shown in the below diagram:

select

The PIVOT operator is used on the obtained result to convert this single column into multiple rows.

  • Get aggregated data using Pivot and convert multiple rows into multiple columns using the below query:
/* Getting aggregated data using Pivot and converting Multiple rows to Multiple column */
WITH cte_result AS(
SELECT 
 m.movieid,
 m.title,
 ROUND(r.rating,0) AS rating,
 u.gender
FROM [movielens].[dbo].[rating] r 
JOIN [movielens].[dbo].[movie] m ON m.movieid=r.movieid
JOIN [movielens].[dbo].[user] u ON u.userid=r.userid
WHERE r.movieid < = 5 )
SELECT movieid,title,
[M_nofuser],[F_nofuser],
[M_rating],[F_rating],
[M_avgr],[F_avgr] 
FROM 
(
SELECT movieid,title,gender+'_'+col AS col,value FROM (
SELECT movieid,title,CAST(SUM(rating) AS FLOAT) AS rating,CAST(COUNT(*) AS FLOAT) AS nofuser,CAST(AVG(rating) AS FLOAT) avgr,gender FROM cte_result GROUP BY movieid,title,gender) rt
unpivot ( value FOR col in (rating,nofuser,avgr))unpiv )tp
pivot ( SUM(value) FOR col in ([M_rating],[M_nofuser],[M_avgr],[F_rating],[F_nofuser],[F_avgr])) piv 
ORDER BY movieid
Multiple rows converted into multiple columns are shown in the below diagram:

select

The transposed movies ratings and its users are graphically represented using MS Excel as follows:

select

Conclusion

In this blog, we discussed about pivot operator in MS SQL to transpose data in rows to columns and unpivot operator to transpose data in columns to rows.

References

Data Flow Pipeline using StreamSets

$
0
0

Overview

StreamSets Data Collector, an open-source, lightweight, powerful engine, is used to stream data in real time. It is a continuous big data ingest and enterprise-grade infrastructure used to route and process data in your data streams. It accelerates Time to Analysis by bringing unique transparency and processing to data in motion.

In this blog, let us discuss about generating a data flow pipeline using StreamSets.

Pre-requisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.5.1.1

Use Case

Generating a data flow pipeline using StreamSets via JDBC connections.

What we need to do:

  1. Install StreamSets Data Collector
  2. Create JDBC Origin
  3. Create JDBC Lookup
  4. Create Dataflow Pipeline
  5. View Pipeline and Stage Statistics

Installing StreamSets Data Collector

As core software is developed in Java, web interface is developed in JavaScript/Angular JS, D3.js, HTML, and CSS.

To install StreamSets, perform the following:

Installing Java

To install Java, use the below command:

sudo apt-add-repository ppa:webupd8team/java
 sudo apt-get update
 sudo apt-get install oracle-java8-installer

Use command: whereis java (to check java location)

Ensure that JAVA_HOME variable is set to:

/usr/lib/jvm/java-8-oracle

To set JAVA_HOME, use the below command:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Installing StreamSets (from Tarball)

To install StreamSets, perform the following:

  • Create a directory as follows:
mkdir /home/streamsets
  • Extract the tar file using the below command:
tar -xzf streamsets-datacollector-core-2.5.1.1.tgz.
  • Create a system user and group named sdc using the below commands:
sudo addgroup sdc
 sudo adduser --ingroup sdc
  • Create the /etc/init.d directory (in root) using the below command:
# mkdir /etc/init.d
  • Copy /home/streamsets/streamsets-datacollector-2.5.1.1/initd/_sdcinitd_prototype to /etc/init.d directory and change ownership of the file to sdc using the below commands:
cp /home/streamsets/streamsets-datacollector-2.5.1.1/initd/_sdcinitd_prototype /etc/init.d/sdc
 chown sdc:sdc /etc/init.d/sdc
  • Edit /etc/init.d/sdc file and set $SDC_DIST and $SDC_HOME environment variables to the location from where tarball is extracted using the below commands:
export SDC_DIST="/home/ubuntu/streamsets/streamsets-datacollector-2.5.1.1/"
 export SDC_HOME="/home/ubuntu/streamsets/streamsets-datacollector-2.5.1.1/"
  • Make the sdc file executable using the below command:
chmod 755 /etc/init.d/sdc
  • Create the Data Collector configuration directory at /etc/sdc: (in root) using the below command:
# mkdir /etc/sdc
  • Copy all files from etc into the Data Collector configuration directory that you just created and extracted the tarball using the below command:
cp -R etc/ /etc/sdc
  • Change the ownership of the /etc/sdc directory and all files in the directory to sdc:sdc using the below command:
chown -R sdc:sdc /etc/sdc
  • Provide ownership only permission for form-realm.properties file in the /etc/sdc directory using the below command:
chmod go-rwx /etc/sdc/form-realm.properties
  • Create the Data Collector log directory at /var/log/sdc and change the ownership to sdc:sdc using the below commands:
mkdir /var/log/sdc
 chown sdc:sdc /var/log/sdc
  • Create Data Collector data directory at the path – /var/lib/sdc and change the ownership to sdc:sdc using the below commands (in root):
mkdir /var/lib/sdc
 chown sdc:sdc /var/lib/sdc
  • Create Data Collector resources directory at /var/lib/sdc-resources and change the ownership to sdc:sdc using the below commands:
mkdir /var/lib/sdc-resources
 chown sdc:sdc /var/lib/sdc-resources
  • Start Data Collector as a service using the below command:
service sdc start
Note: On getting error called “sdc is died”, check the configured limit for the current user using the below command:
ulimit -n
Set the session limit using the below command:
ulimit -u unlimited
  • Access the Data Collector console by entering the following URL in the address bar of browser:
http://&lt;system-ip&gt;:18630/
Note: the default username is “admin” and password is “admin”.

Creating JDBC Origin

To create JDBC origin, perform the following steps:

  • Click Create New Pipeline to create a pipeline.
  • Add Title for the pipeline as shown in the below diagram:

1

Note: In this analysis, Origin “JDBC Query consumer” is used.

  • Download JDBC origin Package Manager as shown in the below diagram:

2

Note: You can also import the package manually using the below command:

/home/streamsets/streamsets-datacollector-2.5.1.1/bin/streamsets stagelibs -install=streamsets-datacollector-jdbc-lib

  • Add configurations to JDBC Query Consumer origin.
  • Uncheck “Incremental mode” in configuration to avoid default Query consumer search for “where” and “order by” clause to execute the query as shown in the below diagram:

3

  • Add “where” and “order by” clause using offset value.

4

  • Click Validate to check the connection.

Note: If you are unable to connect to JDBC Query Consumer, move “mysql-connector-java-5.1.27-bin.jar” to the below path:

/home/streamsets/streamsets-datacollector-2.5.1.1/streamsets-libs/streamsets-datacollector-jdbc-lib/lib

Creating JDBC Lookup

To create JDBC lookup, lookup columns are required from source and lookup table.
For example, use the ‘applicantId’ field in “applicant” (source) table to look up the ‘applicantId’ column in “application” (lookup) table using the below query:

SELECT * FROM application WHERE applicantId = '${record:value('/applicantId')}'

The query uses the value of “applicantId” column from the applicant (source) table. In this example, three tables are used for lookup.

5

The result of the above JDBC lookup is given as an input to next lookup table “loan_raw” by using the below query:

SELECT * FROM loan_raw WHERE applicationId = '${record:value('/applicationId')}'

Creating Dataflow Pipeline

Different “Processors” are used for creating dataflow pipeline.

Field Remover

It discards unnecessary fields in the pipeline.

8

Expression Evaluator

It performs calculations and writes the results to new or existing fields. It is also used to add or modify record header attributes and field attributes.

9

Stream Selector

It passes data to streams based on conditions and uses a default stream to pass records unmatched with user-defined conditions. You can also define a condition for each stream of data.

7

Local FS is used to store the resultant data.

10

 

11

The full data flow pipeline is as follows:

12

Viewing Pipeline and Stage Statistics

A pipeline can be monitored while running it. Real-time summary and error statistics can be viewed for the pipeline and for the stages in the pipeline. By default, the Data Collector console displays pipeline monitoring information while running the pipeline. Any stage can be selected to view its statistics. Similarly, error information for the pipeline and its stages can be viewed.

Previewing Dataflow Pipeline

In Data collector pipeline, on clicking preview, input and output data can be seen in each level.

13

Viewing Pipeline States

Pipeline state is defined as the current condition such as “running” or “stopped” of the pipeline. The pipeline state is displayed in All Pipelines list and Data Collector log.

viewing_pipeline_states

Viewing Pipeline Statistics

Record count, record and batch throughput, batch processing statistics, and heap memory usage are displayed for the pipeline as shown in the below diagram:

14

Values of the parameters currently used by the pipeline is displayed for a pipeline started with runtime parameters as shown in the below diagram:

15

Viewing Stage Statistics

Record and batch throughput, batch processing statistics, and heap memory usage are displayed for a stage as shown in the below diagram:

16

 

17

Conclusion

In this blog, we discussed about configuring JDBC Query Consumer, performing JDBC lookup with more than one tables, creating dataflow pipeline, and monitoring the stage statistics and pipeline statistics.

References

Database Performance Testing with Apache JMeter

$
0
0

Overview

Database performance testing is used to identify performance issues before deploying database applications for end users. Database load testing is used to test the database applications for performance, reliability, and scalability using varying user load. The load testing involves simulating real-life user load for the target database applications and is used to determine the behavior of the database applications when multiple users hit the applications simultaneously.

Pre-Requisites

Use case

Let us perform database load testing to measure the performance of a database using Apache JMeter by configuring MySQL JDBC driver.

Building Database Test Plan

A test plan describes a series of steps to be executed by JMeter on running a database test. To construct a test plan, the following elements are needed:

  • Thread Group
  • JDBC Request
  • Summary Report

Adding Users

The first step involved in creating a JMeter Test Plan is to add a Thread Group element. The element provides details about the number of users to be simulated, frequency of requests to be sent by the users, and the number of requests to be sent by the users.

To add a Thread Group element, perform the following:

  • In the left pane, right click on Test Plan.
  • Select AddThreads (Users)Thread Group as shown in the below diagram:

select

  • Provide the Thread Group name as “JDBC Users”.
  • Click Add to modify the default properties as:
    • No. of Threads (users): 10
    • Ramp-Up Period (in seconds): 100
    • Loop Count: 10 as shown in the below diagram:

thread_properties

Note: The ramp-up period states the time taken to “ramp-up” to the full number of threads chosen.

As 10 threads are used in our use case and the ramp-up period is 100 seconds, JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. So, the query will be executed for 10 (threads)*10 (loop) = 100 times. Likewise, for 10 tables, the total number of samples are 100.

Adding JDBC Requests

To add a JDBC request, perform the following:

  • In the left pane, right click on Thread Group.
  • Select AddConfig ElementJDBC Connection Configuration.
  • Configure the following details:
    • Variable Name: myDatabase
      Note: This name needs to be unique as it is used by the JDBC Sampler to identify the configuration to be used
    • Database URL: jdbc:mysql://ipOfTheServer:3306/cloud
    • JDBC Driver class: com.mysql.jdbc.Driver
    • Username: username of database
    • Password: password for the database username as shown in the below diagram:

jdbc_connection_configuration

Adding Sampler

To add a sampler, perform the following:

  • In the left pane, right click on Thread Group.
  • Select AddSamplerJDBC Request.
  • Provide the following details:
    • Variable Name: ‘myDatabase’ (same as in the configuration element)
    • Enter SQL Query string field as shown in the below diagram:

jdbc_request

Adding Listener to View/Store Test results

A Listener is used to store test results of all JDBC requests in a file and to present the results.

To view the test results, perform the following:

  • In the left pane, right click on Thread Group.
  • Select AddListenerView Results Tree/Summary Report/Graph Results.
  • Save the test plan and click Run (Start or Ctrl + R) to run the test.

All the test results will be stored in the Listener.

Viewing Test results

Tree View

The results can be viewed in tree format as shown in the below diagram:

view_results_tree

Table View

The results can be viewed in table format as shown in the below diagram:

table_view

Graph View

The results can be viewed in graph format as shown in the below diagram:

graph_view

Response Time Graph View

The results can be viewed in graph format as shown in the below diagram:

response_time_graph_view

Performance Metrics

Different performance metrics viewable in JMeter are as follows:

Throughput

Hits/sec or total number of requests per unit of time (sec, mins, hr) sent to server during test.
endTime = lastSampleStartTime + lastSampleLoadTime
startTime = firstSampleStartTime
conversion = unit time conversion value
Throughput = Numrequests / ((endTime – startTime)*conversion)

In our use case, Throughput is 61.566/min. High value of Throughput indicates good performance.

Latency

Delay incurred in communicating a message. Lower value of Latency indicates the high volume of information being sent/received. In our use case, latency for the first thread is 24446.

Min/Max Load Time/Response Time/Sample Time

Difference between the request sent time and response received time. Response time is always greater than or equal to Latency.

In our use case, for all the samples, Response Time >= Latency

90% Line (90th Percentile)

Threshold value below which 90% of the samples fall. To calculate the 90th percentile value, sort the transaction instances by their value and remove the top 10% instances. The highest value left is the 90th percentile.

Similarly, the same calculation applies for 95% and 99% lines. After calculation, the values are 667, 666, and 664 ms, respectively.

Error

Total percentage of errors found in a sample request. 0.00% value indicates that all requests are completed successfully and query performance is good.

Standard Deviation/Deviation

Lower standard deviation value indicates more data consistency. Standard deviation should be less than or equal to half of the average time for a label. If the value is more than that, it indicates an invalid value. In our use case, it is 2881.

Minimum Time

Minimum time taken to send sample requests for this label. The total time equals to the minimum time across all samples. In our use case, it is 0 ms.

Maximum Time

Maximum time taken to send the sample requests for this label. The total time equals to the maximum time across all samples. In our use case, it is 23114 ms.

Average Time

Average response time taken for a request. The total average time is defined as the sum of average of total average of samples.

KB/sec

Measuring throughput rate in Kilobytes per second. In our use case, it is 0.

Samples

Total number of samples pushed to server. In our use case, it is 100.

Conclusion

In this blog, we discussed about measuring performance of the database using JMeter by adding real-life user load, connection details, and a sample MySQL query.

Different Listeners such as Graph Results, View Results Tree, Summary Report, and more can be added to view different metrics. These metrics are shared with Performance Tuning Engineer to identify performance bottlenecks. Similarly, more samplers can be added to test different databases at the same time.

References

Visualize IoT data with Kaa and MongoDB Compass

$
0
0

Overview

Kaa is a highly flexible, open source middleware platform for Internet of Things (IoT) product development. It provides a scalable, end-to-end IoT framework for large cloud-connected IoT networks. Kaa enables data management and real time bidirectional data exchange between the connected objects and backend infrastructure by providing server and endpoint SDK components.

The SDK components support multi-programming development, client server communication handling authentications, and so on. The SDK components can be virtually integrated with unlimited and any type of connected devices or microchips.

In this blog, let us discuss about quick installation of Kaa Sandbox using Virtual Box environment and connecting a sample application with MongoDB log appenders.

Pre-requisites

Installing Kaa Sandbox

To install Kaa Sandbox, perform the following:

  • Download Sandbox image v0.10.0 from the download link as shown in the below diagram:

kaa_download

  • Download .ova file and provide proper location in the local system.
  • Download VirtualBox platform packages.
    In our use case, Windows hosts are used.

virtual_box

  • Locate the two downloaded files in proper location for better understanding.

kaa_location

  • Open the VirtualBox and go to File –> Import Virtual Appliance.
  • Import the downloaded kaa-sandbox-0.10.0.1.ova file as shown in the below diagram:

select

  • On clicking Next button, select proper configuration details such as Name, OS Type, CPU, RAM, and so on.
    Note: Minimum of 4096 MB should be available to run in the environment.
  • On verifying the configuration details, click Import to start the process.

select

  • On completing the import process, start Kaa Sandbox in VM.
  • Click Start arrow button to start the process as shown in the below diagram:

select

On starting Kaa Sandbox in VirtualBox, the local host Kaa Dashboard Access Interfaces will be provided as shown below:

kaa_sandbox_window

Screen 1 provides the local host for dashboard access and screen 2 provides login to this virtual machine.

  • Give log in credentials (user name “Kaa” and password “Kaa”) to log in.
  • Copy Kaa Sandbox Web UI host (http://127.0.0.1:9080) and paste it into browser to browse the URL.
    Now, you can successfully access Kaa dashboard as shown in the below diagram:

sandbox1

Demonstrating Sample Application

In this section, let us discuss about demonstrating a sample Kaa application in Kaa Sandbox.

This application sends temperature data at preset time period from the endpoint to the server. A Log appender is setup to collect data from the endpoint. In our use case, MongoDB log appender is used.

The sample application is executed by:

  • Generating Java SDK
  • Creating log appenders
  • Launching the application
  • Fetching MongoDB logs from the appenders

Choosing Application

The sample application named Data Collection Java Demo is chosen.

Generating SDK

To generate SDK, perform the following:

  • Download Java SDK – DataCollectionDemo.jar by clicking Binary button in the below diagram:

data_collection_java_demoThe corresponding SDK component will be built or generated. You can also download Objective C, C++, and Android files for Java SDK.

  • Click Source button as shown in the above image to generate source code for the application as DataCollectionDemo.tar.gz.
  • Log in to Administration UI by entering the default credentials (username: admin and password: admin123).

kaa_admin_ui

You can add or delete the application and get the application token required to communicate with the endpoint and the server.

Note: Copy this application token to fetch the data from the logs, which is required in further steps.

Creating Log Appenders

Developer credentials (username: devuser and password: devuser123) were used for developing the applications by creating Schemas, log appenders, client configuration, and so on.

kaa_admin_ui2

MongoDB log appender is used in this use case.

Note: You can also choose Rest, File, Cassandra, Couchbase, Kafka, Oracle NoSQL, and Flume.

To create log appender, perform the following:

  • Select the application name from the applications in the Administration UI.
  • Select Data collection demo –> Log appenders –> Add log appender –> MongoDB appender.

select

Launching Application

Launch the application (.jar file) in command prompt.

For example, C:\demo_app> java –jar DataCollectionDemo.jar

Kaa will be automatically started and the data will be sent to logs by log appenders as shown in the below diagram:

kaa_log

Fetching Logs from MongoDB

To fetch logs from MongoDB, log in to Kaa Sandbox and use the below command:

$ mongo kaa
 $db.logs_<application_token>.find() ---- (copied application token in Generate SDK section to be enter here)

fetching_logs_from_mongodb

Visualizing Logs in MongoDB Compass

Connecting to Host

connecting_to_host

Selecting Application Token

selecting_application_token

selecting_application_token1

Applying Query

applying_query

Viewing Startup Log

viewing_startup_log

Conclusion

In this blog, we discussed about basic steps to install Kaa Sandbox using Virtual Machine for Kaa project Internet of Things (IoT). We also discussed about executing the demo application by generating Java SDK, creating log appenders, launching the application, and fetching MongoDB logs from the appenders.

References

Nginx with GeoIP MaxMind Database to Fetch User Geolocation Data

$
0
0

Overview

Geolocation data of a user plays a significant role in business marketing. This data is used to promote or market any brand or product or service in that specific area to which the user belongs to. It also helps in enhancing the user profile.

In this blog, let us discuss about finding geographical location of a user using user’s IP address by just configuring Nginx with GeoIP MaxMind Databases and without doing any coding.

Nginx, an open source HTTP and IMAP/POP3 proxy server, is used as a main web server software or reverse proxy server for Apache. Its GeoIP module (ngx_http_geoip_module) uses precompiled MaxMind databases to set variables such as $geoip_country_name, $geoip_country_code, $geoip_city, and so on with values depending on client’s IP address.

Pre-requisites

  • Ubuntu Platform (Ubuntu 16.04, 12.04, 11.04)

Use Case

Install Nginx on Ubuntu, configure Nginx with GeoIP MaxMind Databases, and find the geolocation of the user using IP address.

Synopsis

  • Installation and Configuration
  • Fetching Geolocation Data Using Client IP

Installation and Configuration

Installing and Configuring Nginx on Ubuntu

To install and configure Nginx on Ubuntu, perform the following:

  • Install Nginx team’s package signing key using the following command:
$ curl -s https://nginx.org/keys/nginx_signing.key | sudo apt-key add –
  • Add the repo to your apt sources using the following commands:
$ sudo echo -e "deb https://nginx.org/packages/mainline/ubuntu/ `lsb_release -cs` nginx\n deb-src https://nginx.org/packages/mainline/ubuntu/ `lsb_release -cs` nginx" > /etc/apt/sources.list.d/nginx.list
  • Resynchronize and install the package index files using the following commands:
$ sudo apt-get update
$ sudo apt-get install nginx

Installing GeoIP Module

GeoIP Module is used to lookup the IP address of a client machine connected with a server machine.

To install GeoIP module, perform the following steps:

  • Download and load the module to /usr/lib/nginx/modules using the following commands:
$ sudo add-apt-repository ppa:nginx/stable
$ sudo apt-get update
$ sudo apt-get install nginx-module-geoip
  • Open nginx.conf using the following command:
$ sudo nano /etc/nginx/nginx.conf
  • Add the below main context in the nginx.conf file:
load_module "modules/ngx_http_geoip_module.so";
Note: Skip the above steps if –with-http-geoip-module already exists on your version of Nginx.

To check the existence of the GeoIP module, use the below command:

$ nginx -V

Downloading GeoIP MaxMind GeoCity and GeoCountry Databases

To download and extract MaxMind GeoCity and GeoCountry databases in Ubuntu system, use the following commands:

mkdir -p /etc/nginx/geoip
cd /etc/nginx/geoip
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoI.dat.gz
gunzip GeoIP.dat.gz
wget
http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
gunzip GeoLiteCity.dat.gz

Configuring Nginx with GeoIP MaxMind Databases

Nginx is configured with GeoIP MaxMind GeoCity and GeoCountry Databases to access MaxMind geo variables.

To configure Nginx with the Databases, use the below command:

load_module "modules/ngx_http_geoip_module.so";
worker_processes 1;
events { worker_connections 1024; }
http {

geoip_country /etc/nginx/geoip/GeoIP.dat; # the country IP database
geoip_city /etc/nginx/geoip/GeoLiteCity.dat; # the city IP database
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
#Set Geo variables in proxy headers
proxy_set_header X_COUNTRY_CODE $geoip_country_code;
proxy_set_header X_CITY_COUNTRY_CODE $geoip_city_country_code;
proxy_set_header X_REGION $geoip_region;
proxy_set_header X_CITY $geoip_city;
proxy_set_header X_POSTAL_CODE $geoip_postal_code;
server {
listen 80;
# All other traffic gets proxied right on through.
location / {
proxy_pass http://127.0.0.1:3000;
}
}
}

Fetching Geolocation Data Using Client IP

A sample web application (NodeJS application) is created to return requested header parameters via JSON response. Custom geo fields are added to the requested header and are made accessible from the application. The application is reverse proxied via Nginx.

To get geolocation data of the user, use the following code written in Node.js:

// This sample web application returns request headers in response
 const express = require('express')
 const app = express()
 var count = 1;

// Location "/show_my_identity" hosts the incoming request headers in response in JSON format
 app.get('/show_my_identity', function (req, res) {
 res.send(JSON.stringify(req.headers));
 console.log('Request',count++,'received from country : ',req.headers.x_country_code);
 })

// Default "/" message
 app.get('/', function (req, res) {
 res.send("Welcome to Treselle lab");
 })

// Application listening on port 3000
 app.listen(3000, function () {
 console.log('Treselle Lab - App listening on port 3000!')
 })

The output of the application with the geolocation data looks similar to the one as shown below:

select

Note: To run the sample Node.js application, Node.js should be installed with required modules.

The application log with the geolocation data looks similar to the one as shown below:

select

Conclusion

In this blog, we discussed about installing Nginx on Ubuntu and configuring Nginx with GeoIP MaxMind GeoCity and GeoCountry Databases. We reverse proxied the sample web application through Nginx to find the geolocation of the user using IP address. In our upcoming blog, let us discuss about Nginx with GeoIP2 MaxMind Database to Fetch User Geolocation Data.

References

Apache NiFi – Data Crawling from HTTPS Websites

$
0
0

Overview

Apache NiFi, a very effective, powerful, and scalable dataflow building platform, is used to process and distribute data and to automate data flow between systems.

In this blog, let us discuss about crawling data from HTTPS websites using Apache NiFi.

Pre-requisites

Download and install the following from the below links:

Use Case

Crawling employment statistics data (about 27 years from 1990 to till date) from a website.

Synopsis

  • Setting and Configuring SSL
  • Accessing and Crawling HTTPS Website Using Apache NiFi

Setting and Configuring SSL

To extract data from HTTPS websites, a SSL context is required to call HTTPS sites and a cacert is needed from JDK file.

Setting SSL in Local Machine

To set the SSL in the local machine, perform the following:

req -x509 -newkey rsa:2048 -config "E:\OpenSSL\openssl-0.9.8e_X64\openssl.cnf" -keyout admin-private-key.pem -out admin-cert.pem -days 365 -subj "/CN=Admin Q. User/C=US/L=Seattle" -nodes
select
  • Create admin-cert.pem and admin-q-user.pfx files using the below command:
pkcs12 -inkey admin-private-key.pem -in admin-cert.pem -export -out admin-q-user.pfx -passout pass:"SuperSecret"
select

After successful execution of the above commands, “admin-cert.pem”, “admin-private-key.pem”, and “admin-q-user.pfx” files will be created as shown in the below diagram:

select

Adding SSL Certificate to Browser

To add “admin-q-user.pfx” file to the browser, perform the following:

  • Go to browser Settings –> Show advanced settings –> HTTPS/SSL –> Manage certificates.
    The screen looks similar to the one shown below:

select

  • Click Import –> Browse and provide the path of “admin-q-user.pfx” file.
  • Enter Password provided in above command.
    For example, SuperSecret.
  • Enable the option – “Automatically select the certificate store based on the type of certificate” and click Next as shown in the below diagram:

select

  • Click Finish.

Creating KeyStore and TrustStore

After successfully adding the certificate to the browser, perform the following:

  • Go to the location path of keytool.exe.
    For example, C:\Program Files\Java\jdk1.8.0_121\bin.
  • Open the command prompt.
  • Create “server_keystore.jks” and “server_truststore.jks” files using the below commands:
keytool -genkeypair -alias nifiserver -keyalg RSA -keypass SuperSecret -storepass SuperSecret -keystore server_keystore.jks -dname "CN=Test NiFi Server" -noprompt
keytool -importcert -v -trustcacerts -alias admin -file admin-cert.pem -keystore server_truststore.jks -storepass SuperSecret -noprompt
After running the above commands, the KeyStore and TrustStore files will be automatically created in the same path of keytool.exe file as shown in the below diagram:

select

Adding SSL Certificate to KeyStore

To add SSL certificate to the KeyStore, use the below command:

keytool -importcert -v -trustcacerts -alias admin -file E:\OpenSSL\openssl-0.9.8e_X64\bin\admin-cert.pem -keystore server_keystore.jks -storepass SuperSecret -noprompt

select

Testing SSL Certificate Added to KeyStore

After adding the certificate, check details such as issuer, validity, and so on of the certificate using the following command:

keytool -list -v -keystore server_keystore.jks

select

Configuring SSL in Apache NiFi

After adding the certificate, add user details in both “authorizations.xml” and “nifi.properties” files of Apache NiFi in the desired path to configure the SSL with Apache NiFi.

For example, C:\Apache_NIFI\nifi-1.2.0-bin\nifi-1.2.0\conf as shown in the below diagram:

select

After adding the user details, the “authorizations.xml” file looks similar to the one below:

select

After adding the user details, the “nifi.properties” file looks similar to the one below:

select

Accessing and Crawling HTTPS Website Using Apache NiFi

Configuring GetHTTP

To crawl data from HTTPS website using Apache NiFi, perform the following:

  • Open Apache NiFi.
  • Select Processor –> GetHTTP as shown in the below diagram:

select

  • Configure “GetHTTP”.

select

  • Add HTTPS website link in URL part for Apache NiFi to crawl the data from the website.
    Note: Just the website URL alone is enough to crawl the data from the website.
  • Add and configure “StandardSSLContextService” as shown in the below diagram:

select

  • Click Go To icon on the SSL Context Service in the above diagram to configure the SSL Context Service.
  • Create a new controller service as shown in the below diagram:

select

  • Edit the newly added controller service to add required property details such as Keystore Filename, Keystore Password, Keystore Type, and SSL Protocol as shown in the below diagram:

select

  • Enable the newly added controller service as shown in the below diagram:

select

Configuring PutFile

To save the output data in the prescribed location, configure PutFile as shown in the below diagram:

select

The page looks similar to the one shown below:

select

After configuring PutFile, run the process to crawl the data from HTTPS website.

Crawled Output

The crawled output data looks similar to the one as shown below:

select

References


Airflow to Manage Talend ETL Jobs

$
0
0

Overview

Airflow, an open source platform, is used to orchestrate workflows as Directed Acyclic Graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and data processing pipelines. Airflow user interface allows easy visualization of pipelines running in production environment, monitoring of the progress of the workflows, and troubleshooting issues when needed. Rich command line utilities are used to perform complex surgeries on DAGs.

In this blog, let us discuss about scheduling and executing Talend jobs with Airflow.

Pre-requisites

Use Case

Schedule and execute Talend ETL jobs with Airflow.

Synopsis

  • Author Talend jobs
  • Schedule Talend jobs
  • Monitor workflows in Web UI

Job Description

Talend ETL jobs are created by:

  • Joining application_id from applicant_loan_info and loan_info as shown in the below diagram:

etl_jobs_1

etl_jobs_2

  • Loading matched data into loan_application_analysis table.
  • Applying a filter on LoanDecisionType field in loan_application_analysis table to segregate values as ApprovedDenied, and Withdrawn as shown in the below diagram:

select

  • Applying another filter on the above segregated values to segregate LoanType as Personal, Auto, Credit, and Home.

The created Talend job is built and moved to the server location. A DAG named Loan_Application_Analysis.py is created with corresponding path of the scripts to execute the flow as and when required.

Creating DAG Folder and Restarting Airflow Webserver

After installing Airflow, perform the following:

  • Create a DAG folder (/home/ubuntu/airflow/dags) in the Airflow path.
  • Move all the .py files into the DAG folder.
  • Restart the Airflow webserver using the below code to view this DAG in UI list:
Loginto the AIRFLOW_HOME path-- eg.(/home/ubuntu/airflow)
 To restart webserver ---> airflow webserver
 To restart scheduler �---> airflow scheduler
After restarting the webserver, all .py files or DAGs in the folder will be referred and loaded into the webUI DAG list.

Scheduling Jobs

The created Talend jobs can be scheduled using Airflow scheduler. For code, look into Reference section.
Note: The job can be manually triggered by clicking Run button under Links column as shown below:

scheduling_jobs_1

Both the auto scheduled and manually triggered jobs can be viewed in the UI as follows:

scheduling_jobs_2

Monitoring Jobs

On executing the jobs, upstream or downstream processes will be started as created in the DAG.

On clicking a particular DAG, the corresponding status such as success, failure, retry, queue, and so on of the job can be visualized in different ways in the UI.

Graph View

The statuses of the jobs are represented in a graphical format as shown below:

graph_view

Tree View

The statuses of the jobs along with execution dates of the jobs are represented in a tree format as shown below:

select

Gannt View

The statuses of the jobs along with execution dates of the jobs are represented in a Gannt format as shown below:

gannt_view

Viewing Task Duration

On clicking Task Duration tab, you can view task duration of whole process or DAGs in a graphical format as shown below:

viewing_task_duration

Viewing Task Instances

On clicking Browse –> Task Instances, you can view the instances on which the tasks are running as shown below:

viewing_task_instances

Viewing Jobs

On clicking Browse –> Jobs, you can view the details such as start time, end time, executors, and so on of the jobs as shown in the below diagram:

viewing_jobs

Viewing Logs

On clicking Browse –> View Log, you can view the details of the logs as shown in the below diagram:

viewing_logs

Data Profiling

Airflow provides a simple SQL query interface to query the data and a chart UI to visualize the tasks.

To profile your data, click Admin –> Connections to select the database connection type as shown in the below diagram:

date_profiling

Ad Hoc Query

To write and query the data, click Data Profiling –> Ad Hoc Query.

ad_hoc_query

Charts

Different types of visualizations can be created for task duration, task status, and so on using charts.

To generate charts such as bar, line, area, and so on for a particular DAG using SQL query, click Data Profiling –> Charts –> DAG_id as shown in the below diagram:

charts

All the DAGs are graphically represented as shown in the below diagram:

charts1

Email Notification

Email notification can be set to know job status such as email_on_failure, email_on_success, email_on_retries, and so on.

To enable the notification, perform the following:

  • Configure settings in airflow.cfg file in airflow_home path as shown below:

select

  • Reset your email setting to Gmail settings –> allow_less secure_apps –> ON to receive email alerts from Airflow.
    Note: You may get authentication_error if the email settings are not properly configured. To overcome this issue, accept the login device as our device in “Gmail device review” as “Yes That Was Me”.

A job failure email is shown below:

email_notification1

On clicking the Log Link in the email, you will be redirected to Logs page.

Conclusion

In this blog, we discussed about authoring, scheduling, and monitoring the workflows from webUI and about triggering the Talend jobs directly from the webUI on demand using bash operator. You can also transfer data from one database to another database using generic_transfer operator.

Hooks can be used to connect to MySQL, HIVE, S3, Oracle, Pig, Redshift, and other operators such as docker_operator, hive_operator, hive_to_samba_operator, http_operator, jdbc_operator, mssql_to_hive, pig_operator, postgres_operator, presto_to_mysql, redshift_to_s3_operator, s3_file_transform_operator, and s3_to_hive_operator.

References

Nginx with GeoIP2 MaxMind Database to Fetch User Geolocation Data

$
0
0

Overview

This is second part about fetching user geolocation data using Nginx and MaxMind Database. In our previous blog on Nginx with GeoIP MaxMind Database to Fetch User Geolocation Data, we discussed about fetching user geolocation data using Nginx and legacy (GeoIP) MaxMind Database.

In this blog, let us discuss about finding geographical location of a user using user’s IP address by just configuring Nginx and GeoIP2 MaxMind Databases and without doing any coding. GeoIP2 has many features such as localized name data, country, country subdivisions, FIPS 10-4, custom country codes, represented country, user type, net speed, registered country, ISO 3166-2, and autonomous system number. ngx_http_geoip2_module is used to create variables with values based on specific variable or client’s default IP.

Pre-requisites

  • Ubuntu Platform (Ubuntu 16.04, 12.04, 11.04)

Use Case

Install Nginx on Ubuntu, configure Nginx with MaxMind Databases, and find the geolocation of the user using IP address.

Synopsis

  • Installation and Configuration
  • Fetching Geolocation Data Using Client IP

Installation and Configuration

Downloading Nginx

Download Nginx version 1.12.0 using the below command:

wget http://nginx.org/download/nginx-1.12.0.tar.gz
tar zxvf nginx-1.12.0.tar.gz
cd nginx-1.12.0

Note: Choose Nginx version 1.9 or above to work with this sample.

Installing libmaxminddb

This library is used to read MaxMind DB files and GeoIP2 databases. It enables faster lookup of IP addresses of a client.

To install libmaxminddb, perform the following:

  • Add PPA to APT sources using the below command:
$ sudo add-apt-repository ppa:maxmind/ppa
  • Install packages using the below command:
$ sudo aptitude update
 $ sudo aptitude install libmaxminddb0 libmaxminddb-dev mmdb-bin

Building ngx_http_geoip2_module as Dynamic Module (Nginx 1.9.11+)

As ngx_http_geoip_module does not support GeoIP2, a third-party module – ngx_http_geoip2_module is used. It supports Nginx streams and GeoIP2.

To include ngx_http_geoip2_module into Nginx, perform the following:

  • Download ngx_http_geoip2_module from GitHub repository
# Workspace location
 cd /opt/treselle/lab/nginx

 # Download 3rd party geoip2 module
 git clone --recursive https://github.com/leev/ngx_http_geoip2_module

 # Add dynamic module
 ./configure
  --add-dynamic-module=/opt/treselle/lab/nginx/ngx_http_geoip2_module

 # Build nginx from source using makefile
 make

 make install
objs/ngx_http_geoip2_module.so module will be automatically created. You can copy it to Nginx module path manually, if needed.
  • Add the following line to nginx.conf file:
load_module modules/ngx_http_geoip2_module.so;

Downloading Legacy MaxMind GeoCity and GeoCountry Databases

To download and extract MaxMind GeoCity and GeoCountry databases in Ubuntu system, use the following commands:

mkdir -p /etc/nginx/geoip2

cd /etc/nginx/geoip2

wget http://geolite.maxmind.com/download/geoip/database/GeoLite2City.mmdb.gz

gunzip GeoLite2-City.mmdb.gz

wget http://geolite.maxmind.com/download/geoip/database/GeoLite2Country.mmdb.gz

gunzip GeoLite2-Country.mmdb.gz

Testing MaxMind GeoCity and GeoCountry Databases Using mmdblookup

mmdblookup is used to look up geo information of an IP address from a MaxMind database file.

To test the MaxMind GeoCity and GeoCountry Databases, use the below command:

$ mmdblookup --file /etc/nginx/geoip2/GeoIP2-Country.mmdb --ip 8.8.8.8 

{
 "country":
 {
 "geoname_id":
 6252001 <uint32>
 "iso_code":
 "US" <utf8_string>
 "names":
 {
 "de":
 "USA" <utf8_string>
 "en":
 "United States" <utf8_string>
  }
 }
}

Configuring Nginx with MaxMind Databases

Nginx is configured with MaxMind GeoCity and GeoCountry Databases to access MaxMind geo variables.

To configure Nginx with the Databases, use the below command:

load_module "modules/ngx_http_geoip2_module.so";

worker_processes 1;

events { worker_connections 1024; }

http {
geoip2 /etc/nginx/geoip2/maxmind-country.mmdb {
 $geoip2_data_city_name city names en;
 $geoip2_data_postal_code postal code;
 $geoip2_data_latitude location latitude;
 $geoip2_data_longitude location longitude;
 $geoip2_data_state_name subdivisions 0 names en;
 $geoip2_data_state_code subdivisions 0 iso_code;
 }

geoip2 /etc/nginx/geoip2/maxmind-city.mmdb {
 $geoip2_data_country_code default=US country iso_code;
 $geoip2_data_country_name country names en;
 }

log_format main '$remote_addr - $remote_user [$time_local] "$request" '
 '$status $body_bytes_sent "$http_referer" '
 '"$http_user_agent" "$http_x_forwarded_for"';

access_log /var/log/nginx/access.log main;

#Set Geo variables in proxy headers
 proxy_set_header X-GEO-CITY-NAME $geoip2_data_city_name;
 proxy_set_header X-GEO-POSTAL-CODE $geoip2_data_postal_code;
 proxy_set_header X-GEO-STATE-NAME $geoip2_data_state_name;
 proxy_set_header X-GEO-STATE-CODE $geoip2_data_state_code;
 proxy_set_header X-GEO-COUNTRY-CODE $geoip2_data_country_code;
 proxy_set_header X-GEO-COUNTRY-NAME $geoip2_data_country_name

$geoip2_data_country_name;server {
 listen 80;
 # All other traffic gets proxied right on through.
 location / {
 proxy_pass http://127.0.0.1:3000;
  }
 }
}

Fetching Geolocation Data Using Client IP

A sample web application (Node.js application) is created to return requested header parameters via JSON response. Custom geo fields are added to the requested header and are made accessible from the application. The application is reverse proxied via Nginx.

To get geolocation data of the user, use the following code written in Node.js:
 // This sample web application returns request headers in response
 const express = require('express')
 const app = express()
 var count = 1;

// Location "/show_my_identity" hosts the incoming request headers in response in JSON format
 app.get('/show_my_identity', function (req, res) {
 res.send(JSON.stringify(req.headers));
 console.log('Request',count++,'received from country : ',req.headers.x_country_code);
 })

// Default "/" message
 app.get('/', function (req, res) {
 res.send("Welcome to Treselle lab");
 })

// Application listening on port 3000
 app.listen(3000, function () {
 console.log('Treselle Lab - App listening on port 3000!')
 })

The output of the application with the geolocation data looks similar to the one as shown below:

select

Note: To run the sample Node.js application, Node.js should be installed with required modules.

The application log with the geolocation data looks similar to the one as shown below:

select

Conclusion

In this blog, we discussed about installing MaxMind GeoIP2 module with Nginx on Ubuntu and configuring Nginx with MaxMind GeoCity and GeoCountry Databases. We reverse proxied the sample web application through Nginx to find the geolocation of the user using IP address. For more details about legacy (GeoIP) MaxMind database model, refer our previous blog on Nginx with GeoIP MaxMind Database to Fetch User Geolocation Data.

References

MySQL to Amazon Aurora – Diverse Ways of Data Migration

$
0
0

Overview

Amazon Aurora, a simple and cost effective relational database engine, is used to set up, operate, and scale MySQL deployments. It possesses speed and reliability of high-end commercial databases. It provides faster recovery from instance failure and is consistent with lower impact on Primary replica. It is compatible with InnoDB engine and Aurora I/O mechanism (16K for read, 4K for write, and all can be batched if smaller).

In this blog, let us discuss about launching Amazon Aurora DB Cluster and various ways of migrating data from MySQL to Amazon Aurora DB Cluster.

Pre-requisites

Create an Amazon Aurora account using the link: https://aws.amazon.com/

Use Case

Launching Amazon Aurora DB Cluster and analyzing various ways of migrating data from existing MySQL database to an Amazon Aurora DB Cluster.

Launching Amazon Aurora DB Cluster

To launch Amazon Aurora DB cluster, perform the following steps:

  • Sign in to Amazon RDS instance.
  • In Select Engine page, select Amazon Aurora as your DB engine.
  • In Specify DB Details page, specify the DB details such as Instance Specifications and Settings.
  • In Configure Advanced Settings page, provide network and security details such as VPS, subnet group, publicly accessible, availability zone, and VPC security group as shown in the below diagram:

select

  • Click Launch DB Instance to launch the instance.
    On successfully launching the DB instance, the page looks similar to the one as shown below:

launch_db_instance

Migrating Data from MySQL to Amazon Aurora DB Cluster

Few ways of migrating data from MySQL to Amazon Aurora DB cluster are:

  • Using Talend Extract-Transform-Load (ETL) Tool – Integrate Talend with Aurora and migrate data into the Aurora DB cluster.
  • Using MySQL Dump – Create MySQL data dump using mysqldump utility and import it into the Aurora DB cluster.
  • Using Amazon RDS MySQL DB Snapshot – Create a DB snapshot of an Amazon RDS MySQL DB instance and migrate it to the Aurora DB cluster.
  • Using Amazon AWS Database Migration Service (AWS DMS) – Connect AWS DMS with MySQL and migrate data from MySQL to the Aurora DB cluster.

Let us discuss about all the above ways of data migration so as to help you choose the most optimized way of migration based on your specific need.

Using ETL – Talend

Migrating data from MySQL to Amazon Aurora DB cluster using ETL tool is the easiest way among all other ways of migration. Few ETL tools used for migration are Pentaho, Kettle, Informatica, Talend, and so on.

In this section, let us discuss about migrating data from MySQL to Amazon Aurora DB cluster using Talend. Talend is best suited when migrating aggregated data from MySQL to Aurora. Using aggregation functions, the data can be migrated from MySQL to Aurora. After integrating MySQL and Aurora, drag and drop the required components to perform any functionality.

Pre-requisites

  • MySQL Version 5.6
  • Configure the tAmazonAuroraOutput component in Talend 6.1

Migrating to Amazon Aurora DB Cluster Using Talend

To migrate data from MySQL to Aurora using Talend, perform the following steps:

  • Open Talend.
  • In the Palette, search tMysqlInput component.
  • In the tMysqlInput component, provide sample RDS instance details.
  • Create a Talend job using tFlowMeter, tMap, tfilterRow, tAggregation, and tAmazonAuroraOutput.
  • Run the job to migrate the data from MySQL to Aurora.

tAggregation component used for migration is shown in the below diagram:

taggregation_component

tfilterRow component used for filtering data above a specified time period is shown in the below diagram:

tfilterrow_component

Note: MySQL engine type will be converted into InnoDB engine after migration.

Using MySQL Dump

This is the best suited method to migrate data from MySQL to the Aurora DB cluster if the data size exceeds 6 TB.
mysqldump utility is used to create MySQL data dump file. The dump file is then imported into the Aurora DB cluster so as to migrate the data to the Aurora DB cluster.

Migrating to Amazon Aurora DB Cluster Using MySQL Dump

To migrate data from MySQL to Aurora using MySQL dump, perform the following steps:

  • Create a MySQL dump file from MySQL database using the below command:
nohup mysql -u username -p'password' -h hostname --port=3306 dstore > dstore_20170703.sql > dstore.log &
  • Import the dump file into Aurora DB Cluster using the below command:
nohup mysql -u username - p'password' -h hostname --port=3306 dstore_crawler < dstore_20170703.sql > dstore_crawler.log &

Using Amazon RDS MySQL DB Snapshot

This is the best suited method to migrate data from MySQL to the Aurora DB cluster if the data size is less than 6 TB. It is easy to migrate data from different regions such as ap-northeast-1, ap-northeast-2, ap-south-1, and so on by just taking a DB snapshot. The DB snapshot of an Amazon RDS MySQL DB instance is taken and the data is migrated into the Aurora DB cluster.

As Aurora DB supports only InnoDB engine, any MyISAM engine tables already present will be converted into InnoDB during migration.

Pre-requisites

  • MySQL Version 5.6
  • Aurora DB Version 1.13
  • Amazon RDS console

Migrating to Amazon Aurora DB Cluster Using Amazon RDS MySQL DB Snapshot

To migrate data from MySQL to Aurora using Amazon RDS MySQL DB Snapshot, perform the following steps:

  • Open Amazon RDS console.
  • Click Instances.
  • Choose RS Instance.
  • Click Instance Actions –> Migrate Latest Snapshot as shown in the below diagram:

migrate_latest_snapshot

  • Mention Instance Specifications as Aurora and set DB Instance Identifier as shown in the below diagram:

instance_specifications

  • Click Migrate to initiate the process of data migration as shown in the below diagram:

migrate

On clicking migrate, the process of migrating data from MySQL to Aurora will be initialized as shown in the below diagram:

migrating_data

The migration progress is shown in the below diagram:

select

The data migrated to Aurora is shown in the below diagram:

data_migration_to_aurora

Using Amazon AWS Database Migration Service (AWS DMS)

AWS Database Migration Service, a web service, is used to easily and securely migrate data between heterogeneous or homogenous databases such as on-premises databases, RDS database, SQL, NoSQL, text based targets, and so on in zero downtime. It is also used for continuous data replication with high-availability.

This service is a low-cost service and allows you to pay only for the resources used and other additional log storage. AWS DMS is connected with MySQL so as to load data from MySQL to the Aurora DB cluster.

Migrating to Amazon Aurora DB Cluster Using Amazon AWS Database Migration Service

To migrate data from MySQL to Aurora using Amazon AWS DMS, perform the following steps:

  • Open Amazon DMS console.
  • Click Migrate –> Next to start the migration process as shown in the below diagram:

migrating_to_amazon_aurora

  • In the Create replication instance page, provide instance details and click Next as shown in the below diagram:

create_replication_instance

  • In the Connect source and target database endpoints page, provide source and target database connection details and click Next to create the replication instance as shown in the below diagram:

connect_source_and_target_database_endpoints

  • In the Create task page, provide source and target database connection details and click Create task to create replication instance as shown in the below diagram:

create_task

  • Under Guided tab, enter details of Selection rules as shown in the below diagram:

guided_tab

  • Under Guided tab, enter details of Transformation rules as shown in the below diagram:

transformation_rules

The data migrated from MySQL to the Aurora DB cluster is shown in the below diagram:

migrated_data

Data Migration Comparison Chart

The time taken to migrate data from MySQL to Aurora DB cluster is graphically represented as a comparison chart as follow:

select

Amazon AWS DMS: It took 4-5 mins for migrating 1 GB file from MySQL to Aurora DB cluster using Amazon AWS DMS at zero downtime. The migration time differs based on AWS instance type.

Conclusion

In this blog article, we discussed about the various ways of migrating data from MySQL to Amazon Aurora DB Cluster. The optimized way of doing data migration varies based on data size and requirements.

References

Drill Data with Apache Drill – Part 2

$
0
0

Overview

This is second part about drilling data with Apache Drill. Apache Drill is an open source low latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10 with CTTAS, Web console, and JDBC connection. It can be integrated with several data sources such as CSV, JSON, TSV, PSV, Avro, and Parquet and can be operated using single query.

In this blog, let us discuss about configuring and connecting different data sources, and querying data from the data sources with Apache Drill. To know more, refer our previous blog post on Drill Data with Apache Drill.

Use Case

Persist data files in different data sources such as MySQL, HDFS, and Hive, query them on-the-fly, export query output in different file formats such as CSV, JSON, and TSV, and load the result sets into HDFS location.

Data Description

To demonstrate Apache Drill’s capability, financial stock data downloaded in parts in different formats such as JSON, CSV, and TSV is loaded into different data sources.

The files used are as follows:

  • energy_overview.tsv – MySQL
  • energy_technical.json – HDFS
  • stock_market_exchange.csv – Hive

To get a better understanding about the data files, refer our previous blog post on Drill Data with Apache Drill.

Pre-requisites

Install Apache Drill 1.10.0 in Distributed Mode on Linux.

wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz
tar xf apache-drill-1.10.0.tar.gz

Note: Drill requires JDK 1.7 and above.

Synopsis

  • Configure and connect different data sources such as MySQL, HDFS, and Hive.
  • Query the data sources on the fly.
  • Export query output in different file formats such as CSV, JSON, and TSV.
  • Store the output file in HDFS location.

Configuring Different Data Sources

RDBMS Storage Plugins

Apache Drill is tested with MySQL’s mysql-connector-java-5.1.37-bin.jar driver.

To connect Apache Drill with MySQL, configure RDBMS storage plugin in Apache Drill console as shown in the below diagram:

select

Hive Storage Plugins

To connect Apache Drill with Hive, enable the existing Hive plugin and update the configuration in Apache Drill console as shown in the below diagram:

select

File Storage Plugin

To connect Apache Drill with HDFS, replace file:/// to hdfs://hostname:port/ in DFS storage plugin as shown in the below diagram:

select

Using Apache Drill in Distributed Mode

Start a Drillbit on each node in a cluster to use Apache Drill in the distributed mode.

Starting Drill

To start Apache Drill, use the below command:

drillbit.sh start

Stopping Drill

To stop Apache Drill, use the below command:

drillbit.sh stop

Verifying Drill Setup

To verify Apache Drill setup, use the below command:

cd apache-drill-1.10.0
bin/sqlline –u jdbc:drill:zk=local
!connect jdbc:drill:zk=local:2181
show databases;

select

To check Drillbits running on the cluster, use the below command:

0: jdbc:drill:zk=<zk1host>:<port> SELECT * FROM sys.drillbits;

select

Querying Data Sources with Apache Drill

Apache Drill’s self-describing data exploration behavior allows users to query different data files from diverse data stores.

Simple Join

It is used to join all the 3 files and retrieve name, ticker, MarketCap, sector, subsector, volume, and moving averages.

select b.Symbol,b.Status,b.Name,b.MarketCap,a.ATR,a.SMA20,c.sector,c.subsector 
from dfs.`/user/tsldp/energy_technical/energy_technical.json` a,
hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` c 
where a.Ticker = b.Symbol and a.Ticker = c.ticker limit 10;

select

Total Volume Based on Subsector

select sum(b.volume) as total_volume,a.subsector 
from hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` a
where a.ticker = b.Symbol group by a.subsector limit 5;

select

Minimum Volume for Tickers

select a.ticker as ticker,min(b.volume) as min_volume 
from hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` a 
where a.ticker = b.Symbol group by a.ticker 
order by a.ticker desc limit 10;

select

Subsector with ATR > 2

select b.subsector,a.ATR 
from dfs.`/user/tsldp/energy_technical/energy_technical.json`a,
mysqldb.energy_data.`energy_overview` b 
where a.ticker = b.ticker 
group by b.subsector,a.ATR having cast(a.ATR as float) > 2;

select

Tickers with w52high Between -20 & -100 with Their Last Price

select symbol,lastprice,change,changepercent 
from hive.stock_market.`stock_market_exchange` 
where symbol in 
(select ticker
 from dfs.`/user/tsldp/energy_technical/energy_technical.json`
 where cast(W52High as float) < -20 and cast(W52High as float) > -100 
order by ticker desc limit 10);

The query can also be executed in Apache Drill web console as shown below:

select

The query output is as follows:

select

 

Loading Output Files in HDFS

To load the result sets as files in HDFS, configure workspace in DFS storage plugin by changing the configuration as shown in the below diagram:

select

The configured workspaces are shown in database list as shown in the below diagram:

select

 

Exporting Output Files

The result sets of the join queries are exported into CSV file format by using csvOut workspace and JSON file format by using jsonOut workspace in HDFS storage plugin.

Switch to csvOut Workspace

To switch to csvOut workspace, use the below command:

use dfs.csvOut';

Change File Storage Format to CSV

To change file storage format to CSV, use the below command:

alter session set `store.format`='csv';

Exporting Output File into CSV Format

create table total_resultset as 
select a.ticker as ticker,sum(b.volume) as total_volume,a.subsector 
from hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` a 
where a.ticker = b.Symbol 
group by a.ticker,a.subsector;

select The output files are exported and stored in the location:

/user/tsldp/csv_output/total_resulset/total_volume.csv

select

References

Data Analysis Using Apache Hive and Apache Pig

$
0
0

Overview

Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better business insights. Pig, a standard ETL scripting language, is used to export and import data into Apache Hive and to process large number of datasets. Pig can be used for ETL data pipeline and iterative processing.

In this blog, let us discuss about loading and storing data in Hive with Pig Relation using HCatalog.

Pre-requisites

Download and configure the following:

Use Case

In this blog, let us discuss about the below use case:

  • Loading unstructured data into Hive
  • Processing, transforming, and analyzing data in Pig
  • Loading structured data into a different table in Hive using Pig

Data Description

Two cricket data files with Indian Premier League data from 2008 to 2016 is used as a data source. The files are as follows:

  • matches.csv – Provides details about each match played
  • deliveries.csv – Provides details about consolidated deliveries of all the matches

These files are extracted and loaded into Hive. The data is further processed, transformed, and analyzed to get the winner for each season and the top 5 batsmen with maximum run in each season and overall season.

Synopsis

  • Create database and database tables in Hive
  • Import data into Hive tables
  • Call Hive SQL in Shell Script
  • View database architecture
  • Load and store Hive data into Pig relation
  • Call Pig script in Shell Script
  • Apply Pivot concept in Hive SQL
  • View Output

Creating Database and Database Tables in Hive

To create databases and database tables in Hive, save the below query as a SQL file (database_table_creation.sql):

select

Importing Data into Hive Tables

To load data from both the CSV files into Hive, save the below query as a SQL file (data_loading.sql):

select

Calling Hive SQL in Shell Script

To automatically create databases & database tables and to import data into Hive, call both the SQL files (database_table_creation.sql and data_loading.sql) using Shell Script.

select

Viewing Database Architecture

The database schema and tables created are as follows:

select

The raw matches.csv file loaded into Hive schema (ipl_stats.matches) is as follows:

select

The raw deliveries.csv file loaded into Hive schema (ipl_stats.deliveries) is as follows:

select

Loading and Storing Hive Data into Pig Relation

To load and store data from Hive into Pig relation and to perform data processing and transformation, save the below script as Pig file (most_run.pig):

select

Note: Create a Hive table before calling Pig file.

To write back the processed data into Hive, save the below script as a SQL file (most_run.sql):

select

Calling Pig Script in Shell Script

To automate ETL process, call files (most_run.pig, most_run.sql) using Shell Script.

select

The data loaded into Hive using Pig script is as follows:

select

Applying Pivot Concept in Hive SQL

As the data loaded into Hive is in rows, SQL Pivot concept is used to convert rows into columns for more data clarity and gaining better insights. User Defined Aggregation Function (UDAF) technique is used to perform pivot in Hive. In this use case, the pivot concept is applied to season and run rows alone.

To use Collect UDAF, add Brickhouse jar file into Hive class path.

The top 5 most run scored batsmen data for each season before applying pivot is shown as follows:

select

The top 5 most run scored batsmen data for each season after applying pivot is shown as follows:

select

Viewing Output

Viewing Winners of a Season

To view winners of each season, use the following Hive SQL query:

select

Viewing Top 5 Most Run Scored Batsmen

To view Top 5 most run scored batsmen, use the following Hive SQL query:

select

The top 5 most run scored batsmen is shown graphically using MS Excel as follows:

select

Viewing Year-wise Runs of Top 5 Batsmen

To view year-wise runs of the top 5 batsmen, use the following Hive SQL query:

select

The year-wise runs of the top 5 batsmen are shown graphically using MS Excel as follows:

select

References

Viewing all 98 articles
Browse latest View live