Sales Data Analysis using Dataiku DSS
Overview Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data...
View ArticleApache Spark on YARN – Performance and Bottlenecks
Overview Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query...
View ArticleApache Spark on YARN – Resource Planning
Overview This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily...
View ArticleApache Spark Performance Tuning – Degree of Parallelism
Overview This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple...
View ArticleApache Spark Performance Tuning – Straggler Tasks
Overview This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This...
View ArticleProtractor with Cucumber
Overview Protractor, an end-to-end testing framework, supports Jasmine and is specifically built for AngularJS application. It is highly flexible with different Behavior-Driven Development (BDD)...
View ArticleDistributed Load Testing using Apache JMeter
Overview Distributed load testing is a process of simulating very high work load of enormous number of users using multiple systems. As a single system cannot generate large number of threads (users),...
View ArticleData Normalization and Filtration Using Drools
Overview Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business...
View ArticleBuilding a RESTful API Using LoopBack
Overview LoopBack, an easy to learn and understand open-source Node.js framework, allows you to create end-to-end REST APIs with less code compared to Express and other frameworks. It allows you to...
View ArticlePivoting and Unpivoting Multiple Columns in MS SQL Server
Overview MS SQL Server, a Relational Database Management System (RDBMS), is used for storing and retrieving data. Data integrity, data consistency, and data anomalies play primary role when storing...
View ArticleData Flow Pipeline using StreamSets
Overview StreamSets Data Collector, an open-source, lightweight, powerful engine, is used to stream data in real time. It is a continuous big data ingest and enterprise-grade infrastructure used to...
View ArticleDatabase Performance Testing with Apache JMeter
Overview Database performance testing is used to identify performance issues before deploying database applications for end users. Database load testing is used to test the database applications for...
View ArticleVisualize IoT data with Kaa and MongoDB Compass
Overview Kaa is a highly flexible, open source middleware platform for Internet of Things (IoT) product development. It provides a scalable, end-to-end IoT framework for large cloud-connected IoT...
View ArticleNginx with GeoIP MaxMind Database to Fetch User Geolocation Data
Overview Geolocation data of a user plays a significant role in business marketing. This data is used to promote or market any brand or product or service in that specific area to which the user...
View ArticleApache NiFi – Data Crawling from HTTPS Websites
Overview Apache NiFi, a very effective, powerful, and scalable dataflow building platform, is used to process and distribute data and to automate data flow between systems. In this blog, let us discuss...
View ArticleAirflow to Manage Talend ETL Jobs
Overview Airflow, an open source platform, is used to orchestrate workflows as Directed Acyclic Graphs (DAGs) of tasks in a programmatic manner. An airflow scheduler is used to schedule workflows and...
View ArticleNginx with GeoIP2 MaxMind Database to Fetch User Geolocation Data
Overview This is second part about fetching user geolocation data using Nginx and MaxMind Database. In our previous blog on Nginx with GeoIP MaxMind Database to Fetch User Geolocation Data, we...
View ArticleMySQL to Amazon Aurora – Diverse Ways of Data Migration
Overview Amazon Aurora, a simple and cost effective relational database engine, is used to set up, operate, and scale MySQL deployments. It possesses speed and reliability of high-end commercial...
View ArticleDrill Data with Apache Drill – Part 2
Overview This is second part about drilling data with Apache Drill. Apache Drill is an open source low latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10...
View ArticleData Analysis Using Apache Hive and Apache Pig
Overview Apache Hive, an open-source data warehouse system, is used with Apache Pig for loading and transforming unstructured, structured, or semi-structured data for data analysis and getting better...
View Article