Apache Drill vs Amazon Athena – A Comparison on Data Partitioning
Overview Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for...
View ArticleAmazon Athena & Tableau – Serverless Interactive Query Service and Business...
Overview Amazon Athena, a serverless query service in Amazon Simple Storage Service (S3) and a pay per service, is used to easily analyze data using standard SQL in S3. It has a very high query...
View ArticleSelf Service Analytics using Dremio
Overview Dremio, a self-service data platform, helps data analysts and data scientists to determine, organize, accelerate, and share any data at any time irrespective of volume, velocity, location, or...
View ArticleData Quality Checks with StreamSets using Drift Rules
Overview In the world of big data, data drift has emerged as a critical technical challenge for data scientists and engineers in unleashing the power of data. It delays businesses from gaining...
View ArticleHandle Class Imbalance Data with R
Overview Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification...
View ArticleAPI Response Tracking with StreamSets, Elasticsearch, and Kibana
Overview RESTful API JSON response data can be used to view various aspects such as pipeline configuration or monitoring information of the StreamSets Data Collector. This API response information can...
View ArticleImport and Ingest Data into HDFS using Kafka in StreamSets
Overview StreamSets provides state-of-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various...
View ArticleKylo – Self-Service Data Ingestion, Cleansing, and Validation (No Coding...
Overview Kylo, a feature-rich data lake platform, is built on Apache Hadoop and Apache Spark. Kylo provides a business-friendly data lake solution and enables self-service data ingestion, data...
View ArticlePredict Lending Club Loan Default Using Seahorse and SparkR
Overview Data scientists are using Python and R to solve data problems due to the ready availability of these packages. These languages are often limited as the data is processed on a single machine,...
View ArticleData Quality Metrics using Talend Data Quality Management
Overview Data Quality is the process of examining data in different data sources according to predefined business goals. It helps to improve the quality of the data and collect statistics and...
View ArticleKylo – Automatic Data Profiling and Search-based Data Discovery
Overview Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts...
View ArticleSensor Data Quality Management using PySpark & Seaborn
Overview Data Quality Management (DQM) is the process of analyzing, defining, monitoring, and improving quality of data continuously. Few data quality dimensions widely used by the data practitioners...
View ArticlePredict Bad Loans with H2O Flow AutoML
Overview Machine learning algorithms play a key role in accurately predicting loan data of any bank. The greatest challenge in machine learning is to employ the best models and algorithms to accurately...
View ArticleCrime Analysis Using H2O Autoencoders – Part 1
Overview Nowadays, Deep Learning (DL) and Machine Learning (ML) are used to analyze and accurately predict data. Machine Learning models are used to accurately predict crimes. Crime prediction not only...
View ArticleStreaming Analytics using Kafka SQL
Overview Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process...
View ArticleCrime Analysis Using H2O Autoencoders – Part 2
Overview This is the second part of a two-part series of Crime Analysis using H2O Autoencoders. In our previous blog on Crime Analysis Using H2O Autoencoders – Part 1, we discussed building the...
View ArticleIngest IoT Sensor Data into S3 with Raspberry Pi3 & StreamSets Data Collector...
Overview Due to increasing amount of data produced from outside of source systems, enterprises are facing difficulties in reading, collecting, and ingesting data into a desired, central database...
View ArticleCustom Partitioning and Analysis using Kafka SQL Windowing
Overview Apache Kafka uses round-robin fashion to produce messages to multiple partitions. Custom partition technique is used to produce a particular type of message in the defined partition and to...
View ArticleCustomer Churn – Logistic Regression with R
1 Overview 2 Learning/Prediction Steps 2.1 Data Description 2.2 Data Preprocessing 2.3 Partitioning the Data & Logistic Regression 2.4 Model Summary 2.5 Prediction Accuracy 3 References Overview...
View ArticleEmbrace Relationships with Neo4J, R & Java
2 Use Case 3 Solution 3.1 Prerequisites 3.2 Download StackOverflow Dataset 3.3 Data Manipulation with R 3.4 Create Nodes and Relationship file with Java 3.5 Create GraphDB with Batch Importer 3.6...
View Article