This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was design to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. If you’re a technologist who wants to expand your knowledge to perform data science operations in Spark, or a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience, this book is for you.
- Consolidate, clean, & transform your data acquired from various data sources
- Perform statistical analysis of data to find hidden insights
- Explore graphical techniques to see what your data looks like
- Use machine learning techniques to build predictive models
- Build scalable data products & solutions
- Start programming using the RDD, DataFrame & Dataset APIs
- Become an expert by improving your data analytical skills
Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework.
- Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, & machine learning
- Solve real-world analytical problems w/ large data sets
- Address data science challenges w/ analytical tools on a distributed system like Spark
- Get hands-on experience w/ algorithms like Classification, regression, & recommendation on real datasets using Spark MLLib package
- Learn about numerical & scientific computing using NumPy & SciPy on Spark
- Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models
When it comes to big data, regular data visualization tools with basic features become insufficient. This book covers the concepts and models used to visualize big data, with a focus on efficient visualizations.
- Understand how basic analytics is affected by big data
- Deep dive into effective & efficient ways of visualizing big data
- Get to know various approaches (using various technologies) to address the challenges of visualizing big data
- Comprehend the concepts & models used to visualize big data
- Know how to visualize big data in real time & for different use cases
- Understand how to integrate popular dashboard visualization tools such as Splunk & Tableau
- Get to know the value & process of integrating visual big data with BI tools such as Tableau
- Make sense of the visualization options for big data, based upon the best suited visualization techniques for big data
Big Data forensics is an important type of digital investigation that involves the identification, collection, and analysis of large-scale Big Data systems. Hadoop is one of the most popular Big Data solutions, and forensically investigating a Hadoop cluster requires specialized tools and techniques. In this book, you’ll discover how to perform a complete forensic investigation of large-scale Hadoop clusters using the same tools and techniques employed by forensics experts.
- Understand Hadoop internals & file storage
- Collect & analyze Hadoop forensic evidence
- Perform complex forensic analysis for fraud & other investigations
- Use state-of-the-art forensic tools
- Conduct interviews to identify Hadoop evidence
- Create compelling presentations of your forensic findings
- Understand how Big Data clusters operate
- Apply advanced forensic techniques in an investigation, including file carving, statistical analysis, & more
Big Data analytics is the process of examining large and complex data sets that often exceed the computational capabilities. R is a leading programming language of data science, consisting of powerful functions to tackle all problems related to Big Data processing. The book will begin with a brief introduction to the Big Data world and its current industry standards, before progressing towards coverage of major R functions for data management and transformations.
- Learn about current state of Big Data processing using R programming language & its powerful statistical capabilities
- Deploy Big Data analytics platforms w/ selected Big Data tools supported by R in a cost-effective & time-saving manner
- Apply the R language to real-world Big Data problems on a multi-node Hadoop cluster
- Explore the compatibility of R with Hadoop, Spark, SQL & NoSQL databases, and H2O platform
Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping machine learning techniques can help you greatly in building predictive models, and this book will give you insights into mastering Big Data via machine learning to make better business decisions.
- Install & maintain Hadoop 2.X cluster & its ecosystem
- Write advanced Map Reduce programs & understand design patterns
- Perform advanced data analysis using the Hive, Pig, & Map Reduce programs
- Import & export data from various sources using Sqoop & Flume
- Understand data storage in various file formats such as Text, Sequential, Parquet, ORQ, & RC Files
- Discuss machine learning principles w/ libraries such as Mahout
- Explore Batch & Stream data processing using Apache Spark
If you are interested in building efficient business solutions using Hadoop, this is the book for you. Here, you’ll build six real-life, end-to-end solutions using the tools in the Hadoop ecosystem, thereby taking your knowledge of Hadoop to the next level.
- Learn about the evolution of Hadoop as the big data platform
- Understand the basics of Hadoop architecture
- Build a 360 degree view of your customer using Sqoop & Hive
- Create & run classification models on Hadoop using BigML
- Use Spark & Hadoop to build a fraud detection system
- Develop a churn detection system using Java & MapReduce
- Build an IoT-based data collection and visualization system
- Learn about the coexistence of NoSQL & In-Memory databases in the Hadoop ecosystem
This book will teach you how to deploy large-scale dataset in deep neural networks with Hadoop for optimal performance. Starting with an exploration of what deep learning is, and what the various models associated with deep neural networks are, this book will then show you how to set up the Hadoop environment for deep learning.
- Explore Deep Learning & various models associated with it
- Understand the challenges of implementing distributed deep learning w/ Hadoop & how to overcome it
- Implement Convolutional Neural Network (CNN) w/ deeplearning4j
- Delve into the implementation of Restricted Boltzmann Machines (RBM)
- Discover the mathematical explanation for implementing Recurrent Neural Networks (RNN)
- Get hands on practice of deep learning and their implementation w/ Hadoop
Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists. This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.
- Get to know the fundamentals of Spark 2 & the Spark programming model using Scala and Python
- Know how to use Spark SQL & DataFrames using Scala and Python
- Get an introduction to Spark programming using R
- Perform Spark data processing, charting, & plotting using Python
- Get acquainted w/ Spark stream processing using Scala & Python
- Be introduced to machine learning using Spark MLlib
- Get started with graph processing using the Spark GraphX
- Bring together all that you’ve learned & develop a complete Spark application
via Ashraf