Course: Scalable data processing with Apache Spark Designing and maintaining scalable data processing applications with Apache Spark

Scalable data processing with Apache Spark

Designing and maintaining scalable data processing applications with Apache Spark


Scalable data processing with Apache Spark introduces you to the popular, open-source processing framework that took over the Big Data landscape. From basic concepts all the way to configuration and operations, you will learn how to model data processing algorithms using Spark's APIs, how to monitor, analyze and optimize Spark's performance, how to deploy and build Spark applications, and how to use Spark's various APIs (RDD, SQL, DataFrame and Dataset).

Intended Audience

This course is intended for individuals responsible for designing and implementing solutions using Apache Spark, namely Solutions Architects and SysOps Administrators, Data Scientists and Data Engineers interested in learning about Apache Spark.

Prerequisites

We recommend that attendees of this course have the following prerequisites:

  • Proficiency in at least one of the following programming languages: Java8 (including Lambdas), Scala, Python
  • Basic familiarity with Java's JVM: classes, jars, and memory management
  • Basic familiarity with big data technologies, including Apache Hadoop, MapReduce, HDFS
  • Basic understanding of data warehousing, relational database systems, and database design

Modules

  Module 1 - Introduction to Apache Spark
  • Overview of Apache Spark
  • Basic concepts: distributed processing and Map/Reduce
  • Word Count example explained
  • Apache Spark application components: Driver, Master, Executor
  • Apache Spark deployment modes and local environment setup
  • Serialization and Shuffling
  • Spark internal processing model: Jobs, Stages and Tasks
  • Apache Spark libraries: core, SQL, MLLib, Graph and Streaming
  • Supported Languages: API comparison
  Module 2 - Programming and Optimizing Apache Spark Jobs
  • RDD API: transformations vs. actions
  • Modeling computations using RDD API
  • Caching, Broadcasting and Checkpointing
  • Common performance pitalls: groupBy, collect, join
  • Accumulators as alternatives to actions
  • Wordcount example revisited - using SQL, Dataframe, and Dataset APIs
  • Supported Data Formats: CSV, JSON, Parquet
  • API Comparison - which one should I use?
  Module 3 - Apache Spark Operations: deployment, monitoring and intgerations
  • Managing Spark Sessions: best practices
  • Apache Spark class loading: common mistakes and best practices
  • Monitoring Spark application: Web UI, Metric sinks
  • Resiliency: retried tasks and stages, and where resiliency fails
  • Logging and viewing history
  • Data storage integration: HDFS, S3, Cassandra and JDBC

Related Courses


Introduction to BigData and Cloud Technologies

Introduction to BigData and Cloud Technologies

BigData and Cloud explained with real-world examples in this intensive 1-day workshop
BigData on Amazon Web Services (AWS)

BigData on Amazon Web Services (AWS)

BigData processing on AWS with Hadoop, Spark, RedShift and more explained