Hands-on Apache Spark Training Course

Next courses

None currently

Length:

3 days

Max students in class:

Delivery method:

Instructor-led, hands-on exercises

Language:

Hebrew

Laptop:

Bring your own (installation instructions will be sent prior to course start)

Lunch:

Included

Apache Spark is the main platform for deep BigData analysis. Companies from all industries - Finance, AdTech, Cyber, Commerce and Internet are using Spark in different modes for ETL, BI, Machine-learning and stream-processing. This developers course gives you hands-on experience with Spark basic and advanced modules, and is focused on Spark DataFrames - the data-optimized API of Spark. Teaching and exercises are done on a cloud environment (AWS EMR, S3 and Zeppelin).

Objectives

Participants will gain end-to-end familiarity with Apache Spark, and know how to:

Install and deploy Apache Spark.
Design Spark computations, using transformations and actions and the DataFrame API.
Use the Spark eco-system including SparkSQL, Spark Streaming, Spark.ML, and more.
Use best-practices and debug and monitoring tooling to produce production-ready deployments.,
Make use of Spark in real-life scnearios, and trafe-off accuracy and performance where needed.

Prerequisites

At least 3 years of programming experience, and experience with either Python, Java or Scala.

Syllabus

Short Scala introduction for Java and Python programmers.
Functional Programming.
Getting to know the BigData ecosystem.
Apache Hadoop (HDFS, MapReduce) and Apache Spark.
Principles of MapReduce.
The foundation for BigData - data locality, partitioning, shuffeling.
BigData tools and applications.
Hands-on AWS: EC2, connecting via SSH, S3, AWS CLI, EMR and HDFS.

Spark low level API - RDD.
SparkSession and SparkContext.
Transformations and actions.
Functional programming and distributed execution.
Working with files.

Distributed computation with DataFrames.
Reading files with DataFrames.
DataFrames API principles.
Data Partitioning - hashmod full-order.
Grouping, sorting, joining in distributed system.
Query plan and explain.

Spark cluster components.
Scheduling - jobs, stages, tasks.
Writing a Spark applications in Scala, Java and Python.
Using spark-submit - local and cluster mode.
Monitoring execution via Spark UI.
Logging, writing and collecting.

SparkSQL components: HiveQL, MetaStore, Storage.
File formats - Parquet, csv, json.
Analytical functions.,
IMDB example SQL walk-through.

IMDB example - hands on using DataFrame syntax.
Creating Dynamic schema - hands on exercise with legacy data.
Defining User Defined Functions (UDF).

Streaming principles - event time, watermark, unbounded table.
DataFrame API for streaming.
Hands-on - reading web logs from Kafka and detecting bots.

Smart sampling.
Bloom filter, linear counting, min-count.
Spark approximation functions.

Machine learning terms - train, test, overfit, regularisation.
The Spark.ML data-frames API.
Recommendation system algorithm - ALS.
Hands-on example - Movie recommendation.
Hyper-parameters

Hands-on Apache Spark

Learn Apache Spark from basics to advanced topics in this hands-on course

Next courses

Objectives

Prerequisites

Syllabus

Ready to get started?

Hands-on Apache Spark

Learn Apache Spark from basics to advanced topics in this hands-on course

Next courses

Objectives

Prerequisites

Syllabus

Ready to get started?

Related courses