Hands-on Apache Spark

Learn Apache Spark from basics to advanced topics in this hands-on course

Next courses

None currently
3 days
20
Instructor-led, hands-on exercises
Hebrew
Bring your own (installation instructions will be sent prior to course start)
Included

Apache Spark is the main platform for deep BigData analysis. Companies from all industries - Finance, AdTech, Cyber, Commerce and Internet are using Spark in different modes for ETL, BI, Machine-learning and stream-processing. This developers course gives you hands-on experience with Spark basic and advanced modules, and is focused on Spark DataFrames - the data-optimized API of Spark. Teaching and exercises are done on a cloud environment (AWS EMR, S3 and Zeppelin).

Objectives

Participants will gain end-to-end familiarity with Apache Spark, and know how to:

  • Install and deploy Apache Spark.
  • Design Spark computations, using transformations and actions and the DataFrame API.
  • Use the Spark eco-system including SparkSQL, Spark Streaming, Spark.ML, and more.
  • Use best-practices and debug and monitoring tooling to produce production-ready deployments.,
  • Make use of Spark in real-life scnearios, and trafe-off accuracy and performance where needed.

Prerequisites

At least 3 years of programming experience, and experience with either Python, Java or Scala.

Syllabus

  • Short Scala introduction for Java and Python programmers.
  • Functional Programming.
  • Getting to know the BigData ecosystem.
  • Apache Hadoop (HDFS, MapReduce) and Apache Spark.
  • Principles of MapReduce.
  • The foundation for BigData - data locality, partitioning, shuffeling.
  • BigData tools and applications.
  • Hands-on AWS: EC2, connecting via SSH, S3, AWS CLI, EMR and HDFS.
  • Spark low level API - RDD.
  • SparkSession and SparkContext.
  • Transformations and actions.
  • Functional programming and distributed execution.
  • Working with files.
  • Distributed computation with DataFrames.
  • Reading files with DataFrames.
  • DataFrames API principles.
  • Data Partitioning - hashmod full-order.
  • Grouping, sorting, joining in distributed system.
  • Query plan and explain.
  • Spark cluster components.
  • Scheduling - jobs, stages, tasks.
  • Writing a Spark applications in Scala, Java and Python.
  • Using spark-submit - local and cluster mode.
  • Monitoring execution via Spark UI.
  • Logging, writing and collecting.
  • SparkSQL components: HiveQL, MetaStore, Storage.
  • File formats - Parquet, csv, json.
  • Analytical functions.,
  • IMDB example SQL walk-through.
  • IMDB example - hands on using DataFrame syntax.
  • Creating Dynamic schema - hands on exercise with legacy data.
  • Defining User Defined Functions (UDF).
  • Streaming principles - event time, watermark, unbounded table.
  • DataFrame API for streaming.
  • Hands-on - reading web logs from Kafka and detecting bots.
  • Smart sampling.
  • Bloom filter, linear counting, min-count.
  • Spark approximation functions.
  • Machine learning terms - train, test, overfit, regularisation.
  • The Spark.ML data-frames API.
  • Recommendation system algorithm - ALS.
  • Hands-on example - Movie recommendation.
  • Hyper-parameters

Ready to get started?

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.