Learning Spark — Introduction

When traditional storage systems and essential programming found it challenging to handle huge volume of data. Google found GFS (Google File system), MapReduce and Big table, that handled hardware, storage and parallel programming challenges.

The computational challenges and solutions in GFS, lead to HDFS and MapReduce as a framework at Yahoo, which was then donated to Apache Software Foundation. When MapReduce, HDFS with Hadoop Common and Hadoop YARN became Hadoop, it had widespread adoption outside Yahoo by large open source community. Hadoop was able to handle large scale jobs for general batch processing , it was not suitable for machine learning, streaming and interactive SQL queries.

Researches at UC Berkeley acknowledged the inefficiencies in MapReduce and the complex learning curve for beginners in MapReduce. They developed ‘Spark’ with parts from MapReduce and further enhanced to support — in memory storage for interactive map and reduce computation.

Spark is designed to make large scale distributed data processing, easier, simpler and faster. In Spark, intermediate results are processed in memory that limits disk I/O and drives speed. Spark offers RDD, which is a fundamental logical data structure upon which other data abstractions are constructed (e.g. spark dataframes). This makes it easy to use.

Apache Spark Ecosystem

Spark is a unified processing engine that can support multiple programming language like Scala, Java, Python, SQL and R. It offers unified libraries like SparkSQL, MLlib, GraphX etc. Spark can read data from different data sources like Apache Hive, HBase, RDBMS and cloud sources like Amazon S3/Azure Storage etc.

Apache Spark Ecosystem of connectors

In Spark Architecture, the core Apache spark components are,

  1. Spark Application

Spark Application has a Spark driver program which can orchestrate parallel operations on Spark Cluster. The driver access Spark components through Spark Session.

Spark Driver is responsible for communication with cluster manager, it requests resources from cluster manager for spark executors, transforms spark operations into DAG computations, schedules and distributes task.

Spark Session consolidates all spark operations and data, it can streamline multiple entry points like Spark Context, Hive Context, SQL Context, Spark Conf and Streaming Context.

Cluster Manager manages and allocates resources for cluster of nodes that runs spark application. It supports Standalone , Apache Hadoop YARN, Apache Mesos and Kubernetes.

Spark Executor runs on each worker node in the cluster. They communicate with Spark Driver and executes tasks on the workers.

Apache Spark components and architecture

Apache Spark’s easy to use APIs suits data of all shapes and sizes across Scala, Java, Python, SQL and R. Undoubtedly, Apache Spark is developer’s delight.

Sources:

“Learning Spark, 2nd Edition, by Jules S. Damji, Brooke Wenig, Tathagata Das and Denny Lee. Copyright 2020 Databricks, Inc., 978–1–492–05004–9”

https://databricks.com/p/ebook/learning-spark-from-oreilly

https://spark.apache.org/docs/latest/cluster-overview.html

https://databricks.com/spark/about

Analytics Professional practicing data solutions of all shapes and sizes