Spark Essentials
Last updated
Last updated
Apache spark is an open sourced fault tolerant data processing framework for Big Data workloads which unifies:
Batch processing
Real-Time / Stream processing
Machine learning
Graph computation
Spark is largely used as a distributed batch computation engine that explciitly handles an entire workflow as a single job (hence a dataflow engine like Tez and Flink)
| Change Mermaird Graphs to Images
Sorting only needs to be performed when it is actually required rather than by default between every Map and Reduce stage
No unnecessary map tasks - Map tasks can be incorporated into a preceding reduce operator
Locality optimizations since all joins and data dependencies in a workflow are explicitly declared - Tasks that consume the same data can be placed on the same machine to reduce network overhead
Intermediate state between operators can be kept in memory or written to local disk to reduce I/O to HDFS
Operators can execute once input is ready without waiting for the preceding stage to complete
Reduce startup overhead by reusing existing JVM processes to run new operators
Spark enables fault tolerance by
Tracking RDD block creation process to rebuild a dataset when a partition fails
Uses DAGs to rebuild data flow across worker nodes upon failure
The spark driver unsurprisingly acts as the "driver" your Spark App. The driver:
Controls the execution of the Spark application
Maintains all states of the Spark cluster (including its executors)
Interfaces with the Cluster Manager to negotiate for physical resources (i.e. Memory and Cores) to launch executors
Within the Driver Program sits a Spark Context or Session (Spark 2.0 onwards). The Context is used by the Driver to establish communication with the Cluster and Resource managers to coordinate and execute jobs.
An instance of a Spark Context / Session must be instantiated in every application. For example in Spark 2.0 we can do the following when executing locally:
TBD
TBD
TBD
Spark executors are processes that perform tasks assigned by the Driver process. Executors simply take tasks assigned by the Driver, run them and report back with the state and results (i.e. Success or Failure)
Cluster managers allocate physicaly resources (E.g. RAM and Cores) to our Spark Application based on its needs. Some common resource managers are detailed below:
Submit JAR or Script to Cluster Manager which then launches a Driver Process on a Worker Node within the Cluster.
Same as Cluster mode except Driver reamins on the Client machine that submitted the Spark Application
Complete departure from Cluster and Client modes. Runs entire Spark Application on a Single machine
Cluster Manager is reponsible for maintaining Spark Applications.
Client machine (Gateway machine or Edge node) is responsible for maintaining Driver process whilst Cluster manages Executor processes.
Achieves Parallelism through multithreading
Usecase
Features
Syntax Errors
Compile Time
Compile Time
Runtime
Analysis Errors
Compile Time
Runtime
Runtime
Default join strategy in Spark