Apache Spark, a short explaination
open source distributed computing software
Imagine instead of having one laptop to yourself, you had 100 differnt laptops. Thisis kind of nice. Firstly, you get more compute power, secondly you more storage, third you get safety of using some other one if one goes dow, but most importantly, now you have a team of 100 laptops to do a task, not just one. That is what Apache Spark is fundamentally, a distributed computing software. Apache Spark is open source i.e. you can use it’s software for free, no company owns it.
Apache Spark has a Driver node (sort of the administrator) and Worker node (sort of the executors). You can think of node as a abstraction of single server or a single computer. There is one node that is the administrator and the worker nodes are nodes that perform a task.
There can be only 1 Driver node, but multiple Worker nodes.
Both Driver node and Worker node, like any computer, have a Memory (RAM) assigned to them. This is what they use for computation and running things on them.
Both Driver node and Worker node also have Cores in them. Cores are individual computation units within the CPU of Driver and Worker. The can perform a task independent of the other one.
With the Worker node, there is an abstraction in Spark, the we call an Executor. Each executor is made up of certain number of cores. It can 1 core or more cores. Of course, it cannot be more than the maximum number of cores in a particular Node.
Spark is about distrbuted computing, paralllelism. It is therefore very important, how many cores you have, and memory each core has. Also, very important, how many partitions your input data. The minimum sweet spot, at minimum, is where each core can fit each data partition easily in memory. The best sweet spot is at where number of total cores is equal to the number of data partitions — is this is where all data can be processed in parallel. In the former — the minimum sweet spot, some data partitions will have to wait until a Core gets free.