Saturday 6 January 2018

Spark - I

In the last series of posts, we looked at Spark SQL. The first segment of the Spark SQL post is here in case you wish to look at it. In this series we will look at basics of Spark Core. We will use Hortonworks Sandbox HDP 2.6.1 for all the work in this post.

While we do not wish to make this series an academic one, it will be useful just to touch on a few important points before we dive into the basics of Spark Core. 

SparkContext: The SparkContext is the entry point to Spark Application. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. As of now, there can be a maximum of only one SparkContext that can be active per JVM. 

Resilient Distributed Datasets: Every Spark application consists of a driver program that defines the SparkContext and runs the user’s main function and executes various parallel operations on a cluster. The driver program coverts all data into the main abstraction provided by Spark called resilient distributed datasets (RDD) and any further processing is in terms of RRDs. A RDD can be seen as an immutable collection of elements that are partitioned across the nodes of the cluster that can be operated on in parallel. RDDs can be created in two ways:

1) Using a file in the Hadoop file system (or any other Hadoop-supported file system)

2) Using an existing Scala collection in the driver program.

Users may also ask Spark to persist an RDD in memory using the persist (or cache) method, allowing it to be reused efficiently across parallel operations. Finally, RDDs have been architected to automatically recover from node failures.

RDDs support two types of operations on the RDDS:

1) Transformations: create a new RDD from a previous one

2) Actions: performs a computation on RDD and returns the result of the computation

Given that Spark distributes the data as RDDs across the cluster and performs parallel transformations and actions, there is the need for sharing variables among the parallel computations. This is the second abstraction that Spark provides in the form of shared variables. Shared variables are of two types:

1) Accumulators: variables that are capable of being added and additive in nature

2) Broadcast variables: that are cached in memory and accessible on all nodes

Once the sandbox is up, navigate to http://localhost:8888/  You will see the below page:













Click on Launch Dashboard button. On the next page, enter maria_dev/maria_dev and click on Sign In:












On the Ambari Dashboard, verify that Spark2 and Zeppelin Notebook are running in below left:















Then, navigate to http://localhost:9995/#/. You can see the "Welcome to Zeppelin!" page:














Click Create new note under Notebook in above page. On the Create new note in the new window, You can enter a Note Name. Note that the Default Interpreter is spark2. Click on Create Note button:














In the new Notebook, click on Interpreter Binding on the right as shown:






In the Interpreter Binding page, make sure that spark2 is at the top of the list. Note that %spark2 is the default. Click on Cancel:

























On the Spark - I Notebook, enter below code:

%spark2

sc
sc.appName

sc.applicationId
sc.defaultMinPartitions
sc.defaultParallelism
sc.deployMode
sc.isLocal
sc.master
sc.sparkUser
sc.startTime
sc.version


It will be as shown below:








Click on arrow next to READY on right top to run the code. The results are shown below:
























Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
. So we use the default sc and do not create a new one. In the code, we are seeing the different attributes of sc (SparkContext)


This concludes the first segment of Spark