May 24, 2026

Spark – Your First Spark Program!

In this tutorial, you will learn how create a basic spark job. We’ll be using Spark Core. This is the base of the Spark project. It provides functionality for distributed task dispatching, scheduling and basic I/O operations.

We would write as simple spark program (Spark Job) that processes a text file. We would also see how to use the Spark Web GUI to view status of jobs, storage and other parameters. We would also demonstrate how the Spark RDD(Resilient Distributed Datasets) works.

Content

  1. Start Spark Shell and Create a Text File
  2. Create an RDD
  3. Execute Work Count Transformation
  4. Cache and Save the RDD
  5. View the Output

 

1. Start Spark Shell and Create a Text File

By now you should have installed the Spark and can start the Spark Shell.

Step 1 – Start the Spark Shell

Step 2 – Create a text file (I name it inputfile.txt) in the home directory where the spark shell started from

Have this content in the text file

People can be prejudiced not only toward those of another nationality, 
race, tribe, or language but also toward those of a different religion, 
gender, or social class. Some judge people negatively based on their age, 
education, disabilities, or physical appearance. 
Yet, they still feel that they are not prejudiced.

 

2. Create an RDD

So we would read this file using Spark-Scala API and create an RDD from it.

Execute the command below:

val fileRDD = sc.textFile("inputfile.txt")

 

This command reads a text file from the given path and creates an RDD named fileRDD. In this case, the file is expected to be in the current location.

 

3. Execute Word Count Transformation

We would like to have a word count of each word in the file. To do this we would use three functions:

flatmap() – split the content of the file by space

map() – get the word count

reduceByKey(_+_) – add values of similar keys

The complete command is given below:

val wordCounts = fileRDD.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_);

 

This command is a transformation applied to the fileRDD and creates a new RDD. If you want to see this new RDD (though it would not make much sense to you!), you can used the command below:

wordCounts.toDebugString

 

4. Cache and Save the RDD

Now we want to cache our transformed RDD and also persist it to storage.

The command below uses the cache() function(persist can also be used) to persist the RDD in memory

wordCounts.cache

 

Use the command below to apply an action to the RDD. Here, we want to save the output of the transformation to a text file. This is saved in a folder named outputDir.

wordCounts.saveAsTextFile("outputDir")

 

5. View the Output

Now, you want to view the ouptutfile. We also want to see the status of the Spark job in the GUI.

Step 1 – Open a new terminal and navigate into the outputDir folder

Step 2 – Use the the command below to view the content

ls -1

 

You will see the output

part-00000 
part-00001 
_SUCCESS

 

Step 3 – You can then use the cat command to view the content of each file

Finally, you can view the status on the Web GUI using the the link:

http://localhost:4040

 

The GUI output is given below:

Apache Spark GUI Showing Spark Jobs
Apache Spark GUI Showing Spark Jobs
0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments