{"id":160,"date":"2022-11-22T21:38:38","date_gmt":"2022-11-22T21:38:38","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/apache-spark\/?p=160"},"modified":"2022-11-22T21:38:38","modified_gmt":"2022-11-22T21:38:38","slug":"spark-your-first-spark-program","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/apache-spark\/spark-your-first-spark-program\/","title":{"rendered":"Spark &#8211; Your First Spark Program!"},"content":{"rendered":"<p>In this tutorial, you will learn how create a basic spark job. We&#8217;ll be using Spark Core. This is the base of the Spark project. It provides functionality for distributed task dispatching, scheduling and basic I\/O operations.<\/p>\n<p>We would write as simple spark program (Spark Job) that processes a text file. We would also see how to use the Spark Web GUI to view status of jobs, storage and other parameters. We would also demonstrate how the Spark RDD(Resilient Distributed Datasets) works.<\/p>\n<p><strong>Content<\/strong><\/p>\n<ol>\n<li><a href=\"#t1\">Start Spark Shell and Create a Text File<\/a><\/li>\n<li><a href=\"#t2\">Create an RDD<\/a><\/li>\n<li><a href=\"#t3\">Execute Work Count Transformation<\/a><\/li>\n<li><a href=\"#t4\">Cache and Save the RDD<\/a><\/li>\n<li><a href=\"#t5\">View the Output<\/a><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t1\">1. Start Spark Shell and Create a Text File<\/strong><\/h4>\n<p>By now you should have installed the Spark and can start the Spark Shell.<\/p>\n<p><strong>Step 1<\/strong> &#8211; Start the Spark Shell<\/p>\n<p><strong>Step 2<\/strong> &#8211; Create a text file (I name it <em>inputfile.txt<\/em>) in the home directory where the spark shell started from<\/p>\n<p>Have this content in the text file<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">People can be prejudiced not only toward those of another nationality<span style=\"color: #333333;\">,<\/span> \r\nrace<span style=\"color: #333333;\">,<\/span> tribe<span style=\"color: #333333;\">,<\/span> or language but also toward those of a different religion<span style=\"color: #333333;\">,<\/span> \r\ngender<span style=\"color: #333333;\">,<\/span> or social class<span style=\"color: #333333;\">.<\/span> Some judge people negatively based on their age<span style=\"color: #333333;\">,<\/span> \r\neducation<span style=\"color: #333333;\">,<\/span> disabilities<span style=\"color: #333333;\">,<\/span> or physical appearance<span style=\"color: #333333;\">.<\/span> \r\nYet<span style=\"color: #333333;\">,<\/span> they still feel that they are not prejudiced<span style=\"color: #333333;\">.<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Create an RDD<\/strong><\/h4>\n<p>So we would read this file using Spark-Scala API and create an RDD from it.<\/p>\n<p>Execute the command below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">val<\/span> fileRDD <span style=\"color: #008800; font-weight: bold;\">=<\/span> sc<span style=\"color: #333333;\">.<\/span>textFile<span style=\"color: #333333;\">(<\/span><span style=\"background-color: #fff0f0;\">\"inputfile.txt\"<\/span><span style=\"color: #333333;\">)<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>This command reads a text file from the given path and creates an RDD named fileRDD. In this case, the file is expected to be in the current location.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Execute Word Count Transformation<\/strong><\/h4>\n<p>We would like to have a word count of each word in the file. To do this we would use three functions:<\/p>\n<p><strong>flatmap()<\/strong> &#8211; split the content of the file by space<\/p>\n<p><strong>map()<\/strong> &#8211; get the word count<\/p>\n<p><strong>reduceByKey(_+_)<\/strong> &#8211; add values of similar keys<\/p>\n<p>The complete command is given below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">val<\/span> wordCounts <span style=\"color: #008800; font-weight: bold;\">=<\/span> fileRDD<span style=\"color: #333333;\">.<\/span>flatMap<span style=\"color: #333333;\">(<\/span>line <span style=\"color: #008800; font-weight: bold;\">=&gt;<\/span> line<span style=\"color: #333333;\">.<\/span>split<span style=\"color: #333333;\">(<\/span><span style=\"background-color: #fff0f0;\">\" \"<\/span><span style=\"color: #333333;\">)).<\/span>map<span style=\"color: #333333;\">(<\/span>word <span style=\"color: #008800; font-weight: bold;\">=&gt;<\/span> <span style=\"color: #333333;\">(<\/span>word<span style=\"color: #333333;\">,<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span><span style=\"color: #333333;\">)).<\/span>reduceByKey<span style=\"color: #333333;\">(<\/span><span style=\"color: #008800; font-weight: bold;\">_<\/span><span style=\"color: #333333;\">+<\/span><span style=\"color: #008800; font-weight: bold;\">_<\/span><span style=\"color: #333333;\">);<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>This command is a transformation applied to the fileRDD and creates a new RDD. If you want to see this new RDD (though it would not make much sense to you!), you can used the command below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">wordCounts<span style=\"color: #333333;\">.<\/span>toDebugString\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. Cache and Save the RDD<\/strong><\/h4>\n<p>Now we want to cache our transformed RDD and also persist it to storage.<\/p>\n<p>The command below uses the <em>cache()<\/em> function(persist can also be used) to persist the RDD in memory<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">wordCounts<span style=\"color: #333333;\">.<\/span>cache\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>Use the command below to <strong>apply an action<\/strong> to the RDD. Here, we want to save the output of the transformation to a text file. This is saved in a folder named <em>outputDir<\/em>.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">wordCounts<span style=\"color: #333333;\">.<\/span>saveAsTextFile<span style=\"color: #333333;\">(<\/span><span style=\"background-color: #fff0f0;\">\"outputDir\"<\/span><span style=\"color: #333333;\">)<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t5\">5. View the Output<\/strong><\/h4>\n<p>Now, you want to view the ouptutfile. We also want to see the status of the Spark job in the GUI.<\/p>\n<p><strong>Step 1<\/strong> &#8211; Open a new terminal and navigate into the outputDir folder<\/p>\n<p><strong>Step 2<\/strong> &#8211; Use the the command below to view the content<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">ls <span style=\"color: #333333;\">-<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>You will see the output<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">part-00000 \r\npart-00001 \r\n<span style=\"color: #007020;\">_<\/span>SUCCESS\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; You can then use the cat command to view the content of each file<\/p>\n<p>Finally, you can view the status on the Web GUI using the the link:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">http:\/\/localhost:4040\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>The GUI output is given below:<\/p>\n<figure id=\"attachment_162\" aria-describedby=\"caption-attachment-162\" style=\"width: 1708px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-162\" src=\"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25.png\" alt=\"Apache Spark GUI Showing Spark Jobs\" width=\"1708\" height=\"1037\" srcset=\"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25.png 1708w, https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25-300x182.png 300w, https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25-1024x622.png 1024w, https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25-768x466.png 768w, https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-content\/uploads\/sites\/13\/2022\/11\/Screenshot-2022-11-22-at-22.34.25-1536x933.png 1536w\" sizes=\"auto, (max-width: 1708px) 100vw, 1708px\" \/><\/a><figcaption id=\"caption-attachment-162\" class=\"wp-caption-text\">Apache Spark GUI Showing Spark Jobs<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>In this tutorial, you will learn how create a basic spark job. We&#8217;ll be using Spark Core. This is the base of the Spark project. &hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[37,32,38,35,9],"class_list":["post-160","post","type-post","status-publish","format-standard","hentry","category-spark","tag-apache","tag-hadoop","tag-rdd","tag-scala","tag-spark-core"],"_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/posts\/160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/comments?post=160"}],"version-history":[{"count":2,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/posts\/160\/revisions"}],"predecessor-version":[{"id":163,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/posts\/160\/revisions\/163"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/media?parent=160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/categories?post=160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/apache-spark\/wp-json\/wp\/v2\/tags?post=160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}