Cloud Computing Lab on Apache Spark

1 Abstract

The goal of this lab is to give you experience in developing Spark programs.

2 Background

Refresh your understanding by re-reading the Big Data Lecture Notes before beginning to do the lab.

3 Lab Environment

The tasks of this Lab are doable in many places: WSU pmcluster, AWS, MS Azure, Chameleon, and even your own laptop.

  1. Choose a platform, and include a rationale for your choice in the lab report.
  2. Spark is installed on pmCluster, AWS, Azure, etc. Please read + report on Piazza the exact pathnames, and version details.
  3. If you wish to install on your own Ubuntu, do apt-get install spark
  4. Installing Spark from Apache downloads http://spark.apache.org/ will install a newer version.

4 Lab Experiment

  1. Choose one among Scala/ Java/ Python languages for the tasks of this Lab, and include a rationale for your choice in the lab report.

4.1 Task 1: Simple Examples

There are several short examples described in http://spark.apache.org/examples.html. Do them all yourself. Scala is preferred. Python and Java are acceptable.

Capture your work in a typescript, insert it into the lab report as pre-formatted text. Comment on any discrepancies between the output you observed with what is shown on the above web page.

4.2 Task 2: Spark Streaming

At the bottom of the page http://spark.apache.org/examples.html there are links to Spark streaming examples. Choose any one example, and run it.

Capture your work in a typescript, insert it into the lab report as pre-formatted text, a few screen shots of all the above in the report.

4.3 Task 3: World Wide Grades

  1. You are given a world-wide "database" (actually a not-so-tiny collection of individual files – see below) of graduate student grades. We need to compute the average grade per country. Average across all years, terms, courses, and students.
  2. Sketch in a detailed manner, short of actual Spark/ Scala/ Java code, how this problem can be solved using Hadoop MapReduce. Explain your thinking. Make this a clearly labeled section in the lab report.
  3. Include the Spark source code as a separate file in the turnin. Include your invocation of this, and the results in the report.
  4. File Format: Hundreds/ Thousands of files are given in a directory named D. A single file contains the grades of one course offering. This is a text file of two columns, one row (line) per student. Left column has student id, as a non-negative integer, the right column has the grade the student received, as a floating point number in the range of 0 to 5. The name of the file encodes a lot of fields in the following order, separated by one hyphen: country code (4 digits), year the course was offred (4 digits), term the course was offered (4 digits), course "number" (4 letters immediately followed by 4 digits). Make further reasonable assumptions, if you must.
  5. Each file has any where from 5 to 100 student records. Each country has about 1000 course records. We have about 80 countries.
  6. Artificially construct, using a script or two, in a language you are comfortable in. Bash shell, Python, Scala, Java, …. Each of you contribute the files for ten countries. Use the international telephone prefixes, padded with leading zeroes, as the country code for your files. Further management details of this subtask will be on Piazza.

4.4 Task (Bonus) 4: (25 points) Hadoop

  1. Do the above task using Hadoop MapReduce.
  2. Compare the ease/ difficulty between Spark and Hadoop.
  3. Measure the performance.

4.5 Task (Bonus) 5: (25 points) KMeans

  1. Do the Kmeans task of the Hadoop lab in Spark.
  2. Compare the ease/ difficulty between Spark and Hadoop.
  3. Measure the performance.

4.6 Task (Bonus) 6: (n*5 points) Chameleon

  1. Do try the above tasks on Chameleon, and write a brief section on your experience. 5 points for each task.

4.7 Survey

[TBD This will be replaced by a Google Forms survey. Real Soon Now.]

  1. Your level of interest in this lab exercise (high, average, low);
  2. How challenging is this lab exercise? (high, average, low);
  3. How valuable is this lab as a part of the course (high, average, low).
  4. Are the supporting materials and lectures helpful for you to finish the project? (very helpful, somewhat helpful, not helpful);
  5. How useful was this lab to your understanding of virtualization?
  6. How many hours (approximately) did you spend on Task 1? 2? 3? 4?
  7. Do you feel confident on applying the skills learned in the lab to solve other problems with Spark? (low, average, high)
  8. Write a paragraph on your experience with Chameleon.

4.8 Turn In

  1. L5Report.pdf should be written as a tech report. Devote one section each for the above tasks. Use your judgement in what to include in these sections. Your overall goal is to convince any reader of your report that you have understood and carried out the tasks.
  2. ~ceg738000/turnin L5 ReadMe.txt myLabJournal.txt wwgrades.scala L5Report.pdf survey.txt

TBD Grading Sheet

5 References

  1. Prabhaker Mateti, Big Data Lecture Notes, 2015. Required Reading.

Copyright © 2015 pmateti@wright.edu www.wright.edu/~pmateti 2015-11-19