Cloud Computing Labs: Relational Data Analysis with Pig and Spark

The primary purpose of this assignment is to get familiar with Pig and Spark programming for analyzing relational data. This project consists of two parts: the first part is to work with the tutorials; the second part will apply Pig Latin and Spark to answer a few queries on the given simple dataset.

Part 1: Getting Familiar with the basics

1.1 Pig and Pig Latin

After you download, unzip, and copy the pig binary to a directory, you need to check the following items.

1.2 Spark and Spark SQL

After you download, unzip, and copy the spark binary to a directory, you need to do the following tasks

Now, answer the following questions:

Question 1.1 Assume the input file is a text document. Please explain what the following piece of code does, line by line. What does this piece of code actually do? (hint: if you are not familiar with the syntax, try the code in pig shell and dump the intermediate results)

A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as token;
C = group B by token;
D = foreach C generate group, COUNT(B);
store D into './output';

Question 1.2 Convert the code in 1.1 to Spark (python), and paste the code here.

Question 1.3 Assume the input file is a text file. Please explain the following piece of PySpark code, line by line. What does this piece of code actually do?

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
A = lines.flatMap(lambda x: x.split(' '))
B = A.map(lambda x: (int(x), 1))
C = B.sortByKey()

Question 1.4 How much time did you spend on the task 1.1-1.3? (in hours)

Question 1.5 How useful are these tasks to your understanding of the Pig Latin language and Spark APIs? (very useful, somewhat useful, not useful)


Part 2: Analyzing Book Sales Records

In this task, you will use Pig and Spark SQL to solve the analytic problems for a set of linked tables. Now we have three data files with the following schema: customer(cid, name, age, city, sex), book(isbn, name), and purchase(year, cid, isbn, seller, price), where purchase.cid is the foreign key to customer.cid and purchase.isbn is the foreign key to book.isbn. The fields in the dataset are separated by "\t". You should assign the schema, i.e., field names and and types, to the fields when you load the data.

The pig local mode is recommended for better performance:

pig -x local [script_file]
and Spark Python should be used for the Spark SQL question.

The code you give for the following questions should be well commented.

Question 2.1 How much did each seller earn? Paste the Pig code and the query answer here.

Question 2.2 Find the names of the books that Amazon gives the lowest price among all sellers. Paste the Pig code and the query answer here.

Question 2.3 Who also bought ALL the books that Harry bought? Paste the Pig and Spark SQL code, and the query answer here.

Question 2.4 How much time did you spend on the tasks 2.1-2.3? (hours)

Question 2.5 How useful are these tasks to your understanding of Pig and Spark programming? (very useful, somewhat useful, not useful)

Final Survey Questions

Question 3.1 Your level of interest in this lab exercise (high, average, low);

Question 3.2 How challenging is this lab exercise? (high, average, low);

Question 3.3 How valuable is this lab as a part of the course (high, average, low).

Question 3.4 Are the supporting materials and lectures helpful for you to finish the project? (very helpful, somewhat helpful, not helpful);

Question 3.5 How much time in total did you spend in completing the lab exercise;

Quertion 3.6 Do you feel confident on applying the skills learned in the lab to solve other problems?

Deliverables

Turn in the PDF report that answers all the questions to Pilot and also a hardcopy of the report in the Monday class.


This page, first created: Oct. 11, 2016