Cloud Computing Labs: Relational Data Analysis with Pig and Spark

The primary purpose of this assignment is to get familiar with Pig and Spark programming for analyzing relational data. This project consists of two parts: the first part is to work with the tutorials; the second part will apply Pig Latin and Spark to answer a few queries on the given simple dataset.

Part 1: Getting Familiar with the basics

1.1 Pig and Pig Latin

After you download, unzip, and copy the pig binary to a directory, you need to check the following items.

1.2 Spark and Spark SQL

After you download, unzip, and copy the spark binary to a directory, you need to do the following tasks

Now, answer the following questions:

Question 1.1 Assume the input file is a text document. Please try to understand what the following piece of code does and comment the code line by line. What does this piece of code actually do? (Variable names are anonymized to A, B, C, etc. to remove side information. In practice, you should not use this kind of naming style.)

A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as token;
C = group B by token;
D = foreach C generate group, COUNT(B);
store D into './output';

Question 1.2 Assume the input file is a text file containing only numbers separated by whitespaces. Please try to understand the following piece of PySpark code and comment the code line by line. What does this piece of code actually do?

lines = sc.textFile(sys.argv[1])
A = lines.flatMap(lambda x: x.split(' '))
B = A.map(lambda x: (int(x), 1))
C = B.sortByKey()

Part 2: Analyzing Book Sales Records

In this task, you will use Pig and Spark to solve the analytic problems for a set of linked tables. Now we have three data files with the following schema: customer(cid, name, age, city, sex), book(isbn, name), and purchase(year, cid, isbn, seller, price), where purchase.cid is the foreign key to customer.cid and purchase.isbn is the foreign key to book.isbn. The fields in the dataset are separated by "\t". You should assign the schema, i.e., field names and and types, to the fields when you load the data.

The pig local mode is recommended for better performance:

pig -x local
and Spark local model is also recommended for easier setup. You can use PySpark or Scala Spark to answer the questions.

The code you give for the following questions should be well commented.

Question 2.1 How much did each seller earn? Develop both the PIG and Spark RDD (Spark RDD means using only the transformations and actions, not Spark SQL) versions to solve the query.

Question 2.2 Find the names of the books that Amazon gives the lowest price among all sellers. Develop both the PIG and Spark RDD versions to solve this query.

Question 2.3 Assume the customers with the same last name are in the same family. Find the family that spent the most money in the books. Develop both the Spark Dataset and DataFrame (Spark SQL) versions to solve this query.

Question 2.4 Who also bought ALL the books that Harry bought? Develop both the Spark Dataset and DataFrame versions to solve this query.

Question 2.5 How much time did you spend on the tasks 2.1-2.4? (hours)

Question 2.6 How useful are these tasks to your understanding of Pig and Spark programming? (very useful, somewhat useful, not useful)

Final Survey Questions

Question 3.1 Your level of interest in this lab exercise (high, average, low);

Question 3.2 How challenging is this lab exercise? (high, average, low);

Question 3.3 How valuable is this lab as a part of the course (high, average, low).

Question 3.4 Are the supporting materials and lectures helpful for you to finish the project? (very helpful, somewhat helpful, not helpful);

Question 3.5 How much time in total did you spend in completing the lab exercise;

Quertion 3.6 Do you feel confident on applying the skills learned in the lab to solve other problems?

Deliverables

Turn in the two PDF files (1) the answers to Question 1.1, 1.2, 2.1-2.4, and (2) the answers to the survey questions 2.5, 2.6, 3.1-3.6.


This page, first created: Oct. 11, 2016, modified: Oct, 2018