Cloud Computing Labs: Scalable Cloud Data Serving with Cassandra

The primary purpose of this assignment is to get familiar with one of the popular cloud data serving systems - Apache Cassandra. Students will learn to use Cassandra, including the commandline environment and the basic programming skills, and setup a Cassandra cluster with Amazon EC2.

Part 1: Working with the Cassandra enviornment.

This part has two goals. (1) We will start with a simple single-node installation and get familiar with the commandline environment. (2) We will learn to use the Python and Java interfaces for programming with Cassandra.

Setup. You will use your own linux box for this task, which can be setup with VirtualBox (virtualbox.org) or Hyper-V on Windows machines. Let's work with the latest version of Ubuntu. You will need to setup Oracle Java using the following steps.

Download the Cassandra code from cassandra.apache.org. Unzip it to the directory ~/cassandra.

cd ~/temp
wget http://apache.mirrors.ionfish.org/cassandra/2.1.7/apache-cassandra-2.1.7-bin.tar.gz
tar -xvzf apache-cassandra-2.1.7-bin.tar.gz
mv apache-cassandra-2.1.7 ~/cassandra

Create Cassandra's working directories

sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
sudo chown -R $USER:$GROUP /var/lib/cassandra
sudo chown -R $USER:$GROUP /var/log/cassandra

Adding the following lines to the ~/.profile, and run "source ~/.profile" to update the environment variables.

export CASSANDRA_HOME=~/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin

Start the Cassandra service

~/cassandra/bin/cassandra

Now, answer the following questions:

Question 1.1 Run the command

~/cassandra/bin/cqlsh
Do you see the message saying that you are connected to the cluster "Test Cluster"? (Yes or No) Also run the command
~/cassandra/bin/nodetool status
and copy the output to the report.

Question 1.2 How much time did you spend on this task?

Question 1.3How difficult did you feel about this task? (very easy, appropriate, too difficult)?

Commandline Environment and Data Modeling. We will use examples to learn how to work with the commandline environment and Cassandra data models. Cassandra is a key-value store, which you can imagine that your data is stored as a big table, where each row is indexed and searched by a key. This simple data model allows Cassandra to scale near linearly. You should check the lecture slides for more discussions on key-value data stores. Although Cassandra provides a SQL-like access method, there is no join operation supported due to the fundamental limited data model. In practice, if you have several relations modeled with the relational model, you have to de-normalize them to create ONE big table for easier processing. Otherwise, joins between tables have to be handled in your programs, which might be inefficient and complex. The disadvantages of denormalization include the redundant data and update anomalies (check the database textbooks for the details on normalization and denormalization), which, however, is not a problem for data analytics, as the rows will be read-only after inserted to the table.

Let's try a small example to show the method of denormalization in Cassandra data modeling. Assume we have three tables: Student(sid, sname), Course(cid, cname), and Register (sid, cid, semester, session), where sid is the primary key of Student table, cid is the primary key of Course table, and the four attributes of Register together serves as the key. Now we want to denormalize the tables to get only one new table RegisterAll, which simply joins all the three tables by the keys. The result is RegisterAll(sid, cid, semester, session, sname, cname).

Now let's create this table in Cassandra. cqlsh is the commandline interface to access the Cassandra system. As you have known from the last exercise, run cqlsh to see the commandline interface. First, you need to create a key space, a concept similar to "the database" in most database systems. Enter the following command

CREATE KEYSPACE studentdb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 }; 
which tells the system which strategy will be used to manage the table, and the replication factor is 2 - each record gets two replicas distributed in the Cassandra cluster to be fault-tolerant. Next, run the command
use studentdb;
You will see that "studentdb" appears in the prompt.

You may create multiple tables in the same key space. However, as we have known, joins between related tables are not supported. Thus, your design of multiple tables in the same key space should not consider table joins. Now, let's create the RegisterAll table by entering:

CREATE TABLE RegisterAll(
sid varchar,
cid varchar,
semester int,
session varchar,
sname varchar,
cname varchar,
PRIMARY KEY(sid, cid, semester, session)
);
"varchar" is one of the Cassandra data types: variable length string. You may find the complete list of Cassandra data types. The keyword "PRIMARY KEY" defines the key of the table. Also, try to insert one record
INSERT INTO RegisterAll (sid, cid, semester, session, sname, cname)
  VALUES ('w000xxx', 'CEG4360','Fall 2015', 1,
  'John Doe', 'Distributed Systems and Cloud Computing');

Now you can use standard SQL SELECT statement to retrieve data. Note that if you want to search by conditions on non-key attributes, you need to create an index on that attribute. For example, the following search will not be allowed

SELECT * FROM RegisterAll
where sname='John Doe'
unless you create an index on "sname"
CREATE INDEX ON RegisterAll(sname);

The CQL syntax is very much like the standard SQL. For details of CQL syntax, please check this link. Extended reading Cassandra can be accessed from different programming languages such as Java, C++, and Python via the Cassandra drivers. Interested students can check the web page to find examples for different languages.


Now, answer the following questions:

Question 1.4 We have a relational database schema: Customer(cid, cname, age, city, gender), Book(isbn, bname), and Purchase(pid, cid, isbn, seller, price), where purchase.cid is the foreign key to customer.cid and purchase.isbn is the foreign key to book.isbn; cid, isbn, and pid are the primary keys for the three tables. Now you design a database "bookdb" for Cassandra that has one table PurchaseAll, which is the denormalization of the three tables. Give the CQL commands to create the database and the table.

Question 1.5 Consider the content in the three tables: customer, book, and purchase and use the INSERT commands to insert records to PurchaseAll so that all the information in the three tables are preserved in the new database. Write down the INSERT commands in the report.

Question 1.6 Write the CQL commands for finding the books that are purchased by "Ellen Smith". You should test your commands to make sure they are executable and the result is correct. Write down the commands in the report.

Question 1.7How much time did you spend on this task?

Question 1.8How difficult did you feel about this task? (very easy, appropriate, too difficult)?

Part 2: Setting up a Cassandra Cluster

In this task, you will learn to setup a Cassandra cluster with Amazon EC2 nodes. Note that this part assumes the students have done the AWS lab so they have no difficulty in setting up the AWS environments and starting multiple EC2 nodes. (This part is independent of Part 1. For students without knowledge of EC2, it can be skipped)

You will need to start 3 virtual instances on EC2. The type of instances does not matter. Install the Oracle Java and Cassandra packages in each instance according to the first part's instruction. Assume Cassandra is also installed in the directory ~/cassandra. The key thing is to configure each node to get the nodes connected to form a cluster. This can be done by editing the configuration file ~/cassandra/conf/cassandra.yaml in each node. You will need to change the following lines

cluster_name: 'Name'
initial_token: Token
seed_provider:
    - seeds:  "Seed IP"
listen_address: Droplet's IP
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
where the "seed IP" is the IP of the unique seed node (master node) in the cluster, "Droplet IP" is the specific node's IP, and Token is set by some token space partitioning algorithm, which is given in the following example.

Specifically, if you use node 0 as the seed node, you can setup the three nodes using the following settings.

Node 0
---------------
cluster_name: 'MySmallCluster'
initial_token: 0
seed_provider:
    - seeds:  "your_node0_instance_IP"
listen_address: your_node0_instance_IP
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

Node 1
---------------
cluster_name: 'MySmallCluster'
initial_token: 3074457345618258602
seed_provider:
    - seeds:  "your_node0_instance_IP"
listen_address: your_node1_instance_IP
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

Node 2
---------------
cluster_name: 'MySmallCluster'
initial_token: 6148914691236517205
seed_provider:
    - seeds:  "your_node0_instance_IP"
listen_address: your_node2_instance_IP
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
After revising the configuration file for each node, go to the seed node and start the cluster by
sudo ~/cassandra/bin/cassandra
Check whether the cluster works normally by typing the command on the seed node
~/cassandra/bin/nodetool status

After you finish these experiments, answer the following questions

Question 2.1 Repeat Exercises 1.4 and 1.5 on the cluster. Then, check the locations of each key (the pid attribute) by running the command on the seed node.

~/cassandra/bin/nodetool getendpoints bookdb PurchaseAll <key>
for each key in the table, and copy the output to the report.

Question 2.2 Remove one of the nodes, say node1, from the cluster by using

 ~/cassandra/bin/nodetool -h node1_IP decommission
and run the previous "nodetool getendpoints" command to check each key in the table. Did you see any change? Copy the output to the report.

Question 2.3 How much time did you spend on this task?

Question 2.4 How difficult do you feel about this task? (easy, appropriate, too difficult)

Final Survey Questions

After you finish all the tasks, answer the following questions.

Question 3.1 Your level of interest in this lab exercise (high, average, low);

Question 3.2 How challenging is this lab exercise? (high, average, low);

Question 3.3 How valuable is this lab as a part of the course (high, average, low);

Question 3.4 Are the supporting materials and lectures helpful for you to finish the project? (very helpful, somewhat helpful, not helpful);

Question 3.5 How much time in total did you spend in completing the lab exercise;

Deliverables

Turn in the answers in one unzipped PDF file to the Pilot project submission.

Make sure that you have terminated all EC2 instances after finishing your work! This can be easily done with the AWS web console.


This page, first created: Jun 2015; last updated: Jun 2015