Under the assumption of untrusted cloud providers, outsourcing data intensive computation such as database services and data mining to public clouds raises privacy concerns. The primary purpose of this assignment is to understand the data privacy problem with outsourced computation in public clouds and implement several methods to mitigate this problem. This project consists of two parts. (1) Multiplication perturbation methods for outsourced data mining; and (2) Crypto-index for outsourced database services.
The idea of data perturbation is to change the original values of a dataset that is prepared for outsourced data mining, so that the untrusted cloud provider cannot figure out any original value while working on the perturbed data. The basic principle is to preserve both data privacy and data utility - the data owner can still obtain high quality data mining models with the perturbed data. In this experiment, we will be looking at one convenient method - geometric data perturbation (GDP). GDP is designed to preserve the utility for Euclidean-distance based data mining models such as kNN classifiers, linear classifiers, Support Vector Machine (SVM) classifiers (with certain kernels), and clustering with Euclidean distance as the similarity measure. In this project, we consider only classification modeling.
Let's define GDP first. The data for mining is often a table with k columns and n rows, denated as a k x n matrix. A column is also called a dimension. Let x be any record of the table, which is represented as a k-element column vector. x should be normalized. You can use the standardizaiton method x_{i} = (x_{i}-m)/s, where x_{i} is the i-th dimension of the vector x, m is the mean of the i-th dimension, and s is the standard deviation of the i-th dimension. In some rare cases that a dimension contains only one identical value, you can set all values of that dimension to 0.
Let R be a secret random orthogonal k by k matrix (so that R times R's transpose is the identity matrix I). Note that R can be generated with simple methods like QR decomposition or eigendecomposition of random k by k matrices, which can be found in some Python or Java libraries. Let t be a secret random k-element column vector, which is drawn from some domain close to the normalized x, e.g., the normal distribution N(0,1). Both R and t should be shared by all records. Let d be a randomly generated k-element noise vector, individually for each record, drawn from a normal distribution N(0, σ^{2}), where the variance σ^{2} is determined by the user. Let y be the perturbed vector for the original vector x. The geometric data perturbation is defined as
y=Rx+t+d
Classification modeling is to find the best function that can predict the label of each record, by learning from a set of given examples in the form (record, label) - we call it training data. The given training datasets are in the SVM format. Each line of the data file represents a record, which looks like
label index:value index:value ...
When you apply GDP to protect classification modeling, you perturb only the records but keep the label unchanged. As a result, each row the outsourced training data looks like (perturbed record, label). In this way, the cloud side can still work on the perturbed records to train a classification model, which should have close accuracy as the one trained with the original data (the accuracy is dependent on the setting of sigma). However, the cloud provider cannot observe the original record, neither can they figure out the original classification model, as long as the perturbation parameters R and t are not disclosed (under certain security assumptions).
You will need to implement the GDP perturbation method and conduct some experiments to understand the important features. Specifically, you need to finish the following tasks.gdp sigma_setting original_data_file perturbed_data_file
./knnc training_data ./adaboost -o training_data ./svm-train -v 5 training_data
Answer the questions:
Question 1.1 Implement the GDP algorithm and paste your GDP implementation code here.
Question 1.2 Conduct the accuracy evaluation on the three datasets and report your experimental results with the following table for each dataset. Each cell represents the accuracy and standard deviation (e.g., in the form 0.82+/-0.02) for the corresponding classification method and the sigma setting. Note that SVM output does not have standard deviation, which is fine. Intepret the table and write down your observations and conclusions.
methods | no perturbation | Sigma=0.05 | Sigma=0.1 | Sigma=0.15 | Sigma=0.2 | Sigma=0.25 |
kNN | ||||||
SVM | ||||||
AdaBoost |
Question 1.3 The provided classificaiton programs have hidden the details of "cross-validation". The default setting is set to five folds. Find the definition of cross-validation from the Web. Describe the procedure of cross-validation with your own words, and answer why we need cross-validation.
Question 1.4 Under what assumptions the security of the GDP perturbation method is guaranteed? What are the potential attacks? - describe a few that you can think of.
Question 1.5 How much time did you spend on Question 1.1 and 1.2, respectively?
Question 1.6 How useful is this task to your understanding of the problem of privacy preserving outsourced mining in the cloud?
Database management is resource consuming. An appealing solution is to export database services to the cloud, which raises privacy concerns, too. Among the existing approaches, crypto-index is a simple solution (not so secure) that transforms a domain to a set of random IDs. It is easy to process queries on transformed data. However, the result may have low precision. In this task, you will implement a simple crypto-index method and experiment wtih single-dimensional range query on transformed data.
For simplicity, let's work on the continuous data. Assume one dimensional of the data, for example, the three datasets you have seen, is in the domain [min, max]. The crypto-index method partitions the domain into a few bins and use a random ID to represent each bin. Correspondingly, when you process a range query, say finding records with dimension i in the range [u, v], you will transform the query to the transformed data domain as "finding records with dimension i in the set of bin IDs [id1, id2,...]".
Let's assume the input data is in the SVM format. The following transformation program will transform one dimension of the data. Its incomplete python implementation is given as a reference, which uses the previously given readsvm.py. (You should revise test the code before using it, and you can implement your own Java version as well if you prefer Java).
crypto-index-transdata input_data dimenion_i bin_width bin_encoding_file output_data
start_1 end_1 random_ID1 start_2 end_2 random_ID2 ...
Question 2.1 Develop a query evaluation program, the usage of which is
crypto-index-query original_data transformed_data bin_encoding_file dimenion_i start end
Question 2.2 Develop a simple script that processes a number n, e.g., n=10, of randomly generated ranges for your queries. Note these ranges should be within the domain of dimension_i. Compute the average query result precision of the n ranges, for different bin_width settings: 0.02, 0.04, 0.06, 0.08, 0.1. You should try your script on the three datasets. Report the result precision and standard deviation (e.g., 0.82+/-0.05) in the following table, and also post your script here.
bin_width | 0.02 | 0.04 | 0.06 | 0.08 | 0.1 |
breast_cancer | |||||
diabetes | |||||
heart |
Question 2.3 Discuss the potential attacks to the crypto-index method that you can come up with.
Question 2.4 How much time did you spend on each of the tasks 2.1-2.3, respectively?
Question 3.1 Your level of interest in this lab exercise (high, average, low);
Question 3.2 How challenging is this lab exercise? (high, average, low);
Question 3.3 How valuable is this lab as a part of the course (high, average, low).
Question 3.4 Are the supporting materials and lectures helpful for you to finish the project? (very helpful, somewhat helpful, not helpful);
Question 3.5 How much time in total did you spend in completing the lab exercise;
Turn in the PDF report that answers all the questions, including the code for some questions.
This page, first created: 11/20/2014; last updated: 11/20/2014 |