Knowledge Transfer Oriented Data Mining with Focus on the Decision Trees Knowledge Type

Project funded by the National Science Foundation under award IIS-1044634 (EAGER)

August 2010 -- July 2012.

Description

This project is to study knowledge transfer oriented data mining (or KTDM). Given two data sets, the idea of KTDM is to discover models that are common to both data sets, as well as models that are unique in one data set. These common and unique models with respect to the two data sets will provide a tool to leverage the already-understood properties of one data set for the purpose of understanding the other, probably less understood, data set. This EAGER project is to concentrate on models in the form of a diversified set of classification trees. The KTDM approach is useful for real-world applications in part due to its ability to allow users to narrow down to particular models, guided by known knowledge from another data set. It will help towards realizing transfer of knowledge and learning in various domains. The project will support a graduate student and will seek collaboration with experts in the medical domain. These will increase the impact of the project.

Publications

1. Guozhu Dong. Cross Domain Similarity Mining: Research Issues and Potential Applications Including Supporting Research by Analogy. ACM SIGKDD Explorations. June 2012. [Get it here]

ABSTRACT: This paper defines the cross domain similarity mining (CDSM) problem, and motivates CDSM with several potential applications. CDSM has big potential in (1) supporting understanding transfer and (2) supporting research by analogy, since similarity is vital to understanding/meaning and to identifying analogy, and since analogy is a fundamental approach frequently used in hypothesis generation and in research. CDSM also has big potential in (3) advancing learning transfer since cross domain similarities can shed light on how to best adapt classifiers/clusterings across given domains and how to avoid negative transfer. CDSM can also be useful for (4) solving the schema/ontology matching problem. Moreover, this paper gives a list of potential research questions for CDSM, and compares CDSM with related studies. One purpose of this paper is to introduce the CDSM problem to the wide KDD community in order to quickly realize the full potential of CDSM.

2. Qian Han and Guozhu Dong. Using Attribute Behavior Diversity to Build Accurate Decision Tree Committees for Microarray Data. Journal of Bioinformatics and Computational Biology. Volume 10, Issue 4, 2012. [Get it here]

DNA microarrays (gene chips), frequently used in biological and medical studies, measure the expressions of thousands of genes per sample. Using microarray data to build accurate classifiers for diseases is an important task. This paper introduces an algorithm, called Committee of Decision Trees by Attribute Behavior Diversity (CABD), to build highly accurate ensembles of decision trees for such data. Since a committee’s accuracy is greatly influenced by the diversity among its member classifiers, CABD uses two new ideas to “optimize" that diversity, namely (1) the concept of attribute behavior based similarity between attributes, and (2) the concept of attribute usage diversity among trees. The ideas are effective for microarray data, since such data have many features and behavior similarity between genes can be high. Experiments on microarray data for six cancers show that CABD outperforms previous ensemble methods significantly and outperforms SVM, and show that the diversified features used by CABD’s decision tree committee can be used to improve performance of other classifiers such as SVM. CABD has potential for other high-dimensional data, and its ideas may apply to ensembles of other classifier types.

3. Guozhu Dong and Qian Han. Mining Accurate Shared Decision Trees from Microarray Gene Expression Data for Different Cancers. The 14th International Conference on Bioinformatics and Computational Biology (BIOCOMP’13). July 22-25, 2013, Las Vegas, USA. [Get it here.]

Abstract—This paper studies the problem of mining shared decision trees across multiple application domains, including multiple microarray gene expression datasets for different cancers. Shared knowledge structures capture similarity between application domains and have many useful applications. Given two datasets with classes, we focus on shared decision trees that are highly accurate in both datasets and whose nodes exhibit highly similar distribution of matching data for the classes of the two datasets. Algorithms are presented for mining high quality shared decision trees having high shared accuracy and high data distribution similarity. Experimental results on microarray datasets for medicine are reported to evaluate the algorithms.

4. Guozhu Dong and Qian Han. Mining Diversified Shared Decision Tree Sets for Discovering Cross Domain Similarities. Pacific Asia Conference on Knowledge Discovery From Data (PAKDD) 2014.

Abstract. This paper studies the problem of mining diversified sets of shared decision trees (SDTs). Given two datasets representing two application domains, an SDT is a decision tree that can perform classification on both datasets and it captures class-based population-structure similarity between the two datasets. Previous studies considered mining just one SDT. The present paper considers mining a small diversified set of SDTs having two properties: (1) each SDT in the set has high quality with regard to “shared” accuracy and population-structure similarity and (2) different SDTs in the set are very different from each other. A diversified set of SDTs can serve as a concise representative of the huge space of possible cross-domain similarities, thus offering an effective way for users to examine/select informative SDTs from that huge space. The diversity of an SDT set is measured in terms of the difference of the attribute usage among the SDTs. The paper provides effective algorithms to mine diversified sets of SDTs. Experimental results show that the algorithms are effective and can find diversified sets of high quality SDTs.

5. This supplementary paper contains supplementary information about shared decision trees mined from various pairs of datasets, including 3 microarray gene expression datasets for cancer and 3 microarray gene expression datasets for cancer treatment outcome.

Page maintained by Guozhu Dong.