Project funded by the
National Science Foundation under award IIS-1044634 (EAGER)
August 2010 -- July 2012.
This project is to study knowledge transfer oriented data mining (or KTDM). Given two data sets, the idea of KTDM is to discover models that are common to both data sets, as well as models that are unique in one data set. These common and unique models with respect to the two data sets will provide a tool to leverage the already-understood properties of one data set for the purpose of understanding the other, probably less understood, data set. This EAGER project is to concentrate on models in the form of a diversified set of classification trees. The KTDM approach is useful for real-world applications in part due to its ability to allow users to narrow down to particular models, guided by known knowledge from another data set. It will help towards realizing transfer of knowledge and learning in various domains. The project will support a graduate student and will seek collaboration with experts in the medical domain. These will increase the impact of the project.
Publications
1. Guozhu Dong. Cross Domain Similarity Mining:
Research Issues and Potential Applications Including Supporting Research by
Analogy. ACM SIGKDD Explorations. June 2012. [Get it here]
ABSTRACT: This paper defines the cross
domain similarity mining (CDSM) problem, and motivates CDSM with several
potential applications. CDSM has big potential in (1) supporting understanding transfer
and (2) supporting research by analogy, since similarity is vital to
understanding/meaning and to identifying analogy, and since analogy is a
fundamental approach frequently used in hypothesis generation and in research.
CDSM also has big potential in (3) advancing learning transfer since cross
domain similarities can shed light on how to best adapt classifiers/clusterings across given domains and how to avoid negative
transfer. CDSM can also be useful for (4) solving the schema/ontology matching problem.
Moreover, this paper gives a list of potential research questions for CDSM, and
compares CDSM with related studies. One purpose of this paper is to introduce
the CDSM problem to the wide KDD community in order to quickly realize the full
potential of CDSM.
2. Qian Han and Guozhu
Dong. Using Attribute Behavior Diversity to Build Accurate Decision Tree
Committees for Microarray Data. Journal of Bioinformatics and Computational
Biology. Volume 10, Issue 4, 2012. [Get it here]
DNA microarrays (gene chips), frequently
used in biological and medical studies, measure the expressions of thousands of
genes per sample. Using microarray data to build accurate classifiers for
diseases is an important task. This paper introduces an algorithm, called Committee
of Decision Trees by Attribute Behavior Diversity (CABD), to build highly
accurate ensembles of decision trees for such data. Since a committee’s
accuracy is greatly influenced by the diversity among its member classifiers, CABD
uses two new ideas to “optimize" that diversity, namely (1) the concept of
attribute behavior based similarity between attributes, and (2) the concept of attribute
usage diversity among trees. The ideas are effective for microarray data, since
such data have many features and behavior similarity between genes can be high.
Experiments on microarray data for six cancers show that CABD outperforms
previous ensemble methods significantly and outperforms SVM, and show that the
diversified features used by CABD’s decision tree committee can be used to
improve performance of other classifiers such as SVM. CABD has potential for
other high-dimensional data, and its ideas may apply to ensembles of other
classifier types.
3. Guozhu Dong and Qian Han. Mining Accurate Shared
Decision Trees from Microarray Gene Expression Data for Different Cancers. The
14th International Conference on Bioinformatics and Computational Biology
(BIOCOMP’13). July 22-25, 2013, Las Vegas, USA. [Get it here.]
Abstract—This
paper studies the problem of mining shared decision trees across multiple
application domains, including multiple microarray gene expression datasets for
different cancers. Shared knowledge structures capture similarity between
application domains and have many useful applications. Given two datasets with
classes, we focus on shared decision trees that are highly accurate in both
datasets and whose nodes exhibit highly similar distribution of matching data
for the classes of the two datasets. Algorithms are presented for mining high
quality shared decision trees having high shared accuracy and high data
distribution similarity. Experimental results on microarray datasets for medicine
are reported to evaluate the algorithms.
4. Guozhu Dong and Qian Han. Mining Diversified Shared
Decision Tree Sets for Discovering Cross Domain Similarities. Pacific Asia
Conference on Knowledge Discovery From Data (PAKDD)
2014.
Abstract. This
paper studies the problem of mining diversified sets of shared decision trees
(SDTs). Given two datasets representing two application domains, an SDT is a
decision tree that can perform classification on both datasets and it captures
class-based population-structure similarity between the two datasets. Previous
studies considered mining just one SDT. The present paper considers mining a
small diversified set of SDTs having two properties: (1) each SDT in the set
has high quality with regard to “shared” accuracy and population-structure
similarity and (2) different SDTs in the set are very different from each
other. A diversified set of SDTs can serve as a concise representative of the
huge space of possible cross-domain similarities, thus offering an effective
way for users to examine/select informative SDTs from that huge space. The diversity
of an SDT set is measured in terms of the difference of the attribute usage
among the SDTs. The paper provides effective algorithms to mine diversified
sets of SDTs. Experimental results show that the algorithms are effective and
can find diversified sets of high quality SDTs.
5. This supplementary paper contains supplementary information about shared decision trees mined from various pairs of datasets, including 3 microarray gene expression datasets for cancer and 3 microarray gene expression datasets for cancer treatment outcome.
Page maintained by Guozhu Dong.