Liang Goh, Susan K. Murphy, Sayan Muhkerjee, Terrence S. Furey
Abstract:
Motivation: Genes silenced by the aberrent
methylation of nearby CpG islands can contribute to the onset or
progression of cancer and represent potential biomarkers for diagnosis
and prognosis. Relatively few have thus far been validated as
hypermethylated in cancer among over 14,000 candidates with promoter
region CpG islands. A descriptive set of genes known to be
unmethylated in cancer does not exist. This lack of a negative set
and a large number of candidates necessitated the development of a new
approach to identify novel genes hypermethylated in cancer.
Results: We developed a general method, cluster_boost, that in an imbalanced data setting predicts new minority class members given limited known samples and a large set of unlabeled samples. Synthetic datasets modeled after the hypermethylated genes data show that cluster_boost can successfully identify minority samples within unlabeled data. Using genome sequence features, cluster_boost predicted candidate hypermethylated genes among 14,000 genes of unknown status. In primary ovarian cancers, we determined the methylation status for 15 genes with different levels of support for being hypermethlyated. Results indicate cluster_boost can accurately identify novel genes hypermethylated in cancer.
Supplemental Files
Supplemental
Table S1 (xls) - List of 63 genes previously reported to be hypermethylated in cancer.
Supplemental
Table S2 (xls) - List of 64 sequence features that described promoter regions of known and
potentially hypermethylated genes.
Supplemental
Table S3 (xls) - Prediction results for 14,249 genes using the cluster_boost algorithm.
Supplemental
Figure SF1 (pdf) - Distribution of genes at each classification threshold for 1%, 5%, 10%, and
20% new synthetic samples (SD2)
Software
cluster_boost.tgz
- Matlab implementation of the cluster_boost algorithm.
readme.txt.
Datasets
Hypermethylated_vector.txt -
Accessions, locations, and feature vectors for 63 known hypermethylated genes.
Unlabeled_vector.txt -
Accessions, locations, and feature vectors for 14,249 genes on unknown methylation status.
|
|
|||
|
|||
|
|