Boston University | Center for Computational Science
HomeNews and EventsResearchEducationPeopleSeminarsFacilitiesContact Us

"Inferring Common Origins from complete mtDNA sequences using PCA and robust clustering with applications to breast cancer progression"

Gyan Bhanot
Computational Biology Center
IBM Research - Yorktown Heights, N.Y.
April 28, 2006
 
The study of human origins using mtDNA suggests that present day humans outside Africa originated from one or more migrations of small groups of individuals between 30K-70K YBP. Coalescence theory shows that any collection of DNA fragments trace back to a common ancestor. Mutations fixed by genetic drift act as markers on the timeline from the common ancestor to the present and can be used to infer migration and founder events in ancestral populations. Most mutations seen in the data today are recent and carry no useful information about deep ancestry. Mutations useful as markers of deep ancestry are those that robustly distinguish large clusters of individuals. We present results from the analysis of 1737 complete mtDNA sequences from public databases to infer a robust set of SNPs that reveal the migration phylogeny and provide robust classification. A PCA analysis clearly separates the samples into L, M and N clades. The PCA eigenvectors identify mutations useful in distinguishing the clades and revealing substructure. Unsupervised consensus ensemble k-clustering is used to devise a new algorithm which constructs the migration tree while avoiding the problems of sample size and selection bias. Our algorithm a robust classification of samples into haplogroups with accuracy > 90%.
 
We find that the African clades L0/L1, L2 and L3 have the greatest heterogeneity, in agreement with their ancient ancestry the structure for the M clade migration tree is consistent with prior literature but our study reveals additional detail. For the N clade, we find that the NA and NRB haplogroups are too close and NRB and J/T/U/H groups too far apart for the current mtDNA tree to represent migrations. Our migration tree for the N-clade deviates from the standard genetic marker tree because it places the NRB samples close to NA and N9a rather than near J/H/U/T. Traditional distance based methods (neighbor-joining, UPGMA) applied produce trees which have a similar topology while parsimony gives several possible trees with varying topology. Our result on the N clade migration raises new questions about the migratory events in the R mega-haplogroup. In addition, we have obtained detailed substructure and SNP patterns for all haplogroups which can be used to classify samples with reliability greater than 90 %.
 
We also applied our methods to analyze breast cancer micro array data to reveal that the progression of disease from "in situ" to "metastatic" proceeds in different clusters of patients along distinct pathways.
 

copyright © 2006, Center for Computational Science | Boston University , MA, 02215