Origins from complete mtDNA sequences using
PCA and robust clustering with applications
to breast cancer progression"
IBM Research - Yorktown
April 28, 2006
The study of human origins
using mtDNA suggests that present day humans
outside Africa originated from one or more migrations
of small groups of individuals between 30K-70K
YBP. Coalescence theory shows that any collection
of DNA fragments trace back to a common ancestor.
Mutations fixed by genetic drift act as markers
on the timeline from the common ancestor to
the present and can be used to infer migration
and founder events in ancestral populations.
Most mutations seen in the data today are recent
and carry no useful information about deep ancestry.
Mutations useful as markers of deep ancestry
are those that robustly distinguish large clusters
of individuals. We present results from the
analysis of 1737 complete mtDNA sequences from
public databases to infer a robust set of SNPs
that reveal the migration phylogeny and provide
robust classification. A PCA analysis clearly
separates the samples into L, M and N clades.
The PCA eigenvectors identify mutations useful
in distinguishing the clades and revealing substructure.
Unsupervised consensus ensemble k-clustering
is used to devise a new algorithm which constructs
the migration tree while avoiding the problems
of sample size and selection bias. Our algorithm
a robust classification of samples into haplogroups
with accuracy > 90%.
We find that the African
clades L0/L1, L2 and L3 have the greatest heterogeneity,
in agreement with their ancient ancestry the
structure for the M clade migration tree is
consistent with prior literature but our study
reveals additional detail. For the N clade,
we find that the NA and NRB haplogroups are
too close and NRB and J/T/U/H groups too far
apart for the current mtDNA tree to represent
migrations. Our migration tree for the N-clade
deviates from the standard genetic marker tree
because it places the NRB samples close to NA
and N9a rather than near J/H/U/T. Traditional
distance based methods (neighbor-joining, UPGMA)
applied produce trees which have a similar topology
while parsimony gives several possible trees
with varying topology. Our result on the N clade
migration raises new questions about the migratory
events in the R mega-haplogroup. In addition,
we have obtained detailed substructure and SNP
patterns for all haplogroups which can be used
to classify samples with reliability greater
than 90 %.
We also applied our methods
to analyze breast cancer micro array data to
reveal that the progression of disease from
"in situ" to "metastatic"
proceeds in different clusters of patients along