WHY DEEP ANCESTRY (Y-SNPs) ARE IMPORTANT
As most genealogists, I originally thought that Y-SNPs played a very minor role for genealogists. Several years ago, this may have been the case. Several years ago, the most common Y-DNA haplogroups were several thousand years old and only divided all of mankind into around 100 deep ancestral branches. The first and most widely known usage of deep ancestry is using haplogroups as a quick methodology to separate genealogical groupings from each other. After all, if you do not share an ancestor at 4,000 years, you obviously will not share an common genealogical ancestor at 300 to 600 years. But most sponsors of Y-STR submissions probably believed that surname admins were just too lazy to manually separate submissions into valid genealogically related surname clusters and most sponsors were not very supportive in ordering deep clade tests or individual SNP tests for more recent branches.
However, many surname admins (and some sponsors) have learned over time that many genealogical clusters were easy to isolate while other groupings of submissions just seemed too genetically diverse to be true genealogical clusters. Some groupings appeared to be multiple genealogical clusters that overlapped in some fashion. Many surname admins are not really aware of the root cause of this overlap but many have learned that this is a result of common Y-STR marker values that have changed little over the last 2,000 years. These common marker values have changed over the last 2,000 years but they were primarily due to parallel mutations and backwards mutations and many submissions arrived back to the same set of marker values they started out with 2,000 years ago. With the growing understanding of common DNA marker values and the number of Y-SNP haplogroups recently passin 500 haplogroups, surname admins are now encouraging sponsors to order deep clade tests and special order Y-SNP tests to assist with the separation of groupings with common DNA marker values.
The discovery of understanding common DNA marker values has lead to the second very useful application of deep ancestral testing testing for genealogical usage - systematically determining how rare haplotypes are when compared to others. For genealogical clusters under the very deep ancestry R1b haplogroup, you can compare the MRCA of the R1b haplotype to the MRCA of the haplotype of your genealogical cluster. If you only find a few mutations, your genealogical cluster has very common DNA marker values overlap with many haplogroups over 1,000 years ago making the testing of deep ancestry much more important. Some combination of marker values are so uncommon, that all Y-STR related submissions will be genealogically related regardless of surnames. For other Y-STR related submissions that have extremely common combinations of DNA marker values, even close Y-STR matches with the same surname may not share haplogroups and therefore could not share common genealogical ancestors 300 to 400 years ago. Evaluating the rarity of your Y-STR haplotypes is critical to ruling out false genealogical matches and is much more common than most researchers realize.
Y-SNPs ARE NOW APPROACHING GENEALOGICAL TIMES
With many haplogroups now having origins that are only 1,000 to 1,500 years old, newly discovered Y-SNPs are recently getting interestingly close to the genealogical time frame. When haplogroups are this close to the genealogical time frame, other major usages of deep ancestry become powerful analytic tools for genealogists. If you compare the MRCA haplotype of your haplogroup to the MRCA haplotype of your genealogical cluster, then you have a DNA fingerprint for your genealogical cluster. These are the mutations between your ancestor when the haplogroup originated and the ancestor when your genealogical cluster originated. When attempting to find out if more remotely related submissions are truly related, matching the DNA fingerprint (or close to the DNA fingerprint) is a very strong factor in determining the possibility of being related. Sharing common mutations from the haplogroup MRCA is a much more important criteria for determining a possible connection than genetic difference. If you find other distantly related genealogical clusters that share parts of this DNA fingerprint, then it is also evidence that these remotely related genealogical clusters could share a common ancestor. If you have possible NPE candidates that strong geographical ties and are genetically close, discovering that they share common mutations from the MRCA of the haplogroup is additional genetic proof supporting the possibility of a NPE connection.
Very few genealogical researchers are aware that DNA fingerprints of genealogical clusters can greatly enhance genetic source documentation that support genetic analysis. DNA fingerprints also provide far superior searches for possible relatives. Searching by the number of mutations only can miss genetically related submissions that mutated more than normal. Also, if your genealogical cluster has common marker values where mutational difference is less reliable due to major overlapping unrelated submissions, having common "off modal" mutations from the MRCA of the haplogroup can be far more accurate test of relatedness. Searching Y-Search with your DNA fingerprint provides much better genetic matches than searching only by mutational difference. Having shared mutations and close genetic matches is a powerful combination. Having shared mutations, close genetic distance and sharing a common surname is even a more powerful combination. If you only analyze the mutations below the MRCA of your genealogical cluster, you are not including important hidden mutations that occurred between the creation of your haplogroup and the creation of your genealogical cluster.
When the age of the haplogroup gets very close to the genealogical time frame (600 to 800 years), the Y-SNP mutation that defines the haplogroup can reveal even more information. These Y-SNP mutations are called "near private" Y-SNPs by deep ancestry researchers. These mutations are always dominated one or two surnames. If 80 % are one surname, 10 % are a second surname and the last 10 % are spread across 20 surnames, then you have probably discovered very distant NPE connection between two most common surnames. The last 10 % of surnames also become excellent NPE candidates since the number of NPEs over a 600 to 800 year time should range between ten and twenty percent.
FINDING Y-SNPs ARE WITHIN GENEALOGICAL TIMES
Just like Y-STRs, Y-SNPs can mutate at any time. Anywhere from 20,000 years ago to only 200 years ago. Any Y-SNP that mutates within the genealogical time frame is called a "private" Y-SNP and is extremely important to genealogical research. Y-STRs really only form clusters of related submissions. The vast majority of Y-STR mutations provide only proof that those submissions that include common mutations must be more closely related. However, the connection between these clusters and the age of these clusters are difficult to determine via only Y-STR information. It is similar to tree trunk and several big branches laying on the ground. You can have many branches, but you have no information of where to put the branches on the tree in proper chronological order. In addition, you have a lot of submissions that do not have any cluster defining mutations. These are like many parts of branches on the ground and no information how they are connected together or where they belong on the tree. Here is a typical DNA descendancy chart of a well established surname cluster with no early Y-STR branch that splits the surname cluster into two early branches:
Scenario 1 - Many Y-STR branches - but no early branch that splits the cluster
If you are very lucky, you may discover a Y-STR mutation that happened just after the formation of your genealogical cluster. These older Y-STR mutations can form an early branch that divides the genealogical cluster into two large branches where all other branches are attached to one branch or the other. This kind of early branch is much more likely if your MRCA haplotype is much more recent (around 300 years). This kind of branch has major genealogical implications as you can eliminate around half of the submissions as being less related and focus your genealogical research on the half of the submissions that belong to your branch. This very special scenario allows Y-STR mutations not only show connections between branches but also must be a branch that is very old and happened just after the creation of the genealogical cluster. These kinds of cluster dividing branches are rare:
Scenario 2 - Several Y-STR branches - with early branch that splits the cluster
Only in the last year or two have "private" Y-SNPs become available for genealogical analysis. Very few genealogists are even aware of these extremely powerful "private" Y-SNP mutations. Unlike Y-STR branches, "private" Y-SNPs reveal new branches with clarity, provide connection information between all branches and provide the relative time frame of each branch. According the November, 2011 ISOGG Y-SNP summary, there are currently over 500 haplotree branches defined and over 150 "private" Y-SNPs that have been discovered. Probably 50 of these "private" Y-SNPs will later become new branches on the haplotree and another 50 will be "near private" Y-SNPs that will probably added as branches to the haplotree. This leaves around 50 Y-SNPs that are probably "private" Y-SNPs in the genealogical time frame. Only a handful of these "private" Y-SNP are probably being analyzed for genealogical purposes. There are currently around 30 new Y-SNPs being discovered every month and another 10 new "private" Y-SNPs found every month. Many scientists believe that there could be 1,000s (or perhaps 10,000s) of "private" Y-SNPs that could be discovered over the next few years. These kinds of branches will become common and will help create a DNA descendancy chart that starts to resemble a traditional genealogical descendancy chart:
Scenario 3 - Many Y-STR branches - with one private SNP
So how do genealogists locate existing "private" Y-SNPs for their genealogical cluster and how do you test for a new Y-SNP mutation? Finding existing "private" Y-SNPs that match your surname project is pretty tedious work and requires some research in the haplogroup projects. If you are lucky, deep ancestry researchers may alert the surname admin of these recently discovered "private" Y-SNPs. These "private" Y-SNPs are a very recent development genetic genealogical research and these "private" Y-SNPs are not well documented. Most newly discovered Y-SNPs are data mined from academic and scientific studies and are then made available for general testing by genealogists. FTDNA also offers a test that discovers new Y-SNPs (and finds one to three new Y-SNPs around 50 % of the time). Currently, new Y-SNPs mutations are now being discovered more than once every few days and the rate of discovery is increasing as more researchers become involved and testing costs continue to decline (or the scope of testing continues to increase by scanning even more base pairs).
THE TESTING OF DNA HOLDS A BRIGHT FUTURE
The future of DNA analysis for genealogists is extremely promising but is evolving painfully slow. Most submissions are random submissions and submissions trickle in at very slow pace due to the high costs of widespread testing. Many submissions have little or no accompanying traditional documentation and everyone is still learning how to analyze DNA submissions. Eventually there will be dozens of submissions for every genealogical cluster with one hundred or more markers to analyze. There is a biological limit to the number of Y-STRs that mutate in a manner compatible with genealogical needs. There are believed to be only 200 to 400 Y-STRs that will match genealogical needs. Over the next few years, Y-SNPs will grow from 500 haplotree branches and 150 "private" Y-SNPs to 5,000 haplotree branches and 50,000 "private" Y-SNPs. This massive growth will provide a wealth of new genetic information to analyze and will introduce some significant growing pains. Over the next few years, the size of autosomal databases will grow significantly and these tests will need to be integrated with Y-STR and Y-SNP genetic information.
In the next two to three years, the scope of genetic genealogical testing will include the entire genome instead multiple tests for each individual being tested. The full genome tests will eventually decline to $1,000, then $500 and eventually $250. Once tested, you have the test results sent to you which will include: 1) all 400 Y-STRs; 2) millions of Y-SNPs that have the potential to mutate; 3) autosomal areas (probably 100 times the data which will provide some improvements); 4) health and gene information (if you select the option); 5) all mtDNA data; 6) X-DNA which may have some genealogical value; and 7) many common specialty tests (such as the special 464 test that reveal non-standard GATC values). This massive infusion of new data will create an unbelievable increase of genetic information to analyze and will certainly have some very significant learning curves as the data will become available faster than the genetic community can easily analyze immediately. The days of primarily manual analysis will be assisted by analysis with advanced analysis tools.
The first complete scan of the first person's entire genome (every human DNA marker that exists) far exceeded $10,000,000 in 2004. Just four years later in 2008, many genome scans were conducted for less than $1,000,000 per individual. Just one year later in 2009, dozens of genomes were scanned for less than $100,000 per genome. Already in 2011, hundreds of genome scans have been conducted and the cost has been reduced again to under $10,000 per scan and a new company just announced $5,000 full genome tests in August, 2011. There is great excitement in the scientific and medical community that under $1,000 full genome tests will be available in the next year or two. Most believe the $1,000 test will be the cost threshold where massive medical testing will become feasible. The costs of future testing for genealogists will be primarily driven by the overhead of delivering this information to genealogists and will require complex software to analyze which will also become a major expense in analyzing millions of markers for each submission.
DNA testing is in the early stages of the technology maturing cycle. Currently, the hardware costs of DNA scanners (and associated "consumable" supplies) are the dominant economic factor and the associated labor costs are not far behind. Currently, the software and software development expenses are in distant third place. Software development costs are currently very limited to simple MRCA calculators, web access to place orders and provide information, relatively small databases for repositories of DNA submissions and simple search engines to compare submissions.
This was the state of affairs for corporate data processing around 15 to 20 years ago. The costs of running a corporate data centers shifted from hardware related costs to labor costs to support these systems due to hardware costs decreasing at staggering rates. Labor costs soared due to massive increases in software development to take advantage of massive amounts of computing power. Costs for software productivity tools greatly increased in order to reduce labor costs by increasing labor productivity. Today, the operational costs of the corporate data center is only a small fraction of software development costs. This same technology maturing cycle will be repeated with the DNA testing industry. Hopefully genealogists will be able to gain a free ride for much of the complex software analysis tools required by the medical industry. The future DNA testing companies will become very dependent on software analysis tools and database extraction tools to analyze the massive amount of data that will become available. This time of transition will require major investments in software development as well as increased skills by genetic researchers.
The number of Y-STR markers will increase with the availability of full genome scans. I can not imagine that genealogists will not taking advantage another 100 to 300 additional Y-STRs when they become available at no additional charge. However, the emphasis will shift from Y-STR analysis to Y-SNP analysis. Y-STR markers are relatively fast mutating markers and only produce clusters of related submissions. Y-STRs rarely show how all the clusters are connected or the age of each cluster. Fortunately, Y-SNPs also form branches that have much less ambiguity, reveal how branches are connected and imply the relative age of each mutation. Just a few years ago, the deep ancestry researchers were discovering branches that occurred 2,000 to 4,000 years ago and most genealogists paid minimal attention to this research. However, the origins of Y-SNP branches of mankind is now approaching 1,000 years for many new haplogroups and many are already under 500 years where they will have a profound impact on genealogists.
There is another future major limit where DNA testing will eventually hit another brick wall. DNA testing only works where there is a reasonable amount of traditional documentation available to assign names, places and specific dates to genetic connections. DNA testing is a great complementary source for genealogical information in the 200 to 400 year time frame where we can enhance our knowledge of the connections of oldest proven ancestors within several generations of these oldest proven ancestors. Eventually, our genealogical research will approach a time frame where 90 % of the evidence will be genetic and only 10 % of the evidence will be traditional documentation. We could discover how are distant ancestors are connected - but may not be able to put names, specific dates and places due to lack of supporting traditional documentation that provides this information. Of course, there will always be set of lucky individuals that tie into more wealthy ancestors that left a better paper trail behind. As the genetic genealogical research travels further back in time, new brick walls will become limiting factors again due to lack of any significant amount of traditional documentation to add any genealogical meaning to our genetic family histories.
As technology greatly enhanced our ability to access, research and document our family histories, DNA testing is providing a new infusion of source documentation to complement our traditional documentation sources. As it took many of years for many genealogists to embrace new computer technologies, it will probably take many years to develop and embrace new DNA technologies for genealogists. Previous generations saw other technology improvements that we take for granted today. Cars allowed us to be more mobile and visit remote courthouses, telephones allowed us to call our distant cousins and copiers enhanced our ability to duplicate source records to share. Improvements in technology of personal computers and the internet databases supporting genealogists will also continue to improve over time but the sheer magnitude of unreliable information continues to explode as well. It is naive to believe that DNA testing for genealogists will be the last major quantum leap in enhancing our genealogical research. If anyone has any ideas of other near term quantum leaps on the horizon (other than DNA related), drop me a note so that I can start preparing for these new opportunities to learn yet more new analysis skills.
|