How To - Basic DNA Concepts

Basic DNA Concepts

Home of Isham Brooks, purchased in the 1830s, photograph ca. 1978

SUMMARY OF TESTS USED BY GENEALOGISTS

This section gives a very brief overview on how DNA testing is used to extract more source documentation about our ancestors. YDNA topics are primarily covered in this section as these tests present the best alternative for DNA testing to connect our older ancestors together. Both YSTR testing and YSNP testing are both used for the analysis in most YDNA Surname Projects. There are a myriad of other DNA tests available but other DNA tests either address too deep ancestry (mtDNA) or too recent ancestry (atDNA). However, atDNA tests is an excellent choice if you have brick walls in the recent past (post 1800).

The mutation rate (or net change of DNA per generation) is extremely critical on why certain tests are selected for different audiences. Below is a very high level summary of how DNA tests are selected for different purposes:

Audience	Mutation Rate	Tests	Comments
Not used since there is very little variation	Almost never	Whole genome	99 % of human's DNA is the same for all of mankind (excluding genes & DNA that recombines every generation)
Medical Community	Very Frequent	Whole genome	These sections are called genes and affect health and biological functions
Deep Ancestry research	Usually Once	YSNP (old)	Find new older YSNPs mutations for deep ancestry
Deep Ancestry research	Usually Once	mtDNA	Develop descendant charts of woman kind (but with only 16K base pairs, only a few mutations will be recent).
Genealogical	Usually Once	YSNP (recent)	Since it is estimted that every three generations have a YSNP mutation, many YSNP mutations are very recent (YDNA currently has 16M bp with today's technology)
Genealogical	Frequent	YSTR (slow)	Complement YSNPs in building haplotrees from surname creation to known ancestors
Genealogical	Extremely Frequent	YSTR (very fast)	Can assist with breaking brick walls that NGS and individual YSNPs tests may not
Very recent genealogical	Extremely Frequent	atDNA	Great for recent pedigree brick walls & adoptions (primarily post 1800 but a few scenarios go back to early 1700s)

The vast majority of DNA recombines every generation or controls our biology (medical). So 22 of the 23 chromosomes are either medically related or changes at such a high rate that is not reliable back to the 1700s. Using atDNA testing (Family Finder, etc.) it will be very hard to break down our brick walls before 1800. mtDNA exists as independent DNA strands of the 23 chromosomes is excellent for tracking deep ancestry of our all female ancestors. However, with only 16,000 base pairs, mtDNA tests will have very few genealogical applications since most mutations are in the distant past. This leaves us with only our sex chromosme, the XY (male) or XX (female). Female oriented DNA just does not seem to get a break for genealogical testing. Females have two DNA structures to test but neither have the potential of YDNA. Unfortunately, biology messes things up for women by providing one random X that is recombined with the males X strand plus one of the female's X getting passed intact to her daughter. So you can randomly walk up an all female line but it is not always mother to daughter. Analysis of the X chromosome from atDNA tests is worthy of analysis but this analysis is very technical and there are a lack of tools to assist. Since it always has a random recombination affect similar to atDNA testing, its analysis has the same limits as atDNA but can go back twice as far when compared to atDNA.

This only leaves YDNA as the best vehicle to break down our brick walls in the 1700s and earlier. But even YDNA has issues as well. Even though the YDNA strand has 59,000,000 base pairs, the vast majority of YDNA can not be used for genealogical research. You have to eliminate the massive gene structures and those areas where there is no detectable change going on. Current economic technology can only scan 16,000,000 base pairs that have the potential to mutate at the rate that is useful to genealogists. Unfortunately, the current technology used in NGS tests are not read sequentially but examines small portions of only 100 to 500 pairs making them incapable of reading a long repetitive structure such as very long YSTR structures. The Big Y can not read around 10 % of the YSTRs but Full Genomes Corporation Elite 2.1 only misses around 2 or 3 % due using technology that has longer read lengths. Within the next year or two, longer read lengths will not only be able to read all YSTR structures but will also be able increase coverage of the YDNA by a small percentage as well. Also, we already are getting around 400 YSTR values with every NGS test but this many YSTRs produce so much genetic variation that manual analysis would be impossible and this would require many more submissions per cluster to eliminate very recent mutations from the mutations around the time frame of our oldest proven ancestors.

There is a possible economical alternate for YSTR testing that could assist our current YSNP testing with 67/111 marker methodology. Testing companies could create two different YSTR tests that would be different existing tests today. One test could be the 111 marker test with the fastest mutating markers removed down to around 75 markers. Many of the average speed mutating markers would be dropped. A second YSTR test would consist of around 35 fast mutating YSTR markers. The 75 marker test would be used to determine where you fit in the tree of mankind, such as L226. This test may not be needed for L226 testers as a simple lower cost YSNP panel could determine this already. The 35 marker fast mutating markers could be used for connections below L226 and similar aged haplogroups. These 35 markers would provide a lot more parallel and backwards mutations that would have to be sorted out and would require at least five to ten tests of male descedants of the same proven ancestor to arrive at the haplotype of each ancestor being analyzed. So the fast mutating YSTR markers are similar to atDNA tests as they would require you to test many male cousins to filter all the noise being produced between your donors and common proven ancestor.

But you say I do not want to pay for five to ten 35 marker tests at $150 each. But NGS tests only provide YSNP mutations for every three or four generations, so the alternative are many more $575 NGS tests in the future. Keep in mind also, future WGS tests will provide all YSTRs at no cost, so there would be no need for NGS testers to order either the fast or slow marker tests. Having a gap every three of four generations for YSNPs is a biological brick wall. Without a complementary YSTR strategy, you can not afford to have four missing generations in genealogical research. Having a four generaton gap in YSNPs would mean you could not distinguish an ancestor from his sons, grandsons or great grandsons. Due to this three or four generation biology limitation, higher resolution NGS tests will reduce this number by 30 % (covering one additional generation per four generations). Also, this YSNP mutation rate varies wildly per genetic line. You may roll the dice and get two resulting on only one YSNP mutation in six generatoins. You may roll the dices and get twelve and get more YSNP mutation per generation for the critical years where you are trying to break through your brick walls.

YSTR mutation rates are averages among thousands of tests. YSTR mutation rates for your particular line will vary dramtically. For L226, I was fortunate to have rolled the dice and got tens and twelves and the FGC5647 has a very large seven YSTR mutations between L226 (1,400 years ago) and my cluster (600 to 1,000 years ago). Most currently known branch of L226 have a maximum of three L226 off modal mutations that define their entire cluster. DC1 submissions have rolled the dice and got two and threes for genetic diversity. Not only that their signatures are only include one or two L226 off modal mutations but the markers included in these signatures are known to have five or ten independent mutations under L226 via parallel or backwards mutations. It is a bit of irony that our royal line of Brian Boru have common marker values that vary little within L226. Or another viewpoint, the rest of us mutate a lot more which does not seem royal characteristic. It will be a lot more expensive to develop descendant charts for the DC1 lines due to the random nature of mutation of DNA. However, just because the DC1 YSTR values have mutated sufficiency to break through brick walls, the verdict for YSNP mutation rates is not known yet. DC1 YSNPs may have mutated much faster than the average rate for the 10 or 20 generations that are key interest to us.

Another valid test for genealogists is autosomal tests (atDNA). This test is a very different test that will help solve more recent brick walls that some genealogical researchers face. The FTDNA "Family Finder" test employs testing atDNA that recombines from the DNA of both the mother and the father at a 50 % rate. Each generation results in 50 % change to the children's DNA - half from the mother and half from the father. Therefore, every generation receives a 50 % change in atDNA. Each child ends up with around 25 % of the DNA of their grandparents. Even though this "recombination" of DNA from father and mother is not really a mutation, it has an equivalent impact of having a 50 % mutation rate. Several hundred thousand of base pairs of atDNA are tested since the "mutation rate" is extremely high. After four or five generations, this test may reveal less information about any relationships since so little common DNA remains. There are exceptions beyond five generations, but only around ten percent of your recombinational DNA will be in tact at six or seven generations. This results in 90 % of ancestors being untraceable after six or seven generations - but a random 10 % of you DNA will remain intact enough to trace some of your ancestors.

There are two characteristics of atDNA tests that are useful for genealogists. The first is the raw amount of common DNA found between two submissions. Siblings will share around 50 % of their DNA, first cousins will share 25 % of their DNA. Therefore, the percentage of shared DNA can approximate the degree of the relationship. Since the split is not exactly 50 / 50 every generation, after 4 or 5 generations, this characteristic of the test becomes less reliable. This DNA recombines and sometimes results in very long strings of DNA remaining intact. Siblings may have thousands of consecutive base pairs shared with each parent. These long strings become shorter and shorter for every generation as the recombination process is random in nature. However, the longer strings of common DNA that remain intact over time reveal short lived atDNA signuatures (segments) that can be used by genealogists to locate ancestors at 4 and 5 generations with high probabilities and locate ancestors at 6 and 7 generations with low probabilities. These tests are also very useful to genealogists with very recent adoptions where researchers are attempting to locate possible close relatives. atDNA tests are also useful by researchers that have recent missing ancestors in their pedigree chart due to poor availability of source documentation or those just starting out with their genealogical research.

Most genealogists are attempting the extend the timeframe by using several strategies that do extend the success rate back a generation or two. The most common strategt is test multiple closer relatives to help sort out which segments belong to specific ancestral lines. My favorite approach is gaming the system for success. You have an extensive genealogy, you can track down a cousin of interest that has a very high years per generation. I have one Brooks line that averages 52 years per generation back into the early 1800s. This doubles the timeframe for researching my Brooks. You also could luck out a get a "sticky" segment where a long segment is passed virtually intact for several generations. Another way to game the system is to test individuals that have a lot of intermarriages. I recently had a solid atDNA match in the mid 1700s. This was aided by the fact that my Brooks line averages around 36 years per generation (making it a better candidate for atDNA testing) and the matching atDNA submission had my proven oldest known ancestor listed three times (not six times because only the mother was our common ancestor via two different husbands).

UNDERSTANDING YSTR MARKERS

YDNA strands are similar to extremely long ladders. The actual DNA strands are more complex than this but the ladder analogy allows researchers to visualize how DNA changes from one generation to another. Each rung of the ladder has a position number assigned. Each rung of the ladder is bonded chemically of a pair of molecules creating "base pairs." The YSTR marker label assigned refers to the location of any rung on the ladder where this DNA structure starts and stops. The number of rungs on this ladder is very predictable but there are parts of the YDNA that have repeating strings of the same common molecules that bind the ladder together. Because some series of rungs never change over time, these series of rungs are used as reference points (primers) to locate other parts of the YDNA strand. These repeating strings have a repeating pattern of same molecule pairs and the number of these strings can vary over time (randomly adding and deleting strings). YSTR are "Y"DNA "S"hort "T"andem "R"epeats. Many YSTRs have ideal mutation rates for genetic genealogical testing individuals in the 300 to 900 year timeframe. Very slow mutating YSTRS are ideal for ancestral research and very faster mutating YSTRs are better for more recent genealogical time frames. However, if they mutate too fast, you have to order multiple tests for known proven ancestors to deduce what the marker values of your ancestors. If they mutate too slow, you seldom see change in the genealogical time frame.

Since YSTRs structures both add and delete strings randomly over time, they can return to their original values (backwards mutations). Since YSTRs mutate at fairly high rates, it is not uncommon to discover multiple independent mutations to the same marker value within the genealogical time frame (parallel mutations). Within a haplogroup like L226, parallel mutations are far more common but backwards mutations are also present as well. If you analyze too many slow mutating YSTRs, the cost of current technology would result in a $500 test for 300 slow mutating markers which is around the same price as a current NGS tests that already includes 90 % of the slow mutating markers. If you add too many fast mutating markers, the number of parallel and backwards mutations would sky rocket (increasing ten fold or more). This would require every person to test every proven ancestor five to ten times from different male descendants to filter out all the parallel and backwards mutations. So the selection of YSTRs is critcial to understand.

Many YSTRs are just too short (under eight repeats) to be reliable. Other YSTRs mutate extremely fast and are too volatile for genealogical usage. However, we made need to test the faster mutating YSTRs to have enough genetic information to connect our ancestors. Also, there are only around 400 to 500 YSTRs that could be useful to genealogists. Many scientists believe that 111 Y-STR markers may be an upper limit that can be safely used with accuracy and they also believe that the accuracy of YSTR analysis is limited to only 300 to 900 years. YSTR tests include a process called YDNA enrichment. This enrichment process basically separates out the YDNA from the other 22 chromosomes and the X chromosome. Enrichment costs drive up the costs of YDNA costs. YDNA testing is not widely used by the medical community, so our genetic genealogy community gains very little from the massive medical testing. Within a year or two, all NGS tests and YSTR tests will be replaced with Whole Genome Sequencing (WGS) as WGS costs are not much higher than current low resolution NGS costs due massive medical research driving down costs for WGS technology. As WGS testing costs continue downward, the 111 marker test by itself will also cost more than the WGS test. The WGS test all data that is interesting to genetic genealogy: high resolution NGS testing for YSNP testing, 500 YSTRs, full mtDNA, XDNA and 10,000 times the atDNA coverage as well (which may not help much but would drive analysis costs beyond what is currently feasible).

VARIATIONS IN TYPES OF YDNA MARKERS

In order to under YSTR markers, there are several variations of YSTRs. YSTR mutate at radically different rates and some marker values are rare to common. It is always much better to discover a mutatation of a YSTR that rarely mutates since other are not as likely ot have parallel mutations due to its slow mutation rate. Some marker values are only found in less than one percent of your haplogroup. Having a very unique value is also a unique signature associated with your part of the haplotree. There are also several variations of different kinds of YSTR markers.

The first YSTR variation is called a multi-copy YSTR marker where the strings of rungs randomly switch positions within a certain defined area. These markers are listed in low to high order of the number of repeated strings as it is not possible to identify which Y-STR sequence is mutating. These multi-copy markers and always have a small alphabetic letter appended at the end of the marker number (ie., CDYa and CDYb or 464a, 464b, 464c and 464d). Multi-copy markers not only have this variable switching of positions but multiple copy markers also have two other variations.

A second type of YSTR variation is another multi-copy only variation. Multi-copy markers can sometimes have extra YSTR sequences added (extra sequences of strings of rungs). YSTR marker 19 and 464 are the most common multi-copy marker where extra YSTR sequences can appear for many generations. This is usually a temporary YDNA signatures as these extra YSTR sequences tend be deleted after a several generations but can remain for many generations. You test could come back with four, six, eight and even twelve values for 464.

A third type of YSTR variation is yet another variation of multi-copy markers. This variation can occur when the chemical makeup (GATC) of these strings can also change as well. This requires a special test to reveal this special variation. Due to the myriad of changes possible, multi-copy markers can provide a wealth of information but are more complex to analyze. There is discussion that these complex multi-copy markers may be excluded in the future and could be replaced by less complicated YSTR markers as they become available over time. However, genealogists are unlikely to throw out information any time soon as it is not in our nature to ignore any information. Nature provided us with a lot of these types of markers, so we may have to deal with these more complex variations for some time.

A fourth type of YSTR variation can happen when the entire YSTR sequence is deleted. This means that the entire variable portion of the Y-STR gets deleted. When a missing YSTR sequence is discovered, the marker value will be reported as a zero. These missing YSTRs are usually short lived YDNA signatures and are somewhat rare. Over several generations, these missing YSTR sequences will always reappear. Missing YSTRs only happen to regular YSTR markers that have no suffix attached. These short lived missing YSTR sequences provide a very unique temporary YDNA signature that can last for a few generations or many generations. This kind of variation in less one percent of the submissions but when they do happen, their presence provide a very unique YDNA signature while the YSTR deletion persists from generation to generation.

A fifth type of YSTR has two variable portions within one fixed region. These YSTRs have 1 and 2 added at the end of the marker number to indicate the number of mutations in each of the two variable sections of the YSTR sequence. For some reason, the scientists count the number of strings in the first part as the first marker value and the number of strings in the entire YSTR as the second marker value. This creates an analysis issue for genealogists who are counting mutations. If the first section of the rung shows one added string, the reported value for the entire rung would also increase by one as well - even though there is only one mutation. Genealogists have to adjust the number of true mutations based on this duplicate counting of the same mutation. For analysis purposes, most researchers modify the second marker value by subtracting the first marker value. This results in no mutations being counted twice and helps more accurately estimate time frames between submissions.

UNDERSTANDING IMPORTANT FACTORS IN YDNA ANALYSIS

The analysis of YDNA submissions can not be summarized in a few paragraphs and has some very complex issues. For those just starting out, you should rely on others to get you up to speed. There are many misconceptions and some very high expectations associated with YDNA testing. YDNA analysis is primarily an exercise in mathematics - probability theory, statistics, logic, pattern recognition, rules of thumb, etc. Having similar YSTR values with 67 markers does not always equate to being related (some groupings of DNA submissions have such common DNA values that they can overlap with many other lines). The opposite is also true, having five or six mutations does not always mean that the two submissions are not related as well (both submissions may have just beat the odds and randomly mutated more than the average). Finding similar DNA values between two submissions is a positive sign that submissions could be related but there is never a certainty of a close relationship. False positive matches run around ten percent of 67 marker matches and while false negatives are only around three or four percent of matches. If your particular has common marker values, the average rates of false hits can increase to 95 %.

There are many factors in analyzing YSTR submissions - much more than seeing how similar DNA submissions are. Finding similar YSTR values only affects the initial phase of the analysis of any grouping of related submissions. Once this grouping phase is completed, the actual mutations become the primary focus of the analysis. There is way too much focus on comparing the number of mutational differences between two submissions which is not always a reliable comparison. You really need at least five to ten submissions (usually dominated by one surname) to start any meaningful analysis. Comparing only two submissions is similar to rolling the dice once and expecting the number of seven every time. Having similar YSTR values (few mutations) is regularly not enough to determine how closely related YSTR submissions will be. This kind of comparison is only reliable if your submissions have relatively genetically isolated which often is not the case. The combination of similar YSTR values (few mutations) and similar surnames (with surname variations allowed) is a powerful combination that must be used together. The surname should be used as a filter for similar YSTR values. For some common surnames that have many genetic origins, even the matching surnames / similar YDNA combination may not be reliable.

The rarity of YSTR marker values of the submissions is being discovered to be a very important parameter of the YSTR analysis. The more markers that have rare values - the higher the quality of the YSTR signature which can be used to separate your suranme cluster from other submissions. There is also a huge variation in the rate of mutations for each marker and mutation rates affect how reliable connections may be. Very fast mutating markers reveal a lot more information as they mutate more often but these fast moving markers but can also introduce more parallel mutations within the same genealogical cluster (the possibility of two independent mutations to the same marker value in the same cluster). Very rare surnames may only have one or two genetic origins, clan based names may have only a handful of genetic origins and common surnames based on trade or places can have dozens (or even hundreds) of genetic origins.

YSNPs of each donor can also reveal a lot about the grouping of submissions and has a major impact on genealogical analysis. Haplogroups are groups of submissions that share the same YSNP mutations. YSNPs are excellent resources to quickly separate submissions into related groupings. Determining the haplotype of your haplogroup is also very useful as well. The haplotype of ancestor is merely our estimate of the marker values associated with our common ancestor based on donors which are descendants of our "M"ost "R"ecent "C"ommon "A"ncestor. MRCA is merely another term for our oldest proven ancestor that is shared by known descendants who have tested. Comparing the MRCA of your haplogroup to the MRCA of your genealogical cluster defines "off modal" mutations from your haplogroup. These "off modal" mutations provide a YSTR signature of your surname cluster and can be used to help define the haplotype of the progenitor of your surname cluster. Knowing the YSTR signature of your surname cluster also is an excellent filter that can validate NPE connections (related lines with different surnames). These YSTR signature are by far the highest quality methodology for detemining relationships between testers.

YDNA is only available from living donors and YDNA analysis attempts to estimate the YDNA of our ancestors based on what was passed down to their descendants. Multiple submissions of every well proven line may be required in order to safely assign mutations to the time frame near our oldest proven ancestors. Only around 25 % of mutations found within any surname cluster will be genealogically significant. Analysis must separate genealogically significant mutations (those very close to oldest proven ancestors) from recent mutations (those mutations that happened in recent times near the donor). It has to be remembered that submissions are not the YDNA of their oldest proven ancestor - they are the YDNA of the donor who is a distant descendant of our oldest proven ancestors. Mutations occur randomly through time and can happen at any time frame - anywhere from the donor of the to the great grandfather of our oldest proven ancestor. Multiple tests of the same oldest proven ancestor can reveal the time frame of these mutations. Only descendants of different sons or grandsons of our oldest proven ancestors should be tested as more recent cousins are unlikely to filter out recent mutations.

YDNA analysis can not be done without well proven traditional documentation. Mutations of donors are translated into our earlier generations based primarily on traditional genealogical documentation. It is extremely important to be realistic about your ancestors since speculation being presented as well proven connections can pollute the genetic analysis. A lot of speculative connections can be supported or rejected based on YDNA evidence as well. The volunteers who analyze YDNA submissions can barely keep up with genetic analysis and should not be expected to get involved in solving traditional genealogical documentation issues of lines that are not even related to their lines. However, these volunteers have an obligation to refute speculative connections based on YDNA evidence which is regularly done. Many speculative ancestries have been proven wrong by YDNA evidence as well as many speculative connections have been strengthened by YDNA evidence. Also, never take genetic evidence as 100 % accurate as there are many special issues with genetics as there are with traditional genealogical research. Sometimes probate are records are proven incomplete and Family Bibles are sometimes more reliable resources or can be unreliable due transcription errors.