How To - Determining Clusters

Casey Surname DNA Project - Determining Clusters

Family of James Alfred Casey and Annette (Tucker) Casey, ca. 1899. First Row (left to right): James Alfred Casey, Bonnie Casey, Annette (Tucker) Casey, Linda Casey, Lucinda Casey, Columbus Casey. Second Row (left to right): Louis Casey, Ella Casey, Perry Casey, Maude Casey

The first major step in the analysis of DNA submissions is the separation of DNA submissions into groupings that have some chance of being related. For fairly uncommon surnames that have very few origins, this is not a difficult task as there are only a handful of related groupings. The Casey DNA Project has discovered only a few genetic sources with over 50 submissions. Two of these genetic sources are non-Irish deep ancestry and are believed to be non-Casey DNA introduced by NPE events (adoptions, out of wedlock or name changes). Two of these genealogical clusters (200 to 300 year time frame) are believed to share a common male ancestor that first used the Casey surname around 600 years ago. One grouping of submissions probably includes two or three genealogical clusters but the DNA is is not distantly enough related to reliaby split this grouping into true genealogical clusters. There will be only a handful of genealogical clusters based on the early Casey clan leaders that first used the surname. There will eventually be another handful of Casey genetic lines that originated due to NPE events. The Casey surname is a clan-based surname with very few genetic origins. The other extreme scenario for determining clusters is found with more common surnames that are based on occupations and geographic features (Smith, Cook, Brooks, Hill, etc.). These surnames can have dozens of genetic origins and dozens of new genetic lines caused by adoptions, out of wedlock, etc.

The most common methodology and best first pass at separating submissions into possible groupings is to use haplogrops (deep ancestry) to define the initial groupings. There are over 200 Y-SNP deep ancestries (haplogroups) which provide a good first pass for the separation of submissions into groupings that should be analyzed together. If submissions do not share the same deep ancestry 2,000 to 20,000 years ago, then they can not share recent ancestors 200 to 400 years ago (primary time frame that is relevant to most genealogists). FTDNA provides an estimate of each submission's haplogroup based on the Y-STR marker values. However, this estimate can only predict the first few branches of man's known deep ancestry and a separate Y-SNP test (deep clade) or special order test of an individual Y-SNP is required to determine the more recent branches.

Obviously, separating submissions into groupings based on haplogroups is a good start but this level of separation only separates all of mankind from 2,000 to 20,000 years ago. The separation in this time frame is much too early for genealogical purposes. There could be several genealogically significant clusters included in each haplogroup grouping that need to be separated for analysis. FTDNA also provides a tool for the administrators (and individual sponsors) which is known as a MRCA (Most Recent Common Ancestor) calculator. If you compare any two submissions, this tool will return the probability of being related based on the number of generations of interest.

For most genealogists, the primary input is around 12 generations (which is around 300 years at 25 years per generation). You really do not need to be too concerned on whether it should be 10 generations or 15 generations and you should not be concerned too much if the per generation time span is 23 or 27 years as well. This tool is only being used as a sanity check for being related in a genealogically significant time frame. A well-defined cluster will have most submissions between 80 and 100 % probabilities. Also, there are those few submissions that just beat the normal odds and mutated more. So having a few submissions between 40 and 80 % is quite acceptable as long this percentage does not exceed a large portion of the submissions being analyzed.

There is yet another time frame that is also important to genealogists. This is the time frame when our ancestors first started using surnames around 600 years ago. It is highly unlikely that many of us would ever find traditional documentation to support this DNA evidence, but analyzing at 24 generations (which is around 600 years at 25 years per generation) will greatly assist genealogists in analyzing the submissions in the 300 year time frame. Although rare, there can be two or more genealogical clusters that could share a common male ancestor when our ancestors first started using surnames.

But why would genealogists care about connecting clusters together that probably will never be connected with supporting traditional documentation? It is important due to the fact that it is sometimes difficult to predict the DNA marker values of our ancestors based only on donors that are living today. The most common methodology to determine the DNA marker values of the MRCA of the cluster is called the "majority rules" methodology. This assumes that our ancestors consistently had three or more sons (true most of time but there could be obvious exceptions). It also assumes that every male descendant had several sons for every generation and that we have traced every line equally with submissions to properly represent all branches (very unlikely).

If it is determined that two genealogical clusters could share a common male ancestor, then this information could greatly help fine tune the "majority rules" MRCA methodology. If they could share a common ancestor, then the MRCA of the common male of these two clusters combined would be somewhere between the MRCA of each separate cluster. It is very likely that the South Carolina Casey cluster and the Munster, Ireland Casey cluster share a common male ancestor. Comparing the two MRCA haplotypes of each cluster will allow some possible modifications to the "majority rules" methodology that is prone to have some errors. This also allows those attempting to define branches to put these branches in the correct time frame (which branch occurred first).

One practical impact of analyzing two clusters sharing a common early male ancestor is determining the first major branch of the South Carolina Casey cluster. About half of the submissions have 460=12 and the other half of the submissions have 460=13. Without any dominant marker value for marker 460, it would be a guess to establish which marker represents the MRCA haplotype. The Munster, Ireland cluster has 460=11 and implies that 460=12 is probably the marker value for the MRCA of the South Carolina cluster. Another supporting factor is the rarity of the 460 marker values for the South Carolina cluster. Marker value 460=13 is extremely rare and 460=12 is rare as well.

Many DNA Project administrators struggle with the time intensive exercise of determining groupings of submissions that appear to be related. For most clusters, it is pretty obvious in defining clusters, however, many "groupings" of submissions are not true genealogical clusters. The submissions in these groupings usually include submissions where the DNA values for their haplogroup just have too many common marker values to yield a DNA fingerprint for a distinct genealogical cluster. Even with 67 markers being tested, many submissions appear to be more closely related than they actually are. Having similar DNA (few mutations between two submissions) does not always result in being closely related. If two submissions have very similar DNA but have different haplogroups, they can not be as closely related as the FTDNA MRCA utility suggests.

The scenario of having common DNA values but not being related has been labeled by Mark Jobling as "overlapping haplotypes." There are three approaches to separate submissions with common DNA marker values. First, you can test your deep ancestry and separate submissions into multiple groupings based on deep ancestry. Submissions that do not have common deep ancestry can not have common ancestors in the last 1,000 years – even though they only have a few mutations at 67 markers. A second approach is to continue to upgrade to 67 markers or 111 markers. Eventually, submissions with more markers tested will pick up more rare DNA marker values that will give the cluster a more unique DNA fingerprint for the genealogical cluster. The third approach is to recruit (or wait for) more submissions to be added to the grouping that eventually will match other submissions.

The R1b1a2 Casey "grouping" has fewer rare marker values and does not appear to be a genealogical cluster. Existing 37 and 67 marker submissions should order deep ancestry (deep clade) tests or upgrade to 67 or 111 markers. This grouping needs more rare marker values or different haplogroups to better define any genealogical clusters. The R1b1a2 grouping just has too many submissions that somewhat related according to the MRCA utility to be a single genealogical grouping (many could belong to a common cluster – but is hard to separate those that do not belong together without additional DNA testing).

There are definitely four well-defined genealogical clusters found in the Casey DNA Project: the South Carolina cluster and the Munster, Ireland cluster, as well as the two clusters with non-Irish deep ancestries (E1b1a and J2 haplogroups). It is amazing that all remaining submissions belong to the R1b1a2 grouping. This grouping has just too much variation of DNA marker values to be just one genealogical grouping but not enough variation to separate this grouping into multiple genealogical groupings. The submissions in the R1b1a2 grouping will remain challenging to analyze until more deep ancestry tests are ordered, more upgrades for markers are ordered (67 and 111 markers) and more submissions become available with a minimum of 37 markers tested.

There are two options for definition of clusters. I personally define "genealogical" clusters as clusters that have a reasonable chance of being connected in a time frame near our oldest proven ancestors. This means that "genealogical" clusters represent groupings of lines that could be connected within two or three generations earlier than our oldest proven ancestors. This is usually 200 or 300 years in the past (obviously some projects may be more fortunate to have proven ancestry before this time frame and may have to adjust the time frame for a genealogical cluster).

Another methodology for determining clusters could be to group together all submissions that could have descended from a common male ancestor when our Irish ancestors first started using surnames. I personally define these clusters as "genetic" clusters as they could share a common genetic male ancestor when our ancestors first started using surnames around 600 years ago. For genealogical research purposes, it is not possible to analyze submissions for connections for genetic clusters as it has very low odds of being connected six to eight generations prior to our oldest proven ancestors. We may prove genetic branches but would be very unlikely to put names on our ancestors in that time frame due to the lack of traditional documentation to enhance the DNA evidence.

Most researchers want groupings defined in terms that will be significant to their genealogical research - therefore, I only create clusters that are genealogically significant. The advantage of this approach is that there will always be some possibility of determining genealogical relationships between these currently unconnected lines. It also implies that lines in other genealogical clusters are no longer worthy of research if you are attempting to connect your line to lines in other clusters. Researchers of the South Carolina cluster should not really be interested in the research of the Munster, Ireland cluster since the connection between these two clusters is five to ten generations prior to any of our oldest proven ancestors.

The approach of creating clusters that only have genealogical significance has significant drawbacks. For instance, the South Carolina cluster and the Munster, Ireland cluster appear to have a reasonable chance of being part of the same genetic cluster (having a common male ancestor with the same surname). This possible connection between these two clusters is extremely important to both clusters. If these clusters are genetically tied together then they would have a common ancestor that has a DNA marker values that must descend in an logical manner from their common haplogroup. To further add to the confusion, it is very possible that with today's limited number of DNA markers being analyzed that many clusters could overlap due to inadequate sample size of DNA markers available (yes this does mean we need more than 67 markers and maybe 111 markers will reduce this problem).