This vignette will help you analyse haplotype data by dealing with the inflation of rare alleles due to low-frequency genotyping errors.
If you have any issue with it please let me know through my github page: https://github.com/alethere/hapgroup
There are two essential pieces of information necessary for this package to work:
ploidy
columns (so for a diploid, there will be two columns
per individual).There is a third piece of information that is quite useful, a “position” data.frame(), which contains marker information. This data.frame must have columns “POS”, “CHROM”, “REF” and “ALT”.
These three datasets are provided as example data:
exPhase
, exInds
, exPos
.
#Table of phases
knitr::kable(exPhase[1:5,1:10])
V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
#Vector of individual names
exInds[1:5]
## [1] "Ind0001" "Ind0002" "Ind0003" "Ind0004" "Ind0005"
#Position data.frame
knitr::kable(exPos[1:5,])
CHROM | POS | REF | ALT |
---|---|---|---|
chr1 | 42755 | T | C |
chr1 | 47251 | G | A |
chr1 | 47258 | C | T |
chr1 | 48755 | T | C |
chr1 | 48757 | A | G |
Haploblocks are defined as groups of markers that are “blocked” or
“joined together”, which combined can be turned into haplotype alleles.
These can be defined in many ways and you can choose your own way. The
hapBlock
object performs haplotyping one block at a time,
so defining which markers fall into a haploblock is totally up to you as
the user.
I do provide a simple function that should help you with some easy
definitions of haploblocks, the calcBlock()
functions. This
function can divide markers into blocks based on:
n
). Every
n
markers a new block is started.window
). Starting from the
first marker, all markers within a window of size window
are grouped into a block.You can choose to exclude blocks below a number of markers of
exclude.size
.
To compute these blocks you must provide a position data.frame as described above.
The result will be a list of data.frames (a split data.frame), with
an added column index
which can be useful to subset the
phase
data.
#For example, we can compute blocks with a window size
winBlocks <- calcBlocks(exPos, window = 20000, exclude.size = 2)
## A total of 37 haploblocks found.
## Aerage window size = 16526.73
## Average markers per block = 33.11
## 3 blocks with <3 markers excluded
winBlocks[1:3]
## $chr1.0
## CHROM POS REF ALT index
## 1 chr1 42755 T C 1
## 2 chr1 47251 G A 2
## 3 chr1 47258 C T 3
## 4 chr1 48755 T C 4
## 5 chr1 48757 A G 5
## 6 chr1 48763 A G 6
## 7 chr1 49459 C T 7
## 8 chr1 50326 G A 8
## 9 chr1 50679 G C 9
## 10 chr1 50855 C A 10
## 11 chr1 50856 C T 11
## 12 chr1 60772 C A 12
## 13 chr1 60776 G A 13
## 14 chr1 61157 T C 14
##
## $chr1.1
## CHROM POS REF ALT index
## 15 chr1 65992 G A 15
## 16 chr1 66801 C T 16
## 17 chr1 66804 A G 17
## 18 chr1 67120 C T 18
## 19 chr1 67123 A G 19
## 20 chr1 67521 T C 20
## 21 chr1 67541 G A 21
## 22 chr1 68254 G A 22
## 23 chr1 74706 A C 23
## 24 chr1 74717 C T 24
## 25 chr1 74727 C G 25
## 26 chr1 77829 A G 26
## 27 chr1 77850 G A 27
## 28 chr1 77861 G A 28
## 29 chr1 78974 C A 29
## 30 chr1 79379 A G 30
##
## $chr1.2
## CHROM POS REF ALT index
## 31 chr1 93353 C T 31
## 32 chr1 94496 C T 32
## 33 chr1 96931 C T 33
## 34 chr1 97580 T C 34
## 35 chr1 97585 T A 35
## 36 chr1 98586 A G 36
## 37 chr1 100710 A G 37
## 38 chr1 100713 A G 38
## 39 chr1 100980 G T 39
## 40 chr1 101082 T C 40
## 41 chr1 101103 G T 41
#We can also compute blocks specifying a specific number of markers within a window
#I also exclude all those below the chosen size
nBlocks <- calcBlocks(exPos, n = 10, exclude.size = 9)
## A total of 121 haploblocks found.
## Aerage window size = 6915.64
## Average markers per block = 10.00
## 3 blocks with <10 markers excluded
nBlocks[1:3]
## $chr1.0
## CHROM POS REF ALT index
## 1 chr1 42755 T C 1
## 2 chr1 47251 G A 2
## 3 chr1 47258 C T 3
## 4 chr1 48755 T C 4
## 5 chr1 48757 A G 5
## 6 chr1 48763 A G 6
## 7 chr1 49459 C T 7
## 8 chr1 50326 G A 8
## 9 chr1 50679 G C 9
## 10 chr1 50855 C A 10
##
## $chr1.1
## CHROM POS REF ALT index
## 11 chr1 50856 C T 11
## 12 chr1 60772 C A 12
## 13 chr1 60776 G A 13
## 14 chr1 61157 T C 14
## 15 chr1 65992 G A 15
## 16 chr1 66801 C T 16
## 17 chr1 66804 A G 17
## 18 chr1 67120 C T 18
## 19 chr1 67123 A G 19
## 20 chr1 67521 T C 20
##
## $chr1.2
## CHROM POS REF ALT index
## 21 chr1 67541 G A 21
## 22 chr1 68254 G A 22
## 23 chr1 74706 A C 23
## 24 chr1 74717 C T 24
## 25 chr1 74727 C G 25
## 26 chr1 77829 A G 26
## 27 chr1 77850 G A 27
## 28 chr1 77861 G A 28
## 29 chr1 78974 C A 29
## 30 chr1 79379 A G 30
You may notice that window sizes do not exaclty match what we asked
for: this is unavoidable. If you set a start at position x
and you take all markers from x
to x + 10000
the last marker will not be at a distance of 10000
from the
first marker (because the first marker is not at x
and the
last marker is not at x + 10000
).
The bread and butter of this package is the R6 class
hapBlock
. You may not have used R6 classes before but they
are quite simple to use (you can find more information here ). They
are basically a list, with some of its items being functions (which are
often called methods), and some hidden items (containing all the
provided data).
Let us look at a single, small haploblock.
b <- winBlocks[[4]]
#hapBlock objects (and R6 objects in general) are always started with the function new()
myBlock <- hapBlock$new(phase = exPhase[b$index,], inds = exInds, ploidy = 2, pos = b)
myBlock
## Haploblock object with 31 markers for 1229 individuals (with ploidy 2 = 2458 alleles)
This haploblock right now only contains the alleles as we have
defined them, no grouping has been performed yet. To perform the
grouping we must use the haplogroup()
method. This will
group alleles based on an expected error rate (by default 0.01 or
1%).
myBlock$haplogroup()
## With a sequencing error of 0.01 the expected number of mismatches is: 0.61 (CI: -0.91 , 2.13) out of 31
## Number of haplogroups detected: 12
We can check the expected number of mismatches given one (or more)
error rates using the mismatch()
method.
myBlock$mismatch(c(0.01,0.02,0.05,0.1))
## lc exp uc
## 1 -0.9064915 0.6138 2.134091
## 2 -0.9026561 1.2152 3.333056
## 3 -0.2548013 2.9450 6.144801
## 4 1.3874322 5.5800 9.772568
By default, the upper boundary of the CI is used.
For this block, this is 2.13 out of 31. We can override this if
necessary with the argument mismatch
myBlock$haplogroup(mismatch = 2)
## Using provided number of mismatches: 2 out of 31
## Number of haplogroups detected: 12
You can obtain some information about each haplogroup using the
function haplostats()
. The result provides a data.frame
with the following columns:
myBlock$haplostats()
## group meanD nvars groupFreq consensus
## 1 H01 2.780220 14 1658 1101001001111001011111100100111
## 2 H02 2.757576 12 428 0000000100000000000000000000000
## 3 H03 2.363636 12 132 1101001001011001000001000100011
## 4 H04 3.309942 19 111 111101111101101100010101N101011
## 5 H05 2.666667 9 48 1101001111011001100101000110011
## 6 H06 1.666667 4 25 100000100N0010010000000001000N0
## 7 H07 2.071429 8 23 1101001001111001000111100100N11
## 8 H08 1.600000 5 22 1101001111011001000101000100011
## 9 H09 1.333333 3 8 N0N1001001111001011111100100111
## 10 H10 0.000000 1 1 0000000100000000000101011000000
## 11 H11 0.000000 1 1 0001011111011001000101010101011
## 12 H12 0.000000 1 1 1101000101011001000000000000011
## freqVar freq
## 1 1101001001111001011111100100111 0.9517491
## 2 0000000100000000000000000000000 0.7640187
## 3 1101001001011001000001000100011 0.6893939
## 4 1111011111011011000101011101011 0.3783784
## 5 1101001111011001100101000110011 0.5208333
## 6 1000001001001001000000000100010 0.5200000
## 7 1101001001111001000111100100111 0.3913043
## 8 1101001111011001000101000100011 0.6818182
## 9 0001001001111001011111100100111 0.3750000
## 10 0000000100000000000101011000000 1.0000000
## 11 0001011111011001000101010101011 1.0000000
## 12 1101000101011001000000000000011 1.0000000
You can also visualize the grouping of the found variants using the
plotGrouping()
method. You may need to adjust the vertex
size:
par(mar = c(0,0,0,2))
myBlock$plotGrouping(vertex.size = 8)
Notice that haplogroups that represent a single variant are not plotted in the clustering graph plot.
Once groups have been established (or before), you may want to obtain the specific alleles for each individuals. You can obtain these alleles in several forms:
haplogroup()
. For
example: H01, H02, H03… Haplotype numbers are based on frequency (H01 is
always the most frequent).buildSequences()
. For example:
“AACCTTTG”, “AACCATTG”, “AACCATTC”…#Notice that each individual has two entries, corresponding to its two alleles
#These are alleles in binary code
myBlock$alleles("al")[1:8]
## Ind0001 Ind0001
## "1101001001111001011111100100011" "1101001001111001011111100100111"
## Ind0002 Ind0002
## "0000000100000000000000000000000" "0000000100000000000000000000000"
## Ind0003 Ind0003
## "1101001001111001011111100100111" "1101001001111001011111100100111"
## Ind0004 Ind0004
## "1101001001111001011111100100111" "1101001001111001011111100100111"
#In variant codes
myBlock$alleles("var")[1:20]
## Ind0001 Ind0001 Ind0002 Ind0002 Ind0003 Ind0003 Ind0004 Ind0004 Ind0005 Ind0005
## "V53" "V56" "V03" "V03" "V56" "V56" "V56" "V56" "V53" "V56"
## Ind0006 Ind0006 Ind0007 Ind0007 Ind0008 Ind0008 Ind0009 Ind0009 Ind0010 Ind0010
## "V56" "V56" "V56" "V56" "V56" "V56" "V56" "V56" "V56" "V56"
#In haplotype code
myBlock$alleles("hap")[1:20]
## Ind0001 Ind0001 Ind0002 Ind0002 Ind0003 Ind0003 Ind0004 Ind0004 Ind0005 Ind0005
## "H01" "H01" "H02" "H02" "H01" "H01" "H01" "H01" "H01" "H01"
## Ind0006 Ind0006 Ind0007 Ind0007 Ind0008 Ind0008 Ind0009 Ind0009 Ind0010 Ind0010
## "H01" "H01" "H01" "H01" "H01" "H01" "H01" "H01" "H01" "H01"
#In sequence code we cannot get it yet
myBlock$alleles("seq")[1:10]
## Warning in myBlock$alleles("seq"): Sequences cannot be computed. Please run the 'buildSequences' method.
## Returning variant alleles.
## Ind0001 Ind0001 Ind0002 Ind0002 Ind0003 Ind0003 Ind0004 Ind0004 Ind0005 Ind0005
## "V53" "V56" "V03" "V03" "V56" "V56" "V56" "V56" "V53" "V56"
To see the “sequence” alleles directly we must provide a template sequence. In this case we will just use a toy example full of “_” so that you can easily see what is happening under the hood.
start <- min(b$POS)
end <- max(b$POS)
seq <- paste(rep("_", end - start + 10), collapse = "")
myBlock$buildSequences(sequence = seq, start = start - 9)
## New sequences can be found in the 'codes' data.frame.
## Use alleles('sequence') to obtain allele sequences for each individual
Now we can obtain the sequence allele, you can see that simply, in each position the reference allele is changed by the alternative allele.
res <- myBlock$alleles("seq")[1]
#I split it because it is too long
res <- sapply(seq(1,to = nchar(res), by = 100),function(i) substr(res,i, i + 99))
cat(unname(res),sep = "\n")
## _________C_A___C____________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ________________________________________________C___________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ________________C_________________________________________G_________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _____________________________________________________________G_______________________C______________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ______________T_T___________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## __________________A_____A___________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ___________________________________________________________________________________________G________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _________________________________________________________C__________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ______________________________G_____________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________C_____G_____________T___________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _________________________________________________________________________________________________A__
## _A__________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _______________________________________________________________A____________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _________________________________________________________T__TG______________________________________
## _______________________________________________________________________G__________________________T_
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _______________________________________________________________A____________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________G_______________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## _____________________________________________________________________________________________C______
## _____________G______________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ____________________________________________________________________________________________________
## ________________________________T
There is something called an MST plot, or maximum spanning tree,
which is a type of graph often used in haplotype comparisons. I worked a
bit on the visualization, although I’m not very happy with it (thus why
it is not a part of the R6 hapBlock
class). You can use it
as shown below, if you give it some alleles you care about (I often
remove the low-frequency alleles).
Several of the parameters must be adjusted manually (especially
vertex.size
and label.size
). Play with it if
you’re not satisfied with the result (or find a different
visualization).
stats <- myBlock$haplostats()
alleles <- stats$freqVar[stats$groupFreq > 1]
names(alleles) <- stats$group[stats$groupFreq > 1]
freqs <- stats$freq[ stats$groupFreq > 1]
haploMST(alleles = alleles, allele.freq = freqs,vertex.size = 30, label.size = 0.8)
You can retrieve several pieces of information from the
hapBlock
object which may be useful to you if you are doing
analysis.
#This is the distance matrix between observed variants
myBlock$D[1:10,1:10]
## V01 V02 V03 V04 V05 V06 V07 V08 V09 V10
## V01 0 1 1 2 2 2 2 3 5 3
## V02 1 0 2 3 1 3 3 4 6 4
## V03 1 2 0 1 1 1 1 2 4 2
## V04 2 3 1 0 2 2 2 3 5 3
## V05 2 1 1 2 0 2 2 3 5 3
## V06 2 3 1 2 2 0 2 1 5 1
## V07 2 3 1 2 2 2 0 1 5 3
## V08 3 4 2 3 3 1 1 0 6 2
## V09 5 6 4 5 5 5 5 6 0 6
## V10 3 4 2 3 3 1 3 2 6 0
#The list of ploidy-repeated alleles
myBlock$inds[1:10]
## [1] "Ind0001" "Ind0001" "Ind0002" "Ind0002" "Ind0003" "Ind0003" "Ind0004"
## [8] "Ind0004" "Ind0005" "Ind0005"
#The codes dataframe which relates alleles to vairants (and other information)
#Here I truncate the sequence column, otherwise it's too long
codes <- myBlock$codes[1:10,]
codes$sequence <- substr(codes$sequence,1,15)
knitr::kable(codes)
allele | variant | group | haplo | variantFreq | sequence |
---|---|---|---|---|---|
0000000000000000000000000000000 | V01 | 4 | H02 | 65 | _________T_G___ |
0000000000000000000000000000010 | V02 | 4 | H02 | 2 | _________T_G___ |
0000000100000000000000000000000 | V03 | 4 | H02 | 327 | _________T_G___ |
0000000100000000000000000000001 | V04 | 4 | H02 | 3 | _________T_G___ |
0000000100000000000000000000010 | V05 | 4 | H02 | 13 | _________T_G___ |
0000000100000000000000000100000 | V06 | 4 | H02 | 5 | _________T_G___ |
0000000100000000000000100000000 | V07 | 4 | H02 | 4 | _________T_G___ |
0000000100000000000000100100000 | V08 | 4 | H02 | 1 | _________T_G___ |
0000000100000000000101011000000 | V09 | 10 | H10 | 1 | _________T_G___ |
0000000100001000000000000100000 | V10 | 4 | H02 | 1 | _________T_G___ |
#The phase data if you need it
knitr::kable(myBlock$phase[1:10,1:10])
V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | |
---|---|---|---|---|---|---|---|---|---|---|
42 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
43 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
45 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
46 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
47 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
49 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
51 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
Although I do not plan to develop this further at the moment, I leave these ideas here in case someone (or my future self) wants to develop additional things.
%dopar%
loop and returning the relevant results (not the
whole hapBlock
object.)