Supporting Online Tables for "Chromatin-associated
periodicity in genetic variation downstream of transcriptional start sites."
Substitution
Models.
For the bidirectional
and transcribed strand unidirectional substitution model, we counted
the number of occurrences of individual substitutions in the alignments between
the Hd-rR and HNI genomes and assign the number to sub(LXR, LYR).
The values of sub(LXR,
LYR) for all pairs of LXR and LYR can be found in the following tables in CSV format. In the
tables, the rows represent LXR and
the columns LYR.
Bidirectional
substitution model ( |LXR |=3, |L|=|R|=1 )
Bidirectional
substitution model ( |LXR|=5, |L|=|R|=2 )
Transcribed
strand unidirectional substitution model ( |LXR|=3,
|L|=|R|=1 )
Transcribed
strand unidirectional substitution model ( |LXR|=5,
|L|=|R|=2 )
To estimate the substitution
rate at the position of X, denoted by
subRate(LXR), the above sum is
divided by N(LXR), the number of
occurrences of LXR and its reverse
complement RXL (or, the number
of occurrences of LXR in the case of
the transcribed strand unidirectional substitution model) in the alignments
between the Hd-rR and HNI genomes:
subRate(LXR) = S {sub(LXR,
LYR) | Y is a nucleotide other than X.
} / N(LXR).
To compute N(LXR), we need to
consider all alignments of LXR that
may involve insertions, deletions, and substitutions; however, the combination
of these mutations makes it difficult to enumerate the occurrences of LXR. To resolve this issue, we utilize
the fact that the majority of these alignments represent perfect matches or substitutions
of LXR, while indel frequencies are
typically less than 1%. Thus, we approximate N(LXR) as
N(LXR) ~ S {sub(LXR,
M) | M is a nucleotide string of the same length of LXR. }
The values of subRate(LXR) for
the bidirectional substitution model ( |L|=|R|=1 or 2 ) can be found:
subRate(LXR) for the
bidirectional substitution model ( |LXR|=3,
|L|=|R|=1 or |LXR|=5, |L|=|R|=2
) (MS Excel)
Indel
Models.
The bidirectional indel model employs the above transformations and calculates
the frequency of all occurrences of
LR => LX*R
and LX*R => LR
together with their complements
RL => RX*L and RX*L
=> RL
in the alignments between the Hd-rR and HNI
genomes. The model assigns the frequency to n(LR, l),
where l is the length of X*. The indel rate at a
position where LR occurs is
estimated as
S { n(LR, l)
| 1 < l } / N(LR),
where N(LR)
is the number of occurrences of LR
and its reverse complement RL in
the alignments between the Hd-rR and HNI genomes. As before, we approximate N(LR)
as
N(LR) ~ S { sub(LR,
M) | M is a nucleotide string of the same length of LR. }.
The 1bp indel rate
is estimated by setting l to 1;
namely,
n(LR, 1) / N(LR).
The indel rate at
a position where LR occurs is:
Indel
rates estimated by bidirectional indel model ( |LR|=4 or 6 ) (MS Excel)
The
k-mer motif indel model searches a local
region around an indel for a continuous k-mer string that occurs significantly around indels in the
entire genome. We generate the model by profiling indels and their neighboring k-mer strings in
the whole genome. For each k-mer motif string M
(e.g., ATAG), let indel(M, l, d) be the number of indels of length l at position d relative to the occurrences of M. The probability that an indel of arbitrary length appears at
position d
relative to the occurrence of M
is estimated as
(S1< l indel(M, l, d))
/ sub(M,M).
The above ratios for |M|=3
and 4 are available in the following table in which the rows present all 3- / 4-mer
strings for M and the columns
indicate distance d relative to the
occurrence of M. The tables present the
probabilities for 1 < l, 1 = l, or 1 < l.
k-mer motif indel model (k=3 or 4, 1 < l,
1 = l, or 1 < l) (MS Excel)