Supporting Online Tables for "Chromatin-associated periodicity in genetic variation downstream of transcriptional start sites."

 

Substitution Models.

 

For the bidirectional and transcribed strand unidirectional substitution model, we counted the number of occurrences of individual substitutions in the alignments between the Hd-rR and HNI genomes and assign the number to sub(LXR, LYR). The values of sub(LXR, LYR) for all pairs of LXR and LYR can be found in the following tables in CSV format. In the tables, the rows represent LXR and the columns LYR.

 

Bidirectional substitution model ( |LXR |=3, |L|=|R|=1 )

Bidirectional substitution model ( |LXR|=5, |L|=|R|=2 )

Transcribed strand unidirectional substitution model ( |LXR|=3, |L|=|R|=1 )

Transcribed strand unidirectional substitution model ( |LXR|=5, |L|=|R|=2 )

 

To estimate the substitution rate at the position of X, denoted by subRate(LXR), the above sum is divided by N(LXR), the number of occurrences of LXR and its reverse complement RXL (or, the number of occurrences of LXR in the case of the transcribed strand unidirectional substitution model) in the alignments between the Hd-rR and HNI genomes:

              subRate(LXR) = S {sub(LXR, LYR) | Y is a nucleotide other than X. } / N(LXR).

To compute N(LXR), we need to consider all alignments of LXR that may involve insertions, deletions, and substitutions; however, the combination of these mutations makes it difficult to enumerate the occurrences of LXR. To resolve this issue, we utilize the fact that the majority of these alignments represent perfect matches or substitutions of LXR, while indel frequencies are typically less than 1%. Thus, we approximate N(LXR) as

              N(LXR) ~ S {sub(LXR, M) | M is a nucleotide string of the same length of LXR. }

The values of subRate(LXR) for the bidirectional substitution model ( |L|=|R|=1 or 2 ) can be found:

 

subRate(LXR) for the bidirectional substitution model ( |LXR|=3, |L|=|R|=1 or |LXR|=5, |L|=|R|=2 )  (MS Excel)

 

 

Indel Models.

 

The bidirectional indel model employs the above transformations and calculates the frequency of all occurrences of

              LR => LX*R and LX*R => LR

together with their complements

              RL => RX*L and RX*L => RL

in the alignments between the Hd-rR and HNI genomes. The model assigns the frequency to n(LR, l), where l is the length of X*. The indel rate at a position where LR occurs is estimated as

              S { n(LR, l) | 1 < l } / N(LR),

where N(LR) is the number of occurrences of LR and its reverse complement RL in the alignments between the Hd-rR and HNI genomes. As before, we approximate N(LR) as

              N(LR) ~ S { sub(LR, M) | M is a nucleotide string of the same length of LR. }.

The 1bp indel rate is estimated by setting l to 1; namely,

              n(LR, 1) / N(LR).

The indel rate at a position where LR occurs is:

 

Indel rates estimated by bidirectional indel model ( |LR|=4 or 6 ) (MS Excel)

 

 

The k-mer motif indel model searches a local region around an indel for a continuous k-mer string that occurs significantly around indels in the entire genome. We generate the model by profiling indels and their neighboring k-mer strings in the whole genome. For each k-mer motif string M (e.g., ATAG), let indel(M, l, d) be the number of indels of length l at position d relative to the occurrences of M. The probability that an indel of arbitrary length appears at position d relative to the occurrence of M is estimated as

              (S1< l indel(M, l, d)) / sub(M,M).

The above ratios for |M|=3 and 4 are available in the following table in which the rows present all 3- / 4-mer strings for M and the columns indicate distance d relative to the occurrence of M. The tables present the probabilities for 1 < l, 1 = l, or 1 < l.

 

k-mer motif indel model (k=3 or 4, 1 < l, 1 = l, or 1 < l)  (MS Excel)