How to slice a genome

Introduction

In this document, we will see how to slice a genome. More specifically, we will extract all the aligned regions with a coverage greater than 0. This can be useful if we want to focus with genomic regions where some reads were aligned and to discard all the empty regions.

Dataset presentation

We will work with the packages pasillaBamSubset that has reads only on chr4.

suppressMessages(library(pasillaBamSubset))
bam_file <- untreated1_chr4()
bam_file
## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam"

Most of the time, for this kind of analysis, we don't want to work with the whole genome at once. It's generally best to work with only one subset of the genome at a time to reduce memory usage. An easy way to get the chromosome size is to work with the Seqinfo function (if the genome of interest is supported):

suppressMessages(library(GenomicAlignments))
current_chr <- "chr4"
dm3 <- Seqinfo(genome = "dm3")
chr4_size <- seqlengths(dm3)[seqnames(dm3) == current_chr]

Extracting the coverage

The extraction of the coverage from a bam file is done in 2 steps:

  1. Extraction of the aligned read with the readGAlignments function.
  2. Conversion in coverage with the coverage function.

Since we want to extract only the reads aligned on the chromosome 4, we need to create a ScanBamParam to filter accordingly with the readGAlignments function:

which = GRanges(seqnames = current_chr, IRanges(1, chr4_size), seqinfo = dm3)
param <- ScanBamParam(which = which)
alignments <- readGAlignments(bam_file, param = param)
alignments
## GAlignments object with 204355 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end
##               <Rle>  <Rle> <character> <integer> <integer> <integer>
##        [1]     chr4      -         75M        75       892       966
##        [2]     chr4      -         75M        75       919       993
##        [3]     chr4      +         75M        75       924       998
##        [4]     chr4      +         75M        75       936      1010
##        [5]     chr4      +         75M        75       949      1023
##        ...      ...    ...         ...       ...       ...       ...
##   [204351]     chr4      +         75M        75   1348268   1348342
##   [204352]     chr4      +         75M        75   1348268   1348342
##   [204353]     chr4      +         75M        75   1348268   1348342
##   [204354]     chr4      -         75M        75   1348449   1348523
##   [204355]     chr4      -         75M        75   1350124   1350198
##                width     njunc
##            <integer> <integer>
##        [1]        75         0
##        [2]        75         0
##        [3]        75         0
##        [4]        75         0
##        [5]        75         0
##        ...       ...       ...
##   [204351]        75         0
##   [204352]        75         0
##   [204353]        75         0
##   [204354]        75         0
##   [204355]        75         0
##   -------
##   seqinfo: 8 sequences from an unspecified genome

It's extremely easy to convert a GAlignments object in a coverage (SimpleRleList) object:

coverage <- coverage(alignments)
coverage
## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
##   Lengths: 23011544
##   Values :        0
## 
## $chr2R
## integer-Rle of length 21146708 with 1 run
##   Lengths: 21146708
##   Values :        0
## 
## $chr3L
## integer-Rle of length 24543557 with 1 run
##   Lengths: 24543557
##   Values :        0
## 
## $chr3R
## integer-Rle of length 27905053 with 1 run
##   Lengths: 27905053
##   Values :        0
## 
## $chr4
## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0
## 
## ...
## <3 more elements>

The coverage is a SimpleRleList with one element for each seqlevels in the seqlevels.

Slice the chromosome

We can use the slice function to extract the regions with at least a coverage of 1. We need to specify a chromosome for this funciton to work:

views <- slice(coverage[[current_chr]], lower = 1)
views
## Views on a 1351857-length Rle subject
## 
## views:
##          start     end width
##    [1]     892    1109   218 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    [2]    1236    1326    91 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 ...]
##    [3]    1512    1586    75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    [4]    2416    2613   198 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    [5]    4426    4500    75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    [6]    4922    5640   719 [ 2  3  3  4  4  4  4  4  4  5  5  5  8 ...]
##    [7]    5670    6090   421 [1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 ...]
##    [8]    6121    6499   379 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    [9]    6501    7554  1054 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
##    ...     ...     ...   ... ...
## [2611] 1345567 1345706   140 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2612] 1345759 1345888   130 [ 1  1  1  1  4  7  9  9  9  9  9  9  9 ...]
## [2613] 1345908 1345990    83 [3 3 3 3 3 4 8 8 9 9 9 9 9 9 9 9 9 9 9 ...]
## [2614] 1346004 1347058  1055 [ 3  4  4  4  4  4  4  4  4  4  4  4  6 ...]
## [2615] 1347062 1347136    75 [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...]
## [2616] 1347142 1347589   448 [2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2617] 1347918 1348342   425 [ 2  2  2  2  3  4  4  4  5  5  8  9 10 ...]
## [2618] 1348449 1348523    75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2619] 1350124 1350198    75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]

It's possible to select each element in using the [[:

views[[2]]
## integer-Rle of length 91 with 3 runs
##   Lengths: 16 59 16
##   Values :  1  2  1
Written on April 24, 2015