How to slice a genome
Introduction
In this document, we will see how to slice a genome. More specifically, we will extract all the aligned regions with a coverage greater than 0. This can be useful if we want to focus with genomic regions where some reads were aligned and to discard all the empty regions.
Dataset presentation
We will work with the packages pasillaBamSubset
that has reads only on chr4.
suppressMessages(library(pasillaBamSubset))
bam_file <- untreated1_chr4()
bam_file
## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam"
Most of the time, for this kind of analysis, we don't want to work with the whole genome at once. It's generally best to work with only one subset of the genome at a time to reduce memory usage. An easy way to get the chromosome size is to work with the Seqinfo
function (if the genome of interest is supported):
suppressMessages(library(GenomicAlignments))
current_chr <- "chr4"
dm3 <- Seqinfo(genome = "dm3")
chr4_size <- seqlengths(dm3)[seqnames(dm3) == current_chr]
Extracting the coverage
The extraction of the coverage from a bam file is done in 2 steps:
- Extraction of the aligned read with the
readGAlignments
function. - Conversion in coverage with the
coverage
function.
Since we want to extract only the reads aligned on the chromosome 4, we need to create a ScanBamParam
to filter accordingly with the readGAlignments
function:
which = GRanges(seqnames = current_chr, IRanges(1, chr4_size), seqinfo = dm3)
param <- ScanBamParam(which = which)
alignments <- readGAlignments(bam_file, param = param)
alignments
## GAlignments object with 204355 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] chr4 - 75M 75 892 966
## [2] chr4 - 75M 75 919 993
## [3] chr4 + 75M 75 924 998
## [4] chr4 + 75M 75 936 1010
## [5] chr4 + 75M 75 949 1023
## ... ... ... ... ... ... ...
## [204351] chr4 + 75M 75 1348268 1348342
## [204352] chr4 + 75M 75 1348268 1348342
## [204353] chr4 + 75M 75 1348268 1348342
## [204354] chr4 - 75M 75 1348449 1348523
## [204355] chr4 - 75M 75 1350124 1350198
## width njunc
## <integer> <integer>
## [1] 75 0
## [2] 75 0
## [3] 75 0
## [4] 75 0
## [5] 75 0
## ... ... ...
## [204351] 75 0
## [204352] 75 0
## [204353] 75 0
## [204354] 75 0
## [204355] 75 0
## -------
## seqinfo: 8 sequences from an unspecified genome
It's extremely easy to convert a GAlignments
object in a coverage (SimpleRleList
) object:
coverage <- coverage(alignments)
coverage
## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
## Lengths: 23011544
## Values : 0
##
## $chr2R
## integer-Rle of length 21146708 with 1 run
## Lengths: 21146708
## Values : 0
##
## $chr3L
## integer-Rle of length 24543557 with 1 run
## Lengths: 24543557
## Values : 0
##
## $chr3R
## integer-Rle of length 27905053 with 1 run
## Lengths: 27905053
## Values : 0
##
## $chr4
## integer-Rle of length 1351857 with 122061 runs
## Lengths: 891 27 5 12 13 45 ... 3 106 75 1600 75 1659
## Values : 0 1 2 3 4 5 ... 6 0 1 0 1 0
##
## ...
## <3 more elements>
The coverage is a SimpleRleList
with one element for each seqlevels
in the seqlevels
.
Slice the chromosome
We can use the slice
function to extract the regions with at least a coverage of 1. We need to specify a chromosome for this funciton to work:
views <- slice(coverage[[current_chr]], lower = 1)
views
## Views on a 1351857-length Rle subject
##
## views:
## start end width
## [1] 892 1109 218 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2] 1236 1326 91 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 ...]
## [3] 1512 1586 75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [4] 2416 2613 198 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [5] 4426 4500 75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [6] 4922 5640 719 [ 2 3 3 4 4 4 4 4 4 5 5 5 8 ...]
## [7] 5670 6090 421 [1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 ...]
## [8] 6121 6499 379 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [9] 6501 7554 1054 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## ... ... ... ... ...
## [2611] 1345567 1345706 140 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2612] 1345759 1345888 130 [ 1 1 1 1 4 7 9 9 9 9 9 9 9 ...]
## [2613] 1345908 1345990 83 [3 3 3 3 3 4 8 8 9 9 9 9 9 9 9 9 9 9 9 ...]
## [2614] 1346004 1347058 1055 [ 3 4 4 4 4 4 4 4 4 4 4 4 6 ...]
## [2615] 1347062 1347136 75 [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...]
## [2616] 1347142 1347589 448 [2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2617] 1347918 1348342 425 [ 2 2 2 2 3 4 4 4 5 5 8 9 10 ...]
## [2618] 1348449 1348523 75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
## [2619] 1350124 1350198 75 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...]
It's possible to select each element in using the [[
:
views[[2]]
## integer-Rle of length 91 with 3 runs
## Lengths: 16 59 16
## Values : 1 2 1