How to extract promoters positions
Introduction
In this post, I will show how easy it is to extract the genomic positions of every promoters of a specific genome build.
For this demo, you will need the TxDb.Hsapiens.UCSC.hg19.knownGene
package:
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
# To avoid have to type the whole package name every time, we use the variable name txdb
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
TL;DR
promoters(genes(txdb), upstream = 1500, downstream = 500)
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58873715, 58875714] - | 1
## 10 chr8 [ 18247255, 18249254] + | 10
## 100 chr20 [ 43279877, 43281876] - | 100
## 1000 chr18 [ 25756946, 25758945] - | 1000
## 10000 chr1 [244006387, 244008386] - | 10000
## ... ... ... ... ... ...
## 9991 chr9 [115095445, 115097444] - | 9991
## 9992 chr21 [ 35734823, 35736822] + | 9992
## 9993 chr22 [ 19109468, 19111467] - | 9993
## 9994 chr6 [ 90538119, 90540118] + | 9994
## 9997 chr22 [ 50964406, 50966405] - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
TxDb
The trick is to use a type of packages that are known as TxDb
. TxDb
stands for Transcripts Database, and as the name implies, it contains information about the transcripts for a specific genome build of a given specie.
If you are lucky, a TxDb
package is already available on Bioconductor and extracting the promoter information will be very straighforward. Otherwise, it is possible to create your own TxDb
object, but this is beyond the scope of the current post.
The goal of this document is not to describe in details the inner workings of TxDb
objects. We will only show two helper functions that allow to easily extract relevant information from this type of object.
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-03-19 13:55:51 -0700 (Thu, 19 Mar 2015)
## # GenomicFeatures version at creation time: 1.19.32
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
promoters
and genes
The promoters
function can be used to extract the information for the promoters of every transcripts from a TxDb
object:
promoters_txdb <- promoters(txdb)
promoters_txdb
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 [ 9874, 12073] + | 1 uc001aaa.3
## [2] chr1 [ 9874, 12073] + | 2 uc010nxq.1
## [3] chr1 [ 9874, 12073] + | 3 uc010nxr.1
## [4] chr1 [ 67091, 69290] + | 4 uc001aal.1
## [5] chr1 [319084, 321283] + | 5 uc001aaq.2
## ... ... ... ... ... ... ...
## [82956] chrY [27605479, 27607678] - | 78803 uc004fwx.1
## [82957] chrY [27606222, 27608421] - | 78804 uc022cpc.1
## [82958] chrY [27607233, 27609432] - | 78805 uc004fwz.3
## [82959] chrY [27635755, 27637954] - | 78806 uc022cpd.1
## [82960] chrY [59360655, 59362854] - | 78807 uc011ncc.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The promoters function returns a GRanges
object corresponding to the positions of the promoters of every transcripts in the Txdb
object.
This returns 82960 promoter regions. But often we are only interested in the promoters of the genes and not of all the transcripts. This is where the genes
function becomes handy:
genes_txdb <- genes(txdb)
promoters_txdb <- promoters(genes_txdb)
promoters_txdb
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58874015, 58876214] - | 1
## 10 chr8 [ 18246755, 18248954] + | 10
## 100 chr20 [ 43280177, 43282376] - | 100
## 1000 chr18 [ 25757246, 25759445] - | 1000
## 10000 chr1 [244006687, 244008886] - | 10000
## ... ... ... ... ... ...
## 9991 chr9 [115095745, 115097944] - | 9991
## 9992 chr21 [ 35734323, 35736522] + | 9992
## 9993 chr22 [ 19109768, 19111967] - | 9993
## 9994 chr6 [ 90537619, 90539818] + | 9994
## 9997 chr22 [ 50964706, 50966905] - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Both function can also be nested to avoid the intermediate genes_txdb
object:
promoters_txdb <- promoters(genes(txdb))
promoters_txdb
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58874015, 58876214] - | 1
## 10 chr8 [ 18246755, 18248954] + | 10
## 100 chr20 [ 43280177, 43282376] - | 100
## 1000 chr18 [ 25757246, 25759445] - | 1000
## 10000 chr1 [244006687, 244008886] - | 10000
## ... ... ... ... ... ...
## 9991 chr9 [115095745, 115097944] - | 9991
## 9992 chr21 [ 35734323, 35736522] + | 9992
## 9993 chr22 [ 19109768, 19111967] - | 9993
## 9994 chr6 [ 90537619, 90539818] + | 9994
## 9997 chr22 [ 50964706, 50966905] - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
By default, the promoters
function will fetch the 2000 nucleotides before the transcription start site (TSS) and the 200 nucleotides after the TSS. This can be controled with the upstream
and downstream
parameters:
unique(width(promoters_txdb))
## [1] 2200
promoters_txdb <- promoters(genes(txdb), upstream = 1500, downstream = 500)
unique(width(promoters_txdb))
## [1] 2000