RNA-seq

Transcriptome, Counts

1. count reads per gene:

  • model with Poisson or neg binomial
  • methods: edgeR, DESeq

2. RNA -> cDNA:

  • rna transcript only contains exons.
  • introns spliced out
  • junctions not in genome

3. same gene multiple transcripts

  • gene not well defined
  • diff versions have diff exons
    (ie 13 diff versions each with diff exons)
  • can have same expression of gene but diff exons
  • methods try to determine exons (w/o reference) (Trinity Oases, Cufflinks, Scripture)

faq

Gene model Q1

According to the UCSC Genes track (it will be on top if you have the default arrangement), how many transcripts does RIT1 have? Note that the RefSeq Genes track below is a separate annotation of the same gene.

  • 3 transcripts shown in the UCSC Genes track
  • have different first and second exons (the thick part of the line)
  • first exons are on the right side because this gene is on the minus strand (it is transcribed from the right side to the left side)
  • We know the gene is on the minus strand, because the arrows in the thin part of the gene (the introns) point to the left.

Gene model Q2

According to UCSC Genes track, how many exons does the top transcript of RIT1 (uc031pqc.1) have?

  • 6, thick lines are exons

Gene model Q3

Look up a nearby gene, CCT3. How many transcripts does this gene have, according to the UCSC Genes track?

Quantification

Model for quant. Q1

If we sequence from these RNAs uniformly at a rate of 1 read per 1 basepair, how many total reads will we expect on average for this gene? We just need to multiply each transcripts copy number (k) with its length (l) and sum these.

4 * 3000 + 7 * 1000 + 1 * 5000 = 24000

  • changes in k(transcript length), will effect total count, when length of transcripts are different
  • weighted average transcript length: weight by estimate of percent of transcript expression
  • differential expression : WATL * sample_depth * change in total gene expression

expression estimates

w=1000
lengths = c(100,200,300,100,100)
mat = cbind(c(1,1,0,1,0),c(1,1,1,1,1),c(0,1,1,0,1))
#counts = c(125,350,300,125,100)
counts = c(60,320,420,60,140)
theta.hat = c(1, 2, 3) / 10000
#mat %*% theta.hat * lengths * w
LHS = counts/(lengths * w)
lm.fit(mat, LHS)$coefficients
##    x1    x2    x3 
## 2e-04 4e-04 1e-03
#tm= t(mat) %*% mat
#dim(tm); dim(mat)
#tmlhs = t(mat) %*% LHS
#heta=solve( tm %*% tmlhs ) 

Quantification: Ambiguity, Bias


Alignment

- cluster computing is normally used (not pc/laptop)

- FASTQ

  • human rna, asthma med, 8samples/runs
  • identifiers:
    1. SRR sequence read archive (look up FASTQ files)
    2. GSM identifier, links to geodatabase
    3. cell line
    4. treatment
  • SRR:
    • number of spots(reads),
    • number of bases,
    • size of compressed SRA_file -> 2 paired FASTQ files (1.6GB)
    • quality (41=max), histogram
    • 2 reads/spot => paired end

file SRR1039508_1.fastq, first read pair, 3 reads @SRR1039508.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130/1 CATTGCTGATACCAANNNNNNNNGCATTCCTCAAGGTCTTCCTCCTTCCCTTACGGAATTACA
+
HJJJJJJJJJJJJJJ########00?GHIJJJJJJJIJJJJJJJJJJJJJJJJJHHHFFFFFD @SRR1039508.2 HWI-ST177:290:C0TECACXX:1:1101:1311:2131/1 CCCTGGACTGCTTCTTGAAAAGTGCCATCCAAACTCTATCTTTGGGGAGAGTATGATAGAGAT
+
HJJJJJJJJJJJJJJJJJIIJIGHIJJJJJJJJJJJJJJJJJJJJJJGHHIDHIJJHHHHHHF @SRR1039508.3 HWI-ST177:290:C0TECACXX:1:1101:1414:2163/1 TCGATCCATCGATTGGAAGGCACTGATCTGGACTGTCAGGTTGGTGGTCTTATTTGCAAGTCC
+
HJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJGJJIGHIJJBGIJCGIAHIJHHHHHHHFFFFF

strand specific (gene+)

  • specific only see 2 reads, where gene is sequential A-B regardless if read1/2 in on +/- (2)
  • if nonspecific, then see 2 more reads, where A-B or B-A for reads1/2 on both +/- (4)

fastqc

plots

  • quality control before alignment
  • time fastqc –noextract seqfile1 seqfile2
  • fastqc report plots
    1. quality along read (phred) \(-10log_{10}p\)
    2. box plot of quality score at each position
    3. per base sequence content
    4. per base GC content compare to theoretical (skew?)
    5. duplicate sequences
    • oversequence common to see few transcripts
    • better check is genomeviewer to see if sequence is uniform over whole
    1. overrepresened
    2. kmers which are enriched

bp 9 lowest quality

STAR alignment

  • paper
  • generate genome
  • alignment
    • cluster with 12 threads
  • AWS

SAM/BAM

integrative genomics viewer


## Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help
## [1] '3.1'

Count Normalization

gene level anaylsis

Count Matrix

  • count RNA -seqfragments from a single sample on a single chromosome of the drosophila genome.
#paired endwise experiment  
library(pasillaBamSubset)
bam.file <- untreated3_chr4()
library(Rsamtools)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: XVector
## Loading required package: Biostrings
bf <- BamFile(bam.file)
bam.list<-BamFileList(bf)

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
library(GenomicFeatures)
#exons.by.gene 
ebg <- exonsBy(txdb, by="gene")
#head(ebg)
seqnames(ebg[1])
## RleList of length 1
## $FBgn0000003
## factor-Rle of length 1 with 1 run
##   Lengths:     1
##   Values : chr3R
## Levels(15): chr2L chr2R chr3L chr3R ... chr3RHet chrXHet chrYHet chrUextra
library(Rsubread)
#fc <- featureCounts(bf, annot.ext=gtf.file,
#                    isGTFAnnotationFile=TRUE, 
#                    isPaired=TRUE)
library(GenomicAlignments)
?summarizeOverlaps
chr4.idx <- all(seqnames(ebg) == "chr4")
ebg.sub <- ebg[chr4.idx]
s0<-summarizeOverlaps(ebg.sub,bam.list, ignore.strand=TRUE, singleEnd=FALSE, fragments=FALSE)
#rowData(s0)
assay(s0[1])
##             untreated3_chr4.bam
## FBgn0002521                 192

Transcriptome alignment RSEM1

library("lattice")
library("rafalib")
genes <- read.table("/rstudio/SRR1039508.genes.results", header=TRUE)
isoforms <- read.table("/rstudio/SRR1039508.isoforms.results", header=TRUE)

#confirm that the FPKM column in 'genes' is the sum of the FPKM column in 'isoforms'
fpkm.per.gene <- split(isoforms$FPKM, isoforms$gene_id)
head(sapply(fpkm.per.gene, sum))
## ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 
##           26.92           11.64            0.16          279.82 
## ENSG00000001460 ENSG00000001461 
##           24.40          150.85
head(genes$FPKM)
## [1]  26.91  11.65   0.16 279.82  24.39 150.85
names(genes)
## [1] "gene_id"          "transcript_id.s." "length"          
## [4] "effective_length" "expected_count"   "TPM"             
## [7] "FPKM"
#log transform histogram of FPKM genes
#dev.off()
histogram(log(genes$FPKM))

#filter negative FPKM
genes2 <- genes[genes$FPKM > 0,]
genes2$gene_id <- droplevels(genes2$gene_id)
isoforms2 <- isoforms[isoforms$gene_id %in% genes2$gene_id,]
isoforms2$gene_id <- droplevels(isoforms2$gene_id)
#check factors
stopifnot(all(genes2$gene_id == levels(isoforms2$gene_id)))

#plot log_effective_count vs log_effective_length
mypar(1,1)
plot(log10(genes2$effective_length), log10(genes2$expected_count)  )