DESeq

DESeq is an R package for quantifying differential gene expression using count data from mapping experiments. To learn more about DESeq you can read the [|DESeq paper] in Genome Biology. Below is an outline of a possible work flow, using DESeq, for a differential expression analysis. For more details you can see the [|package vignette].

code format="bash" bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
 * 1.** Map your raw reads to an assembled or existing genome or transcriptome. If your mapping program does not produce a sam file as its output transform the output to the sam format.
 * 1) convert single-end bwa output to sam file

bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam
 * 1) convert paired-end bwa output to sam file

samtools view -h -o out.sam in.bam code 2. If you are mapping paired-end reads, sort the sam file. code format="bash" sort -s -k 1,1 my_file.sam > my_file_sorted.sam code 3. Extract count data from the sam file. You can use either of the python scripts below or an existing program such as [|htseq-count]. [|counts.py] : counts mapped reads. [|counts-paired.py] : counts only reads for which both pairs map to the same scaffold and both pairs sequence successfully (neither pair is composed of all N's).
 * 1) convert bam file to sam file

The out put of both python scripts above has the following format.

File A: File B: To use this output in DESeq you will have to merge the count files for all datasets used for the differential expression analysis. You can do this in excel and save the merged tables as a .csv file or use the bash commands below. Either way your final file should have the following format: code format="bash" ( head -1 ${in:=my_counts.csv} ; tail -n +2 $in | sort -s -t, -k1,1 ) > my_counts_sorted.csv
 * > contig ||> File A ||
 * > gene1 ||> 0 ||
 * > gene2 ||> 100 ||
 * > contig ||> File B ||
 * > gene1 ||> 100 ||
 * > gene2 ||> 0 ||
 * > contig ||> File A ||> File B ||
 * > gene1 ||> 100 ||> 0 ||
 * > gene2 ||> 0 ||> 100 ||
 * 1) Sort the mapped read count files.

awk 'BEGIN{FS=","} FILENAME=="my_counts_sorted_1.csv" { dat[$1]=","$2 } ; FILENAME=="my_counts_sorted_2.csv" { print dat[$1]","$2 } 'my_counts_sorted_1.csv my_counts_sorted_2.csv code 4. Start an R session and import DESeq. Make sure you are running R version 2.14.0 or greater. code format="rsplus" source("http://www.bioconductor.org/biocLite.R") biocLite("DESeq") library( "DESeq" ) code 5. Read in the combined read counts file. code format="rsplus" countTable <- read.csv( "path/to/counts.csv", header=TRUE, row.names=1) head(countTable)
 * 1) Combine the mapped read count files. Example with 3 files.
 * paste my_counts_sorted_3.csv - > all_sorted.csv
 * 1) header=TRUE indicates that the first line contains column names and row.names=1 means that the
 * 2) first column should be used as row names.
 * 1) Print the top 7 lines of the file.

A     B     C gene1       6374   2758  1553 gene2      6908  42544    17 gene3     16225  20364   173 gene4         0      1     0 gene5       690    201   281 gene6       191    258     0 code You might want to filter out genes which have very low cumulative read mapping counts (low row sum, 10 in the example below) code format="rsplus" rs <- rowSums(countTable) use <- (rs > 10) countTableFilt <- countTable[ use, ]

code 6. Create a factor which describes the condition for each sample. For example if A is a weevil larva library, B is a weevil larve library and C is a weevil adult library then the conditions would be: larva, larva, adult. In the case where multiple libraries have the same condition, they are treated as biological replicates. code format="rsplus" conds <- factor( c( "larva", "larva", "adult", "adult" ) )
 * 1) replicates

conds < factor( c( "larva", "larva", "adult" ) )
 * 1) partial replicates

conds <- factor( c( "larva", "adult" ) ) code 7. Create a CountDataSet, the central data structure in the DESeq package. code format="rsplus" cds <- newCountDataSet( countTable, conds ) code 8. Estimate the effective library size for each sample. The following description is given for this step in the [|DESeq vignette] : "This information is called the “size factors” vector, as the package only needs to know the relative library sizes. So, if the counts of non-differentially expressed genes in one sample are, on average, twice as high as in another, the size factor for the first sample should be twice as large as the one for the other sample...If we divide each column of the count table by the size factor for this column, the count values are brought to a common scale, making them comparable. When called with normalized=TRUE, the counts accessor function [counts( cds, normalized=TRUE )] performs this calculation. This is useful, e.g., for visualization." code format="rsplus" cds <- estimateSizeFactors( cds ) sizeFactors( cds )
 * 1) no replicates

A        B         C 1.2395577 0.7021389 1.1811623 code 9. Estimate the dispersion/variance of the data.

From the DESeq vignette : "The inference in DESeq relies on an estimation of the typical relationship between the data’s variance and their mean, or, equivalently, between the data’s dispersion and their mean. The dispersion can be understood as the square of the coefficient of biological variation. So, if a gene’s expression typically differs from replicate to replicate sample by 20%, this gene’s dispersion is 0.22 = .04... Without any replicates at all, the estimateDispersions function will refuse to proceed unless we instruct it to ignore the condition labels and estimate the variance by treating all samples as if they were replicates of the same condition." code format="rsplus" cds <- estimateDispersions( cds )
 * 1) With biological replicates or with partial replicates.
 * 2) When you have partial replicates only the conditions with replicates will be used to estimate the dispersion.

cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" )
 * 1) without biological replicates

cds <- estimateDispersions(cds) Warning message: In parametricDispersionFit(means, disps) : Dispersion fit did not converge. cds <- estimateDispersions(cds, fitType="local")
 * 1) Correcting a warning message
 * 1) fitType="local" uses a numerical integration to fit the dispersion rather than the default parametric fit.

code 10. Test for differential expression. code format="rsplus" res <- nbinomTest( cds, "larva", "adult" ) head( res )

id  baseMean  baseMeanA  baseMeanB  foldChange  log2FoldChange      pval    padj 1 gene1   3008.14    8828.14      98.14      0.0111           -6.49  3.19e-08  0.0007 2 gene2   5760.50   16657.55     311.97      0.0187           -5.73  1.12e-07  0.0021 3 gene3  12969.76   37190.68     859.31      0.0231           -5.43  1.40e-07  0.0022 4 gene4  11404.17   32631.80     790.36      0.0242           -5.36  1.76e-07  0.0024 5 gene5  11896.77   34154.92     767.70      0.0224           -5.47  2.48e-07  0.0026 6 gene6   7586.69   21461.68     649.20      0.0302           -5.04  1.61e-06  0.0078 code Note: The pval column contains the p-value which corresponds to the significance of the gene's differential expression, the padj column contains adjusted p-values (often called q-values) controlling false discovery rate. You can read more about q-values and false discovery rates here.

11. Plot the log2 fold changes against the base means (sometimes call the MA-plot), coloring in red those genes that are significant at 10% false discovery rate. code format="rsplus" plotDE <- function( res, sig ) plot(       res$baseMean,        res$log2FoldChange,        log="x", pch=20, cex=.3,        col = ifelse( res$padj < sig, "red", "black" ), xlab="Base Mean", ylab=expression(Log[2]~Fold~Change) ) plotDE(res, .1) code
 * 1) Create a plotting function which takes your nbinomTest object and the desired percent false discovery rate as arguments.
 * 1) Plot the MA-plot using a 10% false discovery rate

12. **Optional:** Plot the distribution of p-values. code format="rsplus" hist(res$pval, breaks=100, main="", xaxt = "p-value")

code 13. At this point you may just want to export your table of genes and associates p-values and order and extract the information you need in excel. code format="rsplus" write.csv(res, "DESeq_genes.csv")

code Or you could order and filter your data in R and export each of these tables using the command above.

code format="rsplus" resSig <- res[ res$padj < 0.1, ] code code format="rsplus" resSig[order(resSig$pval)[1:10, length(resSig$id))],] code code format="rsplus" resSig[order(resSig$foldChange, -resSig$baseMean)[1:min(10, length(resSig$id))],] code code format="rsplus" resSig[order(-resSig$foldChange, -resSig$baseMean)[1:min(10, length(resSig$id))],] code 1. Transform the count data. code format="rsplus" cdsBlind <- estimateDispersions( cds, method="blind" ) vsd <- getVarianceStabilizedData( cdsBlind ) code 2. Order the genes by their significance values and display the top 100 genes. code format="rsplus" select <- order(res$pval)[1:100] code 3A. Set up the color pallet and plot the heat map. code format="rsplus" colors <- colorRampPalette(c("white","darkblue"))(100) heatmap( vsd[select,], col = colors, scale = "none")
 * Filter for significant genes, according to some chosen threshold for the false discovery rate:**
 * 1) Filter significant genes with a false discovery rate threshold less than 10%.
 * Print the top 10 differentially expressed genes with a 10% false discovery rate:**
 * Print the top 10 down regulated genes with a 10% false discovery rate:**
 * Print the top 10 up regulated genes with a 10% false discovery rate:**
 * 14.** **Optional: Visualizing differential expression using heat maps.**

code 3B. Plot the heat map using gplots. code format="rsplus" install.packages("gplots") library("gplots") heatmap.2( vsd[select,], col=rev(heat.colors(25)),trace="none") code Note: you can find more information about R color palettes here. You can also use a heat map to **summarize the euclidean distances between your samples**. 1. Transform the count data. code format="rsplus" cdsBlind <- estimateDispersions( cds, method="blind" ) vsd <- getVarianceStabilizedData( cdsBlind ) code 2. Create a distance matrix. code format="rsplus" dists <- dist( t(vsd)) dists C       B B  470.0339 A 611.5110 570.5806 code 3. Plot the distance matrix. code format="rsplus" heatmap( as.matrix( dists ), symm=TRUE, scale="none", margins=c(10,10), col = colorRampPalette(c("black","blue","lightblue"))(20)) code
 * 1) install gplots if it is not already installed.
 * 1) produce a heatmap using a reversed heat color palette.