Bellow is a collection of various scripts and tools that have been useful for analyzing next-gen data. Some of the scripts have been written for personal use and may be buggy or unintuitive. However, they are all free and can usually be coaxed into working or can serve as springboards for accomplishing other tasks. Many of the python scripts require the screed library which can be downloaded here.

Navigate to: | Read statistics | Quality trimming | Shuffle and subset paired end sequences | Assembly statistics | Differential Expression | BLAST | Orthology Assignment | Formatting Fasta Files

General Raw Read Scripts"

General Statistics for manipulating raw reads:
Download the file in scripts --- I will update this with location of these scripts on the bigmac.
http://sfg.stanford.edu/

Read statistics


Fastqc: A quality control tool for high throughput sequence data.

Quality trimming


q-trim.py: Trims sequences in fastq files based on a quality threshold. Change the QSCORE variable in the script to your needs. Data generated using Illumina's pipeline CASAVA 1.8 and higher uses the standard Sanger ASCII encoding of Phred quality scores (ASCII characters 33-126). Set your QSCORE following the ASCII table here (e.g., QSCORE= '5' should trim reads in your fastq file if their Phred < 21; another example: in ASCII ] < a). The second variable you need to change is INTERCRAP which determines how many contiguous bases of low quality you are willing to ignore inside any given read (default is 5 bp). Lastly, MINLENGTH determines the minimum read length you want to retain (default = 30 bp).

both.py: Extracts only reads that have both R1 and R2 sequences. Input 2 files, one with the R1 and another with the R2 reads. Can be useful after quality trimming to make sure you only use sequences for which both reads exist in down stream applications. Written to work with data generated using Illumina's pipeline CASAVA 1.8 and higher.

Shuffle and subset paired end sequences


ShuffleSequences_fastq.pl: Perl script from the Velvet developers. Input 2 files, one with the R1 and another with the R2 reads. The reads must be in corresponding order in both files. Reads that are not in both files are excluded.

FastQ joiner: Galaxy tool. Input 2 files, one with the R1 and another with the R2 reads. The reads do not need to be in corresponding order in both files. Reads that are not in both files are excluded.

shuffle_resamp.py: This python script takes a file containing both R1 and R2 reads as input, and produces a file of shuffled paired end reads and 2 singleton files one for R1 and another for R2 reads. This script can also produce random shuffled subsets of your data (only the paired end reads) given a specified percent cutoff.

Assembly statistics


Count_fasta.pl: Obtain length histogram, GC-content, etc. for sequences in a fasta-format file. This script comes from the UC Davis bioinformatics wiki which includes lots of other useful scripts.

Differential Expression

...

BLAST

...

Orthology Assignment

minSTOP.py: Choose the reading frame with the least stop codons from a fasta file containing all possible translations for each sequence.
annotate_mcl.py: ...

Formatting Fasta Files

...