Here are some examples of how you can you use UNIX tools to start looking at your next-gen data. We suggest you read about Input/Output Redirection and Job Control first.

Uncompressing your Data | FASTQ File Conventions | Counting Reads | Organizing your Data

Uncompressing your Data

So far we have received our data as tar archives from sequencing facilities. Uncompressing these archives is covered in Archived and Compressed Files. You should consider the space availability of the machine you are working on before diving in and uncompressing all of your data. A compressed fastq file can be approximately 400 MB. Your data can consist of up to 100 fastq files per lane, approximately 40 GB. Uncompressed this can be over 100 GB. For comparison, my Work computer, an iMac, only has 80 GB of space on its hard drive! Remember you can use the command df with the flag -h to see how much free space you have on your hard drive.

FASTQ File Conventions

In general your fastq file will have a name that looks something like: lane3-6_GCCAAT_L003_R1_001.fastq
Lane and bar code index
Bar code sequence
Lane number
Read number
File number
This file will contain alternating lines with sequences and base quality scores. You can read more about the fastq file format here. An example fastq entry can be found below. Data generated using Illumina's pipeline CASAVA 1.8 and higher uses the standard Sanger ASCII encoding of Phred quality scores (ASCII characters 33-126).

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG

The first line is prefixed by the “@” symbol and contains the read name.
The second line contains the sequence bases
The third line is prefixed by a + symbol and sometimes repeats the read name. The read name is omitted in the minimal FASTQ case.
The fourth line contains the base qualities.
The header line is interpreted as follows: @ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>:<x-pos>: <y-pos> <read number>:<is filtered>:<control number>:barcode sequence>

Note: In a paired end run, read 1 and read 2 will be in different fastq files.

Counting Reads

You may want to have an idea of the number of reads you obtained from your sequencing run before starting your analysis. Below are some commands you can use for counting reads in the command line. Alternatively, you can use FastQC to count reads and obtain other statistics.
echo `wc -l yourfile.fastq | cut -f1 -d' '` / 4 | bc
## Prints the number of sequences in yourfile.fastq. wc -l counts the number of lines in the file,
## cut grabs the number returned by wc, this number divided by 4 is the number of sequences in the fastq file,
## bc performs the division operation.
You may need to count reads in multiple files. This can be accomplished in the following way.
##not run: extracting word counts from multiple files
echo `wc -l file1.fastq file2.fastq`
40000 file1.fastq 40000 file2.fastq 80000 total
##not run: grabbing the total word count. You should change the argument for the -f flag if you are counting
##sequences in more than 2 files to reflect the column the total is in, for instance with 3 files the total
##will be in the 7th column.
echo `wc -l rhit.csv rhit_GO2.csv` | cut -f5 -d' '
## Run: prints the number of sequences in fastq files 1 and 2.
echo $(echo `wc -l file1.fastq file2.fastq` | cut -f5 -d' ')/4 | bc

Organizing your Data

You will need to combine your files before using them for assemblies or quality checking. There are several ways to do this. It is useful to remember that certain programs require that input data is arranged in a certain way. For instance Velvet/Oases needs the input file to contain alternating paired-end reads and expects paired-end reads to come from opposite strands facing each other, as in the traditional Sanger format. If you have paired-end reads produced from circularisation (i.e. from the same strand), it will be necessary to replace the first read in each pair by its reverse complement before running Velvet. Other assembly programs, such as the Trinity package require paired-end reads to be in separate files in corresponding order.


Combine all fastq files into a new fastq file and remove all files but the combined file.
cat *.fastq > all.fastq && rm !(all.fastq)
You can use a similar approach if you would only like to combine a subset of your fastq files. For example you can combine only files containing reads for pair 1 and only those containing reads for pair 2. Remember the read number is in the file name (lane3-6_GCCAAT_L003_R1_001.fastq).
mkdir temp && cat *R1*.fastq > temp/all_R1.fastq && cat *R2*.fastq > temp/all_R2_fastq &&
rm *.fastq && mv temp/* /your/working/directory && rm -r temp

Shuffle Sequences

The Shuffle script included with the Velvet distribution is described below. For other options see scripts.

If you have forward and reverse reads in two different FASTA files but in corresponding order, a bundled Perl script from the Velvet/Oases developers called will merge the two files into one as appropriate. You can download this script from the Velvet github page . Below are some usage examples.

In general you can use ShuffleSequences in the following way:
./ forward_reads.fastq reverse_reads.fastq output.fastq