Getting+Started+With+Your+Data

Here are some examples of how you can you use UNIX tools to start looking at your next-gen data. We suggest you read about ** Input/Output Redirection ** and ** Job Control ** first.

flat

=Uncompressing your Data= So far we have received our data as tar archives from sequencing facilities. Uncompressing these archives is covered in **Archived and Compressed Files**. You should consider the space availability of the machine you are working on before diving in and uncompressing all of your data. A compressed fastq file can be approximately 400 MB. Your data can consist of up to 100 fastq files per lane, approximately 40 GB. Uncompressed this can be over 100 GB. For comparison, my Work computer, an iMac, only has 80 GB of space on its hard drive! Remember you can use the command **[|df]** with the flag **-h** to see how much free space you have on your hard drive.

=FASTQ File Conventions= In general your fastq file will have a name that looks something like: lane3-6_GCCAAT_L003_R1_001.fastq This file will contain alternating lines with sequences and base quality scores. You can read more about the fastq file format **here**. An example fastq entry can be found below. Data generated using Illumina's pipeline CASAVA 1.8 and higher uses the standard Sanger ASCII encoding of Phred quality scores ( **[|ASCII]** characters 33-126).
 * lane3-6 || Lane and bar code index ||
 * GCCAAT || Bar code sequence ||
 * L003 || Lane number ||
 * R1 || Read number ||
 * 001 || File number ||

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG CATCATCATCATCATCATCATCATCATCATCATCAT + BBBBCCCC?::::::  : : :barcode sequence>

Note: In a paired end run, read 1 and read 2 will be in different fastq files.

=Counting Reads= You may want to have an idea of the number of reads you obtained from your sequencing run before starting your analysis. Below are some commands you can use for counting reads in the command line. Alternatively, you can use [|FastQC] to count reads and obtain other statistics. code format="bash" echo `wc -l yourfile.fastq | cut -f1 -d' '` / 4 | bc code You may need to count reads in multiple files. This can be accomplished in the following way. code format="bash" echo `wc -l file1.fastq file2.fastq` 40000 file1.fastq 40000 file2.fastq 80000 total echo `wc -l rhit.csv rhit_GO2.csv` | cut -f5 -d' ' 80000 echo $(echo `wc -l file1.fastq file2.fastq` | cut -f5 -d' ')/4 | bc 20000 code =Organizing your Data= You will need to combine your files before using them for assemblies or quality checking. There are several ways to do this. It is useful to remember that certain programs require that input data is arranged in a certain way. For instance Velvet/Oases needs the input file to contain alternating paired-end reads and expects paired-end reads to come from opposite strands facing each other, as in the traditional Sanger format. If you have paired-end reads produced from circularisation (i.e. from the same strand), it will be necessary to replace the first read in each pair by its reverse complement before running Velvet. Other assembly programs, such as the Trinity package require paired-end reads to be in separate files in corresponding order.
 * 1) Prints the number of sequences in yourfile.fastq. wc -l counts the number of lines in the file,
 * 2) cut grabs the number returned by wc, this number divided by 4 is the number of sequences in the fastq file,
 * 3) bc performs the division operation.
 * 1) not run: extracting word counts from multiple files
 * 1) not run: grabbing the total word count. You should change the argument for the -f flag if you are counting
 * 2) sequences in more than 2 files to reflect the column the total is in, for instance with 3 files the total
 * 3) will be in the 7th column.
 * 1) Run: prints the number of sequences in fastq files 1 and 2.

Cat
Combine all fastq files into a new fastq file and remove all files but the combined file. code format="bash" cat *.fastq > all.fastq && rm !(all.fastq) code You can use a similar approach if you would only like to combine a subset of your fastq files. For example you can combine only files containing reads for pair 1 and only those containing reads for pair 2. Remember the read number is in the file name (lane3-6_GCCAAT_L003_**R1**_001.fastq). code format="bash" mkdir temp && cat *R1*.fastq > temp/all_R1.fastq && cat *R2*.fastq > temp/all_R2_fastq && rm *.fastq && mv temp/* /your/working/directory && rm -r temp code

Shuffle Sequences
The Shuffle script included with the Velvet distribution is described below. For other options see **scripts**.

If you have forward and reverse reads in two different FASTA files but in corresponding order, a bundled Perl script from the Velvet/Oases developers called shuffleSequencesfastq.pl will merge the two files into one as appropriate. You can download this script from the **[|Velvet github page]**. Below are some usage examples.

In general you can use ShuffleSequences in the following way: code ./shuffleSequences_fastq.pl forward_reads.fastq reverse_reads.fastq output.fastq code