File+Formatting

Bellow are some examples of simple scripting solutions to commonly encountered file manipulation problems.

flat

= Fastq to Fasta File = To convert from a fastq to a fasta file you will need to remove the quality header and quality score line and replace the @ preceding the sequence header with a >.

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG CATCATCATCATCATCATCATCATCATCATCATCAT + BBBBCCCC?EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG CATCATCATCATCATCATCATCATCATCATCATCAT
 * Fasta File:**

code format="python" myFastq = open('myfile.fastq', 'r') #open fastq file for reading myFasta = open('myfile.fasta', 'w') #open fasta file for writing
 * Python:**

while 1: #initiate infinite loop #read 4 lines of the fasta file SequenceHeader= myFastq.readline Sequence= myFastq.readline QualityHeader= myFastq.readline Quality= myFastq.readline if SequenceHeader == '': #exit loop when end of file is reached break #write output myFasta.write('>%s%s' %(SequenceHeader.strip('@'), Sequence))

myFastq.close myFasta.close code
 * 1) close files

code format="bash"
 * Bash:**
 * 1) grep for all sequence header lines and following line (-a 1) in your fastq file. Delete separator ('--') introduced
 * 2) by grep search. Replace @ with >. The '|' character pipes the output from the previous command into the following
 * 3) command. The grep search relies on the 'EAS' being common the all sequence headers in your fastq file.

grep -A 1 '@EAS' myfilefastq | sed '/--/d' | sed 's/@/>/' > myfile.fasta code =Subset File= Here is an example for sub-setting a fastq file containing 1000 sequences into 10 fastq files containing 100 sequences each.

code format="bash"
 * Bash:**
 * 1) Loop over the range of files you need to generate (1000/100 = 10).
 * 2) Create a variable j that keeps track of how many lines you have processed.
 * 3) Pipe (|) the top j lines (head -n) from you file to the tail command to grab the last 100 lines (tail -n 100).
 * 4) Redirect (>>) the lines grabbed by tail into a new file.

for((i=1; i<=10; i=i+1)); do j=$[$i*100]; head -n $j myfile.fastq | tail -n 100 >> new_$i.fastq; done code =Replace Mac Line Breaks=

code format="bash" cat yourfile | tr '\r' '\n' code =Remove Line Breaks from Sequences= Here's how to get a sequence with line breaks onto the same line.
 * Bash:**

>EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG CATCATCATCATCAT CATCATCATCATCAT CATCAT
 * Line Breaks:**

>EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG CATCATCATCATCATCATCATCATCATCATCATCAT
 * No Line Breaks:**

This approach will work for fasta files and will require some modification for fastq files.

code format="python" myFasta = open('myfile.fasta','r') #open fasta file for reading NewFile = open('sameline.fasta','w') #open new fasta file for writing
 * Python:**

line = myFasta.readline #read first line in fasta file

while line: #loop over lines in fasta file NewFile.write(line) #write header line to new file sequenceList = [] #initiate empty list for storing sequence lines line = myFasta.readline #read next line from fasta file while line and not line.startswith('>'): #loop over sequence lines sequenceList.append(line.strip('\n')) #strip line break from line and append to sequenceList line = myFasta.readline #read next line from fasta file NewFile.write('%s\n' % ''.join(sequenceList)) #write sequence to new file

myFasta.close NewFile.close code
 * 1) close files