Making efficient use of the resources you rent from Amazon

At Amazon, and any other cloud computing service for that matter, you have to pay for server time per unit time (usually by the hour). So while you're renting a machine you want to use all the resources you are paying for only as long as you really need them. Here are a few tips for speeding up your work and reducing the amount of time you rent a server.

A lot of tasks can be parallelized easily with GNU parallel. You can download it from here. Installation is as easy as untarring the bz2 archive and then issuing the following commands. GNU parallel is installed on our AMI (check the REDAME in the home directory to see what programs are installed).
./configure
## do ./configure in the directory that was created
## when you untarred the bz2 file
make
make install
GNU parallel handles parallelization of tasks for you by maximizing the number of jobs that run in parallel. Let's say you have 4 cores on a machine and you want to run the same script on 8 different datasets. You can write a simple batch text file for quality trimming sequences in fastq files that looks as follows.
qtrimScript someFile1.fastq
qtrimScript someFile2.fastq
qtrimScript someFile3.fastq
qtrimScript someFile4.fastq
qtrimScript someFile5.fastq
qtrimScript someFile6.fastq
qtrimScript someFile7.fastq
qtrimScript someFile8.fastq
You could issue all these commands one after the other which would actually slow the server down since you'd be trying to run 8 instances of the script at the same time on only 4 cores. Running them sequentially wastes your time and server resources. GNU parallel can take care of this for you.
parallel -j+0 < batch.txt
## This runs as many instance of the commands in batch.txt
## in parallel as you have cores (-j+0) and schedules the
## jobs in an efficient manner
Instead of waiting for the jobs you just started to finish you can unmout the volume you mounted to your server and shut down the server automatically when you're done.
parallel < batch.txt && umount /path/to/volume && poweroff &
## unmount and poweroff are only executed if parallel exits without error
parallel < batch.txt ; umount /path/to/volume && poweroff &
## unmounts volume regardless of exit status of parallel but doesn't
## shut down server unless volume unmounts clean
parallel < batch.txt && poweroff &
## the poweroff command should unmount your volume automatically
## use at your own risk -- adding umount is probably saver
Following the above steps, you should be able to get your results quicker while spending less time and money. There's no reason to pay for an idle server! Note that GNU parallel doesn't optimize RAM usage, so assembly and mapping won't benefit from it. Using the last scripts I mentioned will still save money by shutting down your instance after your assembly is done for example.

You can also send yourself an email before the machine shuts down to inform you that your data is ready to be picked up using mailx. I may get to an example of this at some point.