Spider is an R package for visualizing and analyzing barcode sequence data. This tutorial is a condensed version of the available Spider tutorial with an alternative method for plotting the barcoding gap. More information about this package can be found at the Spider website.




CO1_Trachylina.png
Furthest intraspacific vs. closest interspecific distance density plot generated using commands in step 10.





1. Align your sequences and export the alignment as a fasta file.
2. Get R: Download R from a selected CRAN mirror.
3. Start R
4. Get and load Spider:
install.packages("spider")
library(spider)
5. Import your alignment into R:
Aln <- read.dna("path/to/mySequences.fas", format="fasta")
6. Define genus and species vectors:
#These commands assume that the sequences in the alignment are labeled in the following way: genus_species_accession
SplitNames <- strsplit(dimnames(Aln)[[1]], split="_")
#Species Vector
Spp <- sapply(SplitNames, function(x) paste(x[1], x[2], sep="_"))
#Genus Vector
Gen <- sapply(SplitNames, function(x) paste(x[1]))
7. Calculate summary statistics:
#This function will print summary statistics for your data, example below.
dataStat(Spp, Gen)
 
Genera Species  Min  Max  Median  Mean  Thresh
1      4        2    16   10      9     1
Genera: number of genera, Species: number of species, Min: the minimum number of individuals per species, Max: the maximum number of individuals per species, Median: the mean number of individuals per species, Thresh: how many species have fewer individuals than the threshold (default of 5)
8. Generate a distance matrix using the Kimura 2-parameter model:
Dist <- dist.dna(Aln, pairwise.deletion = TRUE)
9. Generate distributions of the furthest intraspacific and the closest interspecific distances. The Following explanation is taken from the Spider tutorial:
"The “barcoding gap” (Meyer & Paulay, 2005) is an important concept in DNA barcoding. It is the assumption that the amount of genetic variation within species is smaller than the amount of variation between species. This allows the two to be distinguished. As pointed out by Meier et al. (2008), the barcode gap should be calculated using the smallest, rather than the mean interspecific distances. Spider generates two statistics for each individual in the dataset, the furthest intraspecific distance among its own species—maxInDist() and the closest, non-conspecific (i.e., interspecific distance)—nonConDist()."
inter <- nonConDist(Dist, Spp)
intra <- maxInDist(Dist, Spp)
10. Plot density distributions of furthest intraspacific and the closest interspecific distances (see the Spider tutorial for an alternative way of representing the barcoding gap) and save plot as a pdf. You can also plot distributions as overlapping histograms using the directions here.
#Open a pdf file
pdf("My_Barcode_Plot.pdf")
#Convert the count data associated with inter and intra into density distributions
DensityInter <- density(inter)
DensityIntra <- density(intra)
#Plot the intraspacific density distribution: xaxt='n' and yaxt='n' remove
#number values from the x and y axes, main sets the title of the plot,
#xlab sets the x-axis label.
plot(DensityIntra, xaxt ='n', yaxt ='n', main ='My Plot', xlab = "genetic distance")
#Color the intraspacific density distribution red
polygon(DensityIntra, col="red")
#Add the interspacific density distribution to the same plot
lines(DensityInter)
#Color the interspacific density distribution a transparent yellow
polygon(DensityInter, col=rgb(1,1,0,0.5))
#Add a legend to the plot: the first argument specifies the location,
#legend sets the legend text and fill sets the colors for each term in legend
legend('topright', legend=c('intra','inter'), fill=c('red',rgb(1,1,0,0.5)))
#Stop writing to the pdf file
dev.off()
11. Optional: Calculate % overlap between furthest intraspacific and the closest interspecific distance count data distributions.
#Store histogram data for intra and inter distributions in variables
IntraHist <- hist(intra,plot=F)
InterHist <- hist(inter,plot=F)
#Determine common bins for both histograms
Dist <- IntraHist$breaks[2]-InterHist$breaks[1]
Breaks <- seq(min(IntraHist$breaks,InterHist$breaks),max(IntraHist$breaks,InterHist$breaks),Dist)
#Store histogram data with common bins for intra and inter distributions
IntraHist <- hist(intra,breaks=Breaks,plot=F)
InterHist <- hist(inter,breaks=Breaks,plot=F)
#Extract maximum and minimum count values at each bin
MaxValues <- ifelse(IntraHist$counts > InterHist$counts, IntraHist$counts, InterHist$counts)
MinValues <- ifelse(IntraHist$counts < InterHist$counts, IntraHist$counts, InterHist$counts)
#Calculate total area
Area <- sum(MaxValues)
#Calculate percent overlap
(sum(MinValues)/Area)*100
overlap.png
Furthest intraspacific vs. closest interspecific distance histograms (generated from same data as density plot above). Area calculated for measures of percent overlap in grey. See 10 for link to instructions for creating overlapping histograms.