Sequence Clustering

Check out the Westcott and Schloss (2015) PeerJ paper for orientation on the differences between closed reference and de novo approaches. It is a very readable account and does a number of comparisons in bacteria that we have recently tried in fungi. We stayed away from open reference based on the poor performance of that method in their PeerJ paper. In our recent fungal comparisons, which involved four NGS datasets from soil, live wood, dead wood, and live leaves, we found that the results seem pretty equivalent for OTU richness and composition for fungi from known groups (in our case, the live wood fungal community contained many species that weren’t in UNITE).

As such, if you are working in a poorly studied habitat, it seems best to use de novo (e.g. fungi inside of rocks). If you are working in a better studied habitat (e.g. AM fungi in soil), you can much more “safely” use closed ref. The advantage to the closed approach is that you will be able to match your OTUs across studies consistently and there are better alignment algorithms available, so your matches (i.e. accuracy of assignments) is higher. (Speed is also faster with closed ref in my experience, but speed is not likely to be the most limiting factor moving forward). This is a fast moving field, but my general feeling is that closed will get only more popular as UNITE and other databases get better.

For de novo, in terms of the specific clustering algorithm, this is one we have seen the results are broadly similar (VSEARCH, USEARCH, CD-HIT, SWARM) (part of the talk I gave at MSA 2016 - will post on my lab website). The one important exception is that we see heavy OTU inflation issues with both SWARM and CD-HIT compared to the others. In that case, CD-HIT and SWARM made lots of additional OTUs with very few sequences in them. We found that if you eliminated all the OTUs with less than 10 sequences from your analyses, the four algorithms became functionally equivalent in terms of OTU richness. De novo gave more OTUs than closed ref, which is not surprising, but when we relativized the richness by ecological treatment, we found the richness ratios are the same for closed versus de novo. Therefore I would lean towards using the latter in data analyses, as I think the absolute number of OTUs remains suspect in all NGS datasets.

We (myself, Lauren Cline, and Zewei Song) have been working with Dan Knights and Gabe Al-Ghalith at UMN on this project, who are the authors of NINJA (Al-Ghalith et al. (2015) PLoS Computational Biology). NINJA seems to have a lot going for it, so for anyone using closed ref, I encourage you to consider using the NINJA algorithms.

No comments:

Post a Comment