Sequence Filtering

One of the big first decisions is to decide how far into your sequence pile you want to reach for things are probably fungal (assuming the primer set used, in my case ITS1F and ITS2, should do a pretty good job of biasing heavily towards fungi). We have used a cut-off of 0.6 (at least 60% match to something in UNITE) to be considered fungal. I have not looked at the sequences below that, but I imagine BLASTing them against NCBI would give poor matches to lots of different eukaryotes. A higher cut-off will naturally result in more sequences being removed from your pool early, but also more confidence that everything you include is likely to really be fungal in origin (although see notes on taxonomic assignment below). So far, we have worked in habitats where basal lineages are not dominant, so I think if basal lineages are more your target, you might want to lower your threshold to make sure important “novel” groups aren’t screened out at this first step.

More generally, the effort involved in clustering goes up as the number of unique sequences increases, so strong pre-filtering steps can speed up downstream clustering considerably. This is my understanding of one of the core differences between UPARSE and USEARCH (with UPARSE being a stronger early filter and USEARCH being one relies on greater post-cluster filtering effort).

