Barcoding lichen-forming fungi using 454 pyrosequencing is challenged by artifactual and biological sequence variation

Although lichens (lichen-forming fungi) play an important role in the ecological integrity of many vulnerable landscapes, only a minority of lichen-forming fungi have been barcoded out of the currently accepted 18 000 species. Regular Sanger sequencing can be problematic when analyzing lichens since saprophytic, endophytic, and parasitic fungi live intimately admixed, resulting in low-quality sequencing reads. Here, high-throughput, long-read 454 pyrosequencing in a GS FLX+ Systemwas tested to barcode the fungal partner of 100 epiphytic lichen species from Switzerlandusing fungal-specificprimerswhenamplifying the full internal transcribed spacer region (ITS). Thepresent study shows the potential of DNAbarcoding using pyrosequencing, in that the expected lichen funguswas successfully sequenced for all samples except one. Alignment solutions such as BLAST were found to be largely adequate for the generated long reads. In addition, the NCBI nucleotide database—currently the most complete database for lichenforming fungi—can be used as a reference database when identifying common species, since themajority of analyzed lichenswere identified correctly to the species or at least to the genus level. However, several issueswere encountered, including a high sequencing error rate, multiple ITS versions in a genome (incomplete concerted evolution), and in some samples the presence of mixed lichen-forming fungi (possible lichen chimeras).


Introduction
Lichens are intimate and long-term symbiotic associations consisting of a heterotrophic fungal partner-also called the mycobiont-and photosynthetic algae or cyanobacteria-also called the photobiont (Nash 2008).In a systematic context, lichens are named after the fungal partner, and according to most recent estimates, about 18% of all fungal species are lichen-forming (Feuerer and Hawksworth 2007).While lichens include many bio-indicators for monitoring environmental qualityincluding air pollution and ecological integrity of forest landscapes (Nimis et al. 2002)-accurate identification of lichenized fungal species remains challenging (Lumbsch and Leavitt 2011).DNA-based specimen identification to a species level is useful in a system of well-circumscribed taxa and a high-quality reference database (Seifert 2009;Begerow et al. 2010).Recently, the internal transcribed spacer region (ITS) of the nuclear ribosomal RNA cistron was proposed as the primary fungal barcode marker (Schoch et al. 2012).However, only a minority of lichenforming fungi have been barcoded from the estimated 17 500 species (Feuerer and Hawksworth 2007).For example, the largest and most typical order of lichen-forming fungi (Lecanorales) consists of about 5700 species, but only about 29% of them have publicly available ITS sequences in sequence databases such as the National Institute of Health's (NIH) genetic sequence database, GenBank (http://www.ncbi.nlm.nih.gov/genbank/;accessed 28 September 2015), the Barcode of Life Data System (BOLD; www.boldsystems.org;Ratnasingham and Hebert 2007), and UNITE (https://unite.ut.ee/; accessed 24 November 2015; Abarenkov et al. 2010).
Barcoding lichens using Sanger sequencing has been successful in some groups of lichens (Kelly et al. 2011;Divakar et al. 2016), especially with foliose and fruticose lichens; however, in crustose lichens-that constitute by far the vast majority of lichenized species (Bergamini et al. 2005)-it often proves quite challenging (Flück 2012).Sampling difficulties occur where other very similar lichen species live mixed or close by, and many saprophytic, endophytic, and parasitic fungi also live intimately admixed with the lichen mycobiont, making the application of Sanger sequencing insufficient in many cases (Flück 2012;Orock et al. 2012).A limited number of studies are known to have successfully applied pyrosequencing to recover the identity of a lichen, when Sanger sequencing failed to produce usable results (Hodkinson and Lendemer 2013;Lücking et al. 2014a).
Recent advancements in pyrosequencing methods now allow the amplification of fragments up to 1000 base pairs (bp) in the GS FLX+ system of Roche/454 pyrosequencing.However, 454 pyrosequencing is notorious for its high indel (short insertions and deletions) error rate in homopolymeric regions (three or more identical nucleotites) and carry-forward-incomplete-extension (CAFIE) errors (Margulies et al. 2005;Huse et al. 2007; Gilles et al. 2011;Lücking et al. 2014b).In addition to artifactual sequence variation, biological sequence variation-such as intragenomic and (or) intramycelial (i.e., allelic heterozygosity) variation of this multicopy gene-may be possible (Wörheide et al. 2004;Simon and Weiß 2008;Lindner et al. 2013).
The aim of our study was to test whether lichen mycobiont can be successfully identified to a single species using DNA barcoding through pyrosequencing the ITS marker of 100 species of both foliose-fruticose and crustose lichens, using fungal-specific markers in the highthroughput 454 sequencing in the GS FLX+ system, while elucidating and quantifying artifactual and biological sequence variation.

Taxon sampling and morphology-based identification
One hundred lichen specimens, including 52 crustose species and 48 macrolichens (foliose and fruticose thalli), were collected from different locations in Switzerland in 2011 and 2014.The full list of specimens with voucher information is given in Table 1.Additional specimen data-such as the collection GPS coordinates, substrate, specimen photographs, the primary ITS barcodes-are available though a public BOLD dataset DOI: dx.doi.org/10.5883/DS-LICODE.All project data can also be accessed though the PlutoF cloud database under project "Barcoding Swiss lichens using 454 pyrosequencing" (https://plutof.ut.ee/#/study/view/ 31890).Specimen identifications of the collected material were based on morphological and chemical characters, following the nomenclature of Clerc and Truong (2012).Lichen chemistry was examined with standardized thin layer chromatography (TLC) using solvent systems A, B, and C (Culberson andAmmann 1979, White andJames 1985).Sorediate Bacidia and Lecanora species without apothecia, samples #6, #7, and #42, could not be identified to the species level and were left as Bacidina arnoldiana aggregate (aggr.)and Lecanora strobilina aggr., respectively.Specimen LC-088, determined as Lecidella sp., includes an undescribed chemotype for Lecidella (arthothelin, 2,4-dichloronorlichexanthone), and BC-47-1 Lecanora sp.morphologically resembles Haematomma ochroleucum or Biatora flavopunctata, with atranorin, usnic acid, and zeorin identified as secondary substances.Specimens with poorly expressed diagnostic characters were re-checked in light of barcoding data, and the determination of five specimens was corrected (LC-023 Lecanora cf.umbrina f Lecania cf.cyrtella, BC-027-1 Haematomma aff.ochroleucum or Biatora flavopunctata f Lecanora sp., BC-155-5 Lecidea nylanderi f Loxospora elatina, BC-161-5 Lecanora farinaria/expersa f Loxospora elatina, KM-03-02 Usnea florida f Usnea intermedia).All specimens are stored at the WSL Swiss Federal Research Institute at -20 °C.vegetative thallus, 1-2 apothecia were used for DNA extraction.Frozen lichen samples were lyophilized and disrupted with a stainless steel bead in a Retsch MM2000 mill (Düsseldorf, Germany) for 2 min at 30 Hz.The full genomic DNA was extracted using the Qiagen DNEasy Plant Mini Kit (QIAGEN, Hilden, Germany) following the manufacturer's Plant Tissue Mini Protocol and diluted in 50 L elution buffer.A two-step polymerase chain reaction (PCR) approach was used for unidirectional amplicon sequencing.First, the full ITS was amplified via PCR by using short fungal-specific primers, ITS1F (Gardes and Bruns 1993) and ITS4 (White et al. 1990).Each PCR reaction (25 L) consisted of 2× KAPA HiFi HotStart ReadyMix PCR mix (KAPA Biosystems), 5 mol/L of each of the forward and reverse primer, and 1 L of template DNA.Thermal cycle conditions included the initial denaturation at 95 °C for 2 min; 35 cycles of denaturation for 20 s at 98 °C, primer annealing for 15 s at 57 °C, and extension for 20 s at 72 °C; and final extension for 3 min at 72 °C.
The PCR products were visualized on 1.5% agarose gel and purified via Exo-SAP (Fermentas) treatment.In the second PCR step, the products were re-amplified with fulllength fusion primers designed by Microsynth AG (Balgach, Switzerland).The fusion primers contained the GS FLX Titanium A-and B-adapters of the kit Lib-L (Roche Diagnostics), the multiplex identifier (MID) sequence, four base library key sequence (TCAG), and the same sample-specific primers used in the first step.The MID tags were used only on the forward, A-adaptor for unidirectional sequencing.Amplifications were conducted in 50 L reaction volumes containing 39 L of mastermix with reagents 5× KAPA HiFi Buffer, 10 mmol/L KAPA dNTP Mix, 2U/100 L KAPA HiFi HotStart DNA Polymerase (KAPA HiFi HotStart PCR Kit, KAPA Biosystems), 4 mol/L of each of the forward and reverse primer, and 1 L of template DNA.The cycle conditions were as follows: initial denaturation at 95 °C for 3 min; 12 cycles of denaturation at 98 °C for 20 s, primer annealing at 56 °C for 30 s, and elongation at 72 °C for 30 s; and a final elongation step at 72 °C for 5 min.The aliquots were then purified via AMPure Beads XP (Beckman Coulter, Brea, Calif., USA).Concentrations of purified PCR products were quantified by fluorometry using the Quant-iT™ PicoGreen ® dsDNA Assay Kit (Molecular Probes, Eugene, Oreg., USA).Probes were pooled in four equimolar pools (25 each), purified in 1.5% preparative gel, and run on 4/16 of a sequencing plate using Titanium FLX+ reagents on a GS Roche Sequencer (454 technology, Roche Diagnostics).Pre-sequencing steps (starting from the second PCR), sequencing, raw data processing using #3 pipeline preconfigured by Roche for long amplicons, and sorting the resulting reads into samples based on their MID's (demultiplexing) were all carried out by Mircosynth AG.

Data processing and analyses
We followed a traditional and rather conservative approach when sorting and clustering sequences.Reads with a non-matching forward primer, that were shorter than 500 bp, included more than two ambiguous sites, and (or) with a mean quality score lower than 25 were disregarded using the programs Cutadapt v1.7 (Martin 2011) andPRINSEQ-lite v0.20.4 (Schmieder andEdwards 2011).The quality-sorted reads were screened for chimeras using the uchime_ref command in the sequence analysis tool USEARCH v8.0.1623 (Edgar et al. 2011), searching against the UNITE/INSDC reference database of fungal ITS sequences (Nilsson et al. 2015) available at http://unite.ut.ee/repository.php(accessed 26 July 2015).The sequence data were not denoised, and singleton sequences were included to better understand the artifactual and biological sequence variation.The remaining reads for each sample were sorted by length, then clustered with USEARCH at a 95% similarity threshold using the centroids function in the UCLUST algorithm (Edgar 2010).This approach assumes that the longest read of a cluster is the most appropriate and assures that the longest fragment of the full ITS marker will be used for downstream analyses.The less stringent clustering was opted for our data to compensate for a high sequence error rate in the 454 system and to better reflect biological units, based on preliminary analyses where the suggested 97% threshold (see Blaalid et al. 2013) was applied.The resulting centroid sequences were compared against the NCBI nucleotide database (NCBI Resource Coordinators 2013) using the blastn algorithm to obtain their initial taxonomic affiliations.
Even though the Roche GS FLX+ system is capable of sequencing fragments up to 1000-bp long, the average read length remains between 500 and 600 bp (Erguner et al. 2015).The quality of the barcodes becomes an issue towards the end of the read as the base-call quality drops, and the further away from the start, the more ambiguous the base calling becomes (Fig. 1).The full ITS region in fungi has an average length of 500 bp in ascomycetes (Porter and Golding 2011).Based on the available sequences of the studied fungi, the barcoded ITS region is usually more than 550-bp long, and depending on the presence and length of the group I intron, it can be more than 1000-bp long.Consequently, ITS barcodes of more than 500-600 bp get fewer full-length target reads with low-quality sequence ends.Furthermore, various sequencing and PCR errors, such as homopolymer indel errors, CAFIE errors, and chimeras, are frequent and not always detected using the quality scores (reviewed in, for example, Margulies et al. 2005;Huse et al. 2007;Gilles et al. 2011;Lücking et al. 2014b).
To obtain a more reliable species reference sequence, the reads of the target taxon that were concordant with the quality parameters specified above were aligned using the MAFFT v7.017 automatic algorithm (Katoh and Standley 2013) in the Geneious v7.1.6platform (Kearse et al. 2012).To gather the relevant sequences from a pool, we used the cluster identifications from a previous step.Sequences within the target species (the expected myco-biont of the observed and determined lichen species) cluster(s)-identified as the taxon and (or) closely related taxon (for closely related species groups) using the GenBank BLAST function-were gathered and aligned together with reference sequences from GenBank.Clear outliers in the alignment were removed, the primer binding sites were trimmed, and the remaining regionincluding the end of ribosomal RNA gene 18S, ITS1, 5.8S rRNA gene, ITS2, the beginning of the 28S rRNA gene, and in some species, also the group I intron at the end of the 18S gene-was designated as the barcode region for the species.Nucleotide diversity was estimated for alignments with removed reference sequences using DnaSP v5.10.1 (Librado and Rozas 2009).The consensus sequence of this region was assigned as the barcode for this species.The barcode and the representative centroid sequence-the longest read of the biggest cluster of the taxon-were aligned using the same alignment function in MAFFT, to estimate sequence similarity and the quantity of indels and nucleotide differences.Using the consensus sequence approach allowed us to ignore much of the random noise within sequences, but it is dependent on the accuracy of a given alignment.Alternatively to MAFFT, we applied the PaPaRa alignment (Berger and Stamatakis 2011) to better account for erroneous insertions.Finally, however, PaPaRa alignments were not used due to multiple reasons, discussed and exemplified in the supplementary data, File S12 .
We tested the identification of the generated barcodes using the NCBI nucleotide database, currently the most For personal use only.
complete database for lichen-forming fungi.The barcodes were compared against the database using the megaBLAST function in GenBank (http://www.ncbi.nlm.nih.gov/genbank/; accessed 11 January 2015; Madden 2002).To evaluate the success of specimen identification using DNA barcoding we only counted the identity of the best match in GenBank BLAST, ignoring the sequence similarity since no singe meaningful threshold can be applied over the species set, and for many of the taxa, too little or no information is known for accurate threshold estimation.We compared the morphological species determination with the best match (= with the highest bit score) from the barcode sequence megaBLAST search and recorded for each specimen if the identification via DNA barcoding resulted in concordant results at the species or the genus level (= correct species / = correct genus but different species or species not present in the database), divergent (= different genus and species), or neutral results (= unidentified or uncultured fungus).In cases where the hypothesized identity differed from the GenBank match, or the similarity between the GenBank sequences and our sequence was less than 97% (following Blaalid et al. 2013), a phylogeny-based identification was used.To achieve this, available closely related representative ITS sequences of the same genus or species group (depending on the known information of the species phylogeny) were aligned with the barcode using the program MAFFT v7 (Katoh and Standley 2013), where the G-INS-i alignment algorithm (Katoh and Toh 2008) with 1PAM / K = 2 scoring matrix was set.The alignments of ITS sequences were analyzed using a maximum likelihood (ML) criterion in the program RAxML v7.3.1 (Stamatakis 2006) where the evolutionary model was set to GTRGAMMA, and node support was assessed using 1000 "fastbootstrap" replicates (Stamatakis et al. 2008).
Where distinct nucleotide variation-possibly representing intragenomic variation of ITS sequences or a mixture of individuals of the same or closely related species-was noted, the target species alignments were studied carefully, and different ITS version frequencies were counted.In cases where (a) less-dominant version(s) of ITS was/were higher in frequency than 10% of the target reads, consensus barcode(s) for the less-dominant versions were also generated.The most frequent barcode version was assigned as the primary barcode for the specimen.The different barcode versions were aligned using the MAFFT v7.017 automatic algorithm (Katoh and Standley 2013), and sequence variation, nucleotide diversity, and distances (p-distances, with indels treated as missing data with pairwise deletion) were estimated using DnaSP v5.10.1 (Librado and Rozas 2009) and MEGA5 (Tamura et al. 2011).More distant barcodes of a samplepossibly resulting from a mixture of closely related species-were aligned with closely related representative ITS sequences (for the species list and references see Table S1 2 ), and phylogenetic trees were constructed us-ing the ML approach executed with RAxML v7.3.1 as described above.In some samples, unalignable, up to 10-bp-long regions were detected.Since these regions were usually within or after GC-rich areas, we suspected a problematic PCR accompanied by CAFIE errors, rather than genuine mutations (Dutton et al. 1993).A known issue for GC-rich islands is knot formations in the RNA fold.The RNA secondary structure for such problematic regions was estimated using RNAstructure web servers (available at: http://rna.urmc.rochester.edu/RNAstructureWeb;accessed 18 November 2015) utilizing the default parameters (Bellaousov et al. 2013).

Results
The sequencing of 100 lichens resulted in 128 449 reads.The number of reads per sample varied between 207 and 5473, with an average of 1285.A general overview of sequencing reads is given in Fig. 1.Through quality sorting and chimera checking, on average 21% of the reads were removed, varying from 9% to 57% among samples.Clustering of reads at 95% resulted in an average of 88 clusters per sample (minimum -mean±standard deviation -maximum; 5 -88±57 -231).Despite conservative quality sorting, high nucleotide variation per sample was present in the data.While a high proportion of variation resulted from contaminant fungi within samples, significant nucleotide variation within the target taxon was also present (Table 2).The mean number of clusters per target species was 8 (0 -8±7 -32), and the average nucleotide diversity within a target was 0.01128 ± 0.01862.The pairwise identity between the consensus sequence (barcode) and centroid sequence of the biggest cluster was on average 98.5%, with 9.65 as the mean number of differences per sequence pair.Comparing centroid sequences with consensus barcodes revealed a mean difference of 1.46%.About 90% of the differences were insertions (69%) and deletions (21%) that were especially prominent towards the end of the sequences, while the rest were nucleotide substitutions or CAFIE errors.
From 100 samples, we were able to recover the fungal ITS sequence of the lichen mycobiont for all sequenced species except one.Fuscidea arboricola sequences were not recovered in its sequence pool (#22).Full barcodes, including the end of 18S rRNA, ITS1, 5.8S rRNA, ITS2, and the beginning 28S rRNA, were generated for all species, except for the barcode of Anaptychia crinalis, which remained incomplete, with a partly missing ITS1 sequence due to two long group I introns towards the end of 18S rDNA (altogether 512 bp).To recover the end of the barcode, chimeric sequences without the introns were used.Introns at the end of 18S rDNA were identified in 37 target taxa (Table 1).The length of the intron varied from 60 to 594 bp.The primary barcode sequences have been deposited in the NCBI and BOLD databases under the accession codes as indicated in Table 1.All project dataspecimen voucher information, photographs, the complete  a Clustering conducted at 95% using centroids function in USEARCH.
b Quantity of transitions (TS), transversions (TV), and indels (Ind) between the generated consensus sequence (barcode) and representative centroid.
The barcodes from the targeted lichenized fungi were, in most cases, morphologically and molecularly identified to the same species (n = 69) or at least to the same genus (n = 18) using the NCBI nucleotide database as reference (Fig. 2; for details see Table S2 2 ).Nine of the species correctly identified using GenBank BLAST showed less than 97% similarity, which could indicate wider ITS genetic variation of the species or even the presence of cryptic species.ITS sequences of 14 of the studied species were previously not available in GenBank, this being one of the reasons for imprecise identifications.For eight species, the best match in GenBank was an "uncultured fungus", originating from environmental metabarcoding studies where morphology-based identifications and taxonomic assignments are often not performed.
Besides the expected mycobiont, many other fungi were recovered within our samples.For 22 samples, the most-sequenced fungus was not the expected mycobiont but another fungus instead.Many of these fungi were identified as lichen-associated (facultative parasites/ lichenicolous) or plant-associated (epi-or endophytes).However, the lichen-associated fungal diversity within our sequenced lichens is not in the scope of the present study and will not be discussed further.
We studied the nucleotide variation of samples where (a) less frequent version(s) of ITS constituted more than 10% of the expected mycobiont reads.Multiple barcodes per species were generated for 22 samples (Table 3; for the full list of barcodes of such samples refer to Table S3 2 ).The number of generated barcodes, and distances between them, varied considerably from species to species.Barcode versions with only indels as differences showed zero distance (Buellia arborea, Cladonia digitata, Hypotrachyna laevigata).However, since the indels in these species were present in the non-coding regions (group I intron, ITS1, or ITS2), they cannot be completely ruled out as genuine mutations.In Lecanora subcarpinea, Lepraria lobificans, and Menegazzia terebrata, a poorly alignable, up to 10-bp-long region was detected in ITS1.These regions were positioned within or at the end of sequence regions with high G + C content (GC%), where formations of strong knots in RNA folds are likely.In Lecanora subcaripinea, a region of six ambiguous bases occured at the end of a 38-bp region of 79.5% GC content.In Lepraria lobificans, five ambiguous bases occured at the end of a 44-bp region with a GC% of 82.6%.In Menegazzia terebrata, in a 60-bp region, with a GC% of 86.0%, an ambiguous region of nine bases occured within the sequencing reads.RNA secondary structure analyses revealed that the sequences in these areas form a secondary structure knot with probabilities of >95% for Lecanora, >99% for Lepraria, and >99% for Menegazzia. Figure S1 2 shows, by way of example, the ITS1 RNA secondary structure of the primary barcode version and ambiguous bases of this region in Menegazzia terebrata barcodes.
The dominant ITS version of Physica adscendens is highly similar to sequences of this species (99%; Table S2 2 ), while the second, less-frequent version included a high number of mostly unidirectional transitional mutations and was with low similarity to Physcia adscendens sequences (87%).

Fig. 2.
Comparison of morphological species determination and the results from sequence megaBLAST search in the NCBI nucleotide database for the studied 100 species.Only the best match accession identy was considered, and the sequence similarity results were ignored.Diagram proportions are indicated for concordant results at the species or the genus level (= correct species / = correct genus but different species or species not present), divergent (= different genus and species), or neutral results (= unidentified or uncultured fungus).For one species, no sequences were recovered within the quality-sorted reads; this is noted as "no target".3-5).
b Quantity of transitions (TS), transversions (TV), and indels (Ind), divided into exons and introns.c Variation in intron presence.For personal use only.
However, no better alignment with any other sequence in GenBank was found.Altogether, 92 transitional mutations were present among the two ITS versions (4 in 18S, 32 in the group I intron, 22 in ITS1, 9 in 5.8S, 18 in ITS2, and 7 in 28S).These mutations were almost exclusively unidirectional, being mainly G f A and C f T.
Besides artificial nucleotide variation caused by PCR/ sequencing errors, nucleotide variation can occur for biological reasons, such as (i) incomplete concerted evolution within the ITS and (ii) intraspecific variation.For 13 species, distinct nucleotide variation could not be explained by PCR or CAFIE errors only (Table 3).By comparing the variation within the reads of the mycobiont with available Sanger sequences from GenBank of the same species (where available), we also found the same variable bases in Sanger sequences in eight species.Therefore, it seems probable that intragenomic and (or) intra-mycelial (i.e., allelic heterozygosity) variation is present at least in the following samples: #8 Bryoria capillaris, #55 Lepraria rigidula, #64 Melanohalea exasperata, #70 Parmelia sulcata, #79 Physcia adscendens, #95 Usnea intermedia, and #96 Usnea lapponica.For eight species, such comparisons could not be made due to a lack of Sanger sequences.Sequence variation possibly resulting from intragenomic or intra-mycelial variation were additionally detected in #6 Bacidina arnoldiana, #12 Chaenotheca cf.stemonea, #19 Fellhanera bouteillei, #30 Lecania cf.cyrtella, #50 Lecidella scabra, and #51 Lecidella sp., of which the first three are seemingly also accompanied with intraspecific variation (Figs.3-5).The number of barcodes, however, might not necessarily represent the true intragenomic variation of a species.For example, seven variable nucleotide positions were found in Usnea intermedia with no clear segregation into distinct ITS types.In ITS1, three mutation positions (G N C, T N C, T N C) generated three types, while in ITS2, four mutation positions (T N C, G N A, A N C, T N A) generated four types, of which three were dominant.These types were combined with each, generating seven different barcode versions.Incomplete concerted evolution within ITS is therefore possible.However, inflation of ITS types should also be considered due to recombination or chimera formation between different ITS1 and ITS2 versions.The more similar the parents, the harder it is to detect and identify chimeras.Therefore, it could just as well be possible that fewer than seven different ITS versions are natural, while others are artificial formations.The number of barcodes can also be inflated by systematic CAFIE errors, which we have tried to account for when interpreting the reasons of base variation of some species (e.g., #69 Parmelia saxatilis).For the full list of the alternative barcodes and their hypothetical origin refer to Table S3 2 .
The highest number of variable sites, as well as the greatest genetic distances between barcodes, occurred in samples where multiple distinct lineages, probably representing different species, were identified in the pool.These samples were crustose species Bacidina arnoldiana aggr., Chaenotheca cf.stemonea, and Fellhanera bouteillei.
In Bacidina arnoldiana aggr.(sample #6), five different ITS versions from two Bacidina arnoldiana clades were identified (Fig. 4).The morphologically homogenous, yet variable species B. arnoldiana is therefore polyphyletic, comprising in itself multiple lineages and possibly other taxa with unresolved taxonomy.The five ITS versions from Bacidina arnoldiana belong in two major clusters, Bacidina arnoldiana 1 and Bacidina arnoldiana 3, and group together with other Bacidina arnoldiana sequences.Bacidina arnoldiana 2 includes only GenBank data and is more closely related to Bacidina arnoldiana 3. Four ITS versions cluster within Bacidina arnoldiana 1, of which two are dominant, A1 and B1.These are also almost equally present in the sequence pool (A1 in 65 reads, 42.6%; B1 in 45 reads, 41.7%).The subversions A1 and B1 have a pairwise identity of only 93.8%, and based on genetic distances, clade Bacidina arnoldiana 1 could therefore include cryptic species.However, no final conclusions can be made of our limited sample size and data based on ITS markers only.
In Chaenotheca cf.stemonea (sample #12), sequences from two Chaenotheca "species" were present (Fig. 5)-Chaenotheca cf.stemonea and Chaenotheca trichialis-xyloxena group.The first includes four highly similar (>98%) barcodes differing in multiple mutations and in intron presence/absence towards the end of the 18S rDNA.The Chaenotheca trichialisxylogena group includes six barcodes that cluster further into three groups: versions B together with a subgroup of C. trichialis, version A1 clusters within C. xyloxena, and version A2 with a second subgroup of C. trichialis at the base of the clade.The distances within subversions B are low (pairwise identity >98%), while between subversions B, A1, and A2, distances are larger (pairwise identities between 96% and 97%).
Fellhanera bouteillei (sample #19) included three ITS versions, of which two highly similar types (identity 99%), A1 and A2, were dominant, while the third, less-frequent version B was only present in three sequences.The lessdominant version is, however, significant as it represents the clade of Fellhanera beouteillei s.str.together with sequences available in GenBank (Fig. 5).The dominant types A1 and A2 clustered distantly outside of Fellhanera bouteillei and could represent a different species (pairwise identity between barcode versions A1 and B was 89%).

Barcoding lichens
In the scope of this study, we "barcoded" 99 out of 100 lichen mycobiont species using 454 pyrosequencing.Good success in generating barcodes shows the high potential of this method, especially concerning crustose lichens, which can be difficult to sequence with the Sanger technique.To our knowledge, this is the first attempt to apply pyrosequencing to lichen barcoding to such an extent.
The success of DNA-based identification depends greatly on the reference database, both in the quality and quantity of the barcodes included.Purely DNA-based identification of lichens-without considering morphological and chemical characters-is far from reach, especially regarding crustose lichens and samples from poorly investigated regions due to missing references and unknown diversity (Orock et al. 2012).Within the present study, we generated ITS barcodes for 12 described species whose sequences were previously not available in GenBank and for two morphologically newly characterized specimens (#41 Lecanora sp., #51 Lecidella sp.).Multiple species showed low intraspecific ITS sequence similarity to other available sequences in GenBank (e.g., #10 Bunodophoron melanocarpum, #12 Chaenotheca cf.stemonea, #19 Fellhanera bouteillei, #30 Lecania cf.cyrtella, #66 Micarea cinerea) or a high similarity to a different taxon (e.g., #8 Bryoria capillaris, #13-#16 Cladonia species, #36 Lecanora impudens, #47 Lecidella cf.elaeochroma, #47 Lecidella cf.leprothalla, #58-#59 Loxospora elatina, #74 Parmotrema perlatum, #91-#92 Usnea barbata; refer to Table S2 2 and to the project in PlutoF for further information).Before specimen identification via DNA barcoding can be confidently applied, taxonomic studies in these groups are needed to confirm the species monophyly or, in cases of non-monophyly, propose segregation of a taxon into multiple species.However, our results also suggest that the NCBI nucleotide da- tabase, currently the most complete database for lichenforming fungi, could be used as a reference database to identify common species, as the majority of the analyzed lichens were identified correctly to the species or, at least, to the genus level.
When generating barcodes, we applied a rather conservative approach by generating consensus sequences from target taxon alignments, rather than using centroid sequences.In most cases, centroid reads were sufficiently accurate for correct specimen identification to a species level.However, when producing a barcode database, even a few errors can cause systemic mistakes in future identifications.Even though it is more time consuming, this approach helps to better identify and avoid mistakes, such as those caused by sequencing, undetected chimeras, and multiple ITS copies.However, the propagation of errors cannot be avoided completely; they can originate directly from PCR amplification or be hindered by systematic sequencing errors, independent from data processing methods.

Reasons for nucleotide variation
The source of nucleotide variation within a species can be the PCR, the sequencing, or be of biological origin.While the first two sources generate artifactual variation, biological origins represent genuine mutations.
We identified three main sources of artifactual variation in our samples.
(1) Chimeras are hybrid products of PCR formed by different template sequences.Undetected chimeras lead to overestimated richness, artificial entities, and incorrect taxonomic identifications.It has been shown that chimera formation is especially strong in regions with alternating patterns between conserved and less-conserved regions, such as 16S rRNA genes (Shin et al. 2014) secondary structure hinders the access of PCR primers and enzyme work at elongation (Dutton et al. 1993).Even though high-temperature denaturation (98 °C) and high-fidelity heat-stable polymerase were used, the formation of strong knots in RNA structure that impede correct amplification cannot be ruled out and might have affected the ambiguous regions in the species Lecanora subcarpinea, Lepraria lobificans, and Menegazzia terebrata.
(3) 454 pyrosequencing is notorious for its high indel error rate in homopolymeric regions and carry-forwardincomplete-extension (CAFIE) errors (Margulies et al. 2005, Huse et al. 2007, Gilles et al. 2011).Incomplete extension creates shorter homopolymers due to insufficient dNTPs.Carry-forward errors are the result of incomplete dNTP flushing, which then causes a nucleotide from the end of the homoploymer to be read some bases later (e.g., a true sequence AAAATCG, can be read as AAATCGA).A comparison between the consensus barcodes and a representative read (the centroid sequence of the biggest cluster) showed a mean difference of 1.46%.About 90% of the errors were indels.These were most prominent in homopolymeric regions and were especially frequent towards the end of the sequence, while the rest were nucleotide substitutions.Considering that in closely related species complexes, less than 2% variation could already result in a different identification, the necessity to reliably separate artificial from genuine variation becomes clear.
Nucleotide variation in the ITS can represent genuine mutations when a species is characterized by incomplete concerted evolution (Feliner and Rosselló 2007).We have highlighted 12 species in Table 3 where detected nucleotide variation could represent intragenomic variation within ITS, or alternatively, allelic heterozygosity caused by differing nuclei within a single mycelium (Huang et al. 2010;Hyde et al. 2013).The latter is more likely when only two primary allele variants in approximatly equal ratios are observed (Lindner et al. 2013).Two more or less equally represented ITS paralogs were found in #1 Alectoria sarmentosa, #6 Bacidina arnoldiana, #8 Bryoria capillaris, #43 Lecanora subcarpinea, and #70 Parmelia sulcata, while one dominant ITS version together with relatively rare haplotypes was found in #5 Bacidia vermifera, #12 Chaenotheca cf.stemonea, #19 Fellhanera bouteillei, #30 Lecania cf.cyrtella, #50 Lecidella scabra, #51 Lecidella sp., #55 Lepraria rigidula, #64 Melanohalea exasperata, #95 Usnea intermedia, and #96 Usnea lapponica.The latter list includes representatives from a closely related species group in the genus Usnea (sect.Usnea; Mark et al. 2016), where low genetic variation and possible rapid radiation of species have been characterized (Truong et al. 2013;Kraichak et al. 2015).However, four other Usnea specimens from this closely related group were also included in our study, where such nucleotide variation was not detected (#91-#92 U. barbata, #94 U. intermedia, and #97 U. substerilis).Whether or not intragenomic variation can be detected via pyrosequencing likely also depends on the random effects of PCR and the proportions of different ITS copies in a genome.Xu et al. (2009) demonstrated a high level of intraindividual polymorphism in rDNA, including multiple functional genes, putative pseudo genes, and recombinants.They also found a systematic bias in the efficiency with which different sequences were amplified.In Physcia adscendens, a second ITS version with unidirectional transitions from cytocine to thymine and adenine to guanine was detected.The second ITS version of this species could represent a rare divergent ITS allele or a pseudogene, propagated to a higher frequency due to a PCR bias.Ribosomal DNA pseudogenes are characterized by low stability in the predicted secondary structure, a higher substitution rate, and deamination-driven stubstitutions (Buckler et al. 1997).A large multigene family of 5S rRNA, including pseudogenes, has also been identified in other filamentous fungi (Margolin et al. 1998;Rooney and Ward 2005).A duplication event of the ribosomal DNA cistron could be a plausible explanation for the detected nucleotide variation in Physcia adscendens and is in concordance with the findings of Lücking et al. (2012).
Intraspecific variation, resulting from possible lichen chimeras or microcosms-where diverse communities live mixed and alongside-is illustrated by the phylogenies for three crustose species.For example, in a single sample of Bacidina, constituting a minute quantity of lichen soredia, four distinct lineages of Bacidina were identified in a visually inseparable community.In these species, the problem of distinguishing between intraand interspecific variation arises.This variation within a specimen cannot be detected in Sanger sequencing environmental samples, and this could be one of the reasons for the taxonomic mess in the Bacidina arnoldiana aggregate, as also illustrated in Fig. 4. Further studies with more data are needed to investigate such phenomena.
Whatever the reason, nucleotide variation detected through 454 pyrosequencing must be interpreted with caution, both with respect to the acquired barcode database as well as answering biological questions.Using other, less error-prone squencing platforms and (or) combining a large number of reads per taxon could help to avoid random sequencing errors.Sanger references can aid in distinguishing between CAFIE errors and biological sequence variation.However, Sanger sequencing may remain insufficient in some cases, as it is unable to detect intragenomic sequence variation (e.g., to investigate species and genome evolution) and fails to give useable results in a mixture of templates.also wish to thank the editor and the reviewers for their useful comments, and Curtis Gautschi for linguistic corrections.This study was funded by the Rectors' Conference of the Swiss Universities (CRUS) through the SCIEX-NMS ch Scientific Exchange Programme (project SCIEX-LiCode), the Federal Office for the Environment (FOEN; grant to Silvia Stofer and C.S.), and SwissBOL (grant to C.S.).

Fig. 1 .
Fig. 1.General overview of pyrosequencing reads, with (A) number of reads per sample, (B) average read length, and (C) read quality distribution over read length.[Colour online.]

Fig. 3 .
Fig.3.Maximum likelihood ITS gene tree of available Bacidina arnoldiana aggr.sequences, with extended outgroup of closely related Bacidia species.Barcodes from Bacidina arnoldiana aggr.(sample #6) are indicated with a shaded background.The number of reads per barcode is given in brackets at the end of the name.Branches leading to strongly supported nodes (≥70%) are marked in bold.The scale bar shows the number of substitutions per site.[Colouronline.]

Fig. 4 .
Fig.4.Maximum likelihood ITS gene tree of available Chaenotheca sequences, including the barcodes from Chaenotheca cf.stemonea (sample #12, all with shaded background).The number of reads per barcode is given in brackets at the end of the name.Branches leading to strongly supported nodes (≥70%) are marked in bold.The scale bar shows the number of substitutions per site.[Colouronline.]

Table 1 .
Studied species with specimen voucher info, the primary barcode length for the sequenced ITS region, the position and length of an intron in the barcode (if present), and the primary barcode accession numbers in the NCBI and BOLD databases.

Table 2 .
Pyrosequencing results with nucleotide variation in the studied samples and comparison of consensus barcodes with representative sequencing reads for each sample.

Table 3 .
List of samples where minor barcode version(s) constituted more than 10% of the target reads, with characteristics of nucleotide variation between the barcodes and comparison with available Sanger sequences of the species (where possible, otherwise marked with "-").Nucleotide variation of specimens in bold more likely represents genuine mutations, while in others is probably due to PCR/sequencing errors.In presence of intraspecific variation, nucleotide variation within one of the "species" was compared: Bacidina arnoldiana subversions A1 & A2, Chaenotheca cf.stemonea subversions A1, A2, B & C, Fellhanera bouteillei subversions A1 & A2 (for reference see Figs.
Note: a Genome Downloaded from www.nrcresearchpress.com by Lib4RI -Library of Eawag, Empa, PSI & WSL on 10/03/16 Maximum likelihood ITS gene tree of available Fellhanera sequences with extended outgroup of closely related species, including the barcodes from Fellhanera boutellei (sample #19; all with shaded background).The number of reads per barcode is given in brackets at the end of the name.Branches leading to strongly supported nodes (≥70%) are marked in bold.The scale bar shows the number of substitutions per site.[Colouronline.]