A quantitative protocol for DNA metabarcoding of springtails (Collembola)

We developed a novel protocol with superior quantitative analysis results for DNA metabarcoding of Collembola, a major soil microarthropod order. Degenerate PCR primers were designed for conserved regions in the mitochondrial cytochrome c oxidase subunit I (mtCOI) and 16S ribosomal RNA (mt16S) genes based on published collembolan mitogenomes. The best primer pair was selected based on its ability to amplify each gene, irrespective of the species. DNA was extracted from 10 natural communities sampled in a temperate forest (with typically 25– 30 collembolan species per 10 soil samples) and 10 mock communities (with seven cultured collembolan species). The two gene regions were then amplified using the selected primers, ligated with adapters for 454 technology, and sequenced. Examination of the natural community samples showed that 32 and 36 operational taxonomic units (defined at a 90% sequence similarity threshold) were recovered from the mtCOI and mt16S data, respectively, which were comparable to the results of the microscopic identification of 25 morphospecies. Further, sequence abundances for each collembolan species from the mtCOI and mt16S data of the mock communities, after normalization by using a species as the internal control, showed good correlation with the number of individuals in the samples (R = 0.91–0.99), although relative species abundances within a mock community sample estimated from sequences were skewed from community composition in terms of the number of individuals or biomass of the species. Thus, this protocol enables the comparison of collembolan communities in a quantitative manner by metabarcoding.


Introduction
Springtails (Collembola) are a major group of soil microarthropods that are widely distributed in terrestrial ecosystems from tropical to polar regions and are abundant, especially in forests (Hopkin 1997). Typically, 60-80 species of Collembola and a total abundance of 10 4 -10 5 individuals m −2 are found in a forest site (Petersen and Luxton 1982). To date, more than 7,000 collembolan species have been described (Deharveng 2004). Most are omnivorous and mainly feed on detritus and microbes growing on it (e.g., Anderson and Healey 1972) and are preyed upon by predatory mites and larger arthropods (Johnson and Wellington 1980;Usher 1985;Lawrence and Wise 2000), forming an intermediate link in food webs during the decomposition process (Filser 2002).
Collembola are conventionally identified to the species level by the microscopic examination of their morphological features; therefore, the community assessment of this group has been time-consuming. Additionally, DNA barcoding (Hebert et al. 2003) of Collembola, using a standardized portion of the mitochondrial cytochrome c oxidase subunit 1 (mtCOI) gene, revealed that cryptic species were frequently found among morphologically identical specimens from different regions (e.g., Porco et al. 2012b), suggesting that conventional microscopic examination has confounded different species and underestimated their biodiversity. However, due to the required time and cost, it has not been practical to sequence every specimen by Sanger sequencing in community assessments where it is required to examine numerous microarthropods extracted from soil samples; typically, a 100 mL of bulk sample contains 100-1000 total animal specimens, including 10-30 species of Collembola (e.g.,

Page 4 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. Takeda 1987). These problems associated with taxonomic identification impede the comparison of collembolan communities between regions and thus the understanding of the global biodiversity of Collembola.
The application of next-generation sequencing (NGS) technology to eukaryote community assessment has emerged in recent years, called DNA metabarcoding (Taberlet et al. 2012), wherein DNA fragments are amplified and sequenced from environmental or bulk community samples for genes whose sequences could be used as diagnostic features for identifying specimens to the species level. Metabarcoding methods have been developed for various body sizes of animal assemblages from terrestrial and aquatic ecosystems, such as protists (Nolte et al. 2010), nematodes (Porazinska et al. 2009), aquatic meiofauna (Fonseca et al. 2010), river benthic macroinvertebrates (Hajibabaei et al. 2011), and insects (Yu et al. 2012), the assessment of whose communities had confronted a problem similar to that in Collembola.
For Collembola, Ramirez-Gonzalez et al. (2013) developed a metabarcoding protocol using the mtCOI gene and showed that metabarcoding is promising for diversity assessment in this taxon. However, they also reported that its quantitativity was poor. They metabarcoded mock DNA samples in which known and different concentrations of DNA from collembolan species were mixed and found no or only a weak correlation between the number of reads (output) and DNA amount across species within a sample, indicating that one could not determine the abundances species-dependent factors, such as copy number of the target gene per cell, body size (i.e., copy number per individual), and mismatches between PCR primers and binding sites (Deagle et al. 2014). Several studies found that number of reads differed by up to four orders of magnitude between species, even when a sample comprising species of similar body size was sequenced (Porazinska et al. 2010;Elbrecht and Leese 2015), indicating an effect of the degree of primer mismatch between species on PCR bias. The PCR protocol employed in NGS library preparation also may affect the extent of skewing in the final product ratio during PCR from mixed-species templates (Mathieu-Daudé et al. 1996;Polz and Cavanaugh 1998).
The genes used for metabarcoding may also be involved in the problem. The mtCOI gene region currently widely used for metazoans (Hebert et al. 2003) is a protein-coding sequence and thus harbors highly diversified nucleotides among species at the third positions of codons throughout the region owing to synonymous mutations, leading to difficulty in designing specific PCR primer sets that possess sufficient specificity and sensitivity to the target taxon (Deagle et al. 2014; see Leray et al. 2013 andBrandon-Mong et al. 2015 for efforts made to overcome this problem). This aspect of the mtCOI gene is associated with the problem of primer mismatches and thereby quantitativity in metabarcoding. Nuclear ribosomal RNA (rRNA) genes are also widely used for metabarcoding (e.g., Porazinska et al. 2009Porazinska et al. , 2010Fonseca et al. 2010;Hirai et al. 2015), and conserved regions in a taxonomic group may be found more easily in these than in the mtCOI gene (e.g., Machida and Knowlton 2012;Clarke et al. 2014). However, with respect to reference sequences required for taxonomic identification of metabarcoding sequences, a far larger database is available for the mtCOI gene (Ratnasingham and Hebert 2007) than for (nuclear or mitochondrial) rRNA genes. Thus, both coding and non-coding genes have pros and cons at present.

7
To develop a more quantitative method of metabarcoding for Collembola, we designed novel PCR primer sets for mtCOI and mitochondrial 16S rRNA (mt16S) genes and tested their performances in sequencing natural and mock collembolan community samples, employing a carefully designed laboratory protocol to minimize bias during PCR.
Based on the results, we discuss the specificity and sensitivity of our method and its quantitativity in community assessment. In quantitative terms, we examined 1) whether relative species abundances in sequences in a mock community sample reflect those in individuals and 2) the correlation between sequence and animal abundances per species among community samples and the effect of normalization methods (using relative abundance in individuals or biomass or an internal control).

Site description
The collembolans used in the present study were collected mainly from a temperate coniferous forest in the Kamigamo Experimental Forest of Kyoto University (225 m ASL, 135.77°E and 35.07°N), 10-km north of Kyoto, Japan, where the collembolan community has been extensively studied and more than 50 collembolan species have been recorded (e.g., Takeda 1987;Fujii and Takeda 2012;Saitoh et al. 2014). The mean annual temperature and annual cumulative precipitation during 2012 were 14.8°C and 1,485 mm, respectively. The canopy layer was dominated by Japanese cypress, Chamaecyparis obtusa. The soil was a moder type with a ca. 4-cm thick organic layer.

Page 7 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.

Collection of reference sequences
The collembolan specimens were collected by hand-sorting from soil microarthropods extracted from soil samples using Tullgren funnels as well as from culture populations of some species that were established from collected individuals. Individuals of some species were collected from the surfaces of soil, tree trunks, and mushrooms with an aspirator. Specimens were stored in vials with 99.5% ethanol at −20°C until use.
To construct a reference database linking metabarcoding sequences to morphological species, mtCOI and mt16S gene sequences were collected as follows: total DNA was extracted from collembolan specimens by a non-destructive method (Aoyama et al. 2015). After DNA extraction, the specimens were mounted on glass slides and identified to the species level based on morphological traits observed under a microscope according to the literature (Suma 2009;Tanaka 2010;Niijima and Hasegawa 2011;Hasegawa and Niijima 2012;Ichisawa 2012;Itoh et al. 2012;Hasegawa and Tanaka 2013;Furuno et al. 2014;Nakamori et al. 2014). From the DNA, mtCOI and mt16S gene regions were amplified and directly sequenced in both directions on an ABI 3130xl sequencer (Thermo Fisher Scientific, Waltham, Massachusetts, USA). The PCR primers and conditions used for construction of the reference database are listed in Table S1. When a sample was not amplifiable with an ordinary primer pair (Folmer et al. 1994) and the degenerate primers for Collembola (Nakamori 2013), amplification using other combinations of several available primers were tried so that the sequence of the gene region that was targeted in metabarcoding (to be described later) was obtained. We also sequenced the targeted gene regions from single specimens by multiplexing in a sequencing run of a 454 GS Junior sequencer (Roche, Basel, Switzerland) using a method (Aoyama et al. unpublished) analogous to that of Shokralla et al. (2014); the former method used ligation in

Page 8 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. library preparation to add adapters for 454 technology to amplified gene fragments, whereas the latter utilized fusion primers for the purpose.
In total, mitochondrial COI and 16S RNA gene sequences were collected from 302 collembolan specimens assigned to 34 morphological species. A total of 99 and 134 unique sequences of mtCOI and mt16S were found, respectively. Of these, mtCOI sequences from nine species were already published as part of the study by Aoyama et al. (2015). We found that the 454 sequencer frequently yielded multiple sequence types from a single specimen (76 and 103 specimens out of 190 total specimens sequenced for mtCOI and mt16S genes, respectively), which were considered to be unlikely as PCR or sequencing errors because they were found in both directions of reads and from multiple conspecific specimens whose PCR reactions were constructed independently.
Although these sequences presumably include potential pseudogenes termed as nuclear mitochondrial genes (numts) and true haplotypes, we could not yet completely determine which sequences are truly mitochondrial genes or numts; only a small fraction (7%) of the sequences contained symptom(s) for numt (i.e., insertion/deletion or stop codon) and were thereby considered pseudogenes. Translations were possible for most mtCOI sequences without a premature stop codon, and their sequences are often identical within a specimen. We considered that including numts into the reference database is useful for identification of their source organism when they are identified from metabarcoding data. Therefore, all unique sequences were included in our reference database.
Intra-and interspecific variation of the gene regions used for the metabarcoding method were evaluated using our reference database (Fig. S1). Dissimilarities between sequences were calculated by pairwise alignment (p-distance); insertions and deletions were considered as differences. As the size of our database was limited, published sequence Page 9 of 60 Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. data were combined with our data to conduct this and the following analyses. We used mtCOI sequence data from Porco et al. (2012a), who revealed sequence divergences between cryptic species within cosmopolitan morphospecies of Collembola; reproductively isolated lineages found in their study were regarded as different species (in total 188 sequences from 28 (crypto)species from 16 morphospecies). For mt16S data, sequences from published complete collembolan mitogenomes (10 sequences from 9 species; see later subsection for the accession numbers) were combined because no other sequence deposited in a public database completely overlapped the region utilized in our metabarcoding method.
The percentage of gaps in the distributions of sequence divergences were observed to be approximately 10% in both mtCOI and mt16S genes (Fig. S1). Sequences of the mtCOI gene from different species differed by at least 15% ( Fig. S1a,b), whereas those of mt16S differed by more than 17% (Fig. S1c,d). At most, 32.5% and 25.2% intraspecific distances were observed in the mtCOI and mt16S genes, respectively (Fig. S1b,d). Because considerable intraspecific distances were found in Onychiurus flavescens (17.0% in mtCOI and 18.6% in mt16S) and Neanuridae sp.1 (18.1% in mtCOL and 20.9% in mt16S), these morphospecies were suspected to include cryptic species (see below consideration on sequence divergence between collembolan species), although we found that in the other species, intraspecific variation >10% were in most cases due to potential numt sequences with large insertions and/or deletions.
Based on our reference database, threshold values for each gene were used for clustering sequences into operational taxonomic units (OTUs) in the data processing pipeline for metabarcoding data of the collembolan community (to be described later). The threshOpt function from the R package SPIDER (Brown et al. 2012) was

Page 10 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
used to evaluate the effect of the threshold value for delimitation of species on false positive (over-splitting) and false negative (over-lumping) rates for optimization of the threshold (Meyer and Paulay 2005). Over-splitting decreased with increasingly different threshold values, but occurred until thresholds of 31% and 20% in mtCOI ( Fig. S2a, b) and mt16S (Fig. S2c, d), respectively. Over-lumping of species occurred at a threshold value of ≥14% in mtCOI (Fig. S2a, b) and ≥15% in mt16S (Fig. S2c, d), respectively.
The success rate of taxonomic identification based on our database by the best close match method (Meier et al. 2006; using the bestCloseMatch function from the SPIDER package) was examined with a range of threshold values. A large proportion of sequences in the database were successfully identified by the other sequences in the same database at any threshold value. Misidentification of singleton sequences occurred at a threshold value of ≥15% difference in mtCOI (Fig. S3a, b) and ≥17% in mt16S (Fig. S3c, d), respectively. The same test was performed for identification to the family level, and it was revealed that no error occurred at ≤17% and ≤20% threshold values in mtCOI and mt16S, respectively (data not shown).
Based on these observations, the decision was taken to use a threshold of 90% sequence similarity to delineate collembolan sequences to OTUs so that sequences from different species might not be clustered into an OTU, although this threshold might divide sequences from the same currently recognized species into separate OTUs. We also decided to use a 90% sequence similarity for taxonomic assignment, as this value was considered safe from misidentification.

Page 11 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
DNA sequences of mtCOI and mt16S genes of 10 complete mitogenomes of Collembola were retrieved from the NCBI GenBank database (accessions AF272824, AY191995, Nardi et al. 2001Nardi et al. , 2003NC_006074, NC_006075, Cook et al. 2005; EU016192, EU016194-EU016196, EU084034, Carapelli et al. 2007Carapelli et al. , 2014EU124719, Torricelli et al. 2010), consisting of nine species from seven families and including the most distantly related families (Xiong et al. 2008). The sequences were aligned with ClustalW (Larkin et al. 2007) to identify conserved regions in Collembola to design candidate PCR primers for metabarcoding. Degenerate primers for the mtCOI gene were designed within the standard barcode region for animals so that public sequence databases can be used as the reference database in the taxonomic identification of the metabarcoding data, whereas primers for the mt16S genes were not allowed to completely overlap regions that had been used in previous studies (e.g. Schneider et al. 2011).
Primer nucleotides were determined according to Boyce et al. (2009), who suggested that only small parts of oligonucleotide primers should be degenerate (typically up to three of the nine bases at the 3′ ends become ambiguous) and that the remaining part should be determined based on the most abundant nucleotide at each position in the alignment. In total, 9 forward and 12 reverse primers for mtCOI and five forward and six reverse primers for mt16S were designed (data not shown).

Preparation of metabarcoding library for 454 technology
Bulk microarthropod community samples ( Fig. 1.1) to be analysed by this protocol were composed of specimens extracted from 100 mL of core-sampled soil using a Tullgren funnel, each of which typically consisted of 100-1,000 animals comprising primarily Collembola and mites. Natural community samples that were sampled in this manner and artificially comprised mock community samples were analysed (see later subsections); samples were stored in vials with ethanol as a preservative at −20°C until use. Samples were transferred to 1.5-mL PCR tubes, and ethanol was removed prior to DNA extraction. Total DNA was extracted from the community samples using DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions except that microarthropods were milled using a micro homogenizer (BioMasher II; NIP, Tokyo, Japan) in lysis buffer. The extracted DNA samples were quantified using a Qubit 2.0 Fluorometer with a Qubit dsDNA HS assay kit (Thermo Fisher Scientific). The target gene region (mtCOI or mt16S) was amplified for each community sample using the above primer pair tailed with four-base multiplex identifier (MID) tags at the 5′-end of each primer (listed in Table   S2). PCR was performed in a 20-µl reaction volume using a KAPA HiFi HotStart ReadyMix kit (2-ng DNA template, 10 µl of 2× KAPA HiFi HotStart ReadyMix, 0.3 µM of each primer; KAPA Biosystems, Wilmington, Massachusetts, USA). The PCR conditions were as follows: 95°C for 2 min; 15-25 cycles at 98°C for 20 s, 53°C for 30 s, and 72°C for 30 s; followed by a final extension step at 72°C for 1 min. To keep PCR amplification from reaching a plateau, the PCR cycles to be used were determined for each sample beforehand by a preliminary examination in which the band intensities of the PCR products from several distinct cycles (e.g., 15, 18, 21, and 24) were visually compared using electrophoresis to identify the range of cycles where PCR was yet to reach a plateau for the sample ( Fig. 1.2.1; we recommend fewer than three cycles of difference in the PCR cycles between the samples to be analysed simultaneously). The PCR products were purified by size selection for the expected amplicon size using an SPRIselect reagent kit (200-500 bp for mtCOI; 300-700 bp for mt16S; Beckman Coulter, Brea, California, USA). The purified amplicons were checked with an Agilent 2100 Bioanalyzer with High Sensitivity DNA kit (Agilent Technologies, Santa Clara, California, USA) for evaluation of their size distributions and thereby quantification of their molecular concentration by Agilent 2100 Expert Software ( Fig. 1.3.1). The amplicons to be sequenced simultaneously in a sequencing run were pooled into molar concentrations equal between samples. The pooled amplicons were ligated with adapters for 454 technology using a GS Rapid Library Preparation Kit (Roche) to prepare a single library ( Fig. 1.4.1). After further purification by size selection using the SPRIselect reagent kit, the library was quantified with KAPA Library Quantification Kit (Kapa Biosystems).
Sequencing of the library was performed with a 454 GS Junior sequencer (Roche).

Sequence data analysis
A scheme of the strategy to cluster read sequences from metabarcoding data in the present study is shown in

Page 14 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. Figure S4. An outline of the current data processing pipeline is as follows: As the present protocol added NGS adapters to amplicons by ligation, the target gene was sequenced in forward and reverse directions. After quality filtering of raw reads, calling of OTUs was separately conducted for forward and reverse reads using the open-reference method (QIIME pipeline version 1.8, Caporaso et al. 2010), in which reads were clustered with reference sequences using a threshold (97%), and novel OTUs were selected from remaining reads that were not clustered with the reference sequences. The following step merged forward and reverse reads belonging to identical OTUs. For OTUs clustered with reference sequences, this was conducted by sequence mapping of the representative sequences onto reference sequences; newly selected forward and reverse OTUs were merged by exploring similar sequences between them. As we observed that the first clustering by a 97% threshold often clustered identical sequences into separate OTUs due to sequencing errors, sequence variation generated by degenerate primers and MID-tag sequences on 3′-ends, a less conservative threshold (i.e., 94%) was applied in the above-mentioned merging steps to eliminate the artifacts. Finally, the merged OTUs were further clustered into 90% OTUs which we currently assumed as an analogous level to species, as described below.
Reads sequenced in the forward and reverse directions were separated according to the primer sequences. Using the QIIME pipeline, reads were processed (with script split_libraries.py) prior to the calling of OTUs: the primer sequences were removed; the low-quality bases were trimmed from the 3′-ends (using a sliding window of 50 bases, truncated from first base of window when the mean QV was <25); the numbers of allowed mismatches with primer sequence and allowed homopolymer lengths were two and 12 bases, respectively. After trimming, reads of length <200 bases were discarded.

Page 15 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
OTU calling was first performed separately for the forward and reverse reads. Open-reference OTU picking (for 97% of OTUs) was performed with reference sequences (see above; sequences were trimmed to insertion length used in metabarcoding) using the pick_open_reference_otus.py script.
The 97% OTUs were secondarily grouped into OTU groups that were analogous to 90% OTUs ( Fig. S4): (1) representative sequences of OTUs were mapped onto reference sequences using GMAP software (Wu and Watanabe 2005). An OTU whose representative sequence was aligned to a reference sequence (95% of sequence similarity, ≥150 bp of alignment, >70% of query coverage, and <60 bp of clipped length of query) was considered to be a member of a group represented by the reference sequence. Note that during this step, it was intended to limit sequence variation between the furthest members of the cluster to <6%. To achieve this empirically, a slightly stricter threshold (i.e., 95% similarity) was required when clustering was conducted by sequence mapping.
(2) For unmapped OTUs, similarities between their representative sequences were evaluated by pairwise alignment by ClustalW. Based on the result, these OTUs were clustered using the furthest neighbour method (Sørensen 1948) and grouped by threshold similarity of 94%. This grouping was made separately for forward and reverse OTUs at first, and forward and reverse groups were later merged when a reciprocal best alignment with 94% was found between the two groups. For each group, a representative sequence of the OTU with the most abundant member reads was selected as the representative sequence of the group. (3) Finally, representative sequences of these groups (constructed with GMAP or ClustalW) were pairwise aligned and clustered by the furthest-neighbour method. They were thereby further grouped into OTU groups by a threshold similarity of 90%. When an OTU group included a reference sequence, it was determined as the representative sequence. Otherwise, a representative sequence of the For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
OTU with most abundant member reads among the group was selected. These OTU groups by 90% similarity are here referred to as 90% OTUs.
Taxonomy of the OTUs was inferred via BLAST searches (E < 10 -12 , Zhang et al. 2000) against the NCBI nucleotide database as well as reference sequences collected in the present study. For OTUs with the best match to a collembolan sequence, we determined their taxonomy using the following (tentative) thresholds: a hit with query coverage >90% and sequence identity >90% was considered to belong to the same species; with query coverage >80% and sequence identity >85% as indicators of belonging to the same family. For OTUs with the best hit to a taxon other than Collembola, the orders were inferred when query coverage >80% and sequence identity >75%.
See earlier subsection "Collection of reference sequences" for origins of these threshold values.

Quality assessment using natural community samples
The metabarcoding method presented in this study was tested with respect to its selectivity of PCR primers and sensitivity to comprehend collembolan biodiversity. Topsoil was sampled in a plot (10 × 25 m) consisting of 10 adjacent subplots of 5 × 5 m, established in the Kamigamo Experimental Forest. Using cylindrical cores (4 cm in depth, 25 cm 2 in area, and 100 cm 3 in volume), a pair of soil samples was collected from a randomly determined point in each subplot (i.e., the two soil samples had been adjacent in the forest floor). Microarthropods were extracted from the soil samples via Tullgren funnels. Half of the community samples (one sample per subplot; in total 10 samples; samples N_01 to N_10) were analysed by our metabarcoding method, and the remaining half (C_01 to C_10) were mounted on glass slides and examined under a microscope to identify Collembola by species

Page 17 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. and count.
Species accumulation curves with increasing number of samples were drawn based on the number of morphospecies (C_01 to C_10) or 90% OTUs (N_01 to N_10) using specaccum function in the R package vegan (version 2.3-4; Oksanen et al. 2016). The correlation of relative species abundances (mean values of 10 samples) between the three methods were evaluated using the Pearson product-moment correlation coefficient. A total of 13 species found by all the three methods were included in this correlation analysis.

Quality assessment using mock community samples
Mock community samples were analysed by the metabarcoding method to test its quantitativity. A total of seven species of Collembola: Pseudosinella pseudolanuginosa, S. dubiosa, Sinella umesaoi (Entomobryidae), Hypogastrura reticulata (Hypogastruridae), O. flavescens (Onychiuridae), Arrhopalites japonicus (Katiannidae), and Folsomia candida (Isotomidae), were used to prepare mock community samples. The first six species were included in different numbers among the mock samples, while F. candida was included in the same number as an internal control (see below).
Individuals of these species were obtained from laboratory cultures maintained in Yokohama National University except for H. reticulata, of which individuals were collected from a wild population swarming on a mushroom in the Kamigamo Experimental Forest. Laboratory cultures of the other collembolan species were established from individuals collected from the forest, with the exception of F. candida, which were from Fuji-Yoshida (central Japan; Itoh et al. 1995). The mean dry weights of the collembolan specimens were indirectly estimated based on

Page 18 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. body length (from measurement of at least 20 specimens per species) according to Tanaka (1970) andPetersen (1975). Information for the specimens is presented in Table S3.
The species compositions of mock community samples were determined as follows: collembolan species (see above) were randomly permuted to be placed in an order (from 1 to 6) according to which their numbers of individuals were determined to be included in a sample (32,16,8,4, 2, or 1 individuals). To construct 1000 community matrices (each consisting of six species by 10 samples), this permutation was performed 1000 times per sample (thus, 10,000 times in total). Community matrices not satisfying the conditions that a species must be ranked any place (from 1 to 6) in at least one sample and ranked in the same place in at most three samples were discarded. One of the remaining matrices was randomly selected to examine the species compositions of the mock community samples by metabarcoding. According to this matrix (consisting of six species by 10 samples), each sample received different numbers of individuals from the six species and three individuals of F. candida as an internal standard. In total, each of the 10 mock community samples (S_01 to S_10) was composed of 66 individuals of Collembola (Table S4).
The correlation between the read counts and numbers of individuals across samples was calculated for each species, by three separate normalization methods: (1) both the read counts and number of individuals were normalized to relative abundance per sample; (2) the read counts were normalized to the relative abundance, while the number of individuals was converted to relative biomass; (3) the read counts were normalized to the internal control (i.e., divided by read counts per individual of the control species), while the number of individuals was left unchanged. The Pearson product-moment correlation coefficient was used to evaluate the correlations.

Page 19 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
individuals or normalized number of reads (by the internal control) from metabarcoding datasets. To ordinate samples based on the relative difference of abundance between samples, the data were standardized to a mean of 0 and standard deviation of 1 across samples for each species and input to this analysis.
The sequence data obtained in this study have been submitted to the DDBJ Sequence Read Archive (DRA) as accession no. DRA004529.

Page 20 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.

Sequencing throughput from metabarcoding libraries
Sequencing throughput using mtCOI or mt16S library from natural or mock community samples (in total four sequencing runs; 10 samples per sequencing run) is described in Table S5. From 107,043 to 187,573 reads were sequenced per library. The number of quality-filtered reads varied at most 7.7 times between samples in a sequencing run, and at least 1,690 quality-filtered reads were obtained per sample (Fig. S5). Of these, from 574 to 26,964 per sample (24,262 to 133,522 reads per library) were aligned to 97% OTUs (and finally to 90% OTUs).

Quality assessment using natural community samples
The soil microarthropod community in the study site was dominated by Collembola and Acari (Prostigmata, Mesostigmata, and Oribatida) (300 to 1000 animal individuals per 100 mL soil sample, identified under a microscope, Fig. 2a). The taxonomic composition of sequences obtained from the metabarcoding libraries (inferred via BLAST searches, Fig. 2b, c) showed that the largest part of the output sequences were derived from Collembola (47% and 65% reads from mtCOI and mt16S libraries were aligned to collembolan OTUs, respectively; Fig. S6a, c).
Nonetheless, animal taxa other than Collembola accounted for a considerable portion of reads in several samples (e.g., Oribatida in sample N_01 in mtCOI, Hymenoptera in N_01 and N_06 in mt16S, Coleoptera in N_08 sample in both libraries). From the 10 natural community samples, 32 and 36 collembolan OTUs (10-33 and 9-36 per sample) were found in the mtCOI and mt16S libraries, respectively, whereas a total of 25 species (6-13 per sample) were found by morphological identification (Fig. 3).

Page 21 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
Taxonomic assignment of these collembolan OTUs linked 18 mtCOI and 22 mt16S OTUs to 15 and 16 morphological species, respectively. Several of these OTUs were considered to be derived from the same morphological species. For example, three mtCOI and four mt16S OTUs were assigned as F. octoculata (see Table   1). When such OTUs were combined under their assigned species, the difference between the numbers of morphological species and OTUs became smaller (25 morphological species vs. 29 and 30 combined OTUs under assigned species).
The collembolan community assessed by microscopic examination as well as metabarcoding of paired samples is summarized by species in Table 1. Of 25 species found in the 10 natural community samples by microscopic examination, a total of 15 species had been barcoded in our current reference database (Table 1a). These barcoded species represented 97% of the collembolan individuals found. In the metabarcoding data, the OTUs identified to the species level (see above) accounted for 93 and 97% of collembolan reads from the mtCOI and mt16S libraries, respectively (Table 1b, c). A total of 13 collembolan species were found by both microscopic identification and metabarcoding using mtCOI and mt16S libraries. The relative abundances (mean values across samples) of these 13 species, which represented >90% of collembolan individuals or reads of any datasets, were positively correlated (microscope vs. mtCOI, R = 0.87, P < 0.001; microscope vs. mt16S, R = 0.91, P < 0.001; mtCOI vs. mt16S, R = 0.79, P < 0.01; Fig. S7).

Quality assessment using mock community samples
Almost all the reads were derived from target gene regions of Collembola when metabarcoding libraries of mock For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. community samples consisting of seven collembolan species, although a larger number of OTUs than species were found in both libraries (Fig. S6b, d). Thus, in addition to the seven major OTUs that had abundant reads and were considered to represent individual species, extra OTUs were found that were also considered to be derived from some of these species (92.6 to 95.9% sequence identity with the representative sequences) but accounted for only minor proportions of reads from the species (mtCOI data: 12, 7, and 0.5% in P. pseudolanuginosa, S. umesaoi, and H. reticulata, respectively; mt16S data: 4 and 0.7% in P. pseudolanuginosa and H. reticulata, respectively).
The relative species composition of each mock community sample is depicted in Figure 4 by the number of individuals, biomass and read counts in the mtCOI and mt16S datasets; the relative species compositions based on read counts reflected for the most part those based on the number of individuals and biomass (Fig. 4). The relative abundance of each species based on read count correlated to that of the number of individuals across samples, irrespective of genes or species (Fig. S8a, mtCOI: R = 0.81-0.98, P < 0.01 for each; Fig. S8c, mt16S: R = 0.84-0.98, P < 0.01 for each), except for F. candida (the species included in the same number in the samples).
Correlations of the relative abundance of reads to the relative portion of biomass were almost identical to those with the relative portion of individuals described above (Fig. S8b, mtCOI: R = 0.81-0.99, P < 0.01 for each; Fig.   S8d, mt16S: R = 0.82-0.98, P < 0.01 for each). When all the species were pooled, correlations were lower (mtCOI, R = 0.66 and 0.71, and mt16S, R = 0.74 and 0.70, to relative abundance and to relative portion of biomass, respectively) than those separately by species examined above.
Next, another method of normalization using a species as internal standard was assessed. We note that, although F. candida was originally intended to be used as the internal standard, this species was less abundantly sequenced

Page 23 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. from the mtCOI library (0-0.19% of total reads per sample) than from the mt16S library (0.79-3.5%). However, no read from F. candida was found in one of the samples (S_06) in the mtCOI data, rendering this species unsuitable for use as internal standard. Accordingly, P. pseudolanuginosa was selected as the internal standard in place of F. candida, after identification of the species yielding the highest correlation coefficients between the numbers of individuals and the reads when used as internal standard (Table S6) datasets, respectively) (Fig. 5). When data from all the samples and species (excluding internal control species) were pooled, the correlation between normalized count and number of individuals was moderate (R = 0.62, P < 0.001), as shown above for the other normalization methods. The PCA ordination analysis of mock community samples based on the number of individuals or reads showed that four species differentiated community samples along axis 1, whereas two other species differentiated community samples along axis 2, resulting in very similar ordinations of samples between the three datasets (Fig. 6).

Discussion
We designed two novel PCR primer pairs for metabarcoding so that target gene regions (mtCOI and mt16S) could be amplified from Collembola irrespective of species and employed a library preparation protocol to achieve improved quantitativity in sequencing output. The results from the assessment of the collembolan community of a

Page 24 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
forest site using the present metabarcoding method were comparable, in terms of the species composition and biodiversity, to those obtained from paired soil samples by the conventional method (taxonomy identification of each animal under a microscope). Further, sequencing using mock community samples by the metabarcoding method revealed that the abundance of sequence reads from a species showed a strong linear relationship (R = 0.91-0.99) to the number of the individuals included in samples, especially when normalized to an internal control.
These results indicated that this method is applicable to surveys of Collembola biodiversity.

Selectivity and sensitivity assessed using natural community samples
The present results from metabarcoding of natural community samples could be evaluated in terms of specificity for the target taxon (Collembola) as well as capability to comprehend the species within the taxonomic group.
Specificity of the metabarcoding primers is important in the sequencing of a bulk soil microarthropod community sample, extracted via Tullgren funnel, that includes a broad range of taxa of animals such as mites, small insects, and other microarthropods as well as Collembola (see Fig. 2a). If the selectivity of metabarcoding PCR primers for Collembola is poor, amplicons derived from non-target taxa included in such a sample may consume the capacity of a sequencer and reduce the space for Collembola. The present result indicated that the largest part of the sequences from metabarcoding libraries was successfully derived from Collembola. Nonetheless, sequences from other taxa (Oribatida, Coleoptera, or Hymenoptera) were dominant in a few samples, indicating that the primer pairs designed in the present study could amplify the target genes occasionally from these undesired taxa. We speculated, based on the sequences, that the oribatid sequences were from species abundant at the site

Page 25 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
(Tectocepheidae and Camisiidae; Hishi and Takeda 2008), whereas coleopteran and hymenopteran sequences were from (likely a larva of) Tenebrionidae and ants (Formicidae), respectively. Considering that, apart from oribatid mites, these taxa of insects were rarely found under a microscope (only two larvae of Coleoptera and three ants were found in the 10 samples by microscopic examination), their influence on sequencing output per individual was far larger than that of Collembola, likely because of their large body masses compared with mesofaunal taxa (1-20 µg vs. several mg per individual; Swift et al. 1979). If a specific organism is a frequent contaminant then a blocking primer (Vestheim and Jarman 2008) may be used to prevent it from being amplified, although this is not useful for most soil microarthropod community samples in which specific contaminant organisms cannot always be expected. Thus, if the capacity of a sequencer is limited, it is recommended to remove animals of these taxa (easily distinguishable from the others) from microarthropod community samples before DNA extraction.
The assessed collembolan community at the study site was comparable between microscopic identification and metabarcoding of paired samples: the numbers of morphological species and OTUs after taxonomy assignment were in agreement (25 vs. 29 or 30, Table 1). Although the current reference sequence database still did not fully cover the collembolan species in the study site, it covered most of the individuals in the community. Similarly, only approximately half of the collembolan OTUs were identified to the species level, but these OTUs accounted for most of the collembolan reads in both the mtCOI and mt16S datasets. These barcoded species and identified OTUs were largely consistent in terms of relative abundance (Fig. S7), although metabarcoding datasets are somewhat biased (see below); heterogeneity between paired samples (which could not be isolated in this study) also may have caused some differences between microscopic and molecular methods. The high species-dependent variation in the

Page 26 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
abundance of Collembola in the natural community (the exponential decrease in proportion to total individuals explained by a species with increasing rank order of the species based on abundance) may have masked the potential effect of PCR bias. These results nonetheless indicate that, comparable to conventional microscopic identification, the metabarcoding method in the present study could comprehend the biodiversity of a collembolan community in a survey of a natural community.
Our current sequence data-processing pipeline employs a threshold value of 90% sequence identity to delineate OTUs, which were assumed to be comparable to species in present study. While >3% of difference in COI sequence (i.e., <97% of identity) has been usually considered to indicate distinct species in various taxa (Hebert et al. 2003), greater than 3% difference within morphological species has been previously reported in Collembola (e.g., Porco et al. 2012aPorco et al. , 2014Anslan and Tedersoo 2015) and was found in the present study. As distributions of inter-and intraspecific sequence similarities overlapped (Fig. S1), observed sequence variations within a species may have comprised those by cryptic species as well as true intraspecific diversity, suggesting a need to determine a threshold to distinguish the two. Although Porco et al. (2014) suggested a threshold of 14% difference in the mtCOI gene for this purpose, we used a stricter value (10%) to cluster metabarcoding sequences, considering that it would be best to avoid including sequences from different species into an OTU, and afterwards to combine separate OTUs if they were assigned to the same species according to a reference database. Nonetheless, this threshold should be reconsidered when different morphological or cryptic collembolan species with a smaller genetic distance than the current threshold are found in future studies.
The current reference sequence database (of 34 species that were locally collected) and public databases (NCBI NT and BOLD databases) could identify only a few more than half of the collembolan OTU groups from metabarcoding of the natural community. This outcome was likely due to the poor coverage for rare species, each of which accounts for less than 1% of the total collembolan individuals in the community, as well as rare types not yet barcoded from dominant species. Further, of the collembolan OTUs that were identified to species, only one mtCOI was identified to a species barcoded elsewhere (Sminthurinus trinotatus, KJ155516), whereas the other identified OTUs were derived from those barcoded at the study site. Even for species that had been barcoded at other sites, their sequences were different from those collected from identical morphological species at the study site (i.e., <90% of sequence similarity). These results support the huge and hidden diversity in Collembola shown in previous studies (e.g., Porco et al. 2012a,b;Cicconardi et al. 2013) and suggest the need for generating more reference sequences of this group, more extensively at our study site as well as in various other geographic regions.
At present, the construction of a locally collected database for mtCOI and mt16S genes will be needed to identify collembolan species from metabarcoding data. Nevertheless, even in the absence of associated Linnaean names, our analysis pipeline would permit the quantification of biodiversity and the study of compositional heterogeneity among collembolan samples.
The observed intraspecific sequence diversity could be partly explained by potential pseudogenes co-amplified from the nuclear DNA (nuclear mitochondrial pseudogenes, numts; Song et al. 2008) along with the target genes.
By manual examination of representative sequences of OTUs, we found some sequences similar to known species but with substitutions, insertions, and/or deletions, resulting in >10% of difference from the known species, and accordingly, clustered with separate OTUs. Such OTUs had only small numbers of member reads. For example,

Page 28 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
one of three mtCOI OTUs from F. octoculata, which had a deletion and stop codon and was thus considered to be a numt, explained < 1% of reads derived from this species. A similar observation was reported by Ramirez-Gonzalez et al. (2013), who previously metabarcoded a collembolan community with 454 technology. We also often found multiple types of mtCOI and/or mt16S sequences from a single collembolan specimen when sequencing the amplicons of these genes by 454 (data not shown), supporting the possibility of co-amplification of pseudogenes by PCR. Indeed, it should be difficult (especially for mt16S, a non-coding gene) to determine based only on sequencing data from NGS whether a suspected sequence is numt or merely due to errors during PCR and/or sequencing. It is, of course, also necessary to consider carefully other possibilities: heteroplasmy, female sperm storage, gut content (Collembola often consume conspecific corpses and exuviae of others), and contamination. A patch for this problem could be to include (suspected) numts as well as definitely identified target genes in the reference database so that an analyst can determine the taxonomy of the source organism and thereby avoid overestimation of biodiversity.

Quantitativity assessed using mock community samples
As shown in the results from the metabarcoding of mock community samples, very high linearity between numbers of collembolan individuals and sequences per species (i.e., R = 0.91-0.99) was achieved by the present method when the read counts were normalized by using a species as the internal control. On the other hand, the slope of the regression line varied greatly depending on the both species and the gene region (mtCOI and mt16S; Fig. 5) used in metabarcoding. This coefficient was considered (from the method of normalization) to represent the ratio of the

Page 29 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
number of reads obtained per individual of the species to that of reads from the standard species, and thus, represent the biases between the species (including those by PCR and others) under the PCR conditions used in library preparation for NGS. The observed linearity between normalized read count and number of individuals of each species across samples suggests what may have occurred during PCR. That is, ratios of amplicons between species were, likely owing to different PCR efficiencies, skewed from those of their templates with increasing PCR cycles, though the extent was nonetheless similar among samples. Further, the observation that the slope varied with both species and gene (thus, primers) indicates the effect of different extents of mismatches between genomic sequences and the primers at the binding sites (Deagle et al. 2014) as well as that of copy number of the mitogenome in each individual.
These observations are, in some respects, consistent with previous studies in which metabarcoding methods were developed. Within a sample, different species yielded highly different numbers of sequences even between specimens with similar body weight (Porazinska et al. 2009;Hajibabaei et al. 2011;Yu et al. 2012;Elbrecht and Leese 2015); Elbrecht and Leese (2015) showed a high correlation between the numbers of reads and the biomass by metabarcoding a mock sample consisting of specimens of the same species but of different haplotypes and body weights. Similarly to the present study, Diaz-Real et al. (2015) assessed the correlation between the relative species abundances of read counts and the numbers of individuals using mock community samples and found strong correlations in two focal species but with different slopes of the regression lines.
Some of the authors of these previous reports concluded that metabarcoding might not be quantitative, from the observation that PCR bias strongly skewed the relative abundances between species and made it impossible to

Page 30 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. recover the composition of individuals (or biomass) within a sample from sequencing data (Porazinska et al. 2009;Yu et al. 2012;Elbrecht and Leese 2015). Accordingly, it is considered inappropriate to use the relative species abundances in read counts, in place of the numbers of individuals, to calculate alpha diversity indices of a community sample. Some authors, accordingly, recommended transformation of metabarcoding data into presence/absence prior to statistical analyses (Yu et al. 2012;Elbrecht and Leese 2015). From this point of view, the present method was also shown still to be non-quantitative, given that it could not recover relative species abundance in a sample, owing to observed bias between species. Nonetheless, compared with the previous study by Ramirez-Gonzalez et al. (2013) who had metabarcoded Collembola and reported no or weak correlation between relative species abundances in mock community samples with read counts from metabarcoding data (R = 0.03-0.47 when pooling all the species), the present method considerably improved the ability to recover relative species abundance (R = 0.66-0.74, Fig. S8). Furthermore, the present method may become possible in the future when slope values (ratio to a standard species) under a standard protocol are accumulated for various species. This idea is similar to that described in a recently published article by Thomas et al. (2016) who suggested the use of a coefficient for correction of the read abundance of focal species to standard species in metabarcoding a tissue mixture of multiple species.
Even though the relative species abundance within a sample could not be determined by metabarcoding, the linearity of normalized read counts to individuals across samples in the present method allows us to discuss the relative abundance of species among samples in a quantitative manner. Typically, the abundance of focal species can be compared between samples using normalized read counts, much as one does by the number of individuals.

Page 31 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
This comparison is important for ecologists who wish to evaluate the variation of population densities of the focal species among environmental factors, between treatments, and/or with time. Furthermore, by making use of linearity between individuals and read counts, ordination of community samples via principal component analysis (PCA) can be applied to a community matrix based on sequence data. As an example, the mock community samples were ordinated based on individuals and OTUs from read counts from metabarcoding data (Fig. 6). Prior to the analysis, the abundance (individuals or normalized read count) was standardized (to a mean value of 0 and standard deviation of 1) between samples per species in every dataset (that is, only information about relative difference between samples was conserved for each species). As a result, the PCAs showed almost identical ordinations of communities by respective principal components 1 and 2. Note that, since the species included in these mock community samples were fully overlapped, no ordination method was applicable when transformed into presence/absence data.
Selection of the species to be used as internal control is likely to be important. We had originally intended to use F. candida for the purpose, and included three individuals of this species per sample. However, this species was very infrequently sequenced, especially from the mtCOI library, and unfit for use as the standard. Accordingly, in place of F. candida, P. pseudolanuginosa was used as an internal control as this species showed the most preferable result in terms of the correlation coefficient per species across the samples. From the observation that P. pseudolanuginosa, the best internal control species, was the most frequently sequenced species in the mock community samples (see Fig. 4c, d), we speculate that the abundant templates and amplicons of this species minimized the effect of stochasticity during PCR (Polz and Cavanaugh 1998) and sequencing on the read count

Page 32 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. (Amend et al. 2010;Elbrecht and Leese 2015), resulting in the observed good performance as an internal control species. This observation suggests a recommendation for the use of sufficient numbers of individuals (tentatively, >10 per sample) of sufficiently amplifiable (and cultivable) species such as P. pseudolanuginosa as the internal standard.
The proportion of read counts of a species to the number of total reads (i.e., relative read abundance) is an often-used normalization standard for metabarcoding data (e.g., Hajibabaei et al. 2011;Hirai et al. 2015). Although, in the present results, this normalization method also showed high or moderate correlation (Fig. S8), its performance was inferior to that by internal control. This is likely because the relative read abundance of a species in a community sample was affected by other species included in the sample (Fig. 4). As pointed out by Amend et al. (2010), when a focal species with high PCR efficiency is sequenced together with another species with low efficiency, the abundance of the focal species may be overestimated, to an extent that may differ depending on the number of individuals of the less efficient species are included in the sample. Accordingly, even when the same number of individuals of a species is included in multiple samples, its relative abundance may vary between these samples.
The satisfactory results of the present study were most likely due to the laboratory protocol used, which was designed with care to minimize the effects of PCR amplification. We did not use fusion primers, which have gene-specific sequences as well as adapters for the NGS platform and could bias amplification in PCR (Berry et al. 2011). In addition, we had previously designed and examined fusion primers but observed frequent primer dimers and unstable amplification depending on incorporated MIDs in another project (Sunagawa et al. unpublished data).

Page 33 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
Accordingly, we instead used ligation to add NGS adapters to amplicons, similar to the practices of De Barba et al. (2014) and Hirai et al. (2015), the latter of whom reported good linearity between relative species abundances in read counts and those in biomass. Nonetheless, the effect of methods of adding NGS adapters on linearity is unclear, given that Diaz-Real et al. (2015) and Elbrecht and Leese (2015) also reported high linearity using fusion primers.
The present study employed a restricted number of PCR cycles (15-25, determined by preliminary examination for each sample; Fig. 1.2.1) to avoid a plateau during PCR, which could affect final proportions of amplicons from different templates (Mathieu-Daudé et al. 1996). Hirai et al. (2015) also employed small numbers of PCR cycles (22), whereas Diaz-Real et al. (2015) and Elbrecht and Leese (2015) used 40 (with touching down) and 30 cycles, respectively, but all of these authors reported good results. Thus, the effect of PCR cycles is also unclear. To make the present protocol more robust, PCR replication (Porazinska et al. 2010;Diaz-Real et al. 2015;Elbrecht and Leese et al. 2015) and longer extension time (30 s in the present study vs. 1 min or longer in other studies) could be incorporated to minimize the effect of stochasticity on rare templates (Polz and Cavanaugh 1998) and formation of chimeras (Lahr and Katz 2009), respectively, although these did not apparently interfere with the sequencing results in the present study.

Costs and prospective
The cost per sample (for reagents required for DNA extraction, library preparation and sequencing) using the present method was estimated to be approximately $100 for sequencing 10 collembolan community samples at once using the 454 GS Junior sequencer, a cost presumably acceptable for use in place of conventional microscopic

Page 34 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
identification. As NGS adapters were ligated in the final step of library preparation in the present method, the migration of our metabarcoding method from an NGS platform (i.e. 454 technology) to another would be possible by the substitution of the adapters of the target NGS platform for those used in the present protocol. Use of the present method as well as its adaptation to more powerful sequencing platforms such as the Illumina and Ion Torrent platforms may contribute to the understanding of the biodiversity of Collembola. In particular, 300 + 300 bp paired-end sequencing using Illumina MiSeq is presumably promising for both markers examined in the present study because it can fully sequence the gene fragments.
Results in the present study indicate that mtCOI and mt16S genes were comparable in terms of the ability of surveying collembolan biodiversity. Considering the availability of reference databases and online identification systems (i.e., the BOLD system), the use of mtCOI rather than mt16S has an advantage. Nonetheless, as these two markers were revealed to have different PCR biases depending on species, the combined use of both markers may help in understanding the collembolan community more comprehensively when sufficient reference sequences have been accumulated for both genes. Table 1. Community assessment of Collembola at the study site by microscopic examination and metabarcoding using mitochondrial cytochrome c oxidase subunit 1 (mtCOI) and 16S ribosomal RNA (mt16S) genes * Species whose mtCOI and/or mt16S genes were barcoded at the study site. † Number of individuals found per bulk microarthropod sample extracted from 100 ml of soil ‡ Relative species (or OTU) abundance per sample based on number of individuals (by microscopic examination) or reads (by metabarcoding) § Number of samples in which at least one individual or read of the species or OTU was found.

Table footnotes and figure legends:
|| Number of OTUs assigned to species after taxonomic identification of OTUs via reference database (see text).
Shading indicates species found by all of three methods (microscopic examination and metabarcoding by mtCOI and mt16S genes).
Note: Families of unidentified collembolan OTUs were inferred based on BLAST searches against the NCBI Nucleotide Collection database (see text for detail). For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. Table S2. PCR primer sets for metabarcoding used in the present study. Table S3. Taxonomy, body length, and estimated biomass of collembolan specimens used in mock community samples Values are mean ± SD (n > 20). †, Collected from a single cohort. ‡, Collected from a field population. For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record.
sample; this species could not be used as an internal control in mtCOI data because no read was found from this species in one of the samples. For species abbreviations, see Fig. 4.    Table 1). Vertical bars represent 95% confidence intervals.    Filled parts of the columns (in red) indicate within-species comparisons. Figure S2. Effect of the threshold value on false positive (over-splitting) and negative (over-lumping) rates in species delimitation by gene markers based on reference sequences. Figure S3. Effect of threshold value on the success rate of taxonomic identification using gene markers based on reference sequences. Figure S4. A schematic of the OTU clustering pipeline for metabarcoding data used in present study. Figure S5. Portion of quality-filtered reads per sample in multiplexing sequencing runs in present study. Labels around pie charts are 4-base MID tags.

Page 52 of 60
Genome Downloaded from www.nrcresearchpress.com by CORNELL UNIV on 07/24/16 For personal use only. This Just-IN manuscript is the accepted manuscript prior to copy editing and page composition. It may differ from the final official version of record. Figure S6. Taxonomy composition in metabarcoding data per sequencing run in present study. Figure S7. Relationships of relative species abundances in the natural collembolan community between morphological identification and metabarcoding using mtCOI and mt16S genes. Mean relative species abundances (in percentage; n = 10) were plotted for 13 species commonly identified by the three methods.    Table 1. Community assessment of Collembola at the study site by microscopic examination and metabarcoding using mitochondrial cytochrome c oxidase subunit 1 (mtCOI) and 16S ribosomal RNA (mt16S) genes * Species whose mtCOI and/or mt16S genes were barcoded at the study site. † Number of individuals found per bulk microarthropod sample extracted from 100 mL of soil ‡ Relative species (or OTU) abundance per sample based on number of individuals (by microscopic examination) or reads (by metabarcoding) § Number of samples in which at least one individual or read of the species or OTU was found. || Number of OTUs assigned to species after taxonomic identification of OTUs via reference database (see text). Shading indicates species found by all of three methods (microscopic examination and metabarcoding by mtCOI and mt16S genes). Note: Families of unidentified collembolan OTUs were inferred based on BLAST searches against the NCBI Nucleotide Collection database (see text for detail).