Jump to
section
Method Variation statistics Phylogeny Principal Component Analysis (PCA) ADMIXTURE Analysis

Method

DNA sample preparation and sequencing for resequencing

486 Camellia accessions were obtained from the sampling place, which include 15 Other Camellia accessions (other species in Camellia, not in Chazu), 21 Bloom Camellia accessions, 23 Oilseed Camellia accessions, 54 Wild tea accessions, 20 Hybrid accessions (C. sinensis var. assamica × C. sinensis var. sinensis, 246 Assamica accessions (C. sinensis var. assamica) and 107 Sinensis accessions (C. sinensis var. sinensis). In addition, 139 accessions (1 Oilseed Camellia accession, 13 Wild accessions, 5 Hybrid accessions, 37 Assamica accessions and 83 Sinensis accessions) in Wang et. al. paper[1] also were used for resequencing analysis. Young leaves were collected from the plants and snap-frozen in liquid nitrogen. Total DNA was extracted with the DNAsecure plant kit (Tiangen, Beijing). 2 µg genomic DNA from each accession was used to construct a sequencing library following the manufacturer’s instructions using NEBNext Ultra DNA Library Prep Kit (NEB, USA). Paired-end sequencing libraries with an insert size of approximately 400 bp were sequenced on an Illumina NovaSeq 6000 sequencer at Novogene-Beijing. Paired-end resequencing reads were filtered using NGSQCToolkit_v2.3.3[2]. This step removed reads containing adapter or poly-N, and low-quality reads (reads with >40% bases having Phred quality ≤ 20) from the raw data, reads shorter than 70bp were discarded, the yielding clean data for subsequent downstream analyses.

Variation calling and annotation

Paired-end resequencing reads were mapped to the YK10 reference genome with bwa-mem2 (Version: 2.0pre1)[3] using the default parameters. SAMtools (Version: 1.3.1)[4] software was used to convert mapping results into the BAM format. Duplicated reads were marked with the Picard package (picard.sourceforge.net, Version: 2.1.1) after sorting the BAM using Picard package. After BWA alignment, the reads around indels were realigned, realignment was performed with Genome Analysis Toolkit (GATK, version 3.3-0-g37228af)[5] in two steps. The first step used the RealignerTargetCreator package to identify regions where realignment was needed, and the second step used IndelRealigner to realign the regions found in the first step, which produced a realigned BAM file for each accession. The variation detection followed the best practice workflow recommended by GATK[5]. In brief, the variants were called for each accession by the GATK HaplotypeCaller[5]. A joint genotyping step for comprehensive variations union was performed on the gVCF files. In the filtering step, the SNP filter expression was set as “QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 5.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || QUAL < 30”, and the Indel filter expression was set as “QD < 2.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8 || FS > 200.0 || SOR >10.0 || QUAL < 30”. Only insertions and deletions shorter than or equal to 40 bp were considered. Indels and SNPs with none bi-allelic were removed, which yielded the basic set. SNPs with MAF < 0.05 and >50% missing calls were further removed for phylogenetic tree structure, population structure analyses and IBS calculation (the core set).

Population genetics analysis

Whole-genome SNPs were used to construct the ML (Maximum likelihood method) phylogenetic tree with 100 bootstrap using SNPhylo[6] (Version: 20140701). C. cuspidata (KM6) was used to provide outgroup information at corresponding positions. The tool iTOL (http://itol.embl.de) was used to color the phylogenetic tree. SNPs in linkage disequilibrium was filtered using PLINK (Version v1.90b3.38)[7] with a window size of 50 SNPs (advancing 5 SNPs at a time) and an r2 threshold of 0.5. Population structure was analyzed using the ADMIXTURE (Version: 1.3)[8] program with a block-relaxation algorithm. To explore the convergence of individuals, we predefined the number of genetic clusters K from 2-8 and ran the cross-validation error (CV) procedure. Default methods and settings were used in the analyses.

Reference

1. Wang, X., et al., Population sequencing enhances understanding of tea plant evolution. Nat Commun, 2020. 11(1): p. 4447. 
2. Patel, R.K. and M. Jain, NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One, 2012. 7(2): p. e30619. 
3. Md, V., et al. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2019. 
4. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078-9. 
5. McKenna, A., et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010. 20(9): p. 1297-303.
6.Lee, T.H., et al., SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics, 2014. 15: p. 162.
7. Purcell, S., et al., PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 2007. 81(3): p. 559-75.
8. Alexander, D.H., J. Novembre, and K. Lange, Fast model-based estimation of ancestry in unrelated individuals. Genome Res, 2009. 19(9): p. 1655-64.

Variation statistics

Variants Type Core Set
SNPs Total 83,800,168
Intergenic 76,998,171
Intronic 3,688,316
Exonic 846,520
5’-UTR 107,041
3’-UTR 203,833
UTR5;UTR3 409
upstream 1,022,277
downstream 901,863
upstream;downstream 20,783
Splicing 10,461
exonic;splicing 494
Indels Total 3,818,608
Intergenic 3,225,336
Intronic 351,047
Exonic 30,616
5’-UTR 15,079
3’-UTR 22,155
UTR5;UTR3 22
Splicing 1,075
upstream 90,326
downstream 80,894
upstream;downstream 2,042
exonic;splicing 16
# Indels £ 40 bp.

Phylogeny

Note

Principal Component Analysis (PCA)

Add Common name

ADMIXTURE Analysis


Add Common name