Function Annotation and Browse
Verified Alleles
Method
Genome assembly:The genome assembly process involved several steps. First, we performed automatic assembly using CANU (Version: v1.8) with PacBio data. The assembled contigs were then polished using Pilon (Version 1.23) with NGS sequencing data. Next, we used the HERA software with specific parameters for further assembly. Scaffolding of the contigs was performed using SAPHYR optical mapping technology and the Solve software package. Redundancy in the genome was resolved using the Redundans.py software. The resulting sequences were clustered using Hi-C data and the 3d-dna pipeline. Manual review and refinement of the assembly were performed using Juicebox Assembly Tools. Finally, the genome was reassembled using 3d-dna, resulting in 15 anchored chromosomes.
Repeats annotation:Tandem repeats in the genome were identified using Tandem Repeat Finder (TRF). Transposable elements (TEs) were identified using a combination of homology-based and de novo approaches. Homology-based prediction involved searching for known repeats using RepeatMasker and RepeatProteinMask against Repbase. De novo prediction utilized LTR FINDER, ltrharvest, LTR_retriever, and RepeatModeler. The identified repeats were classified using TEsorter based on the REXdb database.
Gene prediction and functional annotation:Gene prediction was performed using EVidence Modeler (EVM). RNA-seq data, protein alignments, ab initio gene predictions, and homologous methods were combined using EVM to generate the final gene set. Training data for ab initio gene predictors were generated using PASA and various tools such as AUGUSTUS, GlimmerHMM, GENSCAN, and SNAP. Homology-based gene annotation utilized protein sequences from related species. The gene functions were assigned based on the best match alignment using eggNOG-mapper against the eggNOG5.0 database.
RNA-seqThese raw reads of RNA-seq were stored in fastq format, and processed through Trimmomatic (Version 0.32). This step removed reads containing adapter, reads containing poly-N and low-quality reads from the raw data and yielded clean data for downstream analyses. The corresponding trimmed clean reads were aligned to the related reference genome employing TopHat2 software with default settings. Calculations of gene expression level were conducted using Cufflinks v2.2.1. Fragments per kilobase of exon per million fragments mapped (FPKM) was used to normalize RNA-seq fragment counts and estimate the relative abundance of each gene. The DEGs were decided based on a P-value < 0.05 and at least a 2-fold change between the two FPKMs.