thesis on genome analysis

Loading metrics

Open Access

Peer-reviewed

Research Article

A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

Roles Methodology, Software, Writing – original draft

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain, Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain, Hospital Universitario Reina Sofía, Córdoba, Spain

Roles Software

Roles Methodology, Software

Roles Conceptualization, Methodology

Affiliations Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), Hospital Virgen del Rocío, Sevilla, Spain, Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Sevilla, Spain, Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla, Spain

Roles Conceptualization

Affiliations Departamento de Biología Celular, Fisiología e Inmunología, Universidad de Córdoba, Córdoba, Spain, Instituto Maimónides de Investigación Biomédica de Córdoba (IMIBIC), Córdoba, Spain, Hospital Universitario Reina Sofía, Córdoba, Spain

Roles Conceptualization, Supervision

Roles Conceptualization, Funding acquisition, Investigation, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Martín Garrido-Rodriguez,
Daniel Lopez-Lopez,
Francisco M. Ortuno,
María Peña-Chilet,
Eduardo Muñoz,
Marco A. Calzado,
Joaquin Dopazo

Published: February 11, 2021
https://doi.org/10.1371/journal.pcbi.1008748
Peer Review
Reader Comments

MIGNON is a workflow for the analysis of RNA-Seq experiments, which not only efficiently manages the estimation of gene expression levels from raw sequencing reads, but also calls genomic variants present in the transcripts analyzed. Moreover, this is the first workflow that provides a framework for the integration of transcriptomic and genomic data based on a mechanistic model of signaling pathway activities that allows a detailed biological interpretation of the results, including a comprehensive functional profiling of cell activity. MIGNON covers the whole process, from reads to signaling circuit activity estimations, using state-of-the-art tools, it is easy to use and it is deployable in different computational environments, allowing an optimized use of the resources available.

Author summary

Currently, RNA massive sequencing RNA-seq is the most extensively used technique for gene expression profiling in a single assay. The output of RNA-seq experiments contains millions of sequences, generated from cDNA libraries produced by the retro-transcription of RNA samples, that need to be processed by computational methods to be transformed into meaningful biological information. Thus, a number of bioinformatic workflows and pipelines have been proposed to produce different types of gene expression measurements, including in some cases, functional annotations to facilitate biological interpretation. While most pipelines focus exclusively on transcriptional data, the ultimate activity of the resulting gene product also depends critically on its integrity. Although traditional hybridization-based transcriptomics methodologies (microarrays) miss this information, RNA-seq data also contains information on variants present in the transcripts that can affect the function of the gene product, which is systematically ignored by current RNA-seq pipelines. MIGNON is the first workflow able to perform an integrative analysis of transcriptomic and genomic data in the proper functional context, provided by a mechanistic model of signaling pathway activity, making thus the most of the information contained in RNA-Seq data. MIGNON is easy to use and to deploy and may become a valuable asset in fields such as personalized medicine.

Citation: Garrido-Rodriguez M, Lopez-Lopez D, Ortuno FM, Peña-Chilet M, Muñoz E, Calzado MA, et al. (2021) A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways. PLoS Comput Biol 17(2): e1008748. https://doi.org/10.1371/journal.pcbi.1008748

Editor: Mihaela Pertea, Johns Hopkins University, UNITED STATES

Received: May 28, 2020; Accepted: January 30, 2021; Published: February 11, 2021

Copyright: © 2021 Garrido-Rodriguez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: MIGNON is available from https://github.com/babelomics/MIGNON and its documentation can be found at https://babelomics.github.io/MIGNON/ . Additionally, we have prepared a bash script to perform a dry run. The instructions can be found at https://babelomics.github.io/MIGNON/1_installation.html . The data used in the examples and figures of this manuscript is freely available at: https://figshare.com/articles/dataset/MIGNON_data/13286627/1 .

Funding: JD has received these grants: SAF2017-88908-R from the Ministerio de Economía y Competitividad and PT17/0009/0006 from the Instituto de Salud Carlos III, as well as an FP7 People Marie-Curie Actions 813533 and and Horizon 2020 Framework Programme 676559. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology Software paper.

Introduction

Because of the plummeting in the cost of sequencing technologies during the last decade, RNA massive sequencing (RNA-seq) has become mainstream to study the transcriptome [ 1 ]. Currently, short-read sequencing technologies, typically producing outputs of 30 million reads per sample, are the most extensively used methodologies for gene expression profiling [ 2 ]. This pace of data generation requires computational processing to produce interpretable results. Thus, the use of pipelines to perform the different steps of transcriptomic data processing have become a widespread practice. The core of these is usually composed by spliced aligners as STAR [ 3 ], HISAT2 [ 4 ] or Rail-RNA [ 5 ], which map reads against a reference genome, or by pseudo-alignment tools as Salmon [ 6 ] or Kallisto [ 7 ], that directly obtain a quantification for the regions of interest using probabilistic models. Additionally, there are pipelines which are intended to be run by the user in local computers or high-performance environments, as QuickRNASeq [ 8 ], or interactively in cloud-based platforms, after uploading raw data to an external service, as BioJupies [ 9 ] or RaNA-Seq [ 10 ]. Typically, the interpretation of the experiment involves differential expression analysis, carried out using count based or linear models, with packages as edgeR [ 11 ], DESeq2 [ 12 ] or limma [ 13 ], followed by methods, such as over representation analysis [ 14 ] or the gene set enrichment analysis [ 15 ], to extract functional information from the obtained results.

Despite different pipelines to perform the aforementioned tasks are available (Tables 1 and 2 ), most of them present two major drawbacks. First, the genomic information contained in the RNA-Seq reads usually remains unused. However, genomic variants, which may contain crucial information about the functionality and potential activity of the resulting proteins in the different processes where they participate, can be retrieved from such sequences. In this sense, it is well known that RNA-Seq has some limitations for DNA variant calling. There are two main points to consider: (i) lowly expressed genes include lower depth, so variant calling is harder in those regions and (ii) the detection of heterozygous variants can be limited due to allele-specific gene expression [ 16 ]. Despite these limitations, it has been demonstrated that variants can be called even for low expressed genes in deeper RNA-Seq sequence samples. Moreover, some studies have shown that RNA-Seq variant calling is able to provide a good sensitivity of 99.7%-99.8% in both heterozygous and homozygous variants whereas precision still reaches 97.6% in homozygous but 90% in heterozygous [ 17 ]. The second major drawback is that conventional functional analysis strategies are mainly descriptive, and very limited in providing biological insights of the underlying molecular mechanisms that produce the observed phenotypic responses. Recently, a new generation of methods, known as mechanistic pathway analyses, are outperforming traditional approaches in both biological explanatory power and interpretability [ 18 ]. Here we present MIGNON ( M echanistic I nte G rative aNalysis O f r N a-seq), a complete and versatile workflow able to exploit all the information contained in RNA-Seq data and producing not only the conventional normalized gene expression matrix, but also an annotated VCF file per sample with the corresponding mutational profile. Moreover, MIGNON can combine both files to model signaling pathway activities through an integrative functional analysis using the mechanistic modeling algorithm Hipathia [ 19 ]. Signaling circuit outputs can further be easily linked to phenotypic features (e.g. disease outcome, drug response, etc.) [ 19 – 21 ]. Mechanistic modeling has been successfully applied to understand disease mechanisms in rare diseases [ 22 , 23 ], complex diseases [ 21 ], and, especially in cancer [ 19 , 24 – 26 ]. Specifically, the hiPathia algorithm has demonstrated to have a superior sensitivity and specificity than other similar algorithms available [ 27 ].

PPT PowerPoint slide
PNG larger image
TIFF original image

https://doi.org/10.1371/journal.pcbi.1008748.t001

https://doi.org/10.1371/journal.pcbi.1008748.t002

Design and implementation

Workflow implementation.

The complete pipeline was developed using the Workflow Description Language (WDL, https://github.com/openwdl/wdl ) due to its flexibility, human-readability and easy deployment. Thus, all the steps of the pipeline were wrapped into WDL tasks that were designed to be executed on an independent unit of containerized software through the use of docker containers, which prevent deployment issues using an independent environment for each execution. The workflow can be executed in personal computers or in high-performance computing (HPC) environments, both locally or in cloud-based services with cromwell ( https://github.com/broadinstitute/cromwell ), a Java based software that control and interpret WDL, using a JSON file as input. To run MIGNON, three dependencies are required: Java (v1.8.0), cromwell and an engine able to run the containerized software (i.e Docker or Singularity). The list of docker containers employed by MIGNON can be found at the S1 Table .

Quality control and alignment

First, using raw sequences as the input for the workflow, fastp (v0.20.0) [ 28 ] is applied to perform the quality trimming and filtering of reads using the default values for windows size and required mean quality and length. Then, FastQC (v0.11.5) can be used to create a quality report for each pre-processed read file. After the quality control step, five modes for the execution of the workflow can be selected (see Table 3 ). Each execution mode uses a different combination of “core” tools to perform the alignment or pseudo-alignment of pre-processed reads, as explained in the tool documentation (see also Fig 1 ). In brief, all of them make use of a combination of STAR (v2.7.2b), HISAT2 (v2.1.0), Salmon (v0.13.0) and FeatureCounts (v1.6.4) [ 29 ] to align (or pseudo-align) reads against a reference genome (or transcriptome) and subsequently obtain the counts per gene matrix. The hisat2 and star modes use a conventional counting strategy, employing FeatureCounts to summarize the number of sequences overlapping the genomic regions of interest (genes), as specified by a genome annotation file. On the other hand, the core component of salmon-hisat2 , salmon-star and salmon consist of the pseudo-aligner Salmon , which directly obtains transcript level quantification using a probabilistic model. Note that in the salmon-hisat2 and salmon-star modes, the execution of STAR or HISAT2 is still necessary to generate the alignment files that feed the variant calling sub-workflow.

Directed graph summarizing the tools employed by the workflow (blue boxes) and the strategy used by MIGNON to integrate genomic and transcriptomic information into signaling circuits. Gene expression and LoF variants are obtained from reads and integrated by doing an in-silico knockdown of genes that present a LoF variant. Then, this combined matrix is used as the input for hiPathia, that estimates the signaling circuits activation status by using expression values as proxies for protein signaling activities.

https://doi.org/10.1371/journal.pcbi.1008748.g001

https://doi.org/10.1371/journal.pcbi.1008748.t003

Variant calling and annotation

Genomic data for the expressed genes can be inferred from reads through variant calling. Due to the number of intermediate steps carried out during this process, it was encapsulated on an independent sub-workflow which is run at sample level. On it, the input material consists of the alignments generated with STAR or HISAT2 and the output is a list of variants in the variant call format (VCF). The whole process is performed using the Genome Analysis ToolKit (v4.1.3.0) [ 30 ], and it was designed following the GATK best practices for the variant calling from RNA-Seq data. Similar to germline variant discovery with DNA sequencing, this sub-workflow specifically includes a step to mark duplicate reads, which will help to reduce the direct dependency of the depth by gene expression. Additionally, the pipeline also includes other steps to specially deal with RNA-Seq peculiarities for variant calling. Thus, some aligned reads are reformatted in order to control the expansion produced by introns. Specifically, reads are split into separate reads when introns are identified inside, thus reducing artifacts in the downstream variant calling. Mapping qualities are also reassigned and adapted to match DNA conventions. Finally, in order to avoid variants called under low evidence, our sub-workflow includes a filter by depth step to only keep those variants found in at least a number of reads (by default >5) as recommended in the literature [ 16 ]. The output VCFs are then annotated with the Variant Effect Predictor (VeP v99) [ 31 ], a powerful tool for the prioritization of genomic variants that summarizes in two scores (Polyphen [ 32 ] and SIFT [ 33 ]) the predicted impact of variants on protein stability and functionality.

Normalization and differential expression analysis

The different execution modes converge at the counts per sample matrix, which is the output of FeatureCounts . On the other hand, for Salmon quantifications, the count matrix is generated with txImport (v1.10.0) [ 34 ] and a transcript-to-gene file. The lengthScaledTPM option is used to correct the estimated counts by both transcript length and library size. Then, RNA-seq gene level counts are normalized with the Trimmed mean of M values (TMM) method and conventional differential gene expression analysis can be performed with the edgeR package (v3.28.0) [ 11 ].

Integrative mechanistic signaling pathway activity analysis

The HiPathia R package (v2.2.0) [ 19 ] is used to perform the functional analysis, either using transcriptomic data alone, or integrating them with the genomic data. HiPathia implements a mechanistic model of signaling pathways that, using gene expression values as proxies of protein activities, infer signaling circuit activities and the corresponding functional profiles triggered by them. Since the model is mechanistic, it allows to infer the effect of an intervention (e.g., a knock-out) on the resulting signaling (and functional) profile [ 35 ], a concept that can easily be assimilated to a loss of function (LoF) [ 21 ]. In practical terms, MIGNON considers that a gene harbors a LoF if it presents at least one variant with a SIFT score < 0.05 and a PolyPhen score > 0.95 (default values that can be modified by the user). Then, an in-silico knock-down is simulated by multiplying the scaled normalized expression values by 0.01 only in the affected samples. Next, the HiPathia signal propagation algorithm is applied to obtain the signaling circuit activities. Finally, the profiles of signaling activities of the samples belonging to the groups of interest are compared using a Wilcoxon signed rank test. For more information about the HiPathia method, please refer to [ 19 ] and [ 21 ].

Modularity of the workflow

The choice of methods for the different steps of MIGNON was based on two recent benchmarking evaluations of the processes to perform the primary analysis of RNA-seq data [ 1 , 36 ]. However, the modular design of the pipeline makes it easy to replace any tool for another one providing it matches the input/output schema used. Thus, users can easily replace tools in the pipeline by making small changes to the MIGNON WDL code, as explained in the documentation ( https://babelomics.github.io/MIGNON/4_advanced.html#modularity ).

MIGNON integrative approach for the mechanistic interpretation of multi-omic information into a pathway framework

MIGNON is the first pipeline able to extract genomic and transcriptomic information from RNA-seq data and integrate them within a mechanistic framework. The ultimate protein activity is assessed from the transcriptional activity conditioned to the integrity of the gene. No matter its level of expression, a gene that harbors a deleterious mutation is in-silico knocked-down by the model to simulate the loss of function ( Fig 1 ). To evaluate how the proposed strategy affects the predicted signaling circuit activities, two different runs of MIGNON were carried out over 462 unrelated human lymphoblastoid cell line samples from the 1000 Genomes sample collection, corresponding to the CEU, FIN, GBR, TSI and YRI populations [ 37 ]. In the reference run, only transcriptomic information (raw) was used, while in the case example run the knock-down strategy was applied. Fig 2A and 2B clearly depicts how the knock-down due to LoF mutations interrupts the transduction of the signal in three circuit/sample pairs. Moreover, Fig 2C shows that the overall predicted signaling circuit activities are significantly lower (paired Wilcoxon signed-rank test P value < 2.2x10 -16 ) when the genomic information is integrated in the model. This example clearly shows how the use of transcriptomics data alone produced an incomplete picture of the real signaling activity and proves the usefulness of multi-omic data integration.

A) Network representation of three signaling circuits that contain genes with loss of function variants for three subjects from the 1000 genomes cohort. The node color indicates whether a gene contained in it has a loss of function variant (yellow) or not (black). Red and blue arrows indicate stimulations and inhibitions, respectively. B) Predicted signaling activity for three circuit/sample pairs on the sub-figure A. Color represents signaling circuit activity with and without considering the genomic information. C) Violin plots showing all the predicted signaling circuit activities with and without the genomic information for the 1000 genomes cohort (paired Wilcoxon signed-rank test P value < 2.2x10 -16 ).

https://doi.org/10.1371/journal.pcbi.1008748.g002

Workflow performance evaluation

To assess MIGNON performance and resource consumption, the workflow was executed over 6 different human datasets ( S2 Table ), comprising a total of 42 samples. It was tested with cromwell (v47) and singularity (v3.5), using 6 different CPU configurations on tasks allowing multi-threading. This analysis revealed that the slower components of the workflow are the aligners (HISAT2 and STAR) and the MarkDuplicates and HaplotypeCaller steps of the GATK sub-workflow. Fig 3 summarizes the time and memory consumption of the tools which allow multi-threading using 6 different CPU configurations. While HISAT2 is slower than STAR, the second one makes a more intensive use of available memory. Therefore, both aligners are available in MIGNON since this tradeoff should be considered if planning to deploy the workflow in cloud-computing based environments or, contrarily, in limited memory computing environments. Additionally, Fig 4 shows the time and memory consumption of the different steps that compose the GATK sub-workflow. Here, MarkDuplicates displays the highest memory consumption and HaplotypeCaller shows the longest runtime. Overall, the different tasks carried out by the workflow show a maximum memory usage under the 32 gigabytes, which makes the pipeline deployable under most computational environments. Finally, and due to the WDL, cromwell and docker combination, the workflow is something fast and easy to deploy and setup.

Multi-thread tasks. A ) Memory consumption by task. Each boxplot represents the maximum memory consumption in Gigabytes (y axis) for each CPU configuration (X axis) and each multi-thread task (facets). Dashed lines indicate the following memory configurations: 4, 8, 16 and 32 gigabytes. B ) Elapsed time by task. Each boxplot represents the elapsed time (Y axis) for each CPU configuration (X axis) and each task (facets). Dashed lines indicate time points: 30, 60, 120 and 240 minutes.

https://doi.org/10.1371/journal.pcbi.1008748.g003

A ) Memory consumption by task. Each boxplot represents the maximum memory consumption in Gigabytes (Y axis) for each task (X axis). Dashed lines indicate the following memory configurations: 4, 8, 16 and 32 gigabytes. B ) Elapsed time by task. Each boxplot represents the elapsed time (Y axis) for each task (X axis). Dashed lines indicate the following time points: 30, 60, 120 and 240 minutes.

https://doi.org/10.1371/journal.pcbi.1008748.g004

Functionality of current available workflows

In order to have a comprehensive list of available pipelines for RNA-seq data processing, only those published from 2015 onwards and able to use raw read files (fastq) as input data were considered. Nine workflows fulfilled these criteria: QuickRNASeq [ 8 ], SEPIA [ 38 ], Recount2 [ 39 ], RNACocktail [ 36 ], ARCHS4 [ 40 ], GREIN [ 41 ], VaP [ 17 ], DEWE [ 42 ] and RaNA-Seq [ 10 ]. Table 1 list the components implemented in each pipeline. Since their performances depend on their components, which are similar across them, a comparison of their respective functionalities is listed in Table 2 . The first noticeable aspect is that, although some of them can carry out variant calling (QuickRNASeq, SEPIA, RNACocktail and VaP), none of them provides a way to integrate the called variants with the gene expression data as MIGNON does. Among the workflows, only SEPIA provides an option for functional analysis of both omic results (obviously transcriptomic and genomic data are interpreted independently). Although the real usage level of these workflows is always difficult to estimate, Google Scholar citations can provide an approximate measurement of the relative impacts in terms of scientific document quotations. According to these observations, SEPIA displays a modest 6% of use among the available workflows. Conversely, Recount2 (36%), ARCHS4 (33%) and RNACocktail (16%) together account for 85% of the citations. Among these, only one (ARCHS4) provides functional analysis, by conventional enrichment analysis. Thus, a workflow capable, not only to extract transcriptomic and genomic information from RNA-seq reads, but also to integrate them and to provide a functional analysis in a sophisticated framework of mechanistic modeling of signaling pathways seems to be a good step forward.

Conclusions

In summary, MIGNON represents an innovative concept of RNA-Seq data analysis that automates the sequence of steps that leads from the uninformative raw reads to the ultimate sophisticated functional interpretation of the experiment, providing, for the first time, a user-friendly framework for integration of genomic and transcriptomic data.

MIGNON makes use of several popular methods to perform the initial processing of reads and utilize the HiPathia mathematical model to provide a mechanistic interpretation of the experiment in the context of human signaling. MIGNON has an enormous application potential in personalized medicine, especially in the analysis of cancer transcriptomes, given its ability to interpret putative driver mutations along with gene expression in the context of signaling activity, a process highly relevant in tumorigenesis.

MIGNON can be easily deployed in different computer environments making an optimal use of the resources. Additionally, the modularity with which the workflow has been designed makes its upgrade and maintenance a straightforward task.

Supporting information

S1 table. list of docker containers employed by mignon..

https://doi.org/10.1371/journal.pcbi.1008748.s001

S2 Table. List of datasets used to assess MIGNON performance.

https://doi.org/10.1371/journal.pcbi.1008748.s002

View Article
PubMed/NCBI
Google Scholar
10. Prieto C, Barrios D. RaNA-Seq: interactive RNA-Seq analysis from FASTQ files to functional analysis. Oxford University Press; 2020.

Essentials of Bioinformatics, Volume I pp 1–18 Cite as

Introduction to Bioinformatics

Babajan Banaganapalli 5 &
Noor Ahmad Shaik 6
First Online: 28 March 2019

3229 Accesses

This chapter offers insights into the interdisciplinary nature of bioinformatics and its contribution and relevance to modern biological research. Modern scientific disciplines like bioinformatics have become highly interdisciplinary. The release of the complete draft of the human genome has virtually revolutionized the shape of modern biological research and has allowed researchers to perceive and interpret complex molecular processes that sustain the life. The discipline of bioinformatics includes adopting diverse range of computational approaches to carry out sequence alignment, structural modeling, biological database design and development, structure prediction, molecular pathway prediction, and in silico gene prediction and mapping. Bioinformatics presently offers excellent highly cohesive data management platforms that work as a seamless interface between wet labs, clinical settings, and state-of-the-art software and database environments.

This is a preview of subscription content, log in via an institution .

Buying options

Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Durable hardcover edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Akalin PK (2006) Introduction to bioinformatics. Mol Nutr Food Res 50(7):610–619. https://doi.org/10.1002/mnfr.200500273

Article CAS PubMed Google Scholar

Al-Abbasi FA, Mohammed K, Sadath S, Banaganapalli B, Nasser K, Shaik NA (2018) Computational protein phenotype characterization of IL10RA mutations causative to early onset inflammatory bowel disease (IBD). Front Genet 9:146. https://doi.org/10.3389/fgene.2018.00146

Article CAS PubMed PubMed Central Google Scholar

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

Babajan, B., Chaitanya, M., Rajsekhar, C., Gowsia, D., Madhusudhana, P., Naveen, M., . . . Anuradha, C. M. (2011). Comprehensive structural and functional characterization of Mycobacterium tuberculosis UDP-NAG enolpyruvyl transferase (Mtb-MurA) and prediction of its accurate binding affinities with inhibitors. Interdiscip Sci , 3(3), 204–216. doi: https://doi.org/10.1007/s12539-011-0100-y

Banaganapalli B, Mohammed K, Khan IA, Al-Aama JY, Elango R, Shaik NA (2016) A computational protein phenotype prediction approach to analyze the deleterious mutations of human MED12 gene. J Cell Biochem 117(9):2023–2035. https://doi.org/10.1002/jcb.25499

Banaganapalli, B., Mulakayala, C., D, G., Mulakayala, N., Pulaganti, M., Shaik, N. A., . . . Chitta, S. K. (2013a). Synthesis and biological activity of new resveratrol derivative and molecular docking: dynamics studies on NFkB. Appl Biochem Biotechnol , 171(7), 1639–1657. doi: https://doi.org/10.1007/s12010-013-0448-z

Banaganapalli, B., Mulakayala, C., Pulaganti, M., Mulakayala, N., Anuradha, C. M., Suresh Kumar, C., . . . Gudla, D. (2013b). Experimental and computational studies on newly synthesized resveratrol derivative: a new method for cancer chemoprevention and therapeutics? OMICS , 17(11), 568–583. doi: https://doi.org/10.1089/omi.2013.0014

Banaganapalli, B., Rashidi, O., Saadah, O. I., Wang, J., Khan, I. A., Al-Aama, J. Y., . . . Elango, R. (2017). Comprehensive computational analysis of GWAS loci identifies CCR2 as a candidate gene for celiac disease pathogenesis. J Cell Biochem , 118(8), 2193–2207. doi: https://doi.org/10.1002/jcb.25864

Batzoglou S, Schwartz R (2014) Computational biology and bioinformatics. Bioinformatics 30(12):i1–i2. https://doi.org/10.1093/bioinformatics/btu304

Blekherman G, Laubenbacher R, Cortes DF, Mendes P, Torti FM, Akman S et al (2011) Bioinformatics tools for cancer metabolomics. Metabolomics 7(3):329–343. https://doi.org/10.1007/s11306-010-0270-3

Bodrossy L, Sessitsch A (2004) Oligonucleotide microarrays in microbial diagnostics. Curr Opin Microbiol 7(3):245–254. https://doi.org/10.1016/j.mib.2004.04.005

Bolger ME, Weisshaar B, Scholz U, Stein N, Usadel B, Mayer KF (2014) Plant genome sequencing - applications for crop improvement. Curr Opin Biotechnol 26:31–37. https://doi.org/10.1016/j.copbio.2013.08.019

Bork P (1997) Bioinformatics and molecular medicine--introduction and call for papers. J Mol Med (Berl) 75(1):3–4

CAS Google Scholar

Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 33(Suppl):228–237. https://doi.org/10.1038/ng1090

Brown TA (2002) Genomes (Second Edition), Bios Scientific Publishers Ltd, Oxford; ISBN 1-85996-201-7

Google Scholar

Brzeski H (2002) An introduction to bioinformatics. Methods Mol Biol 187:193–208. https://doi.org/10.1385/1-59259-273-2:193

Burgess-Herbert SL, Cox A, Tsaih SW, Paigen B (2008) Practical applications of the bioinformatics toolbox for narrowing quantitative trait loci. Genetics 180(4):2227–2235. https://doi.org/10.1534/genetics.108.090175

Article PubMed PubMed Central Google Scholar

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421. https://doi.org/10.1186/1471-2105-10-421

Can T (2014) Introduction to bioinformatics. Methods Mol Biol 1107:51–71. https://doi.org/10.1007/978-1-62703-748-8_4

Article PubMed Google Scholar

Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA (2003) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33(4):518–521. https://doi.org/10.1038/ng1128

Cascorbi I, Henning S, Brockmoller J, Gephart J, Meisel C, Muller JM et al (2000) Substantially reduced risk of cancer of the aerodigestive tract in subjects with variant--463A of the myeloperoxidase gene. Cancer Res 60(3):644–649

CAS PubMed Google Scholar

Chandramouli K, Qian PY (2009) Proteomics: challenges, techniques and possibilities to overcome biological sample complexity. Hum Genomics Proteomics 2009:1. https://doi.org/10.4061/2009/239204

Article CAS Google Scholar

Chen XW, Gao JX (2016) Big Data Bioinformatics. Methods 111:1–2. https://doi.org/10.1016/j.ymeth.2016.11.017

Chicurel M (2002) Genome analysis at your fingertips. Nature 419:751. https://doi.org/10.1038/419751b

Di Tommaso P, Moretti S, Xenarios I, Orobitg M, Montanyola A, Chang JM et al (2011) T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res 39(Web Server issue):W13–W17. https://doi.org/10.1093/nar/gkr245

Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23(1):205–211

PubMed Google Scholar

Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797. https://doi.org/10.1093/nar/gkh340

Edwards D, Batley J (2010) Plant genome sequencing: applications for crop improvement. Plant Biotechnol J 8(1):2–9. https://doi.org/10.1111/j.1467-7652.2009.00459.x

Erichsen HC, Chanock SJ (2004) SNPs in cancer research and treatment. Br J Cancer 90(4):747–751. https://doi.org/10.1038/sj.bjc.6601574

Frye SV, Jin J (2016) Novel therapeutics targeting epigenetics: new molecules, new methods. ACS Med Chem Lett 7(2):123. https://doi.org/10.1021/acsmedchemlett.6b00037

Global Burden of Disease Cancer Collabration, Fitzmaurice C, Akinyemiju TF, Al Lami FH, Alam T, Alizadeh-Navaei R et al (2018) Global, regional, and National Cancer Incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 29 Cancer groups, 1990 to 2016: a systematic analysis for the global burden of disease study. JAMA Oncol. https://doi.org/10.1001/jamaoncol.2018.2706

Goldfeder RL, Parker SC, Ajay SS, Ozel Abaan H, Margulies EH (2011) A bioinformatics approach for determining sample identity from different lanes of high-throughput sequencing data. PLoS One 6(8):e23683. https://doi.org/10.1371/journal.pone.0023683

Greene CS, Tan J, Ung M, Moore JH, Cheng C (2014) Big data bioinformatics. J Cell Physiol 229(12):1896–1900. https://doi.org/10.1002/jcp.24662

Greene CS, Troyanskaya OG (2011) PILGRM: an interactive data-driven discovery platform for expert biologists. Nucleic Acids Res 39(Web Server issue):W368–W374. https://doi.org/10.1093/nar/gkr440

Gutmanas A, Oldfield TJ, Patwardhan A, Sen S, Velankar S, Kleywegt GJ (2013) The role of structural bioinformatics resources in the era of integrative structural biology. Acta Crystallogr D Biol Crystallogr 69.(Pt 5:710–721. https://doi.org/10.1107/S0907444913001157

Hickey G, Blanchette M (2011) A probabilistic model for sequence alignment with context-sensitive indels. J Comput Biol 18(11):1449–1464. https://doi.org/10.1089/cmb.2011.0157

Hinkson IV, Davidsen TM, Klemm JD, Kerlavage AR, Kibbe WA (2017) A comprehensive infrastructure for big data in Cancer research: accelerating Cancer research and precision medicine. Front Cell Dev Biol 5:83. https://doi.org/10.3389/fcell.2017.00083

Holford ME, McCusker JP, Cheung KH, Krauthammer M (2012) A semantic web framework to integrate cancer omics data with biological knowledge. BMC Bioinformatics 13( Suppl 1 ):S10. https://doi.org/10.1186/1471-2105-13-S1-S10

Hou J, Acharya L, Zhu D, Cheng J (2016) An overview of bioinformatics methods for modeling biological pathways in yeast. Brief Funct Genomics 15(2):95–108. https://doi.org/10.1093/bfgp/elv040

Jones J, Otu H, Spentzos D, Kolia S, Inan M, Beecken WD et al (2005) Gene signatures of progression and metastasis in renal cell cancer. Clin Cancer Res 11(16):5730–5739. https://doi.org/10.1158/1078-0432.CCR-04-2225

Jorge NA, Ferreira CG, Passetti F (2012) Bioinformatics of Cancer ncRNA in high throughput sequencing: present state and challenges. Front Genet 3:287. https://doi.org/10.3389/fgene.2012.00287

Katoh M, Katoh M (2006) Bioinformatics for cancer management in the post-genome era. Technol Cancer Res Treat 5(2):169–175. https://doi.org/10.1177/153303460600500208

Kihara C, Tsunoda T, Tanaka T, Yamana H, Furukawa Y, Ono K et al (2001) Prediction of sensitivity of esophageal tumors to adjuvant chemotherapy by cDNA microarray analysis of gene-expression profiles. Cancer Res 61(17):6474–6479

Kihara D, Yang YD, Hawkins T (2007) Bioinformatics resources for cancer research with an emphasis on gene function and structure prediction tools. Cancer Inform 2:25–35

PubMed PubMed Central Google Scholar

Koltes JE, Hu ZL, Fritz E, Reecy JM (2009) BEAP: the BLAST extension and alignment program- a tool for contig construction and analysis of preliminary genome sequence. BMC Res Notes 2:11. https://doi.org/10.1186/1756-0500-2-11

Konishi H, Ichikawa D, Arita T, Otsuji E (2016) Microarray technology and its applications for detecting plasma microRNA biomarkers in digestive tract cancers. Methods Mol Biol 1368:99–109. https://doi.org/10.1007/978-1-4939-3136-1_8

Laczny C, Leidinger P, Haas J, Ludwig N, Backes C, Gerasch A et al (2012) miRTrail--a comprehensive webserver for analyzing gene and miRNA patterns to enhance the understanding of regulatory mechanisms in diseases. BMC Bioinformatics 13:36. https://doi.org/10.1186/1471-2105-13-36

Li PC (2016) Overview of microarray technology. Methods Mol Biol 1368:3–4. https://doi.org/10.1007/978-1-4939-3136-1_1

Loy A, Bodrossy L (2006) Highly parallel microbial diagnostics using oligonucleotide microarrays. Clin Chim Acta 363(1–2):106–119. https://doi.org/10.1016/j.cccn.2005.05.041

Macgregor PF, Squire JA (2002) Application of microarrays to the analysis of gene expression in cancer. Clin Chem 48(8):1170–1177

Madden TL, Tatusov RL, Zhang J (1996) Applications of network BLAST server. Methods Enzymol 266:131–141

Mount DW, Pandey R (2005) Using bioinformatics and genome analysis for new therapeutic interventions. Mol Cancer Ther 4(10):1636–1643. https://doi.org/10.1158/1535-7163.MCT-05-0150

Mychaleckyj JC (2007) Genome mapping statistics and bioinformatics. Methods Mol Biol 404:461–488. https://doi.org/10.1007/978-1-59745-530-5_22

Need AC, Motulsky AG, Goldstein DB (2005) Priorities and standards in pharmacogenetic research. Nat Genet 37(7):671–681. https://doi.org/10.1038/ng1593

Neumann RS, Kumar S, Haverkamp TH, Shalchian-Tabrizi K (2014) BLASTGrabber: a bioinformatic tool for visualization, analysis and sequence selection of massive BLAST data. BMC Bioinformatics 15:128. https://doi.org/10.1186/1471-2105-15-128

Non AL, Thayer ZM (2015) Epigenetics for anthropologists: an introduction to methods. Am J Hum Biol 27(3):295–303. https://doi.org/10.1002/ajhb.22679

Pertsemlidis A, Fondon JW 3rd (2001) Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biol 2 (10):REVIEWS2002

CAS PubMed PubMed Central Google Scholar

Puhler A (2017) Bioinformatics solutions for big data analysis in life sciences presented by the German network for bioinformatics infrastructure. J Biotechnol 261:1. https://doi.org/10.1016/j.jbiotec.2017.08.025

Samish I, Bourne PE, Najmanovich RJ (2015) Achievements and challenges in structural bioinformatics and computational biophysics. Bioinformatics 31(1):146–150. https://doi.org/10.1093/bioinformatics/btu769

Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K et al (2012) Database resources of the National Center for biotechnology information. Nucleic Acids Res 40(Database issue):D13–D25. https://doi.org/10.1093/nar/gkr1184

Schadt EE (2006) Novel integrative genomics strategies to identify genes for complex traits. Anim Genet 37(Suppl 1):18–23. https://doi.org/10.1111/j.1365-2052.2006.01473.x

Shaik NA, Awan ZA, Verma PK, Elango R, Banaganapalli B (2018) Protein phenotype diagnosis of autosomal dominant calmodulin mutations causing irregular heart rhythms. J Cell Biochem. https://doi.org/10.1002/jcb.26834

Shaik NA, Kaleemuddin M, Banaganapalli B, Khan F, Shaik NS, Ajabnoor G et al (2014) Structural and functional characterization of pathogenic non- synonymous genetic mutations of human insulin-degrading enzyme by in silico methods. CNS Neurol Disord Drug Targets 13(3):517–532

Soejima H (2009) Epigenetics-related diseases and analytic methods. Rinsho Byori 57(8):769–778

Subramanian S, West RB, Corless CL, Ou W, Rubin BP, Chu KM et al (2004) Gastrointestinal stromal tumors (GISTs) with KIT and PDGFRA mutations have distinct gene expression profiles. Oncogene 23(47):7780–7790. https://doi.org/10.1038/sj.onc.1208056

Taylor JB, Triggle DJ (2007) Comprehensive medicinal chemistry II. Amsterdam; London: Elsevier

Thompson JD, Gibson TJ, Higgins DG (2002) Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics., Chapter 2, Unit 2.3. https://doi.org/10.1002/0471250953.bi0203s00

Varon A, Wheeler WC (2012) The tree alignment problem. BMC Bioinformatics 13:293. https://doi.org/10.1186/1471-2105-13-293

Vaser R, Adusumalli S, Leng SN, Sikic M, Ng PC (2016) SIFT missense predictions for genomes. Nat Protoc 11(1):1–9. https://doi.org/10.1038/nprot.2015.123

Waage J, Standl M, Curtin JA, Jessen LE, Thorsen J, Tian C et al (2018) Genome-wide association and HLA fine-mapping studies identify risk loci and genetic pathways underlying allergic rhinitis. Nat Genet 50(8):1072–1080. https://doi.org/10.1038/s41588-018-0157-1

Wang F, Kong J, Cooper L, Pan T, Kurc T, Chen W et al (2011) A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform 2:32. https://doi.org/10.4103/2153-3539.83192

Wang Y, Zhang Y, Huang Q, Li C (2018) Integrated bioinformatics analysis reveals key candidate genes and pathways in breast cancer. Mol Med Rep 17(6):8091–8100. https://doi.org/10.3892/mmr.2018.8895

Webb B, Sali A (2017) Protein structure modeling with MODELLER. Methods Mol Biol 1654:39–54. https://doi.org/10.1007/978-1-4939-7231-9_4

Wilkinson GR (2005) Drug metabolism and variability among patients in drug response. N Engl J Med 352(21):2211–2221. https://doi.org/10.1056/NEJMra032424

Yalcin D, Hakguder ZM, Otu HH (2016) Bioinformatics approaches to single-cell analysis in developmental biology. Mol Hum Reprod 22(3):182–192. https://doi.org/10.1093/molehr/gav050

Yang MQ, Athey BD, Arabnia HR, Sung AH, Liu Q, Yang JY et al (2009) High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics. BMC Genomics 10(Suppl 1):I1. https://doi.org/10.1186/1471-2164-10-S1-I1

Zharikova AA, Mironov AA (2016) piRNAs: biology and bioinformatics. Mol Biol (Mosk) 50(1):80–88. https://doi.org/10.7868/S0026898416010225

Zienolddiny S, Skaug V (2012) Single nucleotide polymorphisms as susceptibility, prognostic, and therapeutic markers of nonsmall cell lung cancer. Lung Cancer (Auckl) 3:1–14. https://doi.org/10.2147/LCTT.S13256

Download references

Author information

Authors and affiliations.

Princess Al-Jawhara Center of Excellence in Research of Hereditary Disorders, Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia

Babajan Banaganapalli

Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia

Noor Ahmad Shaik

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Babajan Banaganapalli .

Editor information

Editors and affiliations.

Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia

Khalid Rehman Hakeem

Ramu Elango

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter.

Banaganapalli, B., Shaik, N.A. (2019). Introduction to Bioinformatics. In: Shaik, N., Hakeem, K., Banaganapalli, B., Elango, R. (eds) Essentials of Bioinformatics, Volume I. Springer, Cham. https://doi.org/10.1007/978-3-030-02634-9_1

Download citation

DOI : https://doi.org/10.1007/978-3-030-02634-9_1

Published : 28 March 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-02633-2

Online ISBN : 978-3-030-02634-9

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Genome analysis of the diploid wild potato Solanum bukasovii

Downloadable content.

Bozan, Ilayda
Martina Stromvik (Supervisor)
La pomme de terre (Solanum tuberosum L.) a son origine des Andes en Amérique du Sud. Elle est une culture de base économiquement importante qui peut être cultivée avec succès dans diverses conditions et altitudes. Le génome de la pomme de terre est complexe, avec un vaste pool de gènes provenant de nombreuses espèces sauvages de niveaux de ploïdie variables allant des diploïdes aux hexaploïdes. Les efforts de sélection de la pomme de terre se sont traditionnellement concentrés sur la multiplication végétative de la pomme de terre tétraploïde en raison de sa popularité en tant que culture. Cependant, en raison des niveaux élevés d'hétérozygotie et des difficultés de la sélection tétraploïde, de nouveaux efforts d'amélioration se tournent de plus en plus vers les espèces et variétés diploïdes comme moyen d'introgresser les caractères dans les variétés de cultures. Les données de séquence du génome et les moyens d'analyser et de visualiser ainsi les données sont des outils cruciaux pour y parvenir.Un travail important a déjà été effectué en séquençant et en publiant différents génomes de référence de la pomme de terre; un double monoploïde (S. tuberosum Group Phureja - DM), deux génomes sauvages de référence (S. commersonii et S. chacoense clone M6), un génome de landrance (S. stenotomum subsp. goniocalyx) et deux vrais génomes diploïdes de S. tuberosum (S. tuberosum groupe Tuberosum RH89-039-16, S. tuberosum, Solyntus).La présente étude se concentre sur l'expansion des ressources génomiques pour les analyses du génome de la pomme de terre. Premièrement, l'étude présente une séquence du génome diploïde nouvellement séquencée et assemblée de S. bukasovii, que l'on pense être l'une des espèces sauvages les plus étroitement liées à la pomme de terre cultivée. Cette séquence génomique est comparée aux génomes de référence disponibles de la pomme de terre et les résultats montrent que la variation du nombre de copies (CNV) affecte les gènes qui ont des fonctions importantes telles que la résistance aux maladies et la biosynthèse des métabolites. Deuxièmement, un portail Web - le Potato Genome Diversity Portal (PGDP) - a été développé et mis en œuvre pour visualiser les génomes de pommes de terre publiés à l'aide de JBrowse et pour fournir un outil permettant d'étudier les alignements des génomes. En plus d'aider l'analyse des variations structurelles, le PGDP permet également aux chercheurs de mener des recherches et de partager des données
Potato (Solanum tuberosum L.) originated in the South American Andes and is an economically important staple crop that can be successfully grown in various conditions and altitudes. The potato genome is complex, with a large gene pool drawn from numerous wild species of varying ploidy levels ranging from diploids to hexaploids. Potato breeding efforts in North America and Europe have traditionally focused on vegetative propagation of tetraploid potato because of its higher yield. However, due to the high heterozygosity levels and difficulties of tetraploid breeding, new improvement efforts are increasingly looking to diploid species as a means of introgressing traits into crop varieties.Genome sequence data and means of analyzing and visualizing the data, are crucial to achieve this goal. Significant work has previously been done by sequencing and publishing different potato reference genomes, a double monoploid (S. tuberosum Group Phureja – DM), two wild reference genomes (S. commersonii and S. chacoense clone M6), a landrace genome (S. stenotomum subsp. goniocalyx), and two diploid S. tuberosum genomes (Solanum tuberosum group Tuberosum RH89-039-16, Solanum tuberosum, Solyntus).The present study focuses on expanding the genomic resources for potato genome analyses. First, this study presents a newly sequenced and assembled diploid genome of S. bukasovii, which is thought to be one of the wild species that is most closely related to the cultivated potato. This genome sequence is compared with the available potato reference genomes and the results show that Copy Number Variation (CNV) affect genes have important functions such as disease resistance and metabolite biosynthesis. Second, a web portal – the Potato Genome Diversity Portal (PGDP) - was developed and implemented to visualize the published potato genomes using JBrowse and to provide a tool to investigate the genome alignments along with aiding the structural variation analysis PGDP also enables researchers to conduct research and share data
Plant Science
McGill University
https://escholarship.mcgill.ca/concern/theses/fn1073736
All items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.
Department of Plant Science
Master of Science
Theses & Dissertations

Open access
Published: 07 November 2021

Genome-wide association study and transcriptome analysis dissect the genetic control of silique length in Brassica napus L.

Jia Wang 1 , 2 na1 ,
Yueling Fan 1 , 2 na1 ,
Lin Mao 1 , 2 ,
Cunmin Qu 1 , 2 ,
Kun Lu 1 , 2 ,
Jiana Li 1 , 2 &
Liezhao Liu 1 , 2

Biotechnology for Biofuels volume 14 , Article number: 214 ( 2021 ) Cite this article

3945 Accesses

7 Citations

2 Altmetric

Metrics details

Rapeseed is the third-largest oilseed crop after soybeans and palm that produces vegetable oil for human consumption and biofuel for industrial production. Silique length (SL) is an important trait that is strongly related to seed yield in rapeseed. Although many studies related to SL have been reported in rapeseed, only a few candidate genes have been found and cloned, and the genetic mechanisms regulating SL in rapeseed remain unclear. Here, we dissected the genetic basis of SL by genome-wide association studies (GWAS) combined with transcriptome analysis.

We identified quantitative trait locus (QTL) for SL using a recombinant inbred line (RIL) population and two independent GWAS populations. Major QTLs on chromosomes A07, A09, and C08 were stably detected in all environments from all populations. Several candidate genes related to starch and sucrose metabolism, plant hormone signal transmission and phenylpropanoid biosynthesis were detected in the main QTL intervals, such as BnaA9.CP12-2 , BnaA9.NST2 , BnaA7.MYB63 , and BnaA7.ARF17 . In addition, the results of RNA-seq and weighted gene co-expression network analysis (WGCNA) showed that starch and sucrose metabolism, photosynthesis, and secondary cell wall biosynthesis play an important role in the development of siliques.

Conclusions

We propose that photosynthesis, sucrose and starch metabolism, plant hormones, and lignin content play important roles in the development of rapeseed siliques.

Introduction

Brassica napus L., an amphidiploid species formed by natural hybridization of Brassica rapa and Brassica oleracea , is an important oilseed crop with strong adaptability, wide use and high economic value. As the important oilseed crop in the world, Brassica napus ( B. napus ) is cultivated worldwide and is increasingly used for animal feed, vegetable oil and biodiesel [ 1 ]. Therefore, increasing rapeseed yield is one of the important goals of B. napus breeding and cultivation. Silique length (SL) is one of the yield-determining traits in B. napus [ 2 , 3 ]. Silique plays an important role in the yield formation of B. napus . It is not only a sink organ for absorbing and accumulating photosynthetic products produced by leaves but also a source organ for seed development [ 4 ]. In the late stage of seed development, the functional leaf area of B. napus decreases rapidly, and photosynthesis of green silique becomes the main source of nutrition for seed development [ 5 ]. Long siliques generally have a large photosynthetic area to potentially produce more energy; they consume large amounts of energy for their development. Therefore, a proper silique length is needed to balance the processes of producing, transferring, and consuming energy in silique to optimize seed number and size [ 6 ]. In general, long siliques produce more or larger seeds than short siliques. Under the same planting density, silique number per plant, seed number per silique, and seed weight are the three direct components that determine the seed yield per plant [ 7 ]. Therefore, it is of great significance to determine the genetic basis of long silique and cultivate long siliques rapeseed varieties to improve the yield of rapeseed.

Most rapeseed agronomic traits including SL are complex quantitative traits controlled by multiple genes and are influenced by the environment. Genome-wide association analysis (GWAS) and quantitative trait locus (QTL) mapping are effective methods for dissecting complex traits. According to incomplete statistics, more than 100 QTLs for SL have been identified by linkage and association mapping, and QTL controlling SL are distributed on almost all chromosomes, although major QTLs have been found mainly on chromosomes A07, A09, C02, C08, and C09 [ 1 , 6 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ]. Based on these results, some candidate genes have been reported. Liu et al. [ 16 ] successfully cloned the QTL gene ARF18 , which is a negative regulator that controls both silique length and seed weight in B. napus . One 55-aa deletion prevented ARF18 forming a homodimer, inhibited the activity of downstream auxin genes, and promoted silique elongation by prolonging the length of pericarp cells [ 16 ]. BnaA9.CYP78A9 was cloned in the major QTL region of chromosome A09 by fine mapping. A 3.7-kb insertion of a CACTA-like transposable element (TE) in the regulatory region functioned as an enhancer to stimulate the expression of BnaA9.CYP78A9 and silique elongation [ 10 ]. The molecular regulation of silique development in B. napus is largely unknown, although some important candidate genes have been reported and their relationship with SL has been analysed. Therefore, further understanding of the genetic architecture of rapeseed SL is needed.

With the development of high-throughput sequencing technologies and the continuous decrease in high-throughput sequencing costs, GWAS is increasingly widely used in the study of complex quantitative traits. Increasing the GWAS sample size is the most direct and effective way to improve the test efficiency; however, it incurs a huge workload and expensive experimental costs; and the GWAS of a single small population may not accurately capture the genetic variations. In addition, the GWAS results for the same trait are often inconsistent due to different experimental populations, genotyping, and analysis methods. Meta-analysis provides an attractive alternative to address the abovementioned challenges in the GWAS of single populations, and meta-GWAS have been utilized to detect genetic risk loci for various diseases in humans [ 17 , 18 ]. In plants, using meta-analyses of a genome-wide association study (meta-GWAS), Battenfield et al. [ 19 ] identified marker–trait associations, allele effects, candidate genes, and generated selection markers in bread wheat. Another research group integrated genetic information from 73 published studies including 17,556 soybean germplasm resources, and reported 393 unique peaks including 66 candidate genes across important traits, providing confirmation of many previously reported genes [ 20 ]. Fikere et al. [ 21 ] identified 79 genomic regions (674 SNPs) conferring potential resistance to canola blackleg by meta-analysis GWAS. Su et al. [ 22 ] reported that 3589 significant loci for three component traits and three loci for yield were detected by investigating the four traits of two rice hybrid populations in different environments using meta-GWAS.

In this study, we carried out a QTL analysis with a population of 172 F 9 RILs and GWAS with two independent populations. In addition, we performed WGCNA with transcriptome sequencing of developing pericarp from short and long silique lines. Specifically, we aimed to detect candidate genes associated with SL and to analyse the genetic bases contributing to the difference between short and long siliques. We suggest that selecting SL may be an effective strategy in breeding to improve rapeseed production.

Phenotypic evaluation showed wide variation widely

We measured the SL of 172 RILs from 2016 to 2019, with five replications performed each year, and measured the SL of 520 accessions during 2015 and 608 lines during 2017, with ten replications performed each year. The results showed that SL differed tremendously among rapeseed lines, with 76.26% broad-sense heritability, ranging from 4.81 to 10.89 cm in the RIL population and ranging from 3.39 to 12.74 cm in the GWAS population; 74.07% of lines were concentrated in the range of 5.00–8.00 cm (Fig. 1 a, Additional file 1 : Table S1, Additional file 2 : Fig. S1a, b). A correlation analysis showed a strong correlation between SL and thousand seed weight (TSW), seed yield per plant (YP), seed number per silique (SN), seed number per plant (SNPP), siliques per the main inflorescence (SMI), and harvest index (HI), and a weak correlation with seed oil content (SOC) (Additional file 3 : Fig. S2). The SL variation was further analysed in winter, spring, and semi-winter subgroups. The results indicated that there were differences among the three subgroups, but the differences were not significant. The SL of semi-winter cultivars is 6.07 ± 1.23 cm, while 5.51 ± 1.45 cm and 5.31 ± 1.12 cm in winter and spring accessions, respectively (Additional file 1 : Table S2).

Phenotypic characterization of the short- and long-silique rapeseed. a Silique length of different lines. The line of Z number is from RIL population, and the line of R number is from GWAS population. Bar = 1 cm. b Dynamic changes of silique lengths of S- and L-siliques in different developmental stages. c , d Microstructure observation of the outer pericarp of the S-silique materials and L-silique materials, bar = 100 μm. e , f Microstructure observation of the endocarp of the S-silique materials and L-silique materials, bar = 200 μm. g Cell length measurement of outer pericarp. h Cell width measurement of outer pericarp. i Cell length measurement of endocarp. j Cell width measurement of endocarp. Statistically significant differences were revealed using a Student’s t test: * p < 0.05; ** p < 0.01, *** p < 0.001

We measured the growth rates of siliques using two extreme lines Z068 (5.45 ± 0.29 cm) and Z191 (10.28 ± 1.01 cm), from the RIL population as representatives of short and long siliques, respectively. The results showed that there was no significant difference in SL between long and short siliques in the first 3 days after pollination (DAP). At 9 DAP, there was a significant difference in SL between long and short siliques. The length of the short siliques reached the maximum at 18 DAP, but the length of the long siliques did not stop growing until 27 DAP (Fig. 1 b). Further histological observation of the inner and outer epidermis of the long and short siliques at 27 DAP showed that the cell length of the long silique on the outer epidermis was significantly longer than that of the short silique, while the cell width of the long silique was significantly smaller than that of the short silique, and both the cell length and the cell width of the long silique on the inner epidermis were significantly larger than those of the short silique (Fig. 1 c–j).

RNA-seq found that plant hormones transportation and synthesis of carbohydrates involved in the development of rapeseed silique

After a stringent quality filtering process, 79.37 Gb of clean RNA-seq reads were obtained from 12 samples, with a Q30 percentage ≥ 94.80% (Additional file 4 : Table S2). The obtained 12 samples of clean reads were mapped to the reference genome sequence of B. napus , the percentages of mapped reads were similar among the 12 libraries (86.37–90.23%), and 82.71–86.29% of the reads were uniquely mapped (Additional file 5 : Table S3). Based on the mapped results, the FPKM of all genes was counted, the log 10 FPKM values of all samples fluctuated slightly (− 2.5 to 2.5) (Additional file 6 : Fig. S3a), and the peak values of most genes were between 0 and 1 (Additional file 6 : Fig. S3b). The results of the correlation analysis showed that there was a high correlation and good repeatability between biological replicates (Additional file 6 : Fig. S3c). After DEG analysis, 21,482 DEGs remained for further analysis (the number of DEGs was the sum of each DEG set) (Additional file 6 : Fig. S3d, e, Additional file 7 : Table S4).

To explore the metabolic pathways enriched for the DEGs, the related DEGs in 12 samples were subjected to KEGG metabolic pathway enrichment analysis. In “T1 vs. T2”, we found that the DEGs, whether from S-silique or l -silique, were significantly enriched in starch and sucrose metabolism, phenylalanine metabolism, and phenylpropanoid biosynthesis pathways (Additional file 8 : Fig. S4). The same enrichment result also appeared in “T1 vs. T3”, though the difference was that the plant hormone signal transduction pathway, which was only enriched in the L-silique in “T1 vs. T2”, was also enriched in the S-silique in “T1 vs. T3”. In “T2 vs. T3”, the DEGs of L-silique were still enriched in plant hormone signal transduction, starch and sucrose metabolism, phenylalanine metabolism, etc. (Additional file 8 : Fig. S4). In the T3 stage, the origin of DEGs was significantly different from the T1 and T2 stages, and the genes related to the cutin, suberine, and wax biosynthesis pathways were significantly up-regulated at this stage. Although there were significant differences in DEGs and enrichment pathways among the three stages, the genes related to starch and sucrose metabolism, plant hormone signal transduction, and phenylpropanoid biosynthesis were differentially expressed in different stages of silique development. These results indicated that the genes related to starch and sucrose metabolism, plant hormone signal transduction, and phenylpropanoid biosynthesis may play important roles in silique development.

Linkage mapping and GWAS co-located several major QTLs related to SL

Using linkage mapping, we identified 95 QTLs associated with SL, with a logarithm of the odds (LOD) value above 3.0. Of these, 95 QTLs with phenotypic variation explained (PVE) ranging from 0.02 to 67.06% were identified on nine chromosomes using 4 years, BLUE, and BLUP data (Additional file 9 : Table S5, Additional file 10 : Fig. S5). Three major QTL were located on A07, A09, and C08. qSLA7-2 on A07 was a major locus that was stably detected across all environments and explained 56.24–67.06% of the phenotypic variation. qSLA9-1 and qSLA9-3 on A09 were also major loci that were stably detected across all environments but explained only 0.83–5.33% of the phenotypic variation. qSLC8-2 on C08 was another major locus that was stably detected across all environments and explained 9.91–11.31% of the phenotypic variation. In addition, qSLA6-1 and qSLA6-2 on A06 were detected simultaneously in all environments, but they had a low PVE.

Trait–marker associations were performed using the FarmCPU, Blink, CMLM, GLM, and MLM models in two GWAS mapping populations. Similar results were obtained in the two mapping populations with the five models (Fig. 2 a, b). To facilitate further analysis, combined with Q–Q diagram, we finally chose the MLM model as the follow-up analysis model (Additional file 11 : Fig. S6a, b). In total, we identified 41 SNPs in the 60 K population on chromosomes A06 (1), A07 (6), A09 (15), A10 (1), and C08 (18), whereas 181 SNPs in the whole genome resequencing (WRG) population were identified on chromosomes A01 (9), A03 (9), A04 (11), A06 (10), A07 (52), A08 (5), A09 (54), A10 (4), C01 (1), C04 (1), C05 (5), C06 (1), C07 (4), C08 (9), and C09 (6) using a threshold of 5% after Bonferroni multiple test correction (Additional file 12 : Table S6). The SL was associated with three common significant regions located on chromosomes A07, A09, and C08. Among these, 40 SNPs forming a haplotype block on A09 (27.51–28.18 Mb) were located in the interval of known QTL for SL [ 6 ].

GWAS and QTL co-located of major loci on chromosome A09. a , b Circular Manhattan plots of the 60 K population and WGR population. From the inner ring to the outer ring are results of the FarmCPU, Blink, CMLM, GLM, and MLM models. c Scatterplot of association results from an MLM model analysis of SL on chromosome A09. Negative log10-transformed P values from the GWAS analysis are plotted against the genomic physical position. The green line indicates the threshold level log(1/ N ) = 5.58. d Major QTL loci on the chromosome A09 were repeatedly detected in multiple environments by QTL mapping in an RIL population. 16SL-cq, 17SL-cq, and 18SL-cq represent silique length from Chongqing in 2016, 2017, and 2018, respectively; 18SL-xa and 19SL-xa represent silique length from Xi’an in 2018 and 2019, respectively; BLUP and BLUE represent best linear unbiased predictions and best linear unbiased estimates, respectively. e Location of the reference genome region on A09 corresponding to the major effective loci and LD block analysis of this region. The red gene ID represents that the gene is an important candidate gene. f , i FPKM value and qRT–PCR validation of candidate genes in A09. The upper half represents qRT–PCR results, and the normalized levels ST1 were arbitrarily set to 1. The lower part is the FPKM value obtained by RNA-seq. Statistically significant differences were revealed using a Student’s t test: ns ≥ 0.05, * p < 0.05, ** p < 0.01, *** p < 0.001

Meta-GWAS and polymorphisms in the candidate region were associated with SL

The meta-analysis detected 85 SNPs associated with SL, of which nine SNPs were undetected in all of the single-population GWAS (Additional file 12 : Table S6). Fourteen and 31 SNPs identified in the spring and semi-winters subgroups were confirmed by meta-analysis, respectively, but only three SNPs identified in the winter subgroup were confirmed by meta-analysis. On chromosome A09, the confidence interval of a stable major QTL overlapped with the highly associated region detected by GWAS (Fig. 2 c, d). Forty SNPs identified in GWAS were confirmed by meta-analysis, and seven SNPs undetected in GWAS were mined by meta-analysis in this interval (Additional file 12 : Table S6). 528 gene symbols were found in this region. Among them, 98 were DEGs in RNA-seq, which were listed as candidate genes (Additional file 13 : Table S7). LD analysis showed that most peak SNPs were mainly concentrated in block 2 and block 6 (Fig. 2 e). The peak SNP (S9_28151819) was involved in a 35-kb LD block (block 6) that encompassed 19 SNP markers, and it was 0.91 kb away from BnaA9.CP12-2 was related to carbohydrate anabolism (Additional file 13 : Table S7). In addition, the gene BnaA9.NST2 is involved in secondary cell wall biogenesis [ 23 ] and the Aux/IAA family member BnaA9.IAA30 related to auxin signalling was also found in this block. The results of RNA-seq and qRT–PCR showed that the expression of BnaA9.NST2 in short silique was significantly higher than that in the long silique, especially in the T2 stage (18 DAP) (Fig. 2 f).

In block 2, there were two peak SNPs located inside the BnaA9.SK21 and BnaA9.TMP-C , respectively. BnaA9.SK21 and BnaA9.TMP-C had the same expression pattern in T1 and T2, but BnaA9.TMP-C had a higher expression level in long siliques in T3 (Fig. 2 g, h). Two nonsynonymous SNPs, S9_27782829 and S9_27788376, were present in BnaA9.SK21 and BnaA9.TMP-C , respectively. Further analysis of these two SNPs found that accessions with an AA genotype at S9_27782829 displayed, on average, 18.56% increased silique length compared to the accessions with a TT genotype. The average silique length of AT genotypes was between the AA and TT genotypes. The minor allele (A) was represented in only 17% of the 608 accessions (Additional file 14 : Fig. S7a). There was a significant difference between the GG and CC genotypes at the S9_27788376 ( p < 0.01), and the CC genotype increased silique length (Additional file 14 : Fig. S7b).

Similar results were found on chromosome A07, where a stable major QTL co-located with GWAS. Fourteen SNPs identified in GWAS were confirmed by meta-analysis, and four SNPs undetected in GWAS were mined by meta-analysis in this interval (Additional file 15 : Fig. S8). A total of 102 gene symbols were found in this region, of which 46 were DEGs in RNA-seq and listed as candidate genes (Additional file 13 : Table S7). BnaA7.MYB63 , a homologous gene MYB63 , is a transcriptional regulator specifically activating lignin biosynthetic genes during secondary wall formation in Arabidopsis thaliana [ 24 ] and was only 0.36 kb away from the significant SNP S7_16015077 with a low expression level during silique development (Fig. 2 i). The key lignin biosynthesis gene BnaA7.CCR2 was also detected in this QTL region and BnaA7.CCOAMT , another key gene of lignin biosynthesis, was detected in another highly associated region on A07. Three significant SNPs, S7_16214445, S7_16214995, and S7_16215169, are located inside BnaA7.ARF17 , which is a negative auxin response factor that inhibits downstream auxin-related genes. Further haplotype analysis was focused on the gene BnaA7.ARF17 , and four major haplotypes were observed, with low-frequency haplotypes (less than five accessions) being omitted (Fig. 3 a). We conducted multiple comparison tests of SL, and the results showed that Hap2 and Hap3 had shorter SL than Hap4 ( P < 0.05), while Hap1 was an intermediate type (Fig. 3 b).

Haplotype analysis and polymorphisms in the candidate region were associated with SL. a Haplotypes in 608 accessions (haplotypes with fewer than five accessions were omitted) according to SNP data from WGR population. b Violin plots showing the levels of SL from four haplotypes. c Genomic diversity of chr A09. d Genomic diversity of chr A07. The blue line represents pseudo-wild ancestral and the red line represents landrace rapeseed

We further analysed the sequence diversity of the major QTL region on Chr A07, A09, and C08 among landraces and pseudo-wild ancestral (European turnip and B. oleracea subspecies) genomes based on previously published data [ 25 ] (Fig. 3 c, d). The landraces had a lower π value than pseudo-wild ancestral in the major loci region of A09 (27.50–29.40 Mb). On A07, the highest π value of the major QTL region (from approx. 15.80 to 16.4 Mb) was in landraces (1.18 × 10 −3 ) and pseudo-wild ancestral (1.36 × 10 −3 ). The sequence diversity of the major loci region on chromosome C08 is consistent with A07 and A09 (Additional file 14 : Fig. S7c). These results suggested that these major QTLs might be domesticated and selected during the process of rapeseed domestication from wild type to cultivated rapeseed, resulting in a decrease in sequence diversity.

Co-expression network analysis reveals transcript level differences in photosynthesis and secondary cell wall biosynthesis in long and short siliques

To identify genes related to SL, we performed a weighted gene co-expression network analysis (WGCNA) using the union of non-redundant DEGs and putative candidate genes. After using a dynamic tree cutting algorithm, a total of 18 distinct co-expression modules containing 47 to 1787 genes per module were identified, and 1585 uncorrelated genes were assigned to a grey module, which was ignored in the following study (Fig. 4 a). An analysis of the module–trait relationships revealed that the “lightpink1” ( r = 0.97, p = 9e−08), “chocolate3” ( r = − 0.86, p = 4e−04), “darkgoldnrod4” ( r = − 0.75, p = 0.005), and “lightblue2” ( r = 0.73, p = 0.006) module were highly correlated with the SL in the 12 samples (Fig. 4 b). According to the heatmap of the top 20 genes with the high eigengene connectivity (KME) value in the four modules, the “lightpink1” module had an expression peak in the T2 stage of long silique development, similar to the long silique development pattern, while genes in the “chocolate3” module were highly expressed in the T3 stage of short silique development (Additional file 16 : Fig. S9a, b). The expression level of the top 20 genes in the “darkgoldnrod4” module was the highest in the T1 stage of short silique development, while the expression level of the “lightblue2” module was the highest in the T3 stage of long silique development (Additional file 16 : Fig. S9c, d).

Weighted gene co-expression network analysis. a Clustering dendrogram of genes and construction of modules. b Phenotype and module correlation analysis heat map. Red indicates that the correlation between module eigengenes and silique length is high. c Top 20 KEGG enriched pathways in the set of the lightpink1 module. d Gene co-expression network of the lightpink1 module. The node and edge size is proportional to the core

A Gene Ontology (GO) enrichment analysis of the “lightpink1” module genes identified ten significantly enriched GO terms, most of which are related to photosynthesis. Interestingly, photosynthesis-related pathways were also enriched in the “lightpink1” module by KEGG pathway enrichment analysis (Fig. 4 c, Additional file 17 : Table S8). There was no significant enrichment of GO terms in the “chocolate3” module. One and 13 GO terms were significantly enriched in the “darkgoldnrod4” module and “lightblue2” module, respectively. Among them, most of the enriched terms belonged to “molecular function” and “biological process”, including “monosaccharide transmembrane transporter activity” and “sucrose transport” (Additional file 17 : Table S8). Psby (KME = 0.993), encoding a protein in photosystem II, is one of the hub genes in the “lightpink1” module. In photosystem II, Psby is in close contact with Cytb559 , which can protect photosystem II from photoinhibition for photosynthesis and provide more material and energy for cell proliferation and expansion of the silique pericarp [ 26 ]. BnaC09g09210D (KME = 0.982), another hub gene of the “lightpink1” module, is the homologous gene of AtKNAT7 . In Arabidopsis thaliana , AtKNAT7 is a homologous domain transcription factor of the TALE gene family, which is involved in the regulation of secondary cell wall biosynthesis, and its expression is up-regulated by SND1 and MYB46 [ 27 ]. Genes with weight values between 0.8 and 1 were screened to construct a partial co-expression network around hub genes. In the network, multiple genes were involved in cell elongation and expansion, such as DFL2 , CSLD3 , TCH4 , and CESA6 (Fig. 4 d). These results suggest that photosynthesis and secondary cell wall biosynthesis may play important roles in the determination of silique length during development.

Lignin biosynthesis plays an important role in silique elongation

The important candidate gene BnaA9.NST2 , BnaA7.MYB63 , BnaA7.CCR2 , and BnaA7.CCOAMT , all of which are lignin biosynthesis related genes, were found in the major QTL regions of A07 and A09. Meanwhile, the pathway of “phenylpropanoid biosynthesis” was significantly enriched in different developmental stages of long silique and short silique (Additional file 18 : Fig. S10a–c). These results indicate that lignin may play an important role in the development of siliques. We compared the expression levels of genes related to the lignin biosynthesis pathway in different developmental stages of long silique and short silique. Results indicate that the expression levels of lignin biosynthesis-related genes in different stages of long and short silique development were significantly different, especially in the T2 stage, when the length difference between long and short siliques increased sharply. Lignin biosynthesis related genes were highly expressed in short siliques (Additional file 18 : Fig. S10d).

Similar results were observed of the silique pericarp tissue section and the determination of lignin content. The lignification degree increased gradually with the silique developmental process. Especially at the T2 stage, the lignification degree of the short silique pericarp was significantly higher than that of the long silique pericarp (Fig. 5 a–f). Correspondingly, the lignin content of the short silique was significantly higher than that of the long silique at the T1 and T2 stages, especially T2. However, there was no significant difference in lignin content between long and short siliques at the T3 stage (Fig. 5 g). Consistent with the RNA-seq results, the expression levels of the four genes in the long silique were significantly lower than those in the short silique, especially the T2 stage (Fig. 5 h). All these results suggest that lignin plays an important role in the formation of long and short siliques, especially during the rapid elongation period of the silique.

Evaluation of the phenotypic contribution of lignin to silique length. a – f Microstructure observation of the outer, middle and inner pericarp of the short and long silique. White arrows point to the outer pericarp, yellow arrows point to the middle pericarp, and red arrows point to the inner pericarp. a and d represent microscopic observations of short and long siliques at the T1 stage; b and e represent microscopic observations of short and long siliques at the T2 stage; c and f represent microscopic observations of short and long siliques at the T3 stage. Bar = 100 μm. g Determination of total lignin content in the silique wall of the short and long silique. h qRT–PCR verification of DEGs of key enzymes in the phenylpropanoid–lignin pathway in the short and long siliques at the T1, T2, and T3 stages. Statistically significant differences were revealed using a Student’s t test: ns ≥ 0.05, * p < 0.05, ** p < 0.01, *** p < 0.001

Population stratification is a common source of false positives in GWAS. When the population is stratified, the conventional method is to use a fixed effect covariate matrix ( Q matrix) or PCA to control the false positive caused by population structure. Meta-analysis can increase power, reduce false-positive findings, and even identify new genetic loci, so it can solve the shortcomings of single-population GWAS [ 28 ]. In this study, we tried to perform independent GWAS by subgroups and then perform meta-analysis on the significant SNPs obtained by independent analysis. The results show that the meta-analysis confirmed 71 SNPs identified in the single-population GWAS, of which 59 SNPs had greater p values than in the single-population GWAS. It is worth noting that 14 SNPs undetected in GWAS were identified by meta-analysis, and seven SNPs identified in the single-population GWAS were unconfirmed by meta-analysis. S9_27787423 is closely interlocked with S9_27788376 and located inside BnaA9.TMP-C . The presence of S9_27788376 may mask the effect of S9_27787423, resulting in a small contribution to the phenotype. Meta-analysis showed that S9_27787423 was a significant locus. Similar to this are S7_15927280 and S9_27929254. In brief, using the meta-analysis, missing SNPs that were undetected by the single-population GWAS were retrieved, and false-positive SNPs that were identified by the single-population GWAS were filtered. The advantage of meta-analysis is that it can integrate the results of different research groups to increase the sample size of GWAS analysis, and integrate results across subgroups to avoid the influence of population stratification.

Silique is an important storage organ for photosynthetic products in rapeseed. Moderately increasing the length of the silique can increase the sink capacity of the silique, which is conducive to the transportation of filling materials to seeds. RNA-seq data show that photosynthesis may play an important role during silique development. An interesting phenomenon is that CP12-2 , an important candidate gene related to carbohydrate synthesis and metabolism [ 29 ], was detected simultaneously within the main QTL confidence interval on A09 and C08. BnaA9.CP12-2 and BnaC8.CP12-2 had the same expression pattern, and there was no significant difference between long and short siliques in the T1 stage. In the T2/T3 stage, the expression in the long silique was significantly higher than that in the short silique, especially in the T2 stage when the difference in silique expanded sharply. In addition, KEGG enrichment analysis found that the synthesis and transport pathways of starch and sucrose were significantly enriched in both vertical and horizontal development stages. These results indicate that carbohydrate synthesis and metabolism may be involved in the regulation of silique development.

In this study, both linkage analysis and association analysis detected major QTL loci on A09. The QTL of SL on chromosome A09 has been a focus of research. Although major QTLs have been detected on A09 in many studies, there are differences in mapping results, indicating that there may be multiple genes controlling SL on A09. At present, two SL-related genes have been cloned on A09. ARF18 was the first cloned gene. This gene regulates SL by regulating the auxin signalling pathway and can change seed weight without changing the number of seeds per silique [ 16 ]. Shi et al. cloned BnaA9.CYP78A9 using map-based cloning [ 10 ]. This gene is widely expressed in rapeseed tissues and regulates silique development by affecting the auxin content. Its target gene ARF10/16/17 is similar to the mechanism of ARF18 , both are auxin negative response factors and inhibit downstream auxin-related genes. In our RNA-seq analysis, we found that the plant hormone signal transduction pathway was significantly enriched at different stages of silique development. In addition, linkage analysis and GWAS detected several candidate genes related to plant hormones, such as BnaA7.ARF17 and BnaA9.IAA30 . These results further suggest that plant hormones play an important role in silique development.

In previous studies, long and short siliques were often used as parental materials for gene mapping. Few studies have analysed the characteristics of long and short siliques growth rate. In this study, we found that there was a significant difference in the extension length between long and short siliques at 15 DAP, and the elongation time of short silique fruit was shorter. The results showed that the genes related to the lignin synthesis pathway were significantly differentially expressed between long and short siliques in the early, middle, and late stages of development. In addition, BnaA7.MYB63 and BnaA9.NST2 , the key genes of lignin synthesis, were candidate genes detected by linkage analysis and GWAS. Further determination of lignin content also showed that there was a significant difference in lignin content between long and short siliques. The results suggest that lignin may be one of the important factors determining the silique length. The differential expression of lignin biosynthesis-related genes leads to the differences in lignin content. Lignin accumulation inhibits the expansion of pericarp cells and finally affects silique length. For the same environment, the yield of short siliques rapeseed is often lower than that of long siliques rapeseed. In addition, the seed lignin content is significantly negatively correlated with the seed oil content [ 30 ]. The cost of degrading plant cell wall polysaccharides is high by enzymatic or chemical methods. In addition, lignin affects the processing and utilization of cell wall polysaccharides in many industrial and agricultural aspects. Therefore, strategically reducing lignin content may improve rapeseed yield and the efficiency of biomass utilization for bioenergy.

In the present study, we identified QTLs for SL using an RIL population and two independent GWAS populations. Major QTLs on chromosomes A07, A09, and C08 were stably detected in all environments from all populations. Combining RNA-seq and WGCNA, we found that carbohydrate synthesis and metabolism and plant hormones are involved in the regulation of silique development, and the difference in lignin accumulation may also be a key factor affecting silique length.

Materials and methods

Plant materials and trait measurement.

The genetic map that we developed earlier using a recombinant inbred line (RIL) mapping population with 172 lines was used for QTL mapping in this study [ 31 ]. A previously reported GWAS population with 520 accessions (referred to as the 60 K population) was used for association analysis of silique length [ 32 ], and another previously reported GWAS population with 588 accessions (referred to as the WGR population) was also used in this study [ 25 ]. There are 211 overlapping material between 60 K population and WGR population. The RIL population and the parents were grown in five environments: winter of 2016, 2017, and 2018 at Southwest University in Beibei, Chongqing, China (cq; 29.80° N, 106.40° E) and summer of 2018 and 2019 in Xian, Shanxi, China (xa; 34.27° N, 108.08° E). The 60 K population and WGR population were cultivated under natural growing conditions at the experimental farm of Southwest University for 2015 and 2017, respectively. All lines were arranged in a randomized complete block design with three replicates, and each line was planted in two rows of 10 plants per row, with 30 cm between rows and a distance of 20 cm between plants within each row. After the silique was mature, a ruler was used to measure the silique length (silique body length), and the silique beak length was not included in the silique length. Five siliques were randomly measured for each line of the RIL population, ten siliques were randomly measured for each accession of the GWAS population, and the average value was taken.

QTL mapping, GWAS and meta-GWAS

QTL mapping was performed by composite interval mapping (CIM) using the software WinQTL Cartographer 2.5 software [ 33 ], and LOD thresholds for QTL detection were set to P = 0.05 and were determined using permutation tests with 1000 permutations.

The GWAS was conducted using the general linear model (GLM), mixed linear model (MLM) [ 34 ], compressed mixed linear model (CMLM) [ 34 ], bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK) [ 35 ], and fixed and random model circulating probability unification (FarmCPU) [ 36 ] with a total of 31,839 and 670,028 SNPs (miss data < 20%, MAF > 0.05) in the GAPIT R package [ 37 ]. For meta-GWAS, we first divided the WGR population into three subgroups (spring, winter, and semi-winters), and then conducted GWAS independently. Then, pooled data from the GWAS results of three subgroups were used for meta-analysis. The meta-analysis was performed using METAL with the p value, β -coefficients, and standard errors from single-subgroup GWAS [ 38 ]. In the single-population GWAS and meta-analysis, the genome-wide significant (1/ N ) thresholds by Bonferroni correction, in which N is the number of SNPs, were used in the analysis.

An LD block was generated using Haploview v4.2 via the four-gamete rule [ 39 ]. The parameters were set as follows: the MAF was 0.05, the maximum number of Mendel errors was 1, the Hardy–Weinberg p value cutoff was 0.001, and the minimum genotype was 75%.

RNA-seq and WGCNA

Based on multi-year phenotypic observations, Z068 and Z191, from the RIL population, were used as representatives of the S- and L-siliques. Silique pericarps of five lines with S- and L-siliques were collected 9, 18, and 27 DAP with two replications, and the samples were immediately frozen in liquid nitrogen and stored at − 80 °C for RNA-Seq. ST1, ST2, and ST3 represent S-silique at 9, 18 and 27 DAP, respectively; LT1, LT2, and LT3 represent L-silique at 9, 18 and 27 DAP, respectively. The libraries were sequenced on the Illumina HiSeq 4000 platform. After the removal of low-quality reads, the clean reads were aligned to the B. napus reference genome. Fragments per kilobase million (FPKM) values were calculated to estimate gene expression levels. Differentially expressed genes (DEGs) were identified in each pair of samples using the criteria of an false discovery rate (FDR) < 0.01 and |log2(fold change)| ≥ 1.5. The genes with FPKM values in more than 12 samples lower than 0.3 were filtered out.

WGCNA was performed using averaged FPKM values and the WGCNA package in R software, the analysis method based on tutorial of WGCNA official website ( https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/ ) [ 40 ]. A total of 5809 non-redundant DEGs (RNA-seq) and putative candidate genes (QTL mapping and GWAS results) were included in the WGCNA workflow. Briefly, cluster analysis was first carried out to remove outliers, and the scale-free topology criterion was used to determine the soft threshold, which is defined as the similarity relationships between gene-pairs and is obtained by computing the unsigned Pearson’s correlation matrix, and threshold parameter beta selects the value at which the fitting curve first approaches 0.9. Subsequently, the network was constructed using a step-by-step method by turning the adjacency matrix into a topological overlap matrix (TOM) and calling the hierarchical clustering function, and network modules were identified using a dynamic tree cut algorithm with a minimum cluster size of 30 and merging the threshold function at 0.25. To identify hub genes within the modules, the module membership (MM) for each gene was calculated based on the Pearson correlation between the expression level and the module eigengene. Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) were performed using the OmicShare tools, and network depictions were visualized using Cytoscape v3.2.1.

Candidate genes analysis and qRT–PCR

The putative candidate genes were searched within the interval of the associated SNPs (± 200 kb) based on the Darmor-bzh B. napus reference genome v4.1. Functional annotation was implemented to predict the function of candidate genes using a Blastp program against to the Arabidopsis thaliana TAIR10 protein database.

The expression levels of silique length-related genes were validated by real-time PCR (qRT–PCR) using a CFX96 Real-time System (BIO-RAD, USA). In brief, 1 µg of RNA from each sample was used for cDNA synthesis; the expression of the silique length-related genes in different rapeseed samples was evaluated using SYBR® Premix (TIANGEN, Beijing, China) and a CFX96 Real-time System (BIO-RAD, USA). Gene-specific primers are listed in Additional file 19 : Table S9. Average Cq values were calculated from three replicates, and expression levels were normalized to the reference gene Bna.Actin7 using the 2 −ΔΔCt method.

Microscopic observation and lignin quantification

Using OCT embedding tissue freezing slice technique, the microstructure of the fresh pericarp of 9, 18, and 27 DAP from extreme lines were observed. After successful embedding, all samples were cut continuously by the ice cutter with 30 μm. The sections with the intact structure were placed under the Nikon ECLIPSE E600 microscope, observed under ultraviolet light with 10× and 20× objective lenses, and photographed and recorded.

The acetyl bromide method is used to determine lignin content. Briefly, all pericarps were dried to constant weight in an oven at 70 °C, and the samples were ground to pass an 80-mesh screen in the microball mill prior to drying over P2O5 in a vacuum desiccator. The cell wall was isolated by four stage Soxhlet extraction. The dried cell wall samples about 2 mg were added to a 2 mL glass tube with 200 μL of 25% acetyl bromide in acetic acid. The tubes were tightly sealed and put in a 50 °C water bath for 2 h with shaking at 30 min intervals. After cooling the tubes to room temperature, the samples were transferred to 15 mL stoppered graduated centrifuge tube containing 800 μL 2 M NaOH and 140 μL 0.5 M hydroxylamine hydrochloride. Then acetic acid is used to rinse tubes and made up to 15 mL. The absorbance of the solutions was read at 280 nm using a Varian Cary 50 spectrophotometer. A blank was included to correct for background absorbance by the reagents. The lignin content was estimated according to the following equation: %ABSL = (abs/(Coeff × 1 cm)) * ((15 mL * 100%)/weight(mg)), where abs represents absorbance, Coeff represents extinction coefficient [ 41 ].

Availability of data and materials

The data sets supporting the conclusions of this article are included within the article and its additional files, and the raw RNA-Seq data have been deposited in the NCBI database under BioProject accession code PRJNA752824.

Wang X, Chen L, Wang A, Wang H, Tian J, Zhao X, et al. Quantitative trait loci analysis and genome-wide comparison for silique related traits in Brassica napus . BMC Plant Biol. 2016;16:1–15.

Google Scholar

Diepenbrock W. Yield analysis of winter oilseed rape ( Brassica napus L.): a review. F Crop Res. 2000;67:35–49.

Chay P, Thurling N. Identification of genes controlling pod length in spring rapeseed, Brassica napus L., and their utilization for yield improvement. Plant Breed. 1989;103:54–62.

Bennett EJ, Roberts JA, Wagstaff C. The role of the pod in seed development: strategies for manipulating yield. New Phytol. 2011;190:838–53.

PubMed Google Scholar

King SP, Lunn JE, Furbank RT. Carbohydrate content and enzyme metabolism in developing canola siliques. Plant Physiol. 1997;114:153–60.

CAS PubMed PubMed Central Google Scholar

Dong H, Tan C, Li Y, He Y, Wei S, Cui Y, et al. Genome-wide association study reveals both overlapping and independent genetic loci to control seed weight and silique length in Brassica napus . Front Plant Sci. 2018;9:1–9.

Wang S, Cao L, Wang H. Arabidopsis ubiquitin-conjugating enzyme UBC22 is required for female gametophyte development and likely involved in Lys11-linked ubiquitination. J Exp Bot. 2016;67:3277–88.

Fu Y, Wei D, Dong H, He Y, Cui Y, Mei J, et al. Comparative quantitative trait loci for silique length and seed weight in Brassica napus . Sci Rep. 2015;5:1–9.

Li N, Shi J, Wang X, Liu G, Wang H. A combined linkage and regional association mapping validation and fine mapping of two major pleiotropic QTLs for seed weight and silique length in rapeseed ( Brassica napus L.). BMC Plant Biol. 2014;14:114.

PubMed PubMed Central Google Scholar

Shi L, Song J, Guo C, Wang B, Guan Z, Yang P, et al. A CACTA-like transposable element in the upstream region of BnaA9.CYP78A9 acts as an enhancer to increase silique length and seed weight in rapeseed. Plant J. 2019;98:524–39.

CAS PubMed Google Scholar

Wang H, Zaman QU, Huang W, Mei D, Liu J, Wang W, et al. QTL and candidate gene identification for silique length based on high-dense genetic map in Brassica napus L. Front Plant Sci. 2019;10:1579.

Yang P, Shu C, Chen L, Xu J, Wu J, Liu K. Identification of a major QTL for silique length and seed weight in oilseed rape ( Brassica napus L.). Theor Appl Genet. 2012;125:285–96.

Yang Y, Shen Y, Li S, Ge X, Li Z. High density linkage map construction and QTL detection for three silique-related traits in orychophragmus violaceus derived Brassica napus population. Front Plant Sci. 2017;8:1512.

Zhao W, Zhang L, Chao H, Wang H, Ta N, Li H, et al. Genome-wide identification of silique-related traits based on high-density genetic linkage map in Brassica napus . Mol Breed. 2019;39:86.

Zhou X, Dai L, Wang P, Liu Y, Xie Z, Zhang H, et al. Mining favorable alleles for five agronomic traits from the elite rapeseed cultivar Zhongshuang 11 by QTL mapping and integration. Crop J. 2021. https://doi.org/10.1016/j.cj.2020.12.008 .

Article Google Scholar

Liu J, Hua W, Hu Z, Yang H, Zhang L, Li R, et al. Natural variation in ARF18 gene simultaneously affects seed weight and silique length in polyploid rapeseed. Proc Natl Acad Sci USA. 2015;112:E5123–32.

Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet. 2008;40:638–45.

De Jager PL, Jia X, Wang J, De Bakker PIW, Ottoboni L, Aggarwal NT, et al. Meta-analysis of genome scans and replication identify CD6 , IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci. Nat Genet. 2009;41:776–82.

Battenfield SD, Sheridan JL, Silva LDCE, Miclaus KJ, Dreisigacker S, Wolfinger RD, et al. Breeding-assisted genomics: applying meta-GWAS for milling and baking quality in CIMMYT wheat breeding program. PLoS ONE. 2018;13:e0204757.

Shook JM, Zhang J, Jones SE, Singh A, Diers BW, Singh AK. Meta-GWAS for quantitative trait loci identification in soybean. G3 Genes|Genomes|Genetics. 2021;11:jkab117.

PubMed Central Google Scholar

Fikere M, Barbulescu DM, Malmberg MM, Spangenberg GC, Cogan NOI, Daetwyler HD. Meta-analysis of GWAS in canola blackleg ( Leptosphaeria maculans ) disease traits demonstrates increased power from imputed whole-genome sequence. Sci Rep. 2020;10:14300.

Su J, Xu K, Li Z, Hu Y, Hu Z, Zheng X, et al. Genome-wide association study and Mendelian randomization analysis provide insights for improving rice yield potential. Sci Rep. 2021;11:6894.

Zhong R, Ye ZH. The arabidopsis NAC transcription factor NST2 functions together with SND1 and NST1 to regulate secondary wall biosynthesis in fibers of inflorescence stems. Plant Signal Behav. 2015;10:e989746.

Zhou J, Lee C, Zhong R, Ye ZH. MYB58 and MYB63 are transcriptional activators of the lignin biosynthetic pathway during secondary cell wall formation in Arabidopsis . Plant Cell. 2009;21:248–66.

Lu K, Wei L, Li X, Wang Y, Wu J, Liu M, et al. Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement. Nat Commun. 2019;10:1–12.

Von Sydow L, Schwenkert S, Meurer J, Funk C, Mamedov F, Schröder WP. The PsbY protein of Arabidopsis photosystem II is important for the redox control of cytochrome b559. Biochim Biophys Acta Bioenergy. 2016;1857:1524–33.

Liu Y, You S, Taylor-Teeples M, Li WL, Schuetz M, Brady SM, et al. BEL1-LIKE HOMEODOMAIN6 and KNOTTED ARABIDOPSIS THALIANA7 interact and regulate secondary cell wall formation via repression of REVOLUTA. Plant Cell. 2014;26:4843–61.

Zhou S, Ding R, Meng F, Wang X, Zhuang Z, Quan J, et al. A meta-analysis of genome-wide association studies for average daily gain and lean meat percentage in two Duroc pig populations. BMC Genom. 2021;22:12.

CAS Google Scholar

Fermani S, Trivelli X, Sparla F, Thumiger A, Calvaresi M, Marri L, et al. Conformational selection and folding-upon-binding of intrinsically disordered protein CP12 regulate photosynthetic enzymes assembly. J Biol Chem. 2012;287:21372–83.

Miao L, Chao H, Chen L, Wang H, Zhao W, Li B, et al. Stable and novel QTL identification and new insights into the genetic networks affecting seed fiber traits in Brassica napus . Theor Appl Genet. 2019;132:1761–75.

Liu L, Qu C, Wittkop B, Yi B, Xiao Y, He Y, et al. A high-density SNP map for accurate mapping of seed fibre QTL in Brassica napus L. PLoS ONE. 2013;8:e83052.

Wang J, Jian H, Wei L, Qu C, Xu X, Lu K, et al. Genome-wide analysis of seed acid detergent lignin (ADL) and hull content in rapeseed ( Brassica napus L.). PLoS ONE. 2015;10:e0145045.

Wang S, Basten CJ, Zeng Z-B. Windows QTL Cartographer 2.5. Raleigh: Department of Statistics, North Carolina State University; 2012.

Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ, Tiwari HK, Gore MA, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42:355–60.

Huang M, Liu X, Zhou Y, Summers RM, Zhang Z. BLINK: a package for the next level of genome-wide association studies with both individuals and markers in the millions. Gigascience. 2019;8:1–12.

Liu X, Huang M, Fan B, Buckler ES, Zhang Z. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet. 2016;12:e1005767.

Tang Y, Liu X, Wang J, Li M, Wang Q, Tian F, et al. GAPIT version 2: an enhanced integrated tool for genomic association and prediction. Plant Genome. 2016. https://doi.org/10.3835/plantgenome2015.11.0120 .

Article PubMed Google Scholar

Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–1.

Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21:263–5.

Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559.

Xue FC, Chandra R, Berleth T, Beatson RP. Rapid, microscale, acetyl bromide-based method for high-throughput determination of lignin content in Arabidopsis thaliana . J Agric Food Chem. 2008;56:6825–34.

Cai D, Xiao Y, Yang W, Ye W, Wang B, Younas M, et al. Association mapping of six yield-related traits in rapeseed ( Brassica napus L.). Theor Appl Genet. 2014;127:85–96.

Merk HL, Yarnes SC, Van Deynze A, Tong N, Menda N, Mueller LA, et al. Trait diversity and potential for selection indices based on variation among regionally adapted processing Tomato Germplasm. J Am Soc Hortic Sci. 2012;137:427–37.

Download references

Acknowledgements

We are grateful to the reviewers and editors for their constructive review and suggestions for this paper.

This research was supported by the National Natural Science Foundation of China (31771830, 31971902), China Agriculture Research System of MOF and MARA, and the “111” Project (B12006).

Author information

Jia Wang and Yueling Fan have contributed equally to this work

Authors and Affiliations

College of Agronomy and Biotechnology, Southwest University, Beibei, Chongqing, China

Jia Wang, Yueling Fan, Lin Mao, Cunmin Qu, Kun Lu, Jiana Li & Liezhao Liu

Academy of Agricultural Sciences, Southwest University, Beibei, Chongqing, China

You can also search for this author in PubMed Google Scholar

Contributions

LZL designed study; YLF collected the phenotypic data; JW and YLF conducted study, JW, YLF, and LM analyzed data; CMQ, KL, and JNL provided resources; JW wrote manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Liezhao Liu .

Ethics declarations

Competing interests.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: table s1..

Phenotypic variations in silique length (SL) of the B. napus accessions.

Additional file 2: Figure S1.

Distribution of silique length in the RIL and WGAS populations. (a) Representative RIL population. 16SL-cq, 17SL-cq, and 18SL-cq represent silique length from Chongqing in 2016, 2017, and 2018, respectively; 18SL-xa and 19SL-xa represent silique length from Xi’an in 2018 and 2019, respectively; BLUP and BLUE represent best linear unbiased predictions and best linear unbiased estimates, respectively. (b) Representative GWAS population.

Additional file 3: Figure S2.

Pearson’s correlation coefficients of silique length and yield-related traits. The upper triangle shows statistics (correlation coefficient), the lower triangle shows smooth spline regression, and diagonals are histograms. SL, silique length; TSW, thousand seed weight; SN, seed number per silique; YP, yield per plant; SNPP, seed number per plant; SMI, siliques per main inflorescence; HI, harvest index; SOC, seed oil content.

Additional file 4: Table S2.

Quality of sequencing data.

Additional file 5: Table S3.

Statistical table of sequence alignment results between sample sequencing data and selected reference genome.

Additional file 6: Figure S3.

Overview of RNA-seq. (a) Gene expression profile of each sample. (b) Comparison diagram of the FPKM density distribution of each sample. (c) Correlation heatmap of 12 samples. (d) Venn diagram of DEGs. (e) The number of up- and down-regulated DEGs.

Additional file 7: Table S4.

The number of differentially expressed genes (DEGs).

Additional file 8: Figure S4.

Top 20 KEGG enriched pathways in set of T1 vs. T2, T1 vs. T3, and T2 vs. T3 between short silique and long silique.

Additional file 9: Table S5.

QTL for silique length detected in the RILs across five environments, BLUE, and BLUP.

Additional file 10: Figure S5.

Genetic linkage map and QTL detection of the silique length in the RIL population. For simplicity, only the markers in the QTL confidence intervals, along with the terminal two markers at each end of the QTL-containing chromosomes, are shown.

Additional file 11: Figure S6.

Quantile–quantile plots of estimated − log10( P ) from association analysis of SL. (a) Quantile–quantile plots of the 60 K population; (b) quantile–quantile plots of the WGR population.

Additional file 12: Table S6.

Genome-wide significant association signals of SL using the single-population GWAS and meta-GWAS.

Additional file 13: Table S7.

Candidate genes within the linked genomic region of single nucleotide polymorphisms (SNPs) most highly associated with SL in B. napus .

Additional file 14: Figure S7.

Box plot of genotype frequency distribution of different SNP loci and polymorphisms in the candidate region. (a) S9_27782829 genotype frequency distribution. (b) S9_27788376 genotype frequency distribution. ** represents significance p < 0.01. (c) Genomic diversity of landrace (the red line) and pseudo-wild ancestral (the blue line) rapeseed on C08, respectively.

Additional file 15: Figure S8.

Mapping of QTL on A07 using the approach of linkage analysis and association analysis. The Manhattan plots of chromosome A07 are plotted in the top left graph, and the green line indicates the threshold level log(1/ N ) = 5.58. The QTL on A07 detected in the RIL population is shown in the top right graph. The diagram below these two graphs shows the location of the reference genome region on A07 corresponding to the QTL and LD block analysis of this region. The red gene ID represents that the gene is an important candidate gene.

Additional file 16: Figure S9.

Cluster analysis of the top twenty genes of KME values in modules. (a) Lightpink1 module. (b) Chocolate3 module. (c) Darkgoldnrod4 module. (d) Lightblue2 module.

Additional file 17: Table S8.

GO enrichments of modules.

Additional file 18: Figure S10.

Top 20 KEGG enriched pathways in the set of ST1 vs. LT1, ST2 vs. LT2, and ST3 vs. LT3 and the expression analysis of DEGs of key enzymes in the phenylpropanoid–lignin pathway. (a) ST1 vs. LT1. (b) ST2 vs. LT2. (c) ST3 vs. LT3. (d) The expression analysis of DEGs of key enzymes in phenylpropanoid–lignin pathway. PAL , phenylalanine ammonia-lyase; C4H , cinnamate 4-hydroxylase; 4CL , 4-coumarate acid: CoA ligase; HCT , hydroxycinnamoyl-CoA shikimate/quinate transferase; C3H , coumarate 3-hydroxylase; CCoACOMT , caffeoyl-CoA- O -methyltransferase; CCR , cinnamoyl CoA reductase; F5H , ferulate 5-hydroxylase; COMT , caffeic acid O -methyltransferase; CAD , cinnamyl alcohol dehydrogenase; PER , peroxidase; LAC , laccase.

Additional file 19: Table S9.

Primers used for qRT–PCR.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wang, J., Fan, Y., Mao, L. et al. Genome-wide association study and transcriptome analysis dissect the genetic control of silique length in Brassica napus L.. Biotechnol Biofuels 14 , 214 (2021). https://doi.org/10.1186/s13068-021-02064-z

Download citation

Received : 03 September 2021

Accepted : 25 October 2021

Published : 07 November 2021

DOI : https://doi.org/10.1186/s13068-021-02064-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Brassica napus
Silique length

Biotechnology for Biofuels and Bioproducts

ISSN: 2731-3654

General enquiries: [email protected]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 08 April 2024

A chromosomal-scale genome assembly of modern cultivated hybrid sugarcane provides insights into origination and evolution

Yixue Bao 1 na1 ,
Qing Zhang ORCID: orcid.org/0000-0002-1160-2715 1 , 2 na1 ,
Jiangfeng Huang ORCID: orcid.org/0000-0001-7784-4509 1 na1 ,
Shengcheng Zhang ORCID: orcid.org/0000-0002-3949-4269 2 na1 ,
Wei Yao 1 ,
Zehuai Yu 1 ,
Zuhu Deng 1 ,
Jiaxin Yu 2 ,
Weilong Kong 2 ,
Xikai Yu 1 ,
Shan Lu ORCID: orcid.org/0000-0001-5368-896X 1 ,
Yibin Wang 2 ,
Yuhan Song 2 ,
Chengwu Zou ORCID: orcid.org/0000-0002-2836-2930 1 ,
Yuzhi Xu 1 ,
Zongling Liu 1 ,
Fan Yu ORCID: orcid.org/0000-0003-4550-1936 1 ,
Jiaming Song 1 ,
Youzong Huang ORCID: orcid.org/0000-0002-0962-785X 1 ,
Jisen Zhang ORCID: orcid.org/0000-0003-1041-2757 1 ,
Haifeng Wang 1 ,
Baoshan Chen ORCID: orcid.org/0000-0001-5866-7033 1 ,
Xingtan Zhang ORCID: orcid.org/0000-0002-5207-0882 2 &
Muqing Zhang ORCID: orcid.org/0000-0003-3138-3422 1

Nature Communications volume 15 , Article number: 3041 ( 2024 ) Cite this article

1781 Accesses

1 Altmetric

Metrics details

Agricultural genetics
Comparative genomics
Evolutionary genetics
Plant breeding

Sugarcane is a vital crop with significant economic and industrial value. However, the cultivated sugarcane’s ultra-complex genome still needs to be resolved due to its high ploidy and extensive recombination between the two subgenomes. Here, we generate a chromosomal-scale, haplotype-resolved genome assembly for a hybrid sugarcane cultivar ZZ1. This assembly contains 10.4 Gb genomic sequences and 68,509 annotated genes with defined alleles in two sub-genomes distributed in 99 original and 15 recombined chromosomes. RNA-seq data analysis shows that sugar accumulation-associated gene families have been primarily expanded from the ZZSO subgenome. However, genes responding to pokkah boeng disease susceptibility have been derived dominantly from the ZZSS subgenome. The region harboring the possible smut resistance genes has expanded significantly. Among them, the expansion of WAK and FLS2 families is proposed to have occurred during the breeding of ZZ1. Our findings provide insights into the complex genome of hybrid sugarcane cultivars and pave the way for future genomics and molecular breeding studies in sugarcane.

A draft chromosome-scale genome assembly of a commercial sugarcane

Jeremy R. Shearman, Wirulda Pootakham, … Sithichoke Tangphatsornruang

Genomic insights into the recent chromosome reduction of autopolyploid sugarcane Saccharum spontaneum

Qing Zhang, Yiying Qi, … Jisen Zhang

The complex polyploid genome architecture of sugarcane

A. L. Healey, O. Garsmeur, … A. D’Hont

Introduction

Sugarcane, a member of the sub-tribe Saccharinae within the Andropogoneae tribe 1 , is arguably the most voluminous crop grown worldwide, with its weight surpassing that of staple food crops such as rice, maize, or wheat ( http://www.fao.org/faostat/en/#home ). As the global leader in sugar production and a prime candidate for bioenergy production, sugarcane accounts for over 80% of the world’s sugar and 40% of bioethanol yield, which gives it an estimated annual economic value of up to US $90 billion ( https://www.fao.org/faostat/zh/#data/QV ).

Modern sugarcane hybrids have their origins in interspecific crosses between the thick-stalked, high-sugar Saccharum officinarum and the wild, thin-stalked, low-sugar Saccharum spontaneum , with additional multiple backcrossing with S. officinarum 2 , 3 . This intricate hybridization process not only enhanced the vigor, robustness, tillering, disease resistance, and environmental adaptability of modern cultivated sugarcane hybrids but also escalated the complexity of their genome beyond that of their progenitors 4 . The resulting hybrid genome comprises a mix of aneuploid and homo(eo)logous chromosomes unevenly inherited from the two polyploid progenitor species, leading to a large genome size of ~10 Gb. The number of chromosomes in hybrids can range from 100 to 130 depending on the specific cross 5 , 6 . Approximately 70 to 80% of these chromosomes are derived from S. officinarum , 10 to 20% from S. spontaneum , and about 10% result from interspecific recombination 4 , 7 , 8 , 9 .

While decoding the auto-polyploid genomes of S. officinarum and S. spontaneum 10 , 11 has somewhat clarified the challenges of polyploid phasing through technological advancements and algorithmic innovations 12 , assembling the sugarcane hybrid genome presents an unprecedented challenge due to its heterologous aneuploid nature with gene loci consisting of 8–14 homo(eo)logous copies 13 , 14 , 15 . Published hybrid sugarcane genomes such as SP80-3280 16 , KK3 17 , and R570 18 only partially represent the complete information of hybrid sugarcane cultivars due to incomplete assembly. In addition, one of the most severe diseases, smut, caused 20 to 30% yield and sugar losses in China. Pokkah boeng disease (PBD), a century-old disease worldwide, is becoming more severe in China due to the overuse of nitrogen fertilizer and the warming of the climate.

This work presents a haplotype-resolved and chromosome-level de novo genome assembly of the hybrid sugarcane cultivar ZZ1. We reveal the expansion of genes related to sugar accumulation, smut resistance, and the origin of genes associated with PBD susceptibility.

Karyotype and genome assembly

The modern hybrid sugarcane cultivar originated from hybridization between two ancestral Saccharum species, namely S. spontaneum and S. officinarum , followed by several rounds of backcross with S. officinarum . The high ploidy and extensive recombination between the two subgenomes make the cultivated sugarcane an ultra-complex genome that still needs to be resolved. We initially investigated the genome features of the modern hybrid sugarcane cultivar ZZ1 using flow cytometry and karyotype analysis. The estimated genome size is ~9.0 Gb (Table 1 ), consistent with previous estimations of modern hybrid sugarcane cultivars 18 . Karyotype analysis using the Oligo probes designed from S. spontaneum specific abundant retrotransposons identified a total of 114 chromosomes, of which 68 originated from S. officinarum (So), 31 from S. spontaneum (Ss), and 15 represented recombination (Rec) between subgenomes (Fig. 1a ). Chromosome painting with chromosome-specific probes designed from the S. officinarum (LA-purple) monoploid assembly classified these 114 chromosomes into ten homeologous groups, following the same naming rule in the R570 18 (Supplementary Table 1 ).

a Karyotype analysis, chromosomes with different ancestors of SO (left), SS (right), and recombination between SO and SS (mid) were shown. b overview of sequencing and assembly strategy: (I) assembly, filter, and correct contigs, (II) separate contigs into ROC and YZ groups based on specific k -mer difference, (III) separate contigs into SO and SS groups with mapping quality and similarity, contigs from Rec chromosomes are collected based on Hi-C signals between SO- and SS-derived contigs, (IV) apply haplotype phasing and chromosome-scaffolding in all six groups respectively and Manually adjust. c genomic features of the ZZ1 genome, the reference genome at the center position is Sorghum (Sb), and the collinearity among six chromosome groups in ( b ) with Sb were drawn respectively.

To solve the assembly of this ultra-complex genome, we incorporated multiple sequencing technologies, including Illumina, PacBio CCS platforms, and proximity ligation approach, generating 656 Gb short, 325 Gb high-fidelity (HiFi) long, and 1.175 Tb high-throughput chromatin conformation capture (Hi-C) short reads, respectively (Table 1 and Supplementary Tables 2 , 3 ). To facilitate chromosome phasing, we also sequenced ~50× Illumina short reads for its two parental genomes (ROC25 and YZ89-7; Supplementary Fig. 1 ). The initial contigs were assembled using a widely applied HiFi assembler (i.e., hifiasm), resulting in 11.2 Gb sequences with the N50 of 527 kb (Supplementary Table 4 ). Those misjoined contigs were further identified based on abnormal chromatin interaction signals detected by Hi-C linked reads, refining 3.6% (5125/141,261) of total contigs. In addition, we identified 2.53 Mb artifactual sequences based on a whole-genome survey of artifactual k -mers and 835.53 Mb contamination, including organelle genomes and bacterial sequences. Removing these artifactual and contaminated sequences produces a high-quality contig-level assembly containing 10.4 Gb genomic sequences (Supplementary Table 4 ).

Pre-assigning contigs facilitated the scaffolding of this highly complex genome to different origins (see Methods; Fig. 1b ). We first assigned these refined contigs to maternal (ROC) and paternal (YZ) groups based on resequenced parental genomes (ROC25 and YZ89-7). The two groups contain similar sequences (4.3 Gb in ROC and 4.7 Gb in YZ), accounting for 87% of the assembled size (Supplementary Table 5 ). We also traced the ancestral origination of assembled contigs by investigating the biased sequencing depth and similarity using published genomes and resequencing data of S. spontaneum (Ss) and S. officinarum (So), which classified 58.7% (6.1 Gb/10.4 Gb) sequences into So group, and Ss-derived sequences account for 23.1% (2.4 Gb/10.4 Gb) of the assembled genome, highly consistent with the previous estimation 19 . Our karyotype analysis revealed frequent chromosome recombination (Rec) between the two ancestral subgenomes, and the recombined sequences were further identified by detecting significantly high Hi-C contact signals between So- and Ss-derived contigs (see Methods). It shows that 1905 Mb sequences contribute to the 15 Rec chromosomes. Chromosome phasing and scaffolding were independently applied on the six pre-assigned groups (i.e., ROC-So, ROC-Ss, ROC-Rec, YZ-So, YZ-Ss, and YZ-Rec) using 59,906,451 (77.66%) lib valid paired-end Hi-C reads (Supplementary Table 3 and Supplementary Fig. 2 ), resulting in 87% (9.1/10.4 Gb) sequences anchored onto 114 pseudo-chromosomes, which represents a significant increase compared to the incompletely assembled genomes of previously published sugarcane cultivars 16 , 17 , 18 (Supplementary Tables 5 – 7 ).

The quality of genome assembly was accessed using a series of approaches. Assessment using 1614 benchmarking universal single-copy orthologs (BUSCOs) showed that 99.7% of genes were completely recalled with 98.5% duplication (Supplementary Table 8 ). Alignment of 3.2 billion Illumina clean reads against the genome identified that all the genomic regions could be covered by at least five reads with a mapping rate of 99.94% (Supplementary Table 9 ). The synteny analysis revealed that most genes were ordered consistently with Sorghum and other published sugarcane genome annotation 10 , 11 , 16 , 17 , 18 (Fig. 1c and Supplementary Fig. 3 ). The chromatin contact heatmap showed that the genomic sequences were well-organized along with the diagonals (Supplementary Figs. 2 , 4 ). The LTR assembly index (LAI) calculation indicates that the genome assembly has a value of 12.27, qualifying it as a reference genome 20 .

Annotation and genomic features

We annotated a total of 6.9 Gb repetitive sequences, accounting for 66.54% of the assembled ZZ1 genome, among which 36.45% repetitive sequences are annotated in So, 10.84% from Ss and 11.15% from Rec chromosomes (Supplementary Table 10 ). As the majority type of repetitive sequences, the long terminal repeat (LTR) retrotransposons account for 45.63% of the assembled genome (Supplementary Table 10 and Supplementary Fig. 5 ), which is similar to the ratio in the published sugarcane and sorghum genomes, ~41% in S. spontaneum (Np-X) 11 and ~54% Sorghum 21 . The Kimura divergence calculation indicated that Ty1/Copia elements dominated recent transposon expansion events despite their lower frequency (15.65%) compared to Ty3/Gypsy elements (27.42%) throughout the genome (Supplementary Fig. 5 ).

We sequenced RNA samples from the ZZ1 tissues to annotate protein-coding genes. A total of 370,103 protein-coding genes, spanning a combined length of 1235.86 Mb, were annotated by protein-homology-predicted and RNA-seq-aligned methods, of which 92.14% (341,040) could be successfully validated by RNA-seq reads. Out of all the annotated genes, 30.03% (111,167) consisted solely of a single exon, while 14.26% (52,797) and 15.75% (58,308) genes had 5′UTR and 3′UTR annotations. The comprehensive approach to identifying allele genes within this ultra-complex genome relied on the monoploid genome-annotated gene sets from two sugarcane foundation species as reference (see Methods). After two rounds of protein alignment, we identified a total of 68,509 well-annotated genes with defined alleles into three subgenome groups (36,439 in the So subgenome, 27,760 in the Ss subgenome, and 4310 in the Rec chromosomes) (Supplementary Data 1 ). This annotation contained 26,650 (38.9%) genes with a single allele, 13,493 (19.7%) with two alleles, 9431 (13.8%) with three alleles, and 10.3, 7.6, 5.2, 2.8, 1.2, 0.4, and 0.1% with four to ten alleles, respectively. In addition, we identified 42 genes with 11 alleles and 20 genes with 12 allele counts. We further analyzed the homology of genes between two major subgenomes (So and Ss) in the ZZ1 genome. The statistical results showed that 23,670 genes were identified as homoeologous gene pairs of So-Ss subgenomes, while 12,769 and 3580 were detected as specific genes in the So and Ss, respectively (Supplementary Table 11 ).

The pedigree identification and analysis

Using the species-specific k -mers extracted from the So and Ss genomes, we identified 5.4 Gb (51.9%) genomic sequences from 68 SO-originated chromosomes (ZZSO) and 1.8 Gb (17.3%) from 31 SS-originated chromosomes (ZZSS). We also detected that 1.9 Gb (18.3%) sequences in 15 chromosomes probably originated from the interspecific recombination (Rec) between the two species (Fig. 2a,b ), which coincides with the results of the cytological study 19 . The differentiation in the pedigree of ZZ1 allows us to compare the transposable element (TE) content between ZZSO, ZZSS, and Rec region. We observed that, on average, 71.03% of the genome in ZZSO is comprised of TEs, which is significantly higher than the TE content in ZZSS (62.61%) and Rec (61.07%) (Supplementary Fig. 6 ). These findings indicate a distinct genome structure among the subgenomes, highlighting the impact of pedigree divergence on TE distribution.

A The pedigree of S. officinarum (SO, colored in orange) and S. spontaneum (SS, colored in green) are distributed in all the allelic chromosomes of the ZZ1 genome. The ambiguous segments are colored with blue boxes. The size of each allelic chromosome was indicated by the scale at the bottom. B The histogram displayed the ratio of the pedigree of S. officinarum and S. spontaneum in each homologous group. The x -axis represents the homologous groups from Chr01 to Chr10, and the y -axis represents the ratio of the pedigree. C The divergence of the pedigree of S. officinarum and S. spontaneum in ZZ1, SO: S. officinarum ; SS: S. spontaneum ; ZZSO: pedigree of S. officinarum ; ZZSS: pedigree of S. spontaneum ; x -axis represents the synonymous substitution rate and y -axis represent the tested gene pairs from the combination of comparison. The centerline in each box represents the median; the lower and upper hinges represent the 25th and 75th percentiles, respectively; and the whiskers represent 1.5× the interquartile range. The number of gene pairs from left to right of the x -axis in order is as follows: n = 191,651, 379,051, 165,435, and 69,131. p values were calculated by two-sided Wilcoxon test, *** p < 0.001, ** p < 0.01, * p < 0.05. D Total allelic expression of the chromosomes of ZZ1 in leaf and stem tested in seedling, pre-mature, and mature stages, respectively. The A–E represents the homologous chromosomes in each group, with red indicating pedigree from YZ and the black color from ROC. The pedigree of the homologous chromosomes from S. officinarum and S. spontaneum , or the recombination between them, were indicated with SO, SS, and Rec, respectively. ZBL: leaf in the pre-mature stage; ZBS: stem in the pre-mature stage; ZCL: leaf in the mature stage; ZCS: stem in the mature stage; ZFL: leaf in the tillering stage; ZFS: stem in the tillering stage; ZYL: leaf in the seedling stage; ZYS: stem in the seeding stage. Source data are provided as a Source Data file.

The 68 SO-originated chromosomes contained different numbers of haplotypes in each homologous chromosome group, ranging from five haplotypes in Chr02/04/10 to eight in Chr06 (Fig. 2b and Supplementary Table 6 ). The SS-originated chromosomes had an average number of 3.1 haplotypes in the ten homologous chromosome groups, with the Chr02 group containing the most SS-originated haplotypes (i.e., six) (Fig. 2b and Supplementary Table 6 ). In addition, the pedigree of the S. spontaneum was not distributed on Chr05 and Chr08, which led to the basic chromosome number changes from 10 to 8 in S. spontaneum 10 . It indicates that the S. spontaneum with x = 8 was used in ZZ1 sugarcane breeding rather than the accession with x = 10 11 .

To investigate the divergence between the pedigrees of S. officinarum and S. spontaneum in the ZZ1 genome, we calculated the synonymous substitution ( Ks ) of the ortholog pairs of SO-ZZSO, SS-ZZSS, ZZSS-ZZSO, and SO-SS (Fig. 2c ). The Ks analysis indicated that the ZZSO and ZZSS had the lowest divergence (Medium Ks = 0.020). In contrast, the SO-SS had the highest divergence (Medium Ks = 0.037) compared with other ortholog pairs, suggesting that recombining these two kinds of pedigrees that occurred in ZZ1 might alleviate the sequence divergence between them. The Ks results also exhibited that the divergence of SS-ZZSS (Medium Ks = 0.020) was much higher than that of SO-ZZSO (Medium Ks = 0.030), indicating the multiple generations of backcross with S. officinarum in the process of modern hybrid breeding might reduce the divergence between S. officinarum and its pedigrees of ZZ1. In addition, we have conducted a thorough analysis of the Ks values between the two subgenomes of ZZ1 and Miscanthus. The calculated Ks values for ZZSO-Miscanthus and ZZSS-Miscanthus are 0.0763 and 0.0764, respectively. These values indicate a similar level of divergence between the subgenomes, which aligns with the comparable divergence times observed (Supplementary Fig. 7 ).

Homoeolog expression dominance in ZZ1

We collected RNA-seq data of leaf and stem in seedling, pre-mature, and mature stages to compare the expression pattern of the homoeologous chromosomes in the ZZ1 genome. Most of the chromosomes showed similar average expression levels between the SO and SS subgenomes, including Chr02, Chr03, Chr04, Chr06, and Chr07. However, the asymmetric expression patterns were detected within some homoeologous chromosome groups (Fig. 2d ). For instance, it showed that the three homoeologous chromosomes derived from recombining the two foundational ancestral species were highly expressed compared with other chromosomes in the Chr01 group. The Chr09 group exhibits the dominant expression in the homoeologous chromosomes derived from SS (Fig. 2d ).

In addition, our results identified that a small proportion of homoeolog genes showed biased expression towards either S. officinarum pedigree (7.1% on average) or S. spontaneum pedigree (3.0% on average) in the RNA-seq samples examined (Fig. 3 ). This result indicates no significant global genome dominance between the two foundational species pedigree in ZZ1, consistent with the reported polyploid species, including Gossypium hirsutum 22 , Triticum aestivum 23 , Brassica juncea 24 , and Brassica napus 25 . GO enrichment analysis showed that those SO-dominant genes were mainly enriched in the photosynthesis-related biological processes, including “photosynthesis” and “response to cadmium ion.” However, the SS-dominant genes were primarily enriched in basic biological functions, including peptide metabolic and biosynthetic processes (Fig. 3d, g ).

A Histograms of genome-wide expression of homoeologous genes between the pedigree of S. officinarum and S. spontaneum among ZZ1 tissues and different developmental stages. ZBL: leaf in the pre-mature stage; ZBS: stem in the pre-mature stage; ZCL: leaf in the mature stage; ZCS: stem in the mature stage; ZFL: leaf in the tillering stage; ZFS: stem in the tillering stage; ZYL: leaf in the seedling stage; ZYS: stem in the seeding stage. B Histograms of the distribution of dominant expressed genes in each homoeologous group. The Venn diagram of the dominant expressed genes with SO over SS ( C ) and SS over SO ( F ) in leaf and stem tissues in different development stages. The GO enrichment for the dominant expressed genes with SO over SS ( D ) and SS over SO ( G ) in all tested tissues. The KEGG enrichment for the dominant expressed genes with SO over SS ( E ) and SS over SO ( H ) in all tested tissues. The color bar represents the scale of the corrected p value of enrichment terms. All the significance in GO and KEGG enrichment was tested by two-tailed Fisher’s exact test method. Source data are provided as a Source Data file.

The expansion of crucial gene families related to sugar transport and resistance genes

Sugar accumulation and disease resistance are the most vital agronomic traits constantly being studied when breeding modern sugarcane cultivars. To characterize the genetic basis of these traits in the ZZ1, we analyzed the core gene families related to sugar accumulation and disease resistance (Fig. 4 ).

The expression of gene alleles of pGlcT ( a ) and NBS ( b ) family putative derived from SO, SS, and Rec in leaf and stem of different development stages in ZZ1 is shown by the violin plot; the x -axis represented the three derived lineages and y -axis represented the FPKM value. The legend indicates that the leaf and stem are from different stages of ZZ1 development. ZBL: pre-mature leaf; ZBS: pre-mature stem; ZCL: mature leaf; ZCS: mature stem; ZFL: the leaf in the tillering stage; ZFS: the stem in the tillering stage; ZYL: seedling leaf; ZYS: seedling stem. Source data are provided as a Source Data file.

For the sugar transporter family, a total of 130 genes, including 1166 alleles, likely belong to the members of the sugar transporter superfamily consisting of polyol/monosaccharide transporters (PLT), vacuolar glucose transporter (VGT), early-responsive to dehydration protein (SFP), tonoplast monosaccharide transporters (TMT), sugar transporter protein (STP), plastidic glucose transporter (pGlcT), inositol transporter (INT), sucrose transporter (SUT), and the sugars will eventually be exported transporters (SWEET) subfamily (Table 2 ). Tracing of the allelic genes showed that 47.4% (SUT) to 68.2% (VGT) of alleles of these families were derived from SO, while only 7.0% (SUT) to 30.0% (INT) were from SS. In addition, we identified a substantial number of allelic genes from the reconstruction (Rec) of the two subgenomes, ranging from 13.6% of VGT to 45.6% of SUT. The expression profile of these alleles showed that the alleles derived from SO were significantly more abundant than those derived from SS across the different stages of leaf reconstruction (Fig. 4a ), indicating that the genetic basis for sugar transport in ZZ1 is significantly contributed by SO. In addition, ZZ1 has more members in the PLT, TMT, and SWEET families than SS, SO, and their relative species, suggesting that gene expansion occurred in these three families during the breeding of ZZ1.

The nucleotide-binding site (NBS) is a vital family of plant transcription factors that regulate plant disease resistance. We identified a total of 571 putative gene members of the NBS family, which comprised 5181 alleles with 51.63, 24.45, and 23.91% of the alleles derived from SO, SS, and their Rec regions, respectively (Table 2 ). Although the number of NBS gene alleles derived from SS was less than that from SO, they showed higher expression than that of the alleles derived from SO (Fig. 4b ), indicating that the lineage of SS might contribute to enhanced disease resistance through the crossing. We speculated that these genes might be involved in sugarcane resistance to the pathogen and a series of defense responses after infection.

The lineages dominate expression to participate in smut and PBD

Smut is caused by the Sporisorium scitamineum (Ssc) that enters the sugarcane plant through lateral buds to colonize the apical meristem tissue. Based on gene collinearity analysis, a homologous region for smut resistance was identified on the YZ-Rec-Chr06A chromosome of ZZ1 (see Methods), spanning approximately 33.35 Mb and containing 1902 genes. Collinearity plots, region lengths, and the number of genes indicated that ZZ1 had a significant expansion compared to modern sugarcane cultivar R570 18 . Some genes exhibit a 1:2 collinearity pattern compared to R570 (Fig. 5a ). The gene functional annotation reveals that in the ZZ1 genome, there is a significant presence of genes, such as flagellin sensing 2 ( FLS2 ), wall-associated kinase ( WAK ), associated with plant resistance to pathogen infection, as well as genes related to cell wall formation, such as glucuronic acid xylan and hemicellulose synthesis. In this context, the WAK family in ZZ1 has increased tenfold compared to R570. The copy number of FLS2 in ZZ1 also has increased eight-fold (Fig. 5b ). The substantial amplification of these functional genes may contribute to ZZ1’s high resistance to smut, providing a genetic foundation for such resistance. The analysis of expression patterns revealed that WAK family genes exhibited an upregulated response to Ssc infection and a sustained upregulation in the growth processes of sugarcane buds, roots and leaves (Fig. 5c ), indicating that WAK family genes play a crucial role in resisting pathogen infections and regulating cell proliferation and other processes.

a The QTL linking to sugarcane smut was analyzed using R/QTL v1.39 software. The red dotted line indicates the filter threshold (LOD = 7.119, p < 0.05), and the QTL linking to sugarcane smut was obtained at the R570 Chr06 chromosome. Using the R570 QTL region (7.74 Mb) for smut resistance as a reference, QTL regions for smut resistance were identified in ZZ1, with gene collinearity shown within the QTL region connected by gray curved lines. Depicted 1:2 collinearity pattern using bar graphs. The x -axis represents of ZZ1 blocks per R570 gene, while the y -axis represents the percentage of the genome. b Syntenic distribution of resistance-related gene families within the homologous region. Synteny is represented by gray curved lines, the WAK gene family is indicated in deep brown, and the FLS2 gene family is represented in yellow. c Expression patterns of the WAK gene family. Depicted using bar graphs. The x -axis represents different treatment times for root and bud, while the y -axis represents FPKM values. Data are presented as mean values ± SE, n = 3. Error bars represent the standard error. Source data are provided as a Source Data file.

PBD is one of the most severe diseases caused by Fusarium fujikuroi species Complex 26 , which resulted in catastrophic damage to the sugarcane industry. To elucidate the underlying mechanism of divergence in response to PBD among the two lineages ( S. officinarum and S. spontaneum ) in ZZ1, we implemented RNA-seq to survey the global transcriptomic responses to PBD in the leaf and stem tissues of ZZ1. There were 2794 and 2901 homoeologous genes from SO that showed lineage dominant expression (LDE) relative to those from SS in leaf and stem, which was greater than the number of dominantly expressed genes contributed by SS (1240 and 1234 LDE identified from leaf and stem, respectively) (Fig. 6a ), suggesting the divergent response mechanism between these two lineages in ZZ1. The further GO enrichment for these genes showed that the LDE genes derived from SO lineages were mainly enriched in the photosynthesis-related process (Fig. 6b ), including “photosynthetic electron transport chain,” “response to cadmium ion,” and “plastid organization.” In contrast, the ones derived from SS lineages are principally likely to participate in the process of responding to stimuli, such as “ribosomal large subunit biogenesis,” “peptide metabolic process,” “ribosome assembly,” and “tetrapyrrole metabolic process” (Fig. 6c ), indicating the complex combinatorial regulation of both two lineages in response to PBD.

a Barplot of a total number of dominant expressed genes derived from SO or SS in response to PBD in the leaf and stem of ZZ1. b , c The heatmap of GO enrichment terms for lineage dominant expressed (LDE) genes derived from SO and SS in leaf ( b ) and stem ( c ), respectively. The color bar represents the scale of the corrected p value of enrichment terms. Significance was tested by the two-tailed Fisher’s exact test method. d , e Heatmap of expression changes in the SO and SS lineages between the normal growth (control) and after infection of PBD in the leaf ( d ) and stem ( e ) of ZZ1. The log10 infection/control fold-change values were normalized and scaled by z -score in each gene pair. The gene cluster size and representative enriched functional classification terms are noted. Source data are provided as a Source Data file.

To unveil how the two lineages respond to PBD in ZZ1, we conducted a heatmap and clustering analysis for the LDE genes among the leaf and stem to reveal the regulation patterns in the two lineages (Fig. 6d,e ). As a result, the number of LDE genes that were upregulated in SO (cluster II: 2336 genes) was greater than that in SS (cluster IV: 265 genes) in the leaves, contrary to the number of LDE genes that were downregulated in SO (cluster I: 456 genes) less than SS (cluster III: 977 genes). However, the numbers of up-and down-regulated LDE genes identified in SO [2089 up- (cluster II) and 818 down-regulated (cluster I) genes] are all more than that in SS [828 up- (cluster IV) and 400 down-regulated (cluster III) genes] in stem tissues, indicating the divergent regulation patterns in response to PBD. GO enrichment showed that the downregulated LDE genes of SO were differently enriched from that of SS in the leaf, with the former mainly enriched in photosynthesis-related terms, including “photosynthesis, light reaction”, and “regulation of photosynthesis.” In contrast, the latter is enriched in amide metabolic-related processes, such as the “cellular amide metabolic process,” “amide biosynthetic process,” and “peptide metabolic process.” The downregulated LDE genes from these two lineages are enriched in the same terms (including “cellular amide metabolic process,” “amide biosynthetic process,” and “peptide metabolic process”) in the stem tissue, suggesting the divergent regulation patterns to response to PBD occurred in these two lineages in the two tissue. Furthermore, the upregulated LDE genes from SS in the leaf and stem are mainly enriched in the vital life process, including “mRNA metabolic process,” “nuclear-transcribed mRNA,” “protein-containing complex organization,” and “ribonucleoprotein complex assembly.” The upregulated LDE genes from SO were enriched in response to the stress-related process, including “response to metal ion,” “response to oxidative stress,” “response to water deprivation,” and “response to cadmium ion,” suggesting that the LDE genes from these two lineages might play a cooperative process to respond to PBD in the leaf and stem tissues of ZZ1.

Despite significant advancements in the haplotype phased genome of two ancestral Saccharum species 10 , 11 , which has dramatically advanced the field of sugarcane genomics, they are still unrepresentative of the genomic information of the modern cultivated sugarcane, which possesses many advantageous traits, such as the combination of high sugar content, super abiotic stress resistance, and other exceptional traits. The resolution of these traits is reliant on the complete deciphering of high-quality modern cultivated sugarcane genomes. However, the modern cultivated sugarcane genome is one of the most intricate and challenging worldwide. Over the past two decades, sugarcane genome research pioneers have expended tremendous efforts on the genome of sugarcane cultivars, yet progress has been limited 16 , 17 , 18 . Unlike allopolyploids such as B. napus 25 , wheat 27 , and cotton 28 , modern cultivated sugarcane originated from crosses between auto-polyploid parents, followed by multiple rounds of backcrossing 2 , 3 . Over a long history of breeding, the bloodlines of several parents within the Saccharum genus have been mixed, resulting in a homo(eo)aneuploid commercially used in asexual reproduction. Approximately 10 to 20% of the chromosomes in its genome originated from recombination between the parents, and it possesses 8–14 homo(eo)logous copies of most genes. Therefore, the challenges in assembling the genome of modern cultivated sugarcane include (1) distinguishing between homozygous and heterozygous contigs and achieving chromosome-level scaffolding and (2) assembling and scaffolding chromosomes involving recombination events between the original parental lineages.

To overcome the characteristics of polyploidy and extensive recombination among ancestral subgenomes in the modern sugarcane hybrid genomes, we herein proposed an innovative ‘dimension reduction’ assembly strategy, supplemented by a variety of sequencing technologies, to decipher the modern sugarcane hybrid “ZZ1” genome completely (Fig. 1b ). The quality of this ultra-complex genome, which benefited from innovations in assembly strategies and technological advances, is far superior to the previously published contig-levels SP80-3280 16 , draft chromosome-scale KK3 17 , and mosaic R570 18 monoploid genome, which not only provides a unique perspective on the origin and evolution of modern sugarcane hybrid but also helps to analyze the molecular mechanism of excellent traits, which is of great significance for future precision breeding of sugarcane.

Distant crosses can introduce many superior genes at once, which can significantly impact the creation of new species and improve existing varieties. To create superior sugarcane germplasm, the interspecific crosse of S. spontaneum (with high abiotic stress tolerance) and S. officinarum (with high sugar content) and one or more subsequent backcrosses of S. officinarum have produced many of the modern sugarcane hybrids, namely ROC25 and YZ89-7. Further, the crosses of this modern sugarcane hybrid produced some of the second-generation modern sugarcane cultivars, such as ZZ1. This ingenious hybrid breeding strategy gives modern sugarcane hybrid the advantage of being derived from both ancestral parents but also makes its genome more complex due to the uneven genetic contribution of both ancestral parents from the hybridization process, non-directional loss of chromosomes during the transmission, and the new generation of Rec chromosomes from both parents. The high sequence similarity between large numbers of haplotypes in the same homologous group introduces some misjoin errors in the contig assembly process, and Rec contigs on Rec chromosomes will generate Hi-C signals with contigs of both ancestral origins, which causes great difficulties in scaffolding. Fortunately, direct biparental and original ancestral parental material availability can help to distinguish these massive disordered contigs into different groups by sequence similarity, read depth bias, and biparental ancestral unique k -mers strategies (see Methods). Benefiting from direct biparental and ancestral genomes, 87% of contigs were fruitfully divided into six pre-assigned groups, anchoring to 68 So chromosomes, 31 Ss, and 15 Rec chromosomes, which is consistent with karyotype analysis on the number of ZZ1 chromosomes (Fig. 1a ). This multiple-grouping strategy has important implications for the resolution of other hyper-complex genomes. However, it does not exclude the fact that the availability of parents limits some hyper-complex genomic species.

The high-quality genome of ZZ1 demonstrates its chromosome origin and recombination in high resolution, expanding our knowledge of its chromosomes from the quantitative level to the specific sequence level. Previous studies have indicated the presence of three chromosome bases, namely 8, 9, and 10, in Ss 11 . However, it has remained uncertain that bases were used as parents in early sugarcane hybrid cultivars. Our genomic evidence now demonstrates that the ancestral parental chromosome base contributing to Ss consanguinity in ZZ1 is chromosome base 8, excluding bases 9 and 10. This conclusion is drawn from the observation that the Ss pedigree of the ZZ1 genome lacks intact ancestral chromosomes from Ss, specifically Chr05 and Chr08 homologous groups (Fig. 2a ). This finding may also play a significant role in explaining the formation of aneuploids in hybrid sugarcane during the breeding process. The high-quality ZZ1 genome, in isolation from the FISH technique, more precisely determined the proportional genetic contribution and sequence composition of the two original ancestral parents to the modern sugarcane hybrid, which is vital for us to explore the genetic basis of superior traits in modern sugarcane hybrid.

The gene family results reconfirmed the biased contribution of Ss and So to the high abiotic stress resistance and high sugar content traits in modern sugarcane hybrid. We highlighted many genes that may play important roles, such as NBS, PLT, TMT, and the SWEET family. The total alleles of gene sequences provide the fundamental basis for studying gene-dominant expression under stress conditions. Furthermore, gene expression biases in subgenomes have been extensively observed in allopolyploids, such as B. juncea 24 , wheat 27 , cotton 28 , and Nepenthes gracilis 29 . These biases play a role in shaping the evolution of novel genes and enhancing the development of biologically vital traits. However, our findings indicated that in ZZ1, the hybrid sugarcane, no significant global genome dominance was observed between the two foundational species. This lack of dominance could be attributed, at least in part, to the fact that the sugarcane hybrid is a recently formed homo(eo)aneuploid accompanied by recombination exchange between the two foundational lineages. It is important to note that the two foundational lineages in ZZ1 exhibit significantly different numbers of alleles at the gene level. Despite this, they contribute similar total expression abundance. This intriguing phenomenon may involve epigenomic modifications and siRNA regulation, which are implicated in dosage balancing. A comprehensive analysis combining transcriptomic, epigenomic, and three-dimensional genomic tools would be beneficial to explore this exciting phenomenon further.

ZZ1, derived from ROC25 and YZ89-7, is highly resistant to smut and susceptible to PBD. Compared to the smut-susceptible modern sugarcane cultivar, ZZ1 had better physical resistance to smut, including lower stomatal density and aperture on the outermost bud scale and lower alkanol contents to reduce the germination rates of Ssc teliospores on buds. In addition, the region of smut resistance has expanded significantly, and gene expansion occurred in WAK and FLS2 families during the breeding of ZZ1. These findings provide a genetic foundation for such resistance.

Sample preparation and genome sequencing

The Zhong Zhe No. 1 (ZZ1) plant was cultivated in the greenhouse at Guangxi University. The young leaves from the same individual were collected for DNA extraction and genome sequencing.

Genomic DNA was extracted using a QIAGEN DNeasy Plant Mini kit (Qiagen, Hilden, Germany, catalog number 69106) and subject to library construction with an insert size of 300–500 bp. DNA quality was visually assessed using agarose gel electrophoresis (0.75%), and concentration was estimated using a spectrophotometer (Multiskan Sky Microplate 1510-01307 C, Thermo Fisher Scientific, MA, USA). DNA libraries were sequenced on the Illumina NovaSeq platform with the model of paired-end (PE) 150 bp.

To construct the high-quality SMRTbell libraries (30–50 kb), we followed the manufacturer’s protocol ( https://support.10xgenomics.com/de-novo-assembly/library-prepr/doc/user-guide-chromium-genome-reagent-kit-v1-chemistry ) to isolate high-molecular-weight DNA (>50 kb), which were subsequently subject to size selection by BluePippin system. A total of 224.36 Gb HiFi reads were generated on a PacBio Sequel II platform.

The young leaves from ZZ1 were collected for Hi-C library construction and sequencing. Briefly, the young leaves were fixed with formaldehyde lysed, and then the cross-linked DNA was digested using Hind III over 48 h. The sticky ends attached by biotins are proximity-ligated to form chimeric joined DNA that was physically sheared into a size of 500–700 bp and further sequenced on an Illumina NovaSeq platform.

Spore suspension of Ssc (the causative agent of smut) (1 × 10 6 spores/mL) was inoculated at the roots and buds of ZZ9 (the same parents as ZZ1, derived from ROC25 and YZ89-7, is highly resistant to smut). Inoculated plants were placed in a constant temperature incubator at 28 °C in a medium moisturizing culture. Smut group samples were collected at 0 d, 1 d, 2 d, 3 d, and 4 d after inoculation for each of the above treatments with three duplicates. Similarly, spore suspension of Ssc was inoculated at the leaves of ZZ1. The samples were collected at 5 d and 20 d after inoculation (were collected every 5 d), with water as a control treatment. In addition, the different tissues and different developmental stages were collected from ZZ1, including “leaf in pre-mature stage (ZBL), stem in pre-mature stage (ZBS), leaf in mature stage (ZCL), stem in mature stage (ZCS), leaf in tillering stage (ZFL), stem in tillering stage (ZFS), leaf in seedling stage (ZYL), and stem in seeding stage (ZYS).” Total RNA was extracted from the above samples using RNAprep Pure plant Kit (Tiangen Biotech, Beijing, China, catalog numbers DP432) and subsequently used for cDNA library construction. The quality of the cDNA library was assessed on the Agilent Bioanalyzer 2100 system and sequenced on an Illumina Novaseq platform. The original data was quality controlled by Cutadapt, and the quality control data was compared to the ZZ1 genome using HISAT2 v2.1.0 software 30 . Using Cufflinks software, the expression levels of transcripts and genes were quantified through the position information of Mapped Reads on the genome. FPKM (fragments per kilobase of exon per million fragments mapped) was used as an index to measure the expression level of transcripts.

Contig assembly

We first used Hifiasm 31 to assemble PacBio HiFi reads with default parameters. The resulting contigs contain a total of 11.2 Gb sequences, which is much larger than the estimated genome size. It could be caused by assembly errors and contaminated sequences, introducing artifactual sequences that are not supposed to be present in the assembly. To provide a high-quality genome assembly, we identified the misjoined contigs based on abnormal Hi-C signals. The Hi-C reads aligned against the genome assembly using BWA-MEM algorithm 32 with ‘-SP5M’ parameters, allowing split alignments suitable for Hi-C reads mapping. We further constructed the chromatin contact map within contigs and identified chimeric errors if they show substantial differences between two adjacent bins in the Hi-C signal matrix, which follows the same correction algorithm in 3D-DNA 33 with re-compilation using Python for acceleration ( https://github.com/tangerzhang/ALLHiC/blob/master/bin/ALLHiC_pip.sh ). The artifactual sequences were detected and removed if they contained a large proportion (>40%) of k -mers present only in assembly but absent in the sequencing reads implemented in the Merqury program 34 . To identify the contaminated sequences derived from organelles and bacteria, the plant chloroplast and mitochondria genomes, along with bacterial genomes downloaded from NCBI (accessed on Oct. 2021), were aligned against the ZZ1 genome assembly using BLASTN program with an e -value of 10 −5 . Contigs that have more than >40% genomic regions overlapped with contaminated sequences were removed from the assemblies. The process of quality control resulted in a total of 10.4 Gb high-quality genome sequences in our assembly.

Chromosomal-level genome assembly

The basic idea to accomplish this ultra-complex genome is to reduce the scaffolding complexity through the pre-definition of homologous groups, followed by separating different haplotypes (Fig. 1 ). We first assigned these assembled contigs into two groups (ROC and YZ) representing sequences from two parental genomes. This step can be achieved by two strategies based on read depth and parental-specific k -mers. For the read depth-based strategy, we aligned the parental short reads (ROC25 and YZ89-7) against these assembled contigs using BWA-MEM algorithm 12 with default parameters, respectively. The normalized read depth across all the contigs was calculated using our previously developed CNV caller ( https://github.com/sc-zhang/popCNV ). Contigs with a 1.5× fold-change of depth biased toward ROC25 were assigned to the ROC-derived group and vice versa. In addition, the parental-specific k -mers were identified by comparing 21-mers between ROC25 and YZ89-7 genomic sequences. The parental-specific k -mers were traced back to these assembled contigs, which were subsequently assigned to the ROC or YZ groups based on a twofold change of parental-specific 21-mers.

The ancestor-derived contigs were identified based on genome assemblies and population resequencing of the two fundamental ancestral species, S. officinarum and S. spontaneum . We randomly selected five resequenced S. officinarum and five resequenced S. spontaneum individuals from our previously published data and mapped these reads against the assembled contigs 11 . Following the aforementioned procedure, the read depth for each contig was calculated and normalized using popCNV. We assigned these contigs to the SO-derived group if they showed uniformly and significantly higher read depth in S. officinarum resequenced samples than in S. spontaneum with the cutoff p value of 0.05 at the student’s t -test and vice versa. For those unassigned contigs that did not show a significant difference in read depth between the two ancestral species, we aligned these contigs against the previously published S. spontaneum AP85–441 10 and S. officinarum LA-purple 11 genomes using minimap2 35 and determined the origination for each contig based on the similarity with the two ancestral genomes. Contigs with higher mapping quality and similarity with the S. spontaneum AP85–441 genome were re-assigned to the SS-derived group and vice versa (SO-derived).

The basic idea that Hi-C technology can be used for chromosome-scaffolding is based on the observation that intra-chromosome sequences are highly interacted in comparison with inter-chromosome contigs. It suggests that the contigs involved in recombination between SO and SS subgenomes share a high Hi-C signal density. Following this idea, we mapped previously published Hi-C reads against assemblies in S. spontaneum AP85-441 10 and S. officinarum LA-purple 11 genomes using BWA-MEM algorithm 12 with ‘-SP5M’ parameters, respectively, and separated contigs into two groups SO- and SS-derived. We calculated and normalized the Hi-C signals between the SO- and SS-derived contigs with the following formula 1 :

This analysis identified the normalized Hi-C signal density of 0.4 as the threshold that can confidently distinguish whether the contigs are from recombined or non-recombined chromosomes (Supplementary Fig. 8 ). Based on the threshold, we got 1.9 Gb sequences from different ancestors that highly interacted with 0.78 Gb from ROC and 1.12 Gb from YZ. These contigs were considered sequences from the recombined chromosomes and thereafter assigned to ROC-Rec and YZ-Rec groups.

The aforementioned steps classified the assembled contigs into six groups: ROC-SO, ROC-SS, ROC-Rec, YZ-SO, YZ-SS, and YZ-Rec. Contigs within each group were subject to the ALLHiC haplotype-phasing pipeline independently, following the detailed description to assemble the auto-tetraploid sugarcane genome in Github ( https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-an-auto-polyploid-sugarcane-genome ). The resulting scaffolds were manually inspected and adjusted based on two pieces of evidence, chromatin interaction heatmap that was implemented in juicebox and synteny relationship with the monoploid assemblies of two ancestral genomes revealed by our developed tool CATG (Collinearity-based Assembly correcTor GUI). The CATG program is a GUI application that manually adjusts genome assembly according to the collinearity with a reference genome. The codes and a user-friendly manual are openly accessible ( https://gitee.com/tanger-lab_enterprise/CATG ).

The quality of genome assembly was assessed using a series of approaches. Initially, the completeness of genome assembly was evaluated based on 1614 benchmarking universal single-copy orthologs collected from the Embryophyta_odb10 database 36 . We also applied a k -mer-based strategy to investigate the assembly consensus (i.e., quality value or QV) and completeness of these genomes, which was implemented in the Merqury program 34 . The copy number spectrum plots show that single-copy k -mers are dominant across these genomes with a high level of k -mer completeness (95.99). The QV in the genome assembly is 50.27, corresponding to more than 99.99% single-base accuracy. The genome consistency was assessed by aligning the Illumina short reads, which shows that almost all the genomic regions can be covered by more than 99.9% of sequencing reads. The chromosomal-scale genome assembly was assessed using a chromatin contact map and synteny with the sorghum genome 21 .

Genome annotation

Repeated sequences were annotated in collaboration with RepeatModeler ( http://www.repeatmasker.org/RepeatModeler/ ) and RepeatMasker 37 . Briefly, RepeatModeler searches for repeated sequences and generates a library using the RECON and RepeatScout algorithms 38 . Customized for the task, this library of repetitive sequences contains a variety of consensus sequences, most of which belong to the TE family. This library is subsequently utilized as the entry data for the RepeatMasker.

The protein-encoding gene annotation of the ZZ1 genome was undertaken following the pipeline illustrated by GETA . The pipeline invoked various programs such as HiSAT2 30 , augustus 39 , trimmomatic 40 , and genewise 41 to provide evidence of homologous proteins and transcripts to support the results of the gene ab initio predictions. During the first step, we covered the ZZ1 assemblies using repetitive sequences, which could efficiently diminish the background noise generated by replicated sequences. HiSAT2 is then used to align the trimmomatic quality-controlled RNA-seq data to the assembly results. For reliable intron sequences and to validate the transcript data, GETA was performed to compute a threshold for the sequencing depth of each alignment region and filter out any transcripts below this threshold for depth. Residual high-quality transcripts were utilized by the TransDecoder software as predictive data for the ORF ( https://github.com/TransDecoder/TransDecoder ). Using Augustus software, a Hidden Markov Model was trained on the results based on the exon and intron predictions. Orthologous protein sequences, including those of Arabidopsis thaliana , Oryza sativa , and Sorghum bicolor , were gathered and presented in genewise software alongside those of the two ZZ1 ancestral parental genomes, represented by LA-Purple and AP85-441. We jointly obtained high-quality gene models derived from conserved homologous sequences, expression, and Augustus supporting scores. A homemade Perl pipeline, GetaFilter, was next employed to screen the predicted genes. To summarize, we used the Pfam database and the plant UniProt database to tailor the predictions and yield highly conserved protein sequences. Moreover, we measured the FPKM of RNA-seq data acquired from multiple tissues of ZZ1 to predict. We retained genes that met at least one of the following conditions: first, the presence of a structural domain in the Pfam database or a homologous protein in the plant UniProt database; second, FPKM above three and paired coverage above 80%; and third, Augustus support above 80.

The functional annotation of the protein-coding genes was performed based on the alignment of those deduced proteins against several databases, namely Gene Ontology (GO) 42 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 43 , auxiliary proteins (TrEMBL), protein sequences (Swiss-Prot) 44 and Clusters of Orthologous Groups (KOG) 45 .

To distinguish homologous genes between subgenomes and accurately identify alleles among homologous chromosomes, we developed the following pipeline, which includes the three major steps:

Identification of monoploid genomes and representative genes from the two ancestral genomes. Our previous study generated the haplotype-resolved assemblies for the two ancestral genomes, S. spontaneum 10 (Ss, 2 n = 4 x = 32) and S. officinarum 11 (So, 2 n = 8 x = 80) genomes. Using the two fully phased genome assemblies, we first generated the monoploid genomes containing only one set of haplotypes by identifying and removing redundant sequences (i.e., allelic sequences) with high similarities, which was implemented in our previously developed Khaper program 46 . This analysis resulted in an 802-Mb monoploid genome for Ss and a 1.15-Gb monoploid genome for So. We further annotated the two monoploid genomes, leading to 34,010 representative protein-coding genes in Ss and 39,355 in So.

Detection of allelic genes in the ZZ1 genome based on ancestral monoploid genomes. To separate homologous genes originating from different subgenomes, we used the monoploid genomes of the two ancestors, with a total of 73,365 protein sequences, as reference. The predicted ZZ1 proteins were BLASTed against the reference sequences, and the best match were retained using the parameters “-evalue 1e-5 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -max_target_seqs 1”. A total of 97.20% (359,751/370,103) of the ZZ1 predicted protein sequences were assigned to any of those 52,866 ancestral genes. We further identified the allelic genes based on three strategies: protein similarity, synteny, and coordinates, following a similar approach 10 . Allelic genes were partitioned into the same allelic group if they were located in the same synteny blocks identified by MCScanX 47 and shared a high level of identity (≥70%) and coverage (≥60%). In addition, we also used the coordinating approach to determine allelic genes within the same loci among different haplotypes. The coding sequences of candidate alleles were aligned onto ZZ1 genomes, limiting the number of matches to the maximum number of the ploidy on the reference genome. This process was performed by minimap2 35 with parameters “-x splice -k 12 -a -N 12”. Subsequently, genes with more than 50% overlap within the same loci were considered candidate allelic genes. A total of 236,845 high-quality allelic genes were retained in the initial allelic table, which contains 45,209 allele-defined genes, with 20,665 in the Ss subgenome and 24,544 in the So subgenome. We further refined the allelic table by assigning un-anchor genes based on the second round of protein alignment. This led to 69,680 genes with defined alleles in the final allelic table.

Identification of homoeologous genes between the two subgenomes. We first used blastn 48 to search for optimal matches between So and Ss monoploid cds sequences. The parameter is “-evalue 1e-5-best_hit_score_edge 0.05-best_hit_overhang 0.25-max_target_seqs 1”. A total of 18,525 reciprocal blast hits (RBH), orthologs between Ss and So genomes, were identified. According to the correspondence of orthologs between two ancestral genomes, 10,254 genes were annotated with homoeologous genes between two subgenomes of the “ZZ1” genome in the alleles table, which was obtained in the previous step.

Homoeolog expression dominance analysis

The clean reads from RNA-seq after qualifying control by trimmomatic (v0.38) were mapped against the homoeologous gene pairs using bowtie2 49 . The expression level of each homoeologous gene was calculated by align_and_estimate_abundance.pl of Trinity package 50 . The differentially expressed homoeologous gene pairs with greater than a twofold change were defined as dominant expression gene pairs. The dominant expression genes were relatively higher expressed in the homoeologous pairs, while the lower ones were the subordinate genes. The homoeologous gene pairs that showed non-dominance were defined as neutral genes.

Analysis of synonymous substitution rates (Ks)

Paralogous and orthologous gene pairs were identified using the MCScanX software ( http://chibba.pgml.uga.edu/mcscan2/ ) based on the syntenic blocks with the defeat parameter. The number of synonymous substitutions per synonymous site ( Ks ) of each gene pair was calculated using the Nei-Gojobori method ( https://github.com/tanghaibao/bio-pipeline/blob/master/synonymous_calculation/synonymous_calc.py ).

Identification of homologous QTL for smut-resistance

We screened the large F1 population (c. 17,000 individual clones) of parental (ROC25 × YZ89-7) and constructed a smut-resistant/susceptible subpopulation. After field evaluation for three consecutive years, a total of 401 clones were verified, and each clone was assigned a resistance value (from 1 to 8, 1 representing the most resistant and 8 representing the most susceptible). Genotyping by Sequencing (GBS) technology was employed to sequence ROC25, YZ89-7, and 236 clones from the F1 population. This approach yielded 12,975,602 SNP sites, from which 1196 bin markers were derived after screening and filtering. Utilizing R570 18 as the reference genome, these bin markers were utilized to construct a high-density genetic map of sugarcane, spanning a total length of 701.855 cM, with individual linkage groups varying from 59.855 cM to 81.681 cM. Subsequently, simplified genome sequencing was conducted on 220 individuals exhibiting diverse resistance levels in field conditions. Based on constructing the high-density genetic map for sugarcane, permutation calculation used disease grade as phenotypic data and determined the screening threshold to be 7.119. Next, QTL positioning analysis was conducted using R/QTL v1.39 software, employing Composite Interval Mapping (CIM) to depict a curve graph based on the LOD value of each site. Results indicated the initial location of the smut QTL on the Chr06 chromosome of the R570 genome. The QTL confidence interval, spanning 5 cM from both ends of the peak and declining by 1.5 LOD values (Fig. 5a ), revealed a phenotypic variance explanation of 22.74%. The confidence interval length was 2.967 cM, spanning ~7.74 Mb, and containing 512 genes within the QTL interval.

As a reference, the ZZ1 genome was used as queries. The JCVI (python version MCScan) was employed to locate the chromosomes in ZZ1, each having the maximum number of collinear genes with the QTL region for smut resistance. Subsequently, the identified chromosomes were used as a query. JCVI was utilized to locate the regions with the most densely populated collinear genes as the QTL regions for smut resistance in R570 on homologous chromosomes of ZZ1, as well as used to construct collinearity maps between ZZ1 and the smut QTL regions of R570, and eggNOG-mapper was employed to functionally annotate genes within the QTL regions for smut resistance in all two modern sugarcane cultivar varieties.

Gene family identification

The Hidden Markov Model (HMM) associated with conserved structural domains of each tested gene family was found in the Pfam database. All the gene family members were searched in ZZ1, S. spontaneum AP85-44 10 , S. officinarum LA-Purple 11 , Sorghum 21 , Rice 51 , and maize 52 genome, respectively, using HMMER software with E -value <1E-5 base on the HMM model. All the identified gene family members were further blasted into the NCBI database for manual checking.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw sequence reads, genome assembly, and annotation data generated in this study have been deposited in the China National Center for Bioinformation Genome Warehouse under accession code GWHEQVP00000000 [ https://ngdc.cncb.ac.cn/gwh/Assembly/83532/show ]. The genome data of S. officinarum LA-Purple are available at NCBI under Bioproject accession PRJNA744175 . The genome data of KK3 were downloaded from NCBI under accession JALQSO000000000 . The genome data of SP80-3280 were downloaded from NCBI under accession GCA_009173535.1 . The genome data of R570 were downloaded from the sugarcane genome hub [ https://sugarcane-genome.cirad.fr/content/download ]. The genome data of AP85–441 were downloaded from NCBI under accession QVOL00000000 [ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_003544955.1/ ]. The genome data of Np-X were available at the Sequence Read Archive under BioProject accession PRJNA721787 . The genome data of rice are available at the National Genomics Data Center under PRJCA005549 . The genome data of sorghum were downloaded from NCBI under accession ABXC00000000.3 . The genome data of maize were downloaded from NCBI under accession LPUQ00000000 . RNA-Seq raw data have been deposited in NCBI under accession PRJNA1083323 . The public databases used in this study include UniProt database ( http://www.uniprot.org/ ), GO database ( http://geneontology.org/ ), KEGG database ( https://www.kegg.jp ), Swiss-Prot database ( http://www.uniprot.org/downloads ), TrEMBL ( http://www.expasy.org/sprot ), KOG database ( ftp://ftp.ncbi.nih.gov/pub/COG/KOG/kyva ), and Pfam database ( http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release ). Source data are provided in this paper. Source data are provided with this paper.

Code availability

The Khaper algorithm is available at GitHub [ https://github.com/lardo/khaper ]. ALLHiC can also be found at GitHub [ https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-an-auto-polyploid-sugarcane-genome ]. Codes were also archived on Zenodo [ https://doi.org/10.5281/zenodo.4780792 ].

Paterson, A. H., Moore, P. H. & Tew, T. L. The gene pool of Saccharum species and their improvement. Genomics Saccharinae 11 , 43–71 (2013).

Article Google Scholar

Daniels, J. & Roach, B. in Sugarcane Improvement Through Breeding , Vol. 11 (ed. Heinz, D. J.) 7–84 (Elsevier,1987).

D’Hont, A. & Glaszmann, J. C. Sugarcane genome analysis with molecular markers: a first decade of research. Int. Soc. Sugar Cane Technol. Proc. XXIV Congr. 2 , 556–559 (2001).

Google Scholar

D’Hont, A. et al. Characterization of the double genome structure of modern sugarcane cultivars ( Saccharum spp.) by molecular cytogenetics. Mol. Gen. Genet. 250 , 405–413 (1996).

Article PubMed Google Scholar

Sauer, J. D & Simmonds, N. Evolution of Crop Plants (Longman Group Ltd, 1976).

Sreenivasan, T., Ahloowalia, B. & Heinz, D. Sugarcane callus tissue. Crop Sci. 17 , 814–816 (1987).

Piperidis, G. & D’hont, A. Chromosome composition analysis of various Saccharum interspecific hybrids by genomic in situ hybridization (GISH). Int. Soc. Sugar Cane Technol. Proc. XXIV Congr. 2 , 565–566 (2001).

Cuadrado, A. et al. Genome remodeling in three modern S. officinarum × S. spontaneum sugarcane cultivars. J. Exp. Bot. 55 , 847–854 (2004).

Article CAS PubMed Google Scholar

D’Hont, A. Unraveling the genome structure of polyploids using FISH and GISH; examples of sugarcane and banana. Cytogenet. Genome Res. 109 , 27–33 (2005).

Zhang, J. et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat. Genet. 50 , 1565–1573 (2018).

Zhang, Q. et al. Genomic insights into the recent chromosome reduction of autopolyploid sugarcane Saccharum spontaneum . Nat. Genet. 54 , 885–896 (2022).

Zhang, X. et al. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5 , 833–845 (2019).

Rossi, M. et al. Genomic distribution and characterization of EST-derived resistance gene analogs (RGAs) in sugarcane. Mol. Genet. Genomics 269 , 406–419 (2003).

Aitken, K. et al. QTL identified for yield components in a cross between a sugarcane ( Saccharum spp.) cultivar Q165A and an S. officinarum clone IJ76-514. In Proceedings for the 4th International Crop Science Congress , Poster 3 (Brisbane, Australia, 2004).

Aitken, K. S. et al. A comprehensive genetic map of sugarcane that provides enhanced map coverage and integrates high-throughput Diversity Array Technology (DArT) markers. BMC Genomics 15 , 1–12 (2014).

Souza, G. M. et al. Assembly of the 373k gene space of the polyploid sugarcane genome reveals reservoirs of functional diversity in the world’s leading biomass crop. Gigascience 8 , giz129 (2019).

Article PubMed PubMed Central Google Scholar

Shearman, J. R. et al. A draft chromosome-scale genome assembly of a commercial sugarcane. Sci. Rep. 12 , 20474 (2022).

Article ADS CAS PubMed PubMed Central Google Scholar

Garsmeur, O. et al. A mosaic monoploid reference sequence for the highly complex genome of sugarcane. Nat. Commun. 9 , 2638 (2018).

Article ADS PubMed PubMed Central Google Scholar

Huang, Y. et al. Species-specific abundant retrotransposons elucidate the genomic composition of modern sugarcane cultivars. Chromosoma 129 , 45–55 (2020).

Ou, S. et al. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46 , e126 (2018).

PubMed PubMed Central Google Scholar

McCormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. Plant J. 93 , 338–354 (2018).

Zhang, T. et al. Sequencing of allotetraploid cotton ( Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nat. Biotechnol. 33 , 531–537 (2015).

The International Wheat Genome Sequencing Consortium (IWGSC). A chromosome-based draft sequence of the hexaploid bread wheat ( Triticum aestivum ) genome. Science 345 , 1251788 (2014).

Yang, J. et al. The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nat. Genet. 48 , 1225–1232 (2016).

Chalhoub, B. et al. Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 345 , 950–953 (2014).

Article ADS CAS PubMed Google Scholar

Bao, Y. X. et al. Genetic diversity and pathogenicity of Fusarium fujikuroi species complex (FFSC) causing sugarcane pokkah boeng disease in China. Plant Dis. 107 , 1299–1309 (2023).

Marcussen, T. et al. Ancient hybridizations among the ancestral genomes of bread wheat. Science 345 , 1250092 (2014).

Wang, M. et al. Asymmetric subgenome selection and cis-regulatory divergence during cotton domestication. Nat. Genet. 49 , 579–587 (2017).

Franziska, S. et al. Subgenome dominance shapes novel gene evolution in the decaploid pitcher plant Nepenthes gracilis . Nat. Plants 9 , 2000–2015 (2023).

Pertea, M. et al. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie, and Ballgown. Nat. Protoc. 11 , 1650–1667 (2016).

Article CAS PubMed PubMed Central Google Scholar

Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18 , 170–175 (2021).

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 , 1754–1760 (2009).

Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356 , 92–95 (2017).

Rhie, A. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21 , 1–27 (2020).

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34 , 3094–3100 (2018).

Simão, F. A. et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31 , 3210–3212 (2015).

Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinform. 4 , Unit 4.10 (2009).

Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12 , 1269–1276 (2002).

Stanke, M. et al. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform. 7 , 1–11 (2006).

Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30 , 2114–2120 (2014).

Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res 10 , 547–548 (2000).

Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25 , 25–29 (2000).

Kanehisa, M. et al. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32 , D277–D280 (2004).

Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31 , 365–370 (2003).

Tatusov, R. L. et al. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28 , 33–36 (2000).

Zhang, X. et al. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis . Nat. Genet . 53 , 1250–1259 (2021).

Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40 , e49–e49 (2012).

Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215 , 403–410 (1990).

Langmead, B. & Salzberg, S. Fast gapped-read alignment with bowtie 2. Nat. Methods 9 , 357–359 (2012).

Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29 , 644–652 (2011).

Song, J. M. et al. Two gap-free reference genomes and a global view of the centromere architecture in rice. Mol. Plant 14 , 1757–1767 (2021).

Jiao, Y. et al. Improved maize reference genome with single-molecule technologies. Nature 546 , 524–527 (2017).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2021YFF1000900 to X.Z.), National Natural Science Foundation of China grant (No. 32222019 to X.Z., 32260517 to M.Z., 32072408 to B.C., and 32201794 to Q.Z.), China Agricultural Research Systems funded by MFA and MARA (CARS 170109 to M.Z.), Guangxi Key Research and Development Program (Guike AA22117001 to W.Y.), the fellowship of China National Postdoctoral Program for Innovative Talents (BX20220349 to Q.Z.), the grants from State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources(SKLCUSA-a01 and SKLCUSA-a07 to Q.Z.), the Sugarcane Research Foundation of Guangxi University (2022GZA003 to Y.B.).

Author information

These authors contributed equally: Yixue Bao, Qing Zhang, Jiangfeng Huang, Shengcheng Zhang.

Authors and Affiliations

State Key Laboratory for Conservation and Utilization of Subtropical Agri-Biological Resources & Guangxi Key Laboratory for Sugarcane Biology, Guangxi University, Nanning, 530005, China

Yixue Bao, Qing Zhang, Jiangfeng Huang, Wei Yao, Zehuai Yu, Zuhu Deng, Xikai Yu, Shan Lu, Ru Li, Chengwu Zou, Yuzhi Xu, Zongling Liu, Fan Yu, Jiaming Song, Youzong Huang, Jisen Zhang, Haifeng Wang, Baoshan Chen & Muqing Zhang

National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou, 518120, China

Qing Zhang, Shengcheng Zhang, Jiaxin Yu, Weilong Kong, Yibin Wang, Yuhan Song & Xingtan Zhang

You can also search for this author in PubMed Google Scholar

Contributions

M.Z., X.Z. and B.C. conceived the project and coordinated research activities; M.Z., X.Z., and Y.B. designed the experiments; Y.B., J.H., W.Y., Z.D., R.L., C.Z., Y.X., Z.L. and Y.H. collected the sugarcane materials and performed the experiments; X.Z., Q.Z., S.Z., J.Y., Y.W., Y.S., X.Y. and J.S. contributed to genome assembly and data analysis; Y.B. and S.L. conducted transcriptome sequencing and analysis; Q.Z. and J.H. studied the genes relevant to the key characteristics in sugarcane; Z.Y. and F.Y. performed the oligo-FISH experiments; M.Z., X.Z., B.C., Y.B., Q.Z. and W.K. wrote the manuscript; M.Z., J.Z. and H.W. revised the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Baoshan Chen , Xingtan Zhang or Muqing Zhang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Aureliano Bombarely, Mike Butterfield and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Bao, Y., Zhang, Q., Huang, J. et al. A chromosomal-scale genome assembly of modern cultivated hybrid sugarcane provides insights into origination and evolution. Nat Commun 15 , 3041 (2024). https://doi.org/10.1038/s41467-024-47390-6

Download citation

Received : 05 October 2023

Accepted : 31 March 2024

Published : 08 April 2024

DOI : https://doi.org/10.1038/s41467-024-47390-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Development and analysis of Genome-Scale metabolic reconstruction of Microchloropsis gaditana CCMP526

Campus location, principal supervisor, additional supervisor 1, additional supervisor 2, year of award, department, school or centre, additional institution or organisation, degree type, usage metrics.

Genetics not elsewhere classified
Cell metabolism
Systems biology
Bioinformatics and computational biology not elsewhere classified
Medical biotechnology not elsewhere classified

Open access
Published: 05 April 2024

Yak genome database: a multi-omics analysis platform

Hui Jiang 1 , 2 na1 ,
Zhi-Xin Chai 3 na1 ,
Xiao-Ying Chen 1 , 2 na1 ,
Cheng-Fu Zhang 1 , 2 ,
Yong Zhu 1 , 2 ,
Qiu-Mei Ji 1 , 2 &
Jin-Wei Xin 1 , 2

BMC Genomics volume 25 , Article number: 346 ( 2024 ) Cite this article

240 Accesses

Metrics details

The yak ( Bos grunniens ) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data.

Description

The Yak Genome Database ( http://yakgenomics.com/ ) integrates the research results of genome, transcriptome, proteome, and DNA methylation, and provides an integrated platform for researchers to share and exchange omics data. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The genome module provides access to yak genome sequences, gene annotations and variant information. The transcriptome module offers transcriptome data from various tissues of yak and cattle strains at different developmental stages. The proteome module presents protein profiles from diverse yak organs. Additionally, the DNA methylation module shows the DNA methylation information at each base of the whole genome. Functions of data downloading and browsing, functional gene exploration, and experimental practice were available for the database.

This comprehensive database provides a valuable resource for further investigations on development, molecular mechanisms underlying high-altitude adaptation, and molecular breeding of yak.

Peer Review reports

Although single omics study provides information and insights into specific biological or molecular processes, it is hard to confirm the real molecular mechanisms underlying the functionality of an organism and the relationships between biological processes and environmental factors. Integrating and analyzing multiple omics data provide an effective and systematic approach to life science researchers. In general, genomics provides DNA sequence information, transcriptomics examines gene transcription patterns under specific conditions, proteomics explores the composition and expression levels of proteins in cells, and DNA methylation involves chemical modifications on DNA molecules [ 1 ]. Multi-omics analysis combines data at different levels to comprehensively explore biological processes. Multi-omics analysis reveals connections between genomics, transcriptomics, proteomics, and DNA methylation data, facilitating to understand how genomic variations impact gene transcription and protein expression, as well as the associations between DNA methylation and gene activities [ 2 ]. These pieces of information contribute novel information to the gene regulatory networks, which are important to molecular mechanisms underlying biological functions, development, metabolism, etiopathology, and environmental adaptation.

The yak ( Bos grunniens ) is a unique species in the Qinghai-Tibet Plateau, and widely distributes in high-altitude areas of Western China and neighboring regions. As a large mammal at the highest-altitude area, yak has survived and adapted to the harsh and cold environment after thousands of years of evolution [ 3 ]. Their unique biological features make them an ideal model for studying adaptive evolution and high-altitude ecosystems. Yak also plays important roles in agriculture and economic development. As a significant livestock species, yak provides meat, fur, and other economic resources. Their dung is also an important source of agricultural fertilizer and energy production. Moreover, yak positively impacts the ecological balance and vegetation restoration in the plateau grasslands through their grazing behaviors [ 4 ]. In recent years, we have analyzed yak using different omics approaches. These data preliminarily explored the yak genetic characteristics, gene transcription, protein expression, and DNA methylation patterns, as well as molecular regulatory mechanisms in response to different conditions [ 5 , 6 , 7 , 8 , 9 , 10 , 11 ], providing novel insights into the mechanisms underlying evolution, and high-altitude adaptation in yak.

Currently, the data resources of yak omics researches are generally stored in public databases in their raw data format, such as NCBI. These databases primarily provide storage and retrieval functions, but lack an integrated platform for data integration and in-depth analysis. Hu et al. [ 12 ] developed a yak genome database ( http://me.lzu.edu.cn/yak ), which incorporated genome sequences, predicted genes and associated annotations, non-coding RNA sequences, transposable elements, and single nucleotide variants of yak, as well as three-way whole-genome alignments between human, cattle and yak. However, this database did not include other omics datasets, such as transcriptome, proteome, and DNA methylation. Given the vast and diverse nature of omics data, the traditional database retrieval methods could not fully explore the relationship between different types of datasets [ 13 ]. Thus, an integrated platform of different omics data is crucial to facilitate data integration, interaction, and analysis. An integrated platform can also offer advanced data mining and machine learning algorithms to help researchers discover the complex relationships among yak genomics, transcriptomics, proteomics, and other omics levels, further deepening our understanding of biological processes and diseases in yak.

In this study, the Yak Genome Database ( http://yakgenomics.com/ ) was constructed, which successfully assembled a comprehensive yak fine-scale genome map at the chromosome level, using PacBio sequencing, Illumina sequencing, Bionano assembly, and Hi-C three-dimensional genome scaffolding. Moreover, this platform also integrated transcriptome, proteome, and DNA methylation data of yak, which were not available in Yak Genome Database developed by Hu et al. [ 12 ]. This database provides basic information for yak researches in future, such as molecular breeding, molecular evolution, disease prevention and control.

Construction and content

The Yak Genome Database was deployed in the Ubuntu 20.04 operation system using the AKKA 2.13 (web server), MySQL 8.0.30 (database server), Scala 2.13.2, and SBT 1.3.9. All data were managed and stored using the MySQL Database Management System. The query function was enforced based on Slick 3.3.2 middleware tier. The Jbrowse 1.16.11 was used to visualize the genome. The website interfaces were designed and implemented using the Bootstrap 4.6.0 and the Play Framework 2.8.7. The software versions and statistical tools used for data analyses and plot preparation have been presented in Xin et al. [ 6 , 7 , 8 , 9 , 10 , 11 ]. The boxplots, and heatmaps were prepared using R 4.2.1. The website has been tested in several popular web browsers, including Firefox, Google Chrome, and Internet Explorer.

Utility and discussion

The yak genome database content.

The multi-omics data in the Yak Genome Database are categorized into two central functional domains: data resources and navigation (Fig. 1 ). The data resources contain four main modules, including genome, transcriptome, proteome, and methylation information. The database contains 26,518 genes, 62 transcriptomes, 144,309 proteome spectra, and 22,478 methylation sites of yak. The navigation page consists of Browser, Jbrowse, Search and Blast functions. Currently, the database supports individual download of images and gene data. In the future, we will add functions such as one-click download of whole genome information.

The homepage of yak genome database

Genome module

The Genome module incorporates the complete genomic DNA sequence of yaks obtained by the third-generation high-throughput sequencing platform (PacBio RSII) [ 14 ]. The yak genome was sequenced at a coverage of 70X, with the second-generation sequencing data used to correct errors. The Bionano assisted assembly technology was used for high-quality assembly, and analysis. Next, a refined physical map of the yak chromosome was generated, providing a more readable and complete genome database than the fragmented information in another Yak Genome Database (BosGru_v2.0) [ 15 ], and contributing a novel genome tool to yak researchers.

When accessing the ‘Genome’ section on the homepage, a new page will display information of genes at all locations, such as Gene ID, Chromosome, Start Position, End Position, Strand, GO (Gene Ontology) terms, Interpro, KEGG (Kyoto Encyclopedia of genes and Genomes), Swissprot, and Trembl in a user-friendly table format (Fig. 2 A). When clicking each gene, users can access detailed information of this gene, including annotations, transcriptional levels, proteome data, Jbrowse page, and nucleotide sequences associated with the gene (Fig. 2 B- 2 D). The ‘Annotation’ tav provides comprehensive gene annotation information, including GO terms, KEGG pathways, and Interpro annotations, which can be further explored by clicking them. The ‘Expression’ tab displays gene expression levels across different cattle breeds and tissues, and users can download the images in various formats by selecting the menu in the upper right corner of the image. ‘Jbrowse’ is used to display integrated information from annotated genomic datasets, while ‘Seqs’ provides the coding sequence (CDS) and protein sequence on the selected gene.

Features of the genome module. ( A ) Genome browse. ( B ) Basic information and annotation of a gene. ( C ) Gene expression. ( D ) Gene Jbrowse and sequences

Transcriptome module

Previously, comparative transcriptome sequencing was performed on lung, gluteal muscle, and mammary gland tissues of low-altitude cattle (Sanjiang and Holstein cattle), Tibetan cattle (living at a moderate altitude), and yaks (living at a high altitude). In addition, these tissues of yaks at different ages (6, 30, 60, and 90 months) were also subjected to transcriptome sequencing. These analyses identified the functional genes involved in the major biochemical, metabolic, and signal transduction pathways involved in yak development and high-altitude adaptation [ 10 , 11 ]. These data are included in the transcriptome module on the website, providing a valuable transcriptome database for specific tissue biomarkers, molecular research, and breeding of yaks. After clicking the “Transcriptome” button, users can select the strain in the ‘Sample’ dialog box, enter the gene ID in the ‘Gene ID’ dialog box, and then click ‘Search’ (Fig. 3 A), and then the website will return the transcriptional levels of the selected genes in selected samples in the forms of data table, Boxplot, Lineplot, and Heatmap (Fig. 3 B and D).

Features of the transcriptome module. ( A ) Transcriptome browse. ( B ) Box plot, ( C ) Line plot and ( D ) Heatmap of gene expression

Proteome module

Using the liquid chromatography-mass spectrometry (LC-MS) method, proteomic analyses were conducted for four specific tissues from four different species (yak, Tibetan cattle, Sanjiang cattle, and Holstein cattle) [ 7 , 8 , 9 ]. All the animals were female and 60 months of age. The proteome module provides two input dialog boxes. Users can select two samples and then click the “search” button. Next, the website will return the comparison results of the expression levels of all genes in the two selected samples, including log2(fold change) and statistical parameters (Fig. 4 ).

Browse of the proteome module

Methylation module

DNA methylation is a critical epigenetic modification that occurs in both animals and plants, playing pivotal roles in chromosome structure, gene expression and regulation [ 16 ]. The establishment of a comprehensive DNA methylation database for yak can significantly advance the comprehension of cellular gene expression and regulation, and provide deeper insights into the spatiotemporal specificity of DNA methylation across various developmental stages and organs [ 17 ]. The DNA methylation database of yak presents single-base methylation maps and tissue-specific methylation maps. The single-base methylation maps include: 1) DNA methylation levels at the single-base resolution, 2) DNA methylation levels specific to different base types, 3) DNA methylation levels specific to different gene structures, 4) DNA methylation levels in repetitive sequences, and 5) DNA methylation levels in non-coding sequences and regulatory regions. The tissue-specific methylation maps involve three tissues: mammary gland, lung, and muscle [ 6 ]. On the website, users can select ‘Sample’ and ‘Chromosome’ in the Methylation module, set the ‘Start Position’ and ‘End Position,’ and finally click ‘Search’ to obtain the corresponding DNA methylation results on the selected sequences (Fig. 5 ).

Features of the methylation module. ( B ) Box plot, ( C ) Line plot and ( D ) Heatmap of methylated gene expression

‘Browse’ allows users to read the yak genome directly. ‘JBrowse’ is a next-generation genome browser built with JavaScript and HTML5. The Jbrowse of Yak Genome Database includes tracks describing gene, gene sequence, mRNAs, structure, and other gene-related features, and provides a graphical display of annotations on the yak genome (Fig. 6 ). Users can browse gene models on chromosomes and unanchored contigs. For example, if user set the genomic region from 4,454,001 bp to 5,878,000 bp on Chr1 for browsing, all genes in this region will appear in order (Fig. 6 A). When clicking on ‘BmuPB021145’, an extra layer will appear with the detailed information, such as mRNAs, CDS and other features (Fig. 6 B). For more operational details, users can click the ‘Help’ button, which provides comprehensive instructions and guidance.

Regional view of the genome using Jbrowse. ( A ) A graphic view of the region 4,454,001 bp to 5,878,000 bp on Chr1. ( B ) The interface after clicking on ‘BmuPB021145’.

The ‘Search’ tab supplies users with two methods (search by gene ID or range) for genome searching. When users click on ‘Blast’, three options ‘Blastn Gene’, ‘Blastn Genome’ and ‘Blastp’ will display. Users can select the Blast type and enter a DNA or protein sequence, and set the parameters of ‘Evalue’, ‘Word size’ and ‘Max target seqs’. After clicking the ‘Search’ button, the nucleotide or protein sequence complying the search conditions will display and could be downloaded by the users.

Additional tools

The Yak Genome Database also provides users with several convenient online tools, including Primer designer, GO and KEGG enrichment. The ‘Primer designer’ tool offers primer design function to amplify a selected sequence. The ‘GO enrichment’ and ‘KEGG enrichment’ tools facilitate the users to obtain the GO and KEGG enrichment results of a set of genes.

Maintenance of the yak genome database in future

To ensure continuous operation of the Yak Genome Database, we would assign an administrator to manage the website regularly. We would keep omics studies on yak in future, and all the omics data we obtained would be uploaded to this database. In addition, we would keep cooperations with other investigators and find more cooperators who work on yak. Next, all the progresses on yak omics would also be encouraged to supplement in this database.

Conclusions

The Yak Genome Database is a comprehensive platform of genomic physical map, which integrates genome, transcriptome, proteome, and DNA methylation data. Information in the database can be downloaded, and shared through the Internet. Users who want to upload their own data can contact the administrator of the website. By providing timely updates on yak research progress, the Yak Genome Database enables efficient and interactive sharing of existing scientific data among researchers worldwide who are interested in yak, cattle, livestock, ruminant animals, and even medical research. Comparative analysis of multidimensional data from key yak tissues aims to uncover the mechanisms underlying high-altitude adaptation, disease resistance, cold tolerance, and starvation resistance of large animals in the plateau. These findings contribute to molecular breeding of livestock animals and the understanding of human responses to harsh environments.

Data availability

The datasets generated and analyzed in the current study are freely available on the Download page of Yak database with the web link: http://yakgenomics.com/ .

Abbreviations

Coding Sequence

Gene Ontology

Kyoto Encyclopedia of Genes and Genomes

National Center for Biotechnology Information

Manzoni C, Kia DA, Vandrovcova J, Hardy J, Wood NW, Lewis PA, et al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform. 2018;19(2):286–302.

Article CAS PubMed Google Scholar

Liao Y, Wang J, Zou J, Liu Y, Liu Z, Huang Z. Multi-omics analysis reveals genomic, clinical and immunological features of SARS-CoV-2 virus target genes in pan-cancer. Front Immunol. 2023;14:1112704.

Article CAS PubMed PubMed Central Google Scholar

Ge Q, Guo Y, Zheng W, Zhao S, Cai Y, Qi X. Molecular mechanisms detected in yak lung tissue via transcriptome-wide analysis provide insights into adaptation to high altitudes. Sci Rep. 2021;11(1):7786.

Ayalew W, Chu M, Liang C, Wu X, Yan P. Adaptation mechanisms of Yak ( Bos grunniens ) to high-Altitude Environmental stress. Animals. 2021;11(8):2344.

Article PubMed PubMed Central Google Scholar

Gao X, Wang S, Wang YF, Li S, Wu SX, Yan RG, et al. Long read genome assemblies complemented by single cell RNA-sequencing reveal genetic and cellular mechanisms underlying the adaptive evolution of yak. Nat Commun. 2022;13(1):4887.

Xin J, Chai Z, Zhang C, Zhang Q, Zhu Y, Cao H, et al. Methylome and transcriptome profiles in three yak tissues revealed that DNA methylation and the transcription factor ZGPAT co-regulate milk production. BMC Genom. 2020;21(1):731.

Article CAS Google Scholar

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Signature of high altitude adaptation in the gluteus proteome of the yak. J Exp Zool B Mol Dev Evol. 2020;334(6):362–72.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Differences in proteomic profiles between yak and three cattle strains provide insights into molecular mechanisms underlying high-altitude adaptation. J Anim Phys Anim Nutr. 2022;106(3):485–93.

Xin JW, Chai ZX, Zhang CF, Yang YM, Zhang Q, Zhu Y, et al. Comparative analysis of Skeleton muscle Proteome Profile between Yak and cattle provides insight into high-altitude adaptation. Curr Proteom. 2021;18(1):62–70.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Transcriptome profiles revealed the mechanisms underlying the adaptation of yak to high-altitude environments. Sci Rep. 2019;9(1):7558.

Xin JW, Chai ZX, Zhang CF, Zhang Q, Zhu Y, Cao HW, et al. Comparisons of lung and gluteus transcriptome profiles between yaks at different ages. Sci Rep. 2019;9(1):14213.

Hu Q, Ma T, Wang K, Xu T, Liu J, Qiu Q. The yak genome database: an integrative database for studying yak biology and high-altitude adaption. BMC Genomics. 2012;13:600.

Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat Comp Sci. 2021;1(6):395–402.

Article Google Scholar

Ji QM, Xin JW, Chai ZX, Zhang CF, Dawa Y, Luo S, et al. A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak. Mol Ecol Res. 2021;21(1):201–11.

Jiangfeng F, Yuzhu L, Sijiu Y, Yan C, Gengquan X, Libin W, et al. Transcriptional profiling of two different physiological states of the yak mammary gland using RNA sequencing. PLoS ONE. 2018;13(7):e0201628.

Lucibelli F, Valoroso MC, Aceto S, Plant DNA, Methylation. An epigenetic Mark in Development, Environmental interactions, and evolution. Int j mol sci. 2022;23(15):8299.

Chai Z, Wu Z, Ji Q, Wang J, Wang J, Wang H, et al. Genome-wide DNA methylation and hydroxymethylation changes revealed epigenetic regulation of Neuromodulation and Myelination in Yak Hypothalamus. Front Genet. 2021;12:592135.

Download references

This work was supported by the Program of Provincial Department of Finance of the Tibet Autonomous Region, the Major Special Projects of Tibet Autonomous Region (XZ202101ZD0002N-01), the Second Tibetan Plateau Scientific Expedition and Research Program (2019QZKK0501), and the program National Beef Cattle and Yak Industrial Technology System (CARS-37).

Author information

Hui Jiang, Zhi-Xin Chai and Xiao-Ying Chen contributed equally to this work.

Authors and Affiliations

State Key Laboratory of Hulless Barley and Yak Germplasm Resources and Genetic Improvement, 850000, Lhasa, Tibet, China

Hui Jiang, Xiao-Ying Chen, Cheng-Fu Zhang, Yong Zhu, Qiu-Mei Ji & Jin-Wei Xin

Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences, 850000, Lhasa, Tibet, China

Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Sichuan Province and Ministry of Education, Southwest Minzu University, 610041, Chengdu, Sichuan, China

Zhi-Xin Chai

You can also search for this author in PubMed Google Scholar

Contributions

HJ and ZXC performed the analysis. XYC conducted the database. CFZ and YZ wrote the paper. QMJ supervised the database and JWX revised the manuscript. All authors have approved the final article.

Corresponding authors

Correspondence to Qiu-Mei Ji or Jin-Wei Xin .

Ethics declarations

Ethics approval and consent to participate.

All procedures and experiments involving animals followed the guidelines for the Care and Use of Laboratory Animals. The Ethics Committee at Institute of Animal Science and Veterinary, Tibet Academy of Agricultural and Animal Husbandry Sciences (Permit Number: 2015 − 216) approved this study.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Jiang, H., Chai, ZX., Chen, XY. et al. Yak genome database: a multi-omics analysis platform. BMC Genomics 25 , 346 (2024). https://doi.org/10.1186/s12864-024-10274-6

Download citation

Received : 30 October 2023

Accepted : 31 March 2024

Published : 05 April 2024

DOI : https://doi.org/10.1186/s12864-024-10274-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Multi-omics
Plateau environment

BMC Genomics

ISSN: 1471-2164

Submission enquiries: [email protected]
General enquiries: [email protected]

Open access
Published: 02 April 2024

DNA methylation remodeling and the functional implication during male gametogenesis in rice

Xue Li 1 na1 ,
Bo Zhu 1 na1 ,
Feng Zhao 1 ,
Qian Liu 1 ,
Jiahao Wang 1 ,
Miaomiao Ye 1 ,
Siyuan Chen 1 ,
Junwei Nie 3 ,
Lizhong Xiong 1 ,
Yu Zhao 1 ,
Changyin Wu 1 &
Dao-Xiu Zhou ORCID: orcid.org/0000-0002-1540-0598 1 , 4

Genome Biology volume 25 , Article number: 84 ( 2024 ) Cite this article

778 Accesses

Metrics details

Epigenetic marks are reprogrammed during sexual reproduction. In flowering plants, DNA methylation is only partially remodeled in the gametes and the zygote. However, the timing and functional significance of the remodeling during plant gametogenesis remain obscure.

Here we show that DNA methylation remodeling starts after male meiosis in rice, with non-CG methylation, particularly at CHG sites, being first enhanced in the microspore and subsequently decreased in sperm. Functional analysis of rice CHG methyltransferase genes CMT3a and CMT3b indicates that CMT3a functions as the major CHG methyltransferase in rice meiocyte, while CMT3b is responsible for the increase of CHG methylation in microspore. The function of the two histone demethylases JMJ706 and JMJ707 that remove H3K9me2 may contribute to the decreased CHG methylation in sperm. During male gametogenesis CMT3a mainly silences TE and TE-related genes while CMT3b is required for repression of genes encoding factors involved in transcriptional and translational activities. In addition, CMT3b functions to repress zygotic gene expression in egg and participates in establishing the zygotic epigenome upon fertilization.

Collectively, the results indicate that DNA methylation is dynamically remodeled during male gametogenesis, distinguish the function of CMT3a and CMT3b in sex cells, and underpin the functional significance of DNA methylation remodeling during rice reproduction.

DNA cytosine methylation is a hallmark for repression of transposable elements (TE) and related sequences in complex genomes such as flowering plants and vertebrates. In flowering plants, DNA cytosine methylation occurs in the context of CG, CHG and CHH (where H is A, C, or T) sequences. CG methylation is maintained during cell division by DNA methyltransferase1 (MET1), which recognizes and methylates hemi-methylated CG sites in the newly replicated DNA. Non-CG (i.e. CHG and CHH) methylation in heterochromatin is maintained by plant-specific Chromomethylases3 (CMT3, at CHG sites) and CMT2 (at CHH and CHG sites), which bind to the histone methylation mark H3K9me2, while CHH methylation in euchromatin regions is maintained by Domains Rearranged Methyltransferase2 (DRM2) guided by related siRNA [ 1 ]. DRM2 also mediates de novo DNA methylation regardless of sequence contexts [ 1 ]. All three methylation sites are found in TE and TE-like sequences and in about 10–20% of genes depending on plant species. DNA methylation in positive regulatory sequences impairs gene transcription and causes gene silencing, but enhances gene activity when occurring in repressive DNA elements [ 2 ]. CG methylation is also common within the transcribed regions of genes, where is associated with gene activity [ 3 ]. In rice, DNA methylation displays some differences in methylation levels and genomic distribution compared with Arabidopsis. For instance, rice mCHH is enriched at euchromatin regions and many genes are found to be methylated at CHH sites, especially around TSS [ 4 , 5 ]. There are also quite a number of genes methylated at CHG sites which are related to allelic-specific expression in rice hybrids [ 6 ].

Epigenetic marks are reprogrammed in the gametes and after fertilization. In mammals, there are two distinct phases of epigenetic reprogramming to prevent inheritance of ancestral epigenetic signatures. The first phase consists of a genome-wide erasure of DNA methylation in the primordial germ cells (PGCs), the gamete precursors, followed by the reestablishment of epigenetic signatures to enable gamete maturation and function [ 7 , 8 ]. Global DNA methylation is once again erased post-fertilization in early embryos, followed by another round of global de novo methylation during embryo development [ 9 , 10 , 11 ]. In contrast with mammals, DNA methylation is not globally erased in gametes in flowering plants [ 12 , 13 , 14 , 15 , 16 , 17 ]. Unlike mammalian germ lines that are defined already at an early stage of embryogenesis, thus before meiosis, the plant male and female sexual lineages initiate as diploid meiocytes from somatic cells, which give rise to haploid microspores after meiosis. The male microspores subsequently undergo mitosis, producing the vegetative and generative cells. The generative cell is further divided to produce two sperm cells in the mature pollen grain, the male gametophyte. Previous studies showed that plant sperm DNA methylation is remodeled and shows variation relative to somatic tissues [ 12 , 14 , 16 , 17 , 18 , 19 ]. However, it remains unclear whether the remodeling process is initiated in male meiocytes before meiosis or at the subsequent steps after meiosis and whether DNA methylation remodeling is required for male gametogenesis.

In this work, we analyzed DNA methylation in rice male meiocyte, microspore and sperm and studied the function of a set of chromatin regulators during the process. The results indicate that DNA methylation remodeling starts after meiosis of the male meiocytes and that non-CG methylation, particularly at CHG sequences, is dynamically remodeled during rice male gametogenesis. The work reveals distinct function of DNA methyltransferases and histone demethylase in the remodeling of CHG methylation and suggests that the CHG methylation remodeling has functional significance for male gametogenesis and fertilization.

Non-CG methylation is dynamically remodeled during rice male gametogenesis

To study DNA methylation dynamics during male gametogenesis in rice, we manually isolated male meiocytes, unicellular microspores and sperms of the Zhonghua11 (ZH11) variety as previously described [ 17 , 20 , 21 , 22 ]. About 400 meiocytes, 300 unicellular microspores and 100 sperm cells were collected for bisulfite sequencing (BS-seq) analysis using a protocol developed for small numbers of cells (Additional file 1 : Fig. S1a, b) [ 23 ]. Data with two biological replicates were obtained (Additional file 1 : Fig. S1c, Additional file 2 : Table S1). Violin plots of the BS-seq data revealed that the overall CG methylation (mCG) levels (especially in TE and TE-related genes, TEG) gradually increased during male gametogenesis (Fig. 1 a), whereas CHG methylation (mCHG) levels were first increased in unicellular microspore (UM) but subsequently decreased in sperm (S) to the lowest levels (Fig. 1 a). Density plots confirmed the mCHG variations between microspore (UM) and meiocyte (Me) and between sperm (S) and microspores (UM) (Fig. 1 b). A similar trend of CHH methylation (mCHH) variation was also observed (Fig. 1 b). Scanning of differentially methylated regions (DMRs, defined within 100-bp windows, see Methods) between microspore and meiocyte (UM-Me) detected more hyper than hypo CG (876 hyper versus 646 hypo), CHG (10,720 hyper versus 3,764 hypo), and CHH (31,501 hyper versus 20,983 hypo) DMRs, indicating a clear gain of non-CG methylation in microspore (Fig. 1 c, d). In sperm relative to microspore, there were more hypo- than hyper-DMRs at non-CG, especially CHG context (30,455 hypo- compared to 1,727 hyper-DMRs), confirming a clear decrease of mCHG in sperm. The mCHG levels in sperm were the lowest when compared with those in egg and zygote or somatic tissues of the same rice variety (Additional file 1 : Fig. S2a, see below), which could be also observed in other rice varieties (Additional file 1 : Fig. S2b) [ 4 , 16 , 17 , 24 ]. Analysis of sex cell methylomes obtained from the Nipponbare (NIP) variety indicated that sperm mCHG and mCHH levels were lower than pollen vegetative cell levels, but comparable to central cells of the female gametophyte (Additional file 1 : Fig. S2b) [ 16 , 24 ]. About 2/3 (5,221/7,775) of the increased mCHG (hyper DMRs) in microspore (relative to meiocyte) were erased in sperm, the other 1/3 (2,554/7,775) microspore hyper DMRs remained hyper-methylated in sperm (Additional file 1 : Fig. S3a, clusters A and B). These two clusters of the microspore hyper DMRs were located mainly in genic and intergenic regions (Additional file 1 : Fig. S3b), and showed enrichment of euchromatin histone marks (H3K36me3, H3K9ac) (Additional file 1 : Fig. S3d). By contrast, many sperm hypo DMRs (12,588) that showed no change between microspore and meiocyte (Additional file 1 : Fig. S3a), which corresponded mainly to TE and TEG (Additional file 1 : Fig. S3b), and were enriched for the heterochromatin mark H3K9me2 (Additional file 1 : Fig. S3d). The data indicate that during male gametogenesis mCHG and mCHH at both euchromatin and heterochromatin loci are dynamically remodeled to the lowest levels in sperm.

DNA cytosine methylation in rice meiocytes, unicellular microspores, sperms. a Violin plots showing overall cytosine methylation levels (mCG, mCHG, and mCHH) in transposable elements (TE), transposable gene (TEG) and protein coding gene (Gene) of rice Zhonghua 11 (ZH11) meiocyte (Me), unicellular microspore (UM) and sperm (S). Values of the methylomes are averages from the two replicates. The average methylation levels (white dots) and median values (black bars) are indicated. b Density plot showing the frequency distribution of fractional methylation difference between the indicated samples. c Numbers of differentially methylated regions (DMRs) of between the indicated comparisons, distributed in TE (> 500 bp), TEG, gene, and intergenic regions. DMRs located in TE (red), gene (light green), intergenic region (pink), and TEG (yellow) are shown. d Genome browser screenshots of mCG, mCHG, and mCHH in meiocytes (Me), unicellular microspore (UM), sperm (S). Differentially methylated regions are grey colored

From cluster A of the microspore hyper DMRs, 315 genes were identified. These genes are enriched for translation, ribonucleoprotein complex, and cellular protein metabolic functions, implying that mCHG in protein translation and RNA-binding pathway genes was particularly dynamic during meiocyte to sperm development (Additional file 1 : Fig. S3e, Additional file 4 : Table S3). Several genes showed lower expression in microspore than meiocyte and/or sperm (Additional file 1 : Fig. S4a), suggesting that hypermethylation might play a role in their repression in microspore.

CMT3a and CMT3b function during male gametogenesis

CMT3a is a major CHG methyltransferase gene expressed at high levels throughout the sporophytic development in rice [ 25 , 26 ]. However, its expression became low or undetectable in sperm in several rice varieties (Fig. 2 a). By contrast, CMT3b expression was low or undetectable in vegetative tissues/organs but showed expression in reproductive cells including meiocyte, microspore, and sperm (Fig. 2 a). To study the function of CMT3 genes during male gametogenesis, we produced cmt3a and cmt3b knockout (KO) plants in the ZH11 background using the CRISPR technique and two independent lines for each gene were obtained (Additional file 1 : Fig. S5a). The cmt3a mutants produced mainly defective pollens and were infertile (Fig. 2 b). Cytology sections revealed that the cmt3a pollen development was arrested likely at the bicellular microspore stage (Additional file 1 : Fig. S5b).

Effects of cmt3a and cmt3b mutations on DNA methylation in meiocyte, microspore and sperm. a Transcript levels in FPKM of rice CMT3a and CMT3b in seedling (Se), roots (Ro), meiocyte (Me), unicellular microspore (UM), sperm (S), egg (E), zygote (Z), endosperm nuclei (En, 1.5 days after fertilization) and globular embryo (GE, 3 days after fertilization) from RNA-seq data. The sperm (Kit-S) in Kitaake background was reported by Anderson et al., (2013). b The pollen grains of wild type and cmt3a and cmt3b mutants were I2-KI stained. Bars = 50 μm. c Violin plots comparing overall cytosine methylation levels of wild type and cmt3a and cmt3b mutant meiocyte (Me), unicellular microspore (UM) and sperm (S). The average methylation levels (white dots) and median values (black bars) in transposable elements (TE) are shown. Values of the methylomes are averages from the two replicates. d Number of differential methylated regions (DMR) in cmt3a and cmt3b relative to wild type. Relative portions in TE (> 500 bp), TEG, gene, and Intergenic regions are indicated by different colors. e Venn diagrams showing overlapping of hypo-CHG DMRs in cmt3a and cmt3b meiocyte (left) and sperm (right) relative to wild type cells. f Box plots of DNA methylation levels of hypo-CHG DMRs in meiocyte (Me) versus microspore (UM) (upper) and sperm (S) relative to microspore (UM) (lower) in wild type, cmt3a (3a) and cmt3b (3b) cells. The significance was calculated with multiple comparison tests. Different letters on top of the bars indicate a significant difference ( p < 0.05). g Genome Browser screen captures showing high CHG methylation sites in microspore relative to meiocyte and sperm decreased in cmt3b mutants (highlighted by grey)

By contrast, the cmt3b mutants showed no clear plant morphological phenotype. To check the effects of cmt3 mutations on DNA methylation during male gametogenesis, we performed BS-seq of meiocyte, unicellular microspore, and sperm isolated from two independent CRISPR/Cas9-free lines of cmt3a and/or cmt3b at T3-4 generation (as well as a tissue culture-regenerated wild type line) (Additional file 2 : Table S1). Violin plots of the data revealed that mCHG was almost absent from cmt3a meiocyte (Fig. 2 c, Additional file 1 : Fig. S6a), consistent with the CMT3a loss-of-function effects in somatic tissues [ 25 ]. To a lesser extent, mCG was also reduced in cmt3a meiocyte. However, cmt3a sperm mCHG (as well as mCHH) levels became higher than the mutant meiocyte (Fig. 2 c), suggesting that additional activities partially restored mCHG and/or ectopically mediated mCHH in the mutant sperm. DRM2 and CMT2 being highly expressed in rice sperm (Additional file 1 : Fig. S7a), the residual mCHG level in cmt3a sperm could be maintained by RdDM or CMT2. This hypothesis is supported by the observation that the drm2/cmt2 mutations also reduced the mCHG levels of those loci in leaves (Additional file 1 : Fig. S7b).

By contrast, the cmt3b mutation led to a clear loss of overall mCHG in microspore and sperm but the mutation effect was less clear in meiocyte (Fig. 2 c, Additional file 1 : Fig. S6a). The cmt3b mutation also resulted in some increases of mCHH in sperm. Density plots confirmed the observations (Additional file 1 : Fig. S6b). The increases of mCHH in cmt3a/b sperm might be of indirect effects to compensate mCHG loss in the mutants. Nearly all of the cmt3b hypo-CHG DMRs in meiocyte and sperm overlapped with those of cmt3a (Fig. 2 d, e), indicating that CMT3b functioned to maintain mCHG on a fraction of the CMT3a targets. The cmt3b mutation resulted in a large number (33,412) of hypo-CHG DMRs in microspore (Fig. 2 d), and diminished the mCHG difference between microspore and sperm observed in wild type (Additional file 1 : Fig. S6c). In fact, the methylation levels of hyper-CHG DMRs in wild type microspore versus meiocyte were decreased to the meiocyte levels in cmt3b microspore, and the methylation levels of the hypo-CHG DMRs in wild type sperm versus microspore were decreased to the sperm levels in cmt3b microspore (Fig. 2 f, g). For the three clusters of DMRs shown in Additional file 1 : Fig. S3a, the cmt3b mutation largely reduced the mCHG levels in microspore (Additional file 1 : Fig. S3c). In addition, the 315 genes (from cluster A) showed higher mCHG in exons than introns in microspore (Additional file 1 : Fig. S4b), while the cmt3b mutation reduced the mCHG levels from both exons and introns, suggesting that CMT3b may preferentially target gene exons in microspore (Additional file 1 : Fig. S4a, b). The analysis indicates that CMT3b is required for the increase of mCHG in microspore.

Histone demethylases JMJ706 and JMJ707 reduce CHG methylation

As CMT3a is not or lowly expressed in sperm, mCHG diluted by mitosis may not be maintained in sperm. Alternatively, active DNA demethylation may be involved, as mutations of DNA demethylases locally remodeled DNA methylation in sperm [ 17 ]. Since mCHG is linked to H3K9me2 through a positive feedback loop [ 27 , 28 ], we investigated whether H3K9me2 demethylases were also involved in the decease of mCHG in sperm. JMJ706 was shown to function as a H3K9 demethylase in rice [ 29 ]. JMJ707 is closely related to JMJ706 [ 29 ], of which JMJ707 showed high expression in sperm (Additional file 1 : Fig. S8a). To test whether the genes were involved in mCHG during male gametogenesis, we produced jmj706 and jmj707 double knockout (KO) plants and obtained two independent lines (Additional file 1 : Fig. S8b). The KO lines ( j67 ) showed a reduced pollen viability and seed setting rate (Additional file 1 : Fig. S8c, d). We analyzed the DNA methylome of male meiocyte and sperm of the mutant lines (Cas9-free at T3-4 generation) and found that the mutations had no drastic effect on the overall methylation in the cells (Fig. 3 a, Additional file 1 : Fig. S9a). However, the methylation levels of the hypo-DMRs in wild type meiocyte versus microspore were increased in j67 meiocyte (Fig. 3 b, c, d). Similarly, the methylations levels of the hypo- DMRs in wild type sperm versus microspore were augmented in j67 sperm (Fig. 3 b, c, d). The jmj706/7 mutations clearly elevated mCHG levels of cluster B DMRs in meiocyte and cluster C in sperm (Additional file 1 : Fig. S3c). There was no overlap between cmt3b and jmj706/7 -affected DMRs (Additional file 1 : Fig. S9b). The analysis indicated that JMJ706/707 play a role to reduce mCHG at a set of genomic loci (mainly TE or TEG) in male meiocyte and sperm.

Effects of jmj706/707 mutations on DNA methylation in male sex cells. a Comparison of overall TE methylation levels in wild type and jmj706/707 mutant meiocyte (Me) and sperm (S). Values of the methylomes are averages from the two replicates. The average methylation levels (white dots) and median values (black bars) are shown. b Density plots of CHG methylation differences between jmj706/707 mutant and wild-type meiocyte (upper) and sperm (lower) (black lines). The red traces are density plots confined to the hypo DMRs between meiocyte and microspore (Me-UM) (upper) or between sperm and microspore (S-UM) (lower). c Box plots showing DNA methylation levels of 50-bp hypo-CHG methylation regions in wild type meiocyte (Me) (upper) and sperm (S) (lower) relative to microspore (UM) and in jmj706/707 meiocyte (j67-Me) and sperm (j67-S). The significance was calculated with multiple comparison tests. Different letters on top of the bars indicate a significant difference (p < 0.05). d Genome browser screen shots showing low CHG methylation sites in meiocyte (upper) or sperm (lower) relative to microspore but elevated in j67 mutants. Grey illustrates differentially methylated regions

Effect of the cmt3a/b and jmj706/7 mutations on sexual lineage-specific methylation

It is shown that Arabidopsis male sex cells show sexual-lineage-specific methylation (SLM) or sexual-lineage-hypermethylation (SLH) [ 18 ]. Using the published methods [ 18 ], we identified 555 SLH loci in rice meiocyte, microspore and sperm relative to somatic cells (seedling) (Additional file 1 : Fig. S10a). The cmt3b mutations reduced the SLH levels in all three male sex cell types, especially in microspore (Additional file 1 : Fig. S10b, c). By contrast, the jmj706/7 had no clear effect on SLH in the sex cells (Additional file 1 : Fig. S10b). Further analysis could divide the 555 SLH into 340 canonical SLH and 215 SLM loci (Additional file 1 : Fig. S11a) [ 18 ]. The canonical SLH loci corresponded mainly to TEs, while the SLM loci were located mainly in genes (body and promoter regions) (Additional file 1 : Fig. S11b), suggesting that SLM mainly targets genic regions during male gametophyte and sperm development. In total, 132 genes were targeted by SLM, which are enriched for translational function (Additional file 1 : Fig. S11c, Additional file 5 : Table S4). The 132 genes appeared to be repressed in sperm compared to meiocyte or microspore (Additional file 1 : Fig. S11d, e). The cmt3b mutation reduced SLM and increased expression of some of the genes in sperm (Additional file 1 : Fig. S11d, e), suggesting that CMT3b may be involved in SLM and repression of some of the genes in sperm.

Function of CMT3a and CMT3b in egg and zygote DNA methylation

Unlike in sperm, CMT3a is highly expressed in egg and zygote. CMT3b transcripts are detected in Egg and zygote (Fig. 2 a). To study CMT3 function in egg and zygote, we compared wild type, cmt3a and/or cmt3b egg and zygote methylomes by BS-seq analysis (Additional file 2 : Table S1). In wild type, egg and zygote mCHG levels were higher than sperm (Fig. 4 a, Additional file 1 : Fig. S12a). Because cmt3a was infertile, we only analyzed cmt3a egg methylome. As in male meiocyte and somatic tissues [ 25 ], the cmt3a mutation eliminated almost all mCHG in egg (Fig. 2 c and Fig. 4 a, Additional file 1 : Fig. S12a). The cmt3b mutation had a limited effect on overall mCHG in egg, but caused a clear decrease of mCHG in zygote (Fig. 4 a, Additional file 1 : Fig. S12a). The cmt3b mutation produced more hypo-CHG DMRs (23,460) in zygote than egg (13,249) or sperm (17,576) (Fig. 4 b). About 24% (5,623/23,460) of the hypo-DMRs in cmt3b zygote overlapped with the hyper-DMRs in wild type zygote versus sperm (Fig. 4 c). In fact, the cmt3b mutation reduced the methylation differences of the DMRs between wild type zygote and sperm (Z-S) or egg (Z-E) (Fig. 4 d, e). Together, the data indicate that CMT3b participates in reestablishing mCHG methylation at a subset of the Z-S and Z-E DMRs in the zygote.

Effect of cmt3a and cmt3b mutations on DNA methylation in zygote and/or egg cells. a Comparison of overall TE methylation levels in sperm (S), egg (E) and zygote (Z) of wild type and cmt3a, cmt3b mutants. Values of the methylomes are averages from the two replicates. The average methylation levels (white dots) and median values (black bars) are shown. b Number of differential methylated regions (DMR) in the indicated mutant cells relative to wild type. Different colors indicate the distribution of DMR in TE (red), gene (light green), intergenic region (pink), and TEG (yellow). c Venn diagrams showing overlapping of hyper-CHG DMRs in zygote relative to sperm (Z-S) (upper) or egg (Z-E) (lower) and hypo-CHG DMRs in cmt3b zygote relative to wild type zygote (3bZ-Z). d Box plots showing DNA methylation levels of Z-S (upper) or Z-E (lower) hyper-CHG DMR of the indicated cell type. The significance was calculated with multiple comparison tests. Different letters on top of the bars indicate a significant difference ( p < 0.05). e Screenshots of CHG methylation levels of 5 representative genes in the indicated cell

Function of rice CMT3b in gene expression in reproductive cells

To study the cmt3 mutation effects on gene expression in sex cells, we first performed RNA-seq of wild type and cmt3b male meiocyte and sperm. Three replicates (two replicates for WT sperm) were performed (Additional file 1 : Fig. S13a, Additional file 3 : Table S2). PCA analysis indicated a high reproducibility of the replicates (Additional file 1 : Fig. S13b). The WT meiocyte transcriptome showed high correlation with previously published rice meiocyte transcriptomes ( r > 0.85) (Additional file 1 : Fig. S13c). In cmt3b meiocyte, 2,259 and 1,639 genes were respectively up- and downregulated (Fig. 5 a). Similar numbers of differentially expressed genes (DEGs) were detected in the mutant sperm (Fig. 5 a). Upregulated genes in cmt3b meiocyte were enriched for gene transcription function, while upregulated genes in cmt3b sperm were mainly enriched for translational function (Additional file 1 : Fig. S14a, b). A small number of upregulated DEGs were found to associate with hypo-DMRs in the mutant cells (Fig. 5 b, c; Additional file 6 : Table S5). The analysis suggests that CMT3b plays a role in shutdown of transcriptional activity in meiocyte for preparation of meiosis and may contribute to the low translational activity in sperm [ 30 ].

The cmt3b mutation affected non-TE gene expression in meiocyte and sperm. a Number of differentially expressed non-TE genes (pink) and TE-related genes (TEG) (red) in cmt3b mutant meiocyte and sperm relative to wild type (padj < 0.01, FC > 2). The highly expressed genes (TPM > 10) in sperm were filtrated for comparison. b Number of upregulated genes overlapping with hypo-CHG methylated genes (DMG) detected in cmt3b meiocyte and sperm relative to the wild type cells. c Genome browser screenshots of the methylation and expression levels of representative genes of the 148 (left) and 122 (right) genes shown in ( b )

In parallel, we analyzed the egg and zygote transcriptomes of wild type and cmt3a/b plants. Because cmt3a was infertile, we only analyzed cmt3a egg transcriptome. PCA analysis indicated high levels of reproducibility of the replicates. The cmt3b egg and zygote transcriptomes were close to, but distinct from, the wild type (Additional file 1 : Fig. S13b). By contrast, the cmt3a egg transcriptome was largely distal from the wild type (Additional file 1 : Fig. S13b), consistent with the drastic effect of cmt3a mutation on mCHG in egg. There were in total 4,868 upregulated genes (> two-fold, Q < 0.01), of which 1,648 were hypomethylated at CHG sites in cmt3a egg (Additional file 1 : Fig. S12b, c). Interestingly, among the up-regulated genes, 1,461 were TEGs, of which 982 (67.2%) were hypomethylated at CHG sites in cmt3a egg (Additional file 1 : Fig. S12b, c). The analysis indicates that CMT3a-mediated mCHG is required mainly for TEGs repression in egg, consistent with previous results showing that cmt3a mutation resulted in burst of TE expression [ 26 ]. The cmt3b mutation resulted in totally 3,022 upregulated genes (> two-fold, Q < 0.01) in egg, of which few were TEGs and only 120 (including 26 TEG) were identified as hypo-CHG methylated genes in the mutant egg (Additional file 1 : Fig. S12b, c), indicating that unlike cmt3a , the cmt3b mutation de-repressed mainly non-TE-related genes, which was likely independent of a clear loss of mCHG. However, about 40% of the upregulated genes in cmt3b overlapped with those detected in cmt3a eggs (Additional file 1 : Fig. S12d), suggesting that both CMT3 genes are required for gene (mainly non-TEGs) repression in egg. The cmt3b mutation resulted in upregulation of 2,179 and downregulation of 1,870 genes in zygote (Additional file 1 : Fig. S12b). Similar to that observed in cmt3b egg, relatively few upregulated genes were TEGs or showed hypo mCHG in cmt3b zygote (Additional file 1 : Fig. S12c).

CMT3b represses zygotic genes in egg cells

In wild type zygote we identified 1804 down- and 2628 upregulated (> two-fold, Q < 0.01) genes relative to egg (Fig. 6 a). Among the upregulated genes, 416 overlapped with previously identified genes expressed in rice zygotes (Fig. 6 b) [ 31 ]. The 416 genes were enriched for chromatin replication and cell division functions (Fig. 6 c), consistent with zygotic genome activation to promote zygote cell division in plants [ 32 ]. Within the 416 zygotic genes, 83 were upregulated in cmt3b egg. By contrast, although the cmt3a mutation caused a larger number of upregulated genes in egg, only 39 were among the 416 zygotic genes (Fig. 6 d). Among the 83 zygotic genes upregulated in cmt3b egg were those encode histones, chromatin proteins (HMGs, SMC2, TOP2), DNA methyltransferases (MET1 and CMT3a), transcription factors (E2F, HAP3, NAC), and cell division-related proteins (Fig. 6 e, f, g, Additional file 7 : Table S6). The analysis suggested that CMT3b has a function to repress zygote gene expression program in egg.

The cmt3b mutation de-represses zygote-expressed genes in egg. a Number of differentially expressed genes in wild type zygote relative to egg (FC > 2, p < 0.01). b Overlaps of high zygotic expression gene relative to egg in NIP, DJ and ZH11 varieties (Anderson et al., 2017; Zhou et al., 2021). c GO enrichment of zygote-expressed genes detected in three cultivars. d Transcript heatmaps of the 416 zygote-expressed genes in cmt3b egg (3b-E) and zygote (3b-Z) compared with wild type egg (E) and zygote (Z). e Transcript heatmaps of several representative zygote genes upregulated in cmt3b egg. f GO enrichment of the 83 upregulated genes in cmt3b egg shown in ( d ). g Integrative Genomics Viewer screenshots of six examples of the genes described in ( d )

Non-CG methylation dynamics during rice male gametogenesis

Previous results showed that DNA methylation in plants is partially remodeled or reconfigured in male and female gametes [ 12 , 13 , 14 , 15 , 16 , 17 , 19 ]. In this work, we provided evidence that non-CG (mainly CHG) methylations are dynamically remodeled during rice male gametogenesis with the highest levels observed in microspore and the lowest levels in sperm. It is known that during male germline development there is a cell cycle arrest [ 33 ], which could account for the reduced mCHG and mCHH in sperm. Unlike in Arabidopsis meiocyte that has higher mCG but lower mCHH than in somatic tissues [ 18 ], DNA methylation levels in rice male meiocyte were similar to somatic tissues (Additional file 1 : Fig. S15), suggesting that the remodeling process likely starts after meiosis in rice. Although the underlying function of the dynamic remodeling of mCHG (i.e. first increased in microspore then reduced in sperm) remains to be further explored, association of substantial numbers of genes with hypo mCHG in meiocyte and sperm relative to microspore of hundreds of genes (enriched for translational function) suggest that the remodeling is linked to gene expression dynamics during male gametogenesis. This is supported by the identification SLM, which targets mainly genes that are repressed in sperm. Alternatively, the remodeling may associate with chromatin changes during male gametogenesis.

The increase of mCHG observed in microspore was unexpected, as the methylation levels would be reduced after the meiotic cell divisions. The enhanced mCHG may have functional significance in microspore development. It remains to be studied whether the increased mCHG is related to chromatin reorganization of the haploid genome that was shown to be activated before the bicellular microspore stage in cereals [ 34 ]. As hyper-methylated genes are enriched for ribonucleoprotein complex (such as mRNA-binding) and translation proteins in microspore, the mCHG increase may be also related to repression of genes involved in transcriptional and translational activities that are shutdown during meiosis or male gametogenesis [ 26 ].

The decrease of mCHG in sperm may be required for sperm development and/or reshaping the sperm chromatin that lacks the compact silent center (CSC) found in egg and zygote chromatin 3D structures [ 21 ], and contains specific histone variants and modification patterns [ 35 , 36 , 37 , 38 , 39 ]. Alternatively, the decreased mCHG might facilitate the decondensation of the sperm chromatin upon fusion with the egg cell nucleus after fertilization, allowing initiation of transcription from the paternal genome [ 40 ].

The remodeling of non-CG methylation in rice sperm seemed to differ from Arabidopsis sperm where mCHH is lost or largely reduced [ 12 ]. This may be due to a different mCHH landscape in the rice genome, which is mainly scattered in genic regions [ 4 , 5 ]. In fact, in rice sperm mCHH was found to be reduced in genic regions, but appeared to be enhanced in TE-rich pericentromeric regions (Additional file 1 : Fig. S16a, b). The increased mCHH at pericentromeric region in sperm may be indirectly promoted by DNA demethylation at TE and repetitive sequence in the pollen vegetative cell in rice [ 16 ]. Alternatively, siRNAs produced by microspore may be inherited into sperm, in which they target heterochromatin LTR retrotransposon silencing through the RdDM pathway. In addition, the reduced genic mCHG and mCHH may also involve DNA demethylases that were shown to function in rice sperm [ 17 ].

Distinct functions of CMT3a /b in reproductive cells

CMT3a is the major CHG methyltransferase in rice. CMT3a expression was detected in meiocyte, but barely in sperm (Fig. 2 a), which may contribute to the low mCHG levels in sperm cells. However, CMT3a-mediated mCHG appeared essential for post-meiotic development, as the cmt3a mutation that eliminated mCHG in meiocyte stopped the pollen development at bicellular microspore stage. The expression pattern and mutation effects indicate that CMT3b plays an important role in increasing mCHG in the microspore in addition to complementing CMT3a for mCHG maintenance in meiocyte and sperm. However, the present data suggest that CMT3b preferentially targets protein-coding genes for methylation and is required for repression of transcriptional and/or translational genes during male gametogenesis.

The present work indicated CMT3a/b also have distinct functions in egg. CMT3a plays the major role in silencing TEGs, whereas CMT3b is required for repression of zygote gene expression program in egg. The repression of cell vision genes by CMT3b in egg is consistent with previous results showing that rice and maize egg cells are almost devoid of transcripts encoding histone proteins [ 31 , 41 ]. The cmt3b effects on zygote mCHG and gene expression indicate that CMT3b participates in zygotic genome reprograming by reestablishing mCHG methylation and by regulating gene expression. In conclusion, the work reveals non-CG DNA methylation dynamics during male gametogenesis and distinguishes CMT3a/b functions in mCHG and gene expression in reproductive cells in rice.

Conclusions

The work shows that during male gametogenesis mCHG level is first enhanced in microspores but subsequently reduced to the lowest level in sperm. CMT3a is the main CHG methyltransferase in somatic cells and meiocyte to silence TE and TE-related genes, while CMT3b is required for the surge of mCHG in microspore to repress transcription and translation-related genes, and is involved in the establishment of the zygotic epigenome after fertilization. The histone H3K9me2 demethylases JMJ706 and JMJ707 contribute to the reduction of mCHG in sperm. This study reveals a dynamic remodeling of mCHG during male gametogenesis, which has functional significance for pollen development and fertilization.

Plant materials and growth conditions

Rice variety Zhonghua11 (ZH11) ( Oryza sativa spp. japonica ) was used for transformation of CMT3a , CMT3b , and JMJ706/707 CRISPR/Cas9 vectors in this study. Single-guide RNAs (sgRNA) of CRISPR/Cas9 system were designed as previously reported [ 42 ]. The sgRNA target sequences of CMT3a , CMT3b , and JMJ706/707 as well as primers for genotyping are listed in Additional file 8: Table S7. Mutations in CMT3a , CMT3b and JMJ706/707 were decoded by DSDecodeM ( http://skl.scau.edu.cn/dsdecode/ ) [ 43 , 44 ]. Cas9-free transgenic plants from T3-T4 segregating populations were utilized for phenotypic analysis and isolation of reproductive cells. For field growth, germinated rice seedlings were planted in Wuhan from May to October. For greenhouse growth, germinated rice seedlings were planted in soil-filled boxes under a 14-h light/10-h dark cycle at temperatures of 32 °C (in light) and 26 °C (in dark).

Sperm, meiocyte and unicellular microspore isolation

Rice sperm, meiocyte and unicellular microspore were isolated from anthers of Cas9-free transgenic plants and tissue culture-regenerated wild type plants. For sperm isolation, mature anthers were soaked in 45% (w/v) sucrose and then transferred into 15% (w/v) sucrose to release sperm pairs. For isolation of meiocytes, panicles of about 5 cm length were chosen and middle parts of the panicles were collected. Intact anthers were carefully selected using dissecting needles, placed onto an RNAase-free slide, and suspended in a PBS solution containing 1% proteinase inhibitors. The outer wall of anthers was then removed by capillary to release meiocytes. Isolated meiocytes were checked under a light microscope (Additional file 1 : Fig. S1a). For isolation of unicellular microspores, panicles of about 8 cm length were harvested and the middle parts of the panicles were collected. Anthers were dissected out and made a hole with a capillary tube in the same solution as used for meiocyte isolation. Squeeze gently from the opposite side of the hole to release unicellular microspore. Isolated microspores were checked under a light unicellular microscope (Additional file 1 : Fig. S1b). All isolated cells were collected by a micromanipulator system (Eppendorf, TransferMan® 4r).

Egg and unicellular zygote isolation

Rice eggs and unicellular zygotes were isolated from ovules before and after fertilization of Cas9-free transgenic plants and tissue culture-regenerated wild type [ 17 , 21 ]. Briefly, ovaries of non-pollinated and pollinated florets (about 6.5 h after pollination) were manually collected in RNAase-free water under a dissection microscope. The collected ovules were transferred to a solution containing 0.53 M mannitol and 1% proteinases inhibitors to release eggs or zygotes. All isolated cells were stained by FDA (Fluoresceinc diacetate, Sangon, 596–09-8) and collected by a micromanipulator system (Eppendorf, TransferMan® 4r).

RNA-seq and BS-seq library construction

For RNA-seq library construction, mRNA isolated from different types of cells were reverse transcribed and amplified by using a Single Cell Full Length mRNA Amplification Kit (Vazyme, Cat.# N712) according to manufacturer’s instruction. cDNAs were sheared into 200–400 bp DNA fragments followed by purification using VAHTS DNA Clean Beads (Vazyme, Cat.# N411). The rest steps were performed using the TruePrep DNA Library Prep Kit V2 for Illumina (Vazyme, Cat.# TD502). About 3000 sperm cells and 400 meiocytes were used to construct the transcriptome libraries. Fifty cells of egg or unicellular zygote were used for each replicate of the transcriptome library construction. BS-seq libraries were constructed using reported protocol [ 23 ]. About 300–400 cells were pooled for each replicate for sperm, meicoyte and unicellular microspore bisulfite seq library construction and fifty cells of egg or unicellular zygote were used to construct bisulfite seq library.

Semi-thin section

Proper florets were gathered and 50% FAA (50 ml absolute ethanol, 10 ml 37% formaldehyde solution, 5 ml glacial acetic acid, add double distilled water to 100 ml) was used to fix samples. Semi-thin embedding were performed using Herau Kulzer Technovit 7100 resin. Briefly, the materials were first transferred to a gradient alcohol solution (70% for 4 h at 4° C, 85% for 1 h at room temperature, 95% for 1 h at room temperature, 100% for 2 h at room temperature), then immersed separately in a pre-infiltration solution (equal parts of 96% or absolute ethanol and base liquid Technovit 7100) and an infiltration solution (1 g hardener I dissolved in 100 ml base liquid) for two hours. The treated materials were embedded in a 65° C film stand with polymerisation solution (1 ml hardener is added with the help of a pipette and stirred into 15 ml of preparation solution) and then sliced.

RNA-seq data analysis

First, raw RNA-seq data were cleaned using fastp software to remove connectors and filter low-quality reads [ 45 ]. Clean read was then matched to the MSU7.0 rice reference genome using hisat2 software [ 46 ]. FeatureCounts were used for quantitative analysis and the differentially expressed genes (Fold change > 2, q < 0.01) were calculated by DESeq2 [ 47 ].

BS-seq data analysis

Fastp software was used to remove connectors and filter low-quality reads from the acquired raw BS-seq data. Clean reads were mapped to the MSU7.0 rice genome. Bismark software was used to match, deduplicate and extract methylation sites [ 48 ]. To obtain more loci for analysis, we combined the two biological replicates. Duplicates were removed and uniquely mapped reads were retained and each cytosine covered by at least five reads for further analysis. To avoid methylation being strongly influenced by single cytosine sites, the methylation level of each cytosine is calculated separately and then averaged for all cytosines to represent the methylation level for each bin (bin methylation level = (sum of individual cytosine methylations) / (number of cytosines within 100 bp).

To identify differential methylated regions (DMRs), the whole genome was divided into 100-bp bins. Bins that contained at least five cytosines each and every cytosine with at least a five-fold coverage were retained, absolute methylation difference of 0.5, 0.3, and 0.1 for CG, CHG, and CHH, respectively, and P values < 0.01 (Fisher’s exact test) were considered as DMRs. Neighboring DMRs within 200 bp were merged.

TE gene and gene Annotation information was downloaded from the Rice Genome Annotation project ( http://rice.uga.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.locus_brief_info.7.0 ). Differentially methylated genes (DMG) were identified using bedtools software [ 49 ], filtrated by > 80% of overlap between DMRs and genes.

Density plots show the frequency distribution of DNA methylation differences between 50-bp window of two samples with at least 20 informative sequenced cytosines in both samples and 70% CG, 30% CHG, or 10% CHH methylation in either of the samples as previously described [ 16 ].

Identification of sexual lineage-specific methylation loci

The previously reported method [ 18 ] was used to identify SLH and SLM in rice male sex cells. Briefly, Average sex cell mCG and mCHH levels within 100 bp bins were calculated from meiocytes, microspores and sperm, and average sex cell mCHG levels were calculated from meiocytes and microspores. Fractional methylation in 100 bp windows across the genome was compared between an average of selected sex cells (SexAV) and somatic tissues (Seedling) (Diff = SexAV—Seedling). The total methylation level was significantly different (Fisher's exact test p -value < 0.01), and the methylation level of all sex cell replicates was higher than that of all somatic tissues and selected windows meeting the following criterion: Diff_CG > 0 & Diff_CHG > 0.05 & Diff_CHH > 0.1 & (Diff_CG + Diff_CHG + Diff_CHH) > 0.4.

The refined list of SLHs (555 loci) was then separated into two groups based on the level of CHH/G methylation in somatic tissues: 1) SLMs with CHH and CHG methylation lower than 0.05 and 0.1, respectively, in somatic tissues (215 loci). 2) canonical SLHs with CHH methylation higher than 0.05 or CHG methylation higher than 0.1, in somatic tissues (340 loci) [ 18 ].

Availability of data and materials

Genes sequence data from this article can be found in the Rice Genome Annotation Project website under the following accession numbers: CMT3a, LOC_Os10g01570, CMT3b, LOC_Os03g12570, JMJ706, LOC_Os10g42690, JMJ707, LOC_Os02g46930. All high throughput data in support of the finding of this study deposited to the Gene Expression Omnibus (GEO) under the accession number GSE235680 [ 50 ]. The transcription profiles of reported rice sperm were downloaded from Gene Expression Omnibus (accession no. GSE50777) [ 51 ]. The rice central cells and vegetative cells BS-seq data was under the accession number GSE89789 [ 52 ] and GSE126791 [ 53 ]. The rice cmt2 and drm2 mutant BS-seq data was under the accession number GSE138705 [ 54 ]. No other scripts and software were used other than those mentioned in the Methods section.

Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat Rev Genet. 2010;11:204–20.

Article CAS PubMed PubMed Central Google Scholar

Kim MY, Zilberman D. DNA methylation as a system of plant genomic immunity. Trends Plant Sci. 2014;19:320–6.

Article CAS PubMed Google Scholar

Lloyd JPB, Lister R. Epigenome plasticity in plants. Nat Rev Genet. 2022;23:55–68.

Tan F, Zhou C, Zhou Q, Zhou S, Yang W, Zhao Y, Li G, Zhou DX. Analysis of chromatin regulators reveals specific features of rice DNA methylation pathways. Plant Physiol. 2016;171:2041–54.

Zemach A, McDaniel IE, Silva P, Zilberman D. Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science. 2010;328:916–9.

Ma X, Xing F, Jia Q, Zhang Q, Hu T, Wu B, Shao L, Zhao Y, Zhang Q, Zhou DX. Parental variation in CHG methylation is associated with allelic-specific expression in elite hybrid rice. Plant Physiol. 2021;186:1025–41.

Guo F, Yan L, Guo H, Li L, Hu B, Zhao Y, Yong J, Hu Y, Wang X, Wei Y, et al. The transcriptome and DNA methylome landscapes of human primordial germ cells. Cell. 2015;161:1437–52.

Seisenberger S, Andrews S, Krueger F, Arand J, Walter J, Santos F, Popp C, Thienpont B, Dean W, Reik W. The dynamics of genome-wide DNA methylation reprogramming in mouse primordial germ cells. Mol Cell. 2012;48:849–62.

Bergman Y, Cedar H. DNA methylation dynamics in health and disease. Nat Struct Mol Biol. 2013;20:274–81.

Li C, Fan Y, Li G, Xu X, Duan J, Li R, Kang X, Ma X, Chen X, Ke Y, et al. DNA methylation reprogramming of functional elements during mammalian embryonic development. Cell Discov. 2018;4:41.

Article PubMed PubMed Central Google Scholar

Wang X, Bhandari RK. DNA methylation dynamics during epigenetic reprogramming of medaka embryo. Epigenetics. 2019;14:611–22.

Calarco JP, Borges F, Donoghue MT, Van Ex F, Jullien PE, Lopes T, Gardner R, Berger F, Feijó JA, Becker JD, Martienssen RA. Reprogramming of DNA methylation in pollen guides epigenetic inheritance via small RNA. Cell. 2012;151:194–205.

Ibarra CA, Feng X, Schoft VK, Hsieh TF, Uzawa R, Rodrigues JA, Zemach A, Chumak N, Machlicova A, Nishimura T, et al. Active DNA demethylation in plant companion cells reinforces transposon methylation in gametes. Science. 2012;337:1360–4.

Hsieh PH, He S, Buttress T, Gao H, Couchman M, Fischer RL, Zilberman D, Feng X. Arabidopsis male sexual lineage exhibits more robust maintenance of CG methylation than somatic tissues. Proc Natl Acad Sci U S A. 2016;113:15132–7.

Li C, Xu H, Fu FF, Russell SD, Sundaresan V, Gent JI. Genome-wide redistribution of 24-nt siRNAs in rice gametes. Genome Res. 2020;30:173–84.

Kim MY, Ono A, Scholten S, Kinoshita T, Zilberman D, Okamoto T, Fischer RL. DNA demethylation by ROS1a in rice vegetative cells promotes methylation in sperm. Proc Natl Acad Sci U S A. 2019;116:9652–7.

Zhou S, Li X, Liu Q, Zhao Y, Jiang W, Wu A, Zhou DX. DNA demethylases remodel DNA methylation in rice gametes and zygote and are required for reproduction. Mol Plant. 2021;14:1569–83.

Walker J, Gao H, Zhang J, Aldridge B, Vickers M, Higgins JD, Feng X. Sexual-lineage-specific DNA methylation regulates meiosis in Arabidopsis. Nat Genet. 2018;50:130–7.

Liu Q, Ma X, Li X, Zhang X, Zhou S, Xiong L, Zhao Y, Zhou D-X. Paternal DNA methylation is remodeled to maternal levels in rice zygote. Nat Commun. 2023;14:6571.

Collado-Romero M, Alós E, Prieto P. Unravelling the proteomic profile of rice meiocytes during early meiosis. Front Plant Sci. 2014;5:356.

Zhou S, Jiang W, Zhao Y, Zhou DX. Single-cell three-dimensional genome structures of rice gametes and unicellular zygotes. Nat Plants. 2019;5:795–800.

Jiang P, Lian B, Liu C, Fu Z, Shen Y, Cheng Z, Qi Y. 21-nt phasiRNAs direct target mRNA cleavage in rice male germ cells. Nat Commun. 2020;11:5191.

Clark SJ, Smallwood SA, Lee HJ, Krueger F, Reik W, Kelsey G. Genome-wide base-resolution mapping of DNA methylation in single cells using single-cell bisulfite sequencing (scBS-seq). Nat Protoc. 2017;12:534–47.

Park K, Kim MY, Vickers M, Park JS, Hyun Y, Okamoto T, Zilberman D, Fischer RL, Feng X, Choi Y, Scholten S. DNA demethylation is initiated in the central cells of Arabidopsis and rice. Proc Natl Acad Sci U S A. 2016;113:15138–43.

Hu D, Yu Y, Wang C, Long Y, Liu Y, Feng L, Lu D, Liu B, Jia J, Xia R, et al. Multiplex CRISPR-Cas9 editing of DNA methyltransferases in rice uncovers a class of non-CG methylation specific for GC-rich regions. Plant Cell. 2021;33:2950–64.

Cheng C, Tarutani Y, Miyao A, Ito T, Yamazaki M, Sakai H, Fukai E, Hirochika H. Loss of function mutations in the rice chromomethylase OsCMT3a cause a burst of transposition. Plant J. 2015;83:1069–81.

Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–36.

Fang J, Jiang J, Leichter SM, Liu J, Biswal M, Khudaverdyan N, Zhong X, Song J. Mechanistic basis for maintenance of CHG DNA methylation in plants. Nat Commun. 2022;13:3877.

Sun Q, Zhou DX. Rice jmjC domain-containing gene JMJ706 encodes H3K9 demethylase required for floral organ development. Proc Natl Acad Sci U S A. 2008;105:13679–84.

Idler RK, Hennig GW, Yan W. Bioinformatic identification of novel elements potentially involved in messenger RNA fate control during spermatogenesis. Biol Reprod. 2012;87:138.

Anderson SN, Johnson CS, Chesnut J, Jones DS, Khanday I, Woodhouse M, Li C, Conrad LJ, Russell SD, Sundaresan V. The zygotic transition is initiated in unicellular plant zygotes with asymmetric activation of parental genomes. Dev Cell. 2017;43:349-358 e344.

Dresselhaus T, Jürgens G. Comparative embryogenesis in angiosperms: activation and patterning of embryonic cell lineages. Annu Rev Plant Biol. 2021;72:641–76.

Borges F, Donoghue MTA, LeBlanc C, Wear EE, Tanurdzic M, Berube B, Brooks A, Thompson WF, Hanley-Bowdoin L, Martienssen RA. Loss of Small-RNA-Directed DNA Methylation in the Plant Cell Cycle Promotes Germline Reprogramming and Somaclonal Variation. Curr Biol. 2021;31:591-600 e594.

Nelms B, Walbot V. Gametophyte genome activation occurs at pollen mitosis I in maize. Science. 2022;375:424–9.

Houben A, Kumke K, Nagaki K, Hause G. CENH3 distribution and differential chromatin modifications during pollen development in rye (Secale cereale L.). Chromosome Res. 2011;19:471–80.

Borg M, Jacob Y, Susaki D, LeBlanc C, Buendia D, Axelsson E, Kawashima T, Voigt P, Boavida L, Becker J, et al. Targeted reprogramming of H3K27me3 resets epigenetic memory in plant paternal chromatin. Nat Cell Biol. 2020;22:621–9.

Huang X, Sun MX. H3K27 methylation regulates the fate of two cell lineages in male gametophytes. Plant Cell. 2022;34:2989–3005.

Ingouff M, Rademacher S, Holec S, Soljic L, Xin N, Readshaw A, Foo SH, Lahouze B, Sprunck S, Berger F. Zygotic resetting of the HISTONE 3 variant repertoire participates in epigenetic reprogramming in Arabidopsis. Curr Biol. 2010;20:2137–43.

Buttress T, He S, Wang L, Zhou S, Saalbach G, Vickers M, Li G, Li P, Feng X. Histone H2B.8 compacts flowering plant sperm through chromatin phase separation. Nature. 2022;611:614–22.

Scholten S, Lörz H, Kranz E. Paternal mRNA and protein synthesis coincides with male chromatin decondensation in maize zygotes. Plant J. 2002;32:221–31.

Chen J, Strieder N, Krohn NG, Cyprys P, Sprunck S, Engelmann JC, Dresselhaus T. Zygotic genome activation occurs shortly after fertilization in maize. Plant Cell. 2017;29:2106–25.

He Y, Zhang T, Yang N, Xu M, Yan L, Wang L, Wang R, Zhao Y. Self-cleaving ribozymes enable the production of guide RNAs from unlimited choices of promoters for CRISPR/Cas9 mediated genome editing. J Genet Genomics. 2017;44:469–72.

Liu W, Xie X, Ma X, Li J, Chen J, Liu YG. DSDecode: A Web-Based Tool for Decoding of Sequencing Chromatograms for Genotyping of Targeted Mutations. Mol Plant. 2015;8:1431–3.

Xie X, Ma X, Liu YG. Decoding Sanger Sequencing Chromatograms from CRISPR-Induced Mutations. Methods Mol Biol. 2019;1917:33–43.

Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90.

Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37:907–15.

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.

Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–2.

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

Li X, Zhu B, Lu Y, Zhao F, Liu Q, Wang J, Ye M, Chen S, Nie J, Xiong L, et al: DNA methylation remodeling and the functional implication during male gametogenesis in rice. 2024, GSE235680. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/gds/?term=GSE235680

Anderson SN, Johnson CS, Jones DS, Conrad LJ, Gou X, Russell SD, Sundaresan V: Transcriptomes of isolated Oryza sativa gametes characterized by deep sequencing: evidence for distinct sex-dependent chromatin and epigenetic states before fertilization. 2013, GSE50777. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/gds/?term=GSE50777 .

Park K, Kim MY, Vickers M, Park JS, Hyun Y, Okamoto T, Zilberman D, Fischer RL, Feng X, Choi Y, Scholten S: DNA demethylation is initiated in the central cells of Arabidopsis and rice. 2016, GSE89789 . Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/gds/?term=GSE89789 .

Kim MY, Ono A, Scholten S, Kinoshita T, Zilberman D, Okamoto T, Fischer RL: DNA demethylation by ROS1a in rice vegetative cells promotes methylation in sperm. 2019, GSE126791. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/gds/?term=GSE126791 .

Hu D, Yu Y, Wang C, Long Y, Liu Y, Feng L, Lu D, Liu B, Jia J, Xia R, et al: Multiplex CRISPR-Cas9 editing of DNA methyltransferases in rice uncovers a class of non-CG methylation specific for GC-rich regions. Plant Cell 2021, GSE138705. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/gds/?term=GSE138705 .

Download references

Acknowledgements

We thank Mr. Hao Liu from the National Key Laboratory of Crop Genetic Improvement for essential help in managing the high-throughput computing clusters.

Peer review information

Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 9 .

The work was supported by National Natural Science Foundation of China [32070563, 31730049], the Fundamental Research Funds for the Central Universities [2662023SKPY002], and the Agence Nationale de la Recherche (LANDSREC, ANR- 21-CE20-0012–01).

Author information

Xue Li and Bo Zhu contributed equally to this work.

Authors and Affiliations

National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China

Xue Li, Bo Zhu, Feng Zhao, Qian Liu, Jiahao Wang, Miaomiao Ye, Siyuan Chen, Lizhong Xiong, Yu Zhao, Changyin Wu & Dao-Xiu Zhou

Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, College of Agriculture, Yangzhou University, Yangzhou, 225009, China

Vazyme Biotech Co., Ltd, Nanjing, 210000, China

Institute of Plant Science Paris-Saclay (IPS2), CNRS, INRAE, Université Paris-Saclay, 91405, Orsay, France

Dao-Xiu Zhou

You can also search for this author in PubMed Google Scholar

Contributions

XL produced the mutants, isolated the cells and participated in data production, BZ produced and analyzed the data; YL, FZ and QL participated in data analysis; JW, MY and SC participated in material production; LX, CW participated in the project design and YZ participated in supervision and management of the project; DXZ coordinated the project, analyzed the data and wrote the paper.

Corresponding author

Correspondence to Dao-Xiu Zhou .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

Junwei Nie claims competing interests, the remaining authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: figure s1..

Isolation of rice meiocytes and unicellular microspores and quality control of rice male germ cells BS-seq. Figure S2. DNA methylation levels in reproduction cells compared with somatic tissues. Figure S3. Analysis of microspore hyper DMRs relative to meiocyte and sperm. Figure S4. Analysis of mCHG genes in unicellular microspore. Figure S5. Effects of cmt 3 a and cmt 3 b in pollen development. Figure S6. CMT3b maintains high mCHG level in microspore. Figure S7. Analysis had no effect on mCHG in cmt3a mutant sperm cells. Figure S8. Production and phenotypic analysis of jmj706/707 double mutants. Figure S9. Average methylation level of TEG and genes in jmj706/707 meiocyte and sperm cells compared with wild type. Figure S10. Effect of the cmt3a/b and jmj706/7 mutations on sexual lineage-specific methylation. Fig. S11. Analyze the Canonical SLH and lineage-specific methylation (SLM). Figure S12. Effects of cmt3a and cmt3b mutations in zygote and/ or egg DNA methylation. Figure S13. Transcriptome data analysis. Figure S14. Enrichment of upregulated genes in cmt3b meiocyte and sperm. Figure S15. DNA methylation levels in rice reproductive cells and seedlings. Figure S16. DNA CHH methylation landscape in rice reproductive cells and seedlings.

Additional file 2: Table S1.

Summary of BS-seq data.

Additional file 3: Table S2.

Summary of RNA-seq data.

Additional file 4: Table S3.

315 genes annotated by Cluster A.

Additional file 5: Table S4.

132 genes annotated by SLM.

Additional file 6: Table S5.

CMT3b regulates gene in meiocyte and sperm.

Additional file 7: Table S6.

83 zygotic genes upregulated in cmt3b egg.

Additional file 8: Table S7.

Primers used in this study.

Additional file 9.

Review history.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Li, X., Zhu, B., Lu, Y. et al. DNA methylation remodeling and the functional implication during male gametogenesis in rice. Genome Biol 25 , 84 (2024). https://doi.org/10.1186/s13059-024-03222-w

Download citation

Received : 09 August 2023

Accepted : 25 March 2024

Published : 02 April 2024

DOI : https://doi.org/10.1186/s13059-024-03222-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Rice male gametogenesis
DNA methylation

Genome Biology

ISSN: 1474-760X

Submission enquiries: [email protected]
General enquiries: [email protected]

Office of Neuroscience Research

Thesis Defense: Ellie Wilson (Molecular Genetics and Genomics Program) – “Integrating DNA methylation and 3D-genome architecture to identify functional regulatory sequences in IDH mutant AML”

“Integrating DNA methylation and 3D-genome architecture to identify functional regulatory sequences in IDH mutant AML”

Thesis lab: David Spencer (WashU Medicine)

For inquiries contact Ellie at [email protected] .

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
v.5(4); Jul-Aug 2020

Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes

Robert a. petit, iii.

a Division of Infectious Diseases, Department of Medicine, Emory University School of Medicine, Atlanta, Georgia, USA

Timothy D. Read

Associated data.

Bactopia Analysis Pipeline workflow. Download FIG S1, PDF file, 0.1 MB .

Results returned after querying ENA for Lactobacillus . Download Data Set S1, TXT file, 2.4 MB .

SRA/ENA accession numbers of experiments processed by Bactopia. Download Data Set S2, TXT file, 0.02 MB .

Nextflow runtime report for Lactobacillus genomes processed by Bactopia. Download Data Set S3, PDF file, 0.1 MB .

Sequencing quality ranks per year from 2011 to 2019 of Lactobacillus genome projects. Genome projects were grouped into three ranks of increasing quality: bronze, silver, and gold. The rank was based on coverage, read length, per-read quality, and total assembled contigs. The highest rank, gold, represented 62% ( n = 967) of the available Lactobacillus genome projects. Of the remaining genomes, 25% ( n = 386) were ranked silver, and 13% ( n = 205) were ranked bronze. Between the years of 2011 and 2019, gold-ranked samples consistently outnumbered silver- and bronze-ranked sampled except for the years 2011 and 2015. However, it is likely that the total number of gold-ranked samples is underrepresented due coverage reduction being based on the estimated genome size. Download FIG S2, PDF file, 0 MB .

Lactobacillus samples excluded from analysis. Download Table S1, DOCX file, 0.02 MB .

Comparison of estimated genome size and assembled genome size. The assembled genome size ( y axis) and estimated genome size ( x axis) were plotted for each sample. The color of the dot is determined by the rank of the sample. The solid black line represents a 1:1 ratio between the assembled genome size and the estimated genome size. The genome size was estimated for each sample by Mash ( 17 ) using the raw sequences. Download FIG S3, PDF file, 0.02 MB .

Assembled genome sizes are consistent within species. Download FIG S4, PDF file, 0 MB .

Samples with a non- Lactobacillus taxonomic classification. Download Table S2, DOCX file, 0.02 MB .

Raw Illumina sequences of Lactobacillus samples used in this study were acquired from experiments submitted under BioProject accession numbers PRJDB1101 , PRJDB1726 , PRJDB4156 , PRJDB4955 , PRJDB5065 , PRJDB5206 , PRJDB6480 , PRJDB6495 , PRJEB10572 , PRJEB11980 , PRJEB14693 , PRJEB18589 , PRJEB19875 , PRJEB21025 , PRJEB21680 , PRJEB22112 , PRJEB22252 , PRJEB23845 , PRJEB24689 , PRJEB24698 , PRJEB24699 , PRJEB24700 , PRJEB24701 , PRJEB24713 , PRJEB24715 , PRJEB25194 , PRJEB2631 , PRJEB26638 , PRJEB2824 , PRJEB29398 , PRJEB29504 , PRJEB2977 , PRJEB3012 , PRJEB3060 , PRJEB31213 , PRJEB31289 , PRJEB31301 , PRJEB31307 , PRJEB5094 , PRJEB8104 , PRJEB8721 , PRJEB9718 , PRJNA165565 , PRJNA176000 , PRJNA176001 , PRJNA183044 , PRJNA184888 , PRJNA185359 , PRJNA185406 , PRJNA185584 , PRJNA185632 , PRJNA185633 , PRJNA188920 , PRJNA188921 , PRJNA212644 , PRJNA217366 , PRJNA218804 , PRJNA219157 , PRJNA222257 , PRJNA224116 , PRJNA227106 , PRJNA227335 , PRJNA231221 , PRJNA234998 , PRJNA235015 , PRJNA235017 , PRJNA247439 , PRJNA247440 , PRJNA247441 , PRJNA247442 , PRJNA247443 , PRJNA247444 , PRJNA247445 , PRJNA247446 , PRJNA247452 , PRJNA254854 , PRJNA255080 , PRJNA257137 , PRJNA257138 , PRJNA257139 , PRJNA257141 , PRJNA257142 , PRJNA257182 , PRJNA257185 , PRJNA257853 , PRJNA257876 , PRJNA258355 , PRJNA258500 , PRJNA267549 , PRJNA269805 , PRJNA269831 , PRJNA269832 , PRJNA269860 , PRJNA269905 , PRJNA270961 , PRJNA270962 , PRJNA270963 , PRJNA270964 , PRJNA270965 , PRJNA270966 , PRJNA270967 , PRJNA270968 , PRJNA270969 , PRJNA270970 , PRJNA270972 , PRJNA270973 , PRJNA270974 , PRJNA272101 , PRJNA272102 , PRJNA283920 , PRJNA289613 , PRJNA29003 , PRJNA291681 , PRJNA296228 , PRJNA296248 , PRJNA296274 , PRJNA296298 , PRJNA296309 , PRJNA296751 , PRJNA296754 , PRJNA298448 , PRJNA299992 , PRJNA300015 , PRJNA300023 , PRJNA300088 , PRJNA300119 , PRJNA300123 , PRJNA300179 , PRJNA302242 , PRJNA303235 , PRJNA303236 , PRJNA305242 , PRJNA306257 , PRJNA309616 , PRJNA312743 , PRJNA315676 , PRJNA316969 , PRJNA322958 , PRJNA322959 , PRJNA322960 , PRJNA322961 , PRJNA336518 , PRJNA342061 , PRJNA342757 , PRJNA347617 , PRJNA348789 , PRJNA376205 , PRJNA377666 , PRJNA379934 , PRJNA381357 , PRJNA382771 , PRJNA388578 , PRJNA392822 , PRJNA397632 , PRJNA400793 , PRJNA434600 , PRJNA436228 , PRJNA474823 , PRJNA474907 , PRJNA476494 , PRJNA477598 , PRJNA481120 , PRJNA484967 , PRJNA492883 , PRJNA493554 , PRJNA496358 , PRJNA50051 , PRJNA50053 , PRJNA50055 , PRJNA50057 , PRJNA50059 , PRJNA50061 , PRJNA50063 , PRJNA50067 , PRJNA50115 , PRJNA50117 , PRJNA50125 , PRJNA50133 , PRJNA50135 , PRJNA50137 , PRJNA50139 , PRJNA50141 , PRJNA50159 , PRJNA50161 , PRJNA50163 , PRJNA50165 , PRJNA50167 , PRJNA50169 , PRJNA50173 , PRJNA504605 , PRJNA504734 , PRJNA505088 , PRJNA52105 , PRJNA52107 , PRJNA52121 , PRJNA525939 , PRJNA530250 , PRJNA533291 , PRJNA533837 , PRJNA542049 , PRJNA542050 , PRJNA542054 , PRJNA543187 , PRJNA544527 , PRJNA547620 , PRJNA552757 , PRJNA554696 , PRJNA554698 , PRJNA557339 , PRJNA562050 , PRJNA563077 , PRJNA573690 , PRJNA577465 , PRJNA578299 , PRJNA68459 , and PRJNA84 .

Links for the websites and software used in this study are as follows: Bactopia website and documentation, https://bactopia.github.io/ ; Github, https://www.github.com/bactopia/bactopia/ ; Zenodo Snapshot, https://doi.org/10.5281/zenodo.3926909 ; Bioconda, https://bioconda.github.io/recipes/bactopia/README.html ; and the containers Docker, https://cloud.docker.com/u/bactopia/ , and Singularity, https://cloud.sylabs.io/library/rpetit3/bactopia .

It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.

Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus , a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopia .

IMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.

INTRODUCTION

Sequencing a bacterial genome, an activity that once required the infrastructure of a dedicated genome center, is now a routine task that even a small laboratory can undertake. Many open-source software tools have been created to handle various parts of the process of using raw read data for functions such as single nucleotide polymorphism (SNP) calling and de novo assembly. As a result of dedicated community efforts, it has recently become much easier to locally install these bioinformatic tools through package managers (Bioconda [ 1 ] and Brew [ 2 ]) or through the use of software containers (Docker and Singularity). Despite these advances, producers of bacterial sequence data face a bewildering array of choices when considering how to perform analysis, particularly when large numbers of genomes are involved and processing efficiency and scalability become major factors.

Efficient bacterial multigenome analysis has been hampered by three missing functionalities. First is the need to have workflows of workflows' that can integrate analyses and provide a simplified way to start with a collection of raw genome data, remove low-quality sequences, and perform the basic analytic steps of de novo assembly, mapping to reference sequence, and taxonomic assignment. Second is the desire to incorporate user-specific knowledge of the species into the input of the main genome analysis pipeline. While many microbiologists are not expert bioinformaticians, they are experts in the organisms they study. Third is the need to create an output format from the main pipeline that could be used for future customized downstream analysis such as pan-genome analysis and basic visualization of phylogenies.

Here, we introduce Bactopia, an integrated suite of workflows primarily designed for flexible analysis of Illumina genome sequencing projects of bacteria from the same taxon. Bactopia is based on Nextflow workflow software ( 3 ) and is designed to be scalable, allowing projects as small as a single genome to be run on a local desktop or projects including many thousands of genomes to be run as a batch on a cloud infrastructure. Running multiple tasks on a single platform standardizes the underlying data quality used for gene and variant calling between projects run in different laboratories. This structure also simplifies the user experience. In Bactopia, complex multigenome analysis can be run in a small number of commands. However, there are myriad options for fine-tuning data sets used for analysis and the functions of the system. The underlying Nextflow structure ensures reproducibility. To illustrate the functionality of the system, we performed a Bactopia analysis of 1,664 public genome samples of the Lactobacillus genus, an important component of the microbiome of humans and animals.

Design and implementation.

Bactopia links together open-source bioinformatics software, available from Bioconda ( 1 ), using Nextflow ( 3 ). Nextflow was chosen for its flexibility: Bactopia can be run locally, on clusters, or on cloud platforms with simple parameter changes. It also manages the parallel execution of tasks and creates checkpoints allowing users to resume jobs. Nextflow automates installation of the component software of the workflow through integration with Bioconda. For ease of deployment, Bactopia can be installed either through Bioconda, a Docker container, or a Singularity container. All of the software programs used by Bactopia (version 1.4.0) described in the manuscript are listed in Table 1 with their individual version numbers.

TABLE 1

List of bioinformatic tools used by the Bactopia Analysis Pipeline, version 1.4.0

There are three main components of Bactopia ( Fig. 1 ; see also Fig. S1 in the supplemental material). Bactopia Data Sets (BaDs) is a framework for formatting organism-specific data sets to be used by the downstream analysis pipeline. The Bactopia Analysis Pipeline (BaAP) is a customizable workflow for the analysis of individual bacterial genome projects that is an extension and generalization of the previously published Staphylococcus aureus -specific Staphopia Analysis Pipeline (StAP) ( 4 ). The inputs to BaAP are FASTQ files from bacterial Illumina sequencing projects, either imported from the National Centers for Biotechnology Information (NCBI) Short Read Archive (SRA) database or provided locally, and any reference data in the BaDs. Bactopia Tools (BaTs) is a set of workflows that use the output files from a BaAP project to run genomic analysis on multiple genomes. For this project we used BaTs to (i) summarize the results of running multiple bacterial genomes through BaAP, (ii) extract 16S gene sequences and create a phylogeny, (iii) assign taxonomic classifications with the Genome Taxonomy Database (GTDB) ( 5 ), (iv) determine subsets of Lactobacillus crispatus samples by average nucleotide identity (ANI) with FastANI ( 6 ), and (v) run pan-genome analysis for L. crispatus using Roary ( 7 ) and create a core-genome phylogeny.

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-f0001.jpg

Bactopia overview. (a) A general overview of the Bactopia workflow. (b) A detailed diagram of processing pathways within the Bactopia Analysis Pipeline showing optional data set inputs.

FIG S1

Comparison to similar open-source software..

At the time of writing (February 2020), we knew of only three other actively maintained open-source generalist bacterial genomic workflow software programs that encompassed a similar range of functionality to Bactopia: ASA 3 P ( 8 ), TORMES ( 9 ), and the currently unpublished Nullarbor ( 10 ). The versions of these programs used many of the same component software programs (e.g., Prokka, SPAdes, BLAST+, and Roary) but differed in the philosophies underlying their design ( Table 2 ). This made head-to-head runtime comparisons somewhat meaningless as each was aimed at a different analysis scenario and produced a different output. Bactopia was the most open-ended and flexible, allowing the user to customize input databases and providing a platform for downstream analysis by different BaTs rather than built-in pangenome and phylogeny creation. Bactopia also had some features not implemented in the other programs, such as SRA/ENA search and download and automated reference genome selection for identifying variants. Both Bactopia and ASA 3 P are highly scalable, and each can be seamlessly executed on local, cluster, and cloud environments with little effort required by the user. ASA 3 P was the only program to implement long-read assembly of multiple projects. TORMES was the only program to include a user-customizable RMarkdown for reporting and to have optional analyses specifically for Escherichia and Salmonella . Nullarbor was the only program to implement a prescreening method for filtering out potential biological outliers prior to full analysis.

TABLE 2

A comparison of bacterial genome analysis workflows

Use case: the Lactobacillus genus.

We performed a Bactopia analysis of publicly available raw Illumina data labeled as belonging to the Lactobacillus genus. Lactobacillus is an important component of the human microbiome, and cultured samples have been sequenced by several research groups over the past few years. Lactobacillus crispatus and other species are often the majority bacterial genus of the human vagina and are associated with low pH and reduction in pathogen burden ( 11 ). Samples of the genus are used in the food industry for fermentation in the production of yoghurt, kimchi, kombucha, and other common items. Lactobacillus is a common probiotic although recent genome-based transmission studies showed that bloodstream infections can follow after ingestion by immunocompromised patients ( 12 ).

In November 2019, we initiated Bactopia analysis using the following three commands:

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-m0001.jpg

The “bactopia search” subcommand produced a list of accession numbers for 2,030 experiments that had been labeled as “ Lactobacillus ” (taxonomy identifier [taxon ID]: 1578) ( Data Set S1 ). After filtering for only Illumina sequencing, 1,664 accession numbers for experiments remained ( Data Set S2 ).

DATA SET S1

Data set s2.

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-m0003.jpg

Samples were processed on a 96-core SLURM cluster with 512 GB of available RAM. Analysis took approximately 2.5 days to complete, with an estimated runtime of 30 min per sample (determined by adding up the median process runtime, for 17 different processes in total, in BaAP). No individual process used more than 8 GB of memory, with all but five using less than 1 GB. Nextflow ( 3 ) recorded detailed statistics on resource usage, including CPU, memory, job duration, and input-output (I/O). ( Data Set S3 ).

DATA SET S3

Analysis of lactobacillus genomes using bats..

The BaAP outputted a directory of directories named after the unique experiment accession number for each sample. Within each sample directory were subdirectories for the output of each analysis run. These data structures were recognized by BaTs for subsequent analysis.

We used BaT “summary” to generate a summary report of our analysis. The report includes an overview of sequence quality, assembly statistics, and predicted antimicrobial resistances and virulence factors. It also outputs a list of samples that fail to meet minimum sequencing depth and/or quality thresholds.

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-m0008.jpg

BaT “summary” grouped samples as gold, silver, bronze, exclude, or unprocessed, based on BaAP completion, minimum sequencing coverage, per-read sequencing mean quality, minimum mean read length, and assembly quality ( Table 3 ; Fig. S2 ). To be placed in a group, a sample had to meet each cutoff. Cutoffs were based on those used by the Staphopia Analysis Pipeline (StAP) ( 4 ) with the addition of a contig count cutoff. For this analysis we used the default values for these cutoffs to group our samples. Gold samples were defined as those having greater than 100× coverage, per-read mean quality greater than Q30, mean read length greater than 95 bp, and an assembly with fewer than 100 contigs. Silver samples were defined as those having greater than 50× coverage, per-read mean quality greater than Q20, mean read length greater than 75 bp, and an assembly with less than 200 contigs. Bronze samples were defined as those having greater than 20× coverage, per-read mean quality greater than Q12, mean read length greater than 49 bp, and an assembly with fewer than 500 contigs. A total of 106 samples (the exclude and unprocessed groups) were excluded from further analysis ( Table S1 ). Forty-eight samples that failed to meet the minimum thresholds for bronze quality were assigned to the exclude group. Fifty-eight samples that were not processed by BaAP due to sequencing-related errors or because of the estimated genome sizes were grouped as unprocessed. Of these, one (SRA accession no. SRX4526092 ) was labeled as paired end but did not have both sets of reads, one (SRA accession no. SRX1490246 ) was identified to be an assembly converted to FASTQ format, and 14 had insufficient sequencing depth. The remaining 42 samples, unprocessed by BaAP, had an estimated genome size which exceeded 4.2 Mbp (set at runtime). We queried these samples against available GenBank and RefSeq sketches using Mash screen and Sourmash lca gather. There were 36 samples that contained evidence for Lactobacillus but also sequences for other bacterial species, phage, virus, and plant genomes. There were six samples that contained no evidence for Lactobacillus , four of which had matches to multiple bacterial species, and two of which had matches only to Saccharomyces cerevisiae .

TABLE 3

Summary of Lactobacillus genome sequencing projects quality and coverage a

FIG S2

Table s1.

There were 1,558 samples with gold, silver, or bronze quality ( Table 3 ) that were used for further analysis. For these we found that, on average, the assembled genome size was about 12% smaller than the estimated genome size ( Table 3 ; Fig. S3 ). If we assume that the assembled genome size is a better indicator of a sample’s genome size, the average coverage before quality control (QC) increased from 220× to 268×. In this use case, the Lactobacillus genus, it was necessary to estimate genome sizes, but in dealing with samples from a single species, it may be better to provide a known genome size.

FIG S3

For visualization of the phylogenetic relationships of the samples, we used the “phyloflash” and “gtdb” BaTs.

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-m0009.jpg

Maximum-likelihood phylogeny from reconstructed 16S rRNA genes. A phylogenetic representation of 1,470 samples using IQ-Tree ( 28 , – 30 ). (a) A tree of the full set of samples. The outer ring represents the genus assigned by GTDB-Tk, as indicated. (b) The same tree as shown in panel a, but with the non- Lactobacillus clade collapsed. Major groups of Lactobacillus species (indicated with a letter) and the most sequenced Lactobacillus species have been labeled. The inner ring represents the average nucleotide identity (ANI), determined by FastANI ( 6 ), of samples to L. crispatus . The tree was built from a multiple-sequence alignment ( 31 ) of 16S genes reconstructed by phyloFlash ( 25 ) with 1,281 parsimony-informative sites. The likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −54,698. Taxonomic classifications were assigned by GTDB-Tk ( 21 ).

A recent analysis of completed genomes in the NCBI found 239 discontinuous de novo Lactobacillus species using a 94% ANI cutoff ( 33 ). Based on GTDB taxonomic classification, which applies a 95% ANI cutoff, we identified 161 distinct Lactobacillus species in 1,554 samples. The five most sequenced Lactobacillus species, accounting for 45% of the total, were L. rhamnosus ( n = 225), L. paracasei ( n = 180), L. gasseri ( n = 132), L. plantarum ( n = 86), and L. fermentum ( n = 80). Within these five species the assembled genomes sizes were remarkably consistent ( Fig. S4 ). There were 58 samples that were not classified as Lactobacillus , of which 34 were classified as Streptococcus pneumoniae by both 16S gene sequencing and GTDB ( Table S2 ).

FIG S4

Table s2.

We found that 505 (∼33%) of 1,554 taxonomic classifications by 16S gene and GTDB were in conflict with the taxonomy according to the NCBI SRA, illustrating the importance of an unbiased approach to understanding sample context. In samples that had both a 16S and GTDB taxonomic classification, there was disagreement in 154 out of 1,467 samples. Of these, 47% were accounted for by the recently described L. paragasseri ( 34 ) ( n = 72). This possibly highlights a lag in the reclassification of assemblies in the NCBI Assembly database.

Analysis of the pangenome of the entire genus using a tool such as Roary ( 7 ) would return only a few core genes, owing to sequence divergence of evolutionarily distant species. However, because the “roary” BaT can be supplied with a list of individual samples, it is possible to isolate the analysis to the species level. As an example of using BaTs to focus on a particular group within the larger set of results, we chose L. crispatus , a species commonly isolated from the human vagina and also found in the guts/feces of poultry.

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-m0013.jpg

ANI analysis revealed 38 samples as having >96.1% ANI to L. crispatus , with no other sample greater than 83.1%. Four completed L. crispatus genomes were also included in the analysis ( Table 4 ), for a total of 42 genomes. The pan-genome of L. crispatus was revealed to have 7,037 gene families and 972 core genes ( Fig. 3 ). Similar to a recent analysis by Pan et al. ( 40 ), L. crispatus was separated into two main phylogenetic groups, one associated with human vaginal isolates and the other having more mixed provenance and including chicken, turkey, and human gut isolates.

TABLE 4

Lactobacillus crispatus genomes used in pan-genome analysis a

An external file that holds a picture, illustration, etc.
Object name is mSystems.00190-20-f0003.jpg

Core-genome maximum-likelihood phylogeny of Lactobacillus crispatus . A core-genome phylogenetic representation using IQ-Tree ( 28 , – 30 ) of 42 L. crispatus samples. The putatively recombinant positions predicted using ClonalFrameML ( 37 ) were removed from the alignment with maskrc-svg ( 38 ). The tree was built from 972 core genes identified by Roary with 9,209 parsimony-informative sites. The log-likelihood score for the consensus tree constructed from 1,000 bootstrap trees was −1,418,106.

Last, we looked at patterns of antibiotic resistance across the genus using a table, generated by the “summary” BaT, of resistance genes and loci called by AMRFinder+ ( 41 ). Only 79 out of 1,496 Lactobacillus samples defined by GTDB-Tk ( 21 ) were found to have predicted resistance using AMRFinder+. The most common resistance categories were tetracyclines (67 samples), followed by macrolides, lincosamides, and aminoglycosides (16, 15, and 11 samples, respectively). Species with the highest proportion of resistance included L. amylovorus (12/14 tetracycline resistant) and L. crispatus (10/42 tetracycline resistant). Only three genomes of L. amylophilus were included in the study, but each contained matches to genes for macrolide, lincosamide, and tetracycline resistance. The linking thread between these species is that they are each commonly isolated from agricultural animals. The high proportion of L. crispatus samples isolated from chickens that were tetracycline resistant has been previously observed ( 42 , 43 ) ( Fig. 3 ).

A recent analysis of 184 Lactobacillus type strain genomes by Campedelli et al. ( 44 ) found a higher percentage of type strains with aminoglycoside (20/184), tetracycline (18/184), erythromycin (6/184), and clindamycin (60/184) resistance. Forty-two of the type strains had chloramphenicol resistance genes whereas, here, AMRFinder+ returned only 1/1,467 genes. These differences probably reflect a combination of the different sampling biases of the studies and the strategy of Campedelli et al. to use a relaxed threshold for hits to maximize sensitivity (blastp matches against the CARD database with acid sequence identity of 30% and query coverage of 70% [ 44 ]). Resistance is probably undercalled by both methods because of a lack of well-characterized resistance loci from the Lactobacillus genus to use for comparison.

Bactopia is a flexible workflow for bacterial genomics. It can be run on a laptop for a single bacterial sample, but, critically, the underlying Nextflow framework allows it to make efficient use of large clusters and cloud-computing environments to process the many thousands of genomes that are currently being generated. For users that are not familiar with bacterial genomic tools and/or who require a standardized pipeline, Bactopia is a one-stop shop that can be easily deployed using conda, Docker, and Singularity containers. For researchers with particular interest in individual species or genera, BaDs can be highly customized with taxon-specific databases.

The current version of Bactopia has only minimal support for long-read data, but this is an area that we plan to expand in the future. We also plan to implement more comparative analyses in the form of additional BaTs. With a framework set in place for developing BaTs, it should be possible to make a toolbox of workflows that not only can be used for all bacteria but are also customized for annotating genes and loci specific for particular species.

MATERIALS AND METHODS

Bactopia data sets..

The Bactopia pipeline can be run without downloading and formatting Bactopia Data Sets (BaDs). However, providing them enriches the downstream analysis. Bactopia can import specific existing public data sets, as well as accessible user-provided data sets in the appropriate format. A subcommand (“bactopia datasets”) was created to automate downloading, building, and (or) configuring these data sets for Bactopia.

BaDs can be grouped into those that are general and those that are user supplied. General data sets include a Mash ( 17 ) sketch of the NCBI RefSeq ( 16 ) and PLSDB ( 20 ) databases and a Sourmash ( 19 ) signature of microbial genomes (including viral and fungal) from the NCBI GenBank ( 18 ) database. Ariba ( 13 ), a software program for detecting genes in raw read (FASTQ) files, uses a number of default reference databases for virulence and antibiotic resistance. The available Ariba data sets include ARG-ANNOT ( 45 ), CARD ( 15 ), MEGARes ( 46 ), NCBI Reference Gene Catalog ( 47 ), plasmidfinder ( 48 ), resfinder ( 49 ), SRST2 ( 50 ), VFDB ( 14 ), and VirulenceFinder ( 51 ).

When an organism name is provided, additional data sets are set up. If a multilocus sequence typing (MLST) schema is available for the species, it is downloaded from PubMLST.org ( 52 ) and set up for BLAST+ ( 53 ) and Ariba. Each RefSeq completed genome for the species is downloaded using ncbi-genome-download ( 35 ). A Mash sketch is created from the set of downloaded completed genomes to be used for automatic reference selection for variant calling. Protein sequences are extracted from each genome with BioPython ( 54 ), clustered using CD-HIT ( 55 , 56 ), and formatted to be used by Prokka ( 36 ) for annotation. Users may also provide their own organism-specific reference data sets to be used for BLAST+ alignment, short-read alignment, or variant calling.

Bactopia Analysis Pipeline.

The Bactopia Analysis Pipeline (BaAP) takes input FASTQ or preassembled genomes as FASTA files and optional user-specified BaDs and performs a number of workflows that are based on either de novo whole-genome assembly, reference mapping, or sequence decomposition (i.e., k -mer-based approaches) ( Fig. 1b ). BaAP has incorporated numerous existing bioinformatic tools ( Table 1 ) into its workflow ( Fig. 1b ; see also Fig. S1 in the supplemental material). For each tool, many of the input parameters are exposed to the user, allowing for fine-tuning analysis.

BaAP: acquiring FASTQs.

Bactopia provides multiple ways for users to provide their FASTQ-formatted sequences. Input FASTQs can be local or downloaded from public repositories or preassembled genomes as FASTA files. There is also an option for hybrid assembly of Illumina and long-read data.

Local sequences can be processed one at a time or in batches. To process a single sample, the user provides the path to the FASTQ(s) and a sample name. For multiple samples, this method does not make efficient use of Nextflow’s queue system. Alternatively, users can provide a “file of filenames” (FOFN), which is a tab-delimited file with information about samples and paths to the corresponding FASTQ(s). By using the FOFN method, Nextflow queues each sample and makes efficient use of available resources. A subcommand (“bactopia prepare”) was created to automate the creation of an FOFN.

Raw sequences available from public repositories (e.g., European Nucleotide Archive [ENA], Sequence Read Archive [SRA], DNA Data Bank of Japan [DDBJ], or NCBI Assembly) can also be processed by Bactopia. Sequences associated with a provided experiment accession number (e.g., DRX, ERX, or SRX prefix) or NCBI Assembly accession number (e.g., GCF or GCA prefix) are downloaded and processed exactly as local sequences would be. A subcommand (“bactopia search”) was created which allows users to query ENA to create a list of experiment accession numbers from the ENA Data Warehouse API ( 57 ) associated with a BioProject accession number, taxon ID, or organism name.

BaAP: validating FASTQs.

The path for input FASTQ(s) is validated, and, if necessary, sequences from public repositories are downloaded using fastq-dl ( 58 ). If a preassembled genome is provided as an input, 2- by 250-bp paired-end reads are simulated using ART ( 59 ). Once validated, the FASTQ input(s) is tested to determine if it meets a minimum threshold for continued processing. All BaAP steps expect to use Illumina sequence data, which represent the great majority of genome projects currently generated. FASTQ files that are explicitly marked as non-Illumina or have properties that suggest that they are non-Illumina (e.g., read length or error profile) are excluded. By default, input FASTQs must exceed 2,241,820 bases (20× coverage of the smallest bacterial genome, Nasuia deltocephalinicola [ 60 ]) and 7,472 reads (minimum required base pairs/300 bp, the longest available reads from Illumina). If estimated, the genome size must be between 100,000 bp and 18,040,666 bp, which is based on the range of known bacterial genome sizes ( N. deltocephalinicola , NCBI accession no. GCF_000442605, 112,091 bp ; Minicystis rosea , NCBI accession no. GCF_001931535, 16,040,666 bp). Failure to pass these requirements excludes the samples from further subsequent analysis. The threshold values can be adjusted by the user at runtime.

BaAP: FastQ quality control and generation of pFASTQ.

Input FASTQs that pass the validation steps undergo quality control steps to remove poor-quality reads. BBDuk, a component of BBTools ( 61 ), removes Illumina adapters and phiX contaminants and filters reads based on length and quality. Base calls are corrected using Lighter ( 62 ). At this stage, the default procedure is to downsample the FASTQ file to an average 100× g enome coverage (if over 100×) with Reformat (from BBTools). This step, which was used in StAP ( 4 ), significantly saves computing time at little final cost to assembly or SNP calling accuracy. The genome size for coverage calculation is either provided by the user or estimated based on the FASTQ data by Mash ( 17 ). The user can provide their own value for downsampling FASTQs or disable it completely. Summary statistics before and after QC are created using FastQC ( 63 ) and fastq-scan ( 64 ). After QC, the original FASTQs are no longer used, and only the processed FASTQs (pFASTQ) are used in subsequent analysis.

BaAP: assembly, reference mapping, and decomposition.

BaAP uses Shovill ( 65 ) to create a draft de novo assembly with MEGAHIT ( 66 ), SKESA ( 67 ) (default), SPAdes ( 26 ), or Velvet ( 68 ) and makes corrections using Pilon ( 69 ) from the pFASTQ. Alternatively, if long reads were provided with paired-end pFASTQ, a hybrid assembly is created with Unicycler ( 70 ). The quality of the draft assembly is assessed by QUAST ( 71 ) and CheckM ( 72 ). Summary statistics for the draft assembly are created using assembly scan ( 73 ). If the total size of the draft assembly fails to meet a user-specified minimum size, further assembly-based analyses are discontinued. Otherwise, a BLAST+ ( 53 ) nucleotide database is created from the contigs. The draft assembly is also annotated using Prokka ( 36 ). If available at runtime, Prokka will first annotate with a clustered RefSeq protein set, followed by its default databases. The annotated genes and proteins are then subjected to antimicrobial resistance prediction with AMRFinder+ ( 47 ).

For each pFASTQ, sketches are created using Mash ( k = 21,31) and Sourmash ( 19 ) ( k = 21,31,51). McCortex ( 74 ) is used to count 31-mers in the pFASTQ.

BaAP: optional steps.

At runtime, Bactopia checks for BaDs specified by the command line (if any) and adjusts the settings of the pipeline accordingly. Examples of processes executed only if a BaDs is specified include Ariba ( 13 ) analysis for each available reference data set, sequence containment estimation against RefSeq ( 16 ) with mash screen ( 75 ) and against GenBank ( 18 ) with sourmash lca gather ( 19 ), and PLSDB ( 20 ), with mash screen and BLAST+. The sequence type (ST) of the sample is determined with BLAST+ and Ariba. The nearest reference RefSeq genome, based on mash ( 17 ) distance, is downloaded with ncbi-genome-download ( 35 ), and variants are called with Snippy ( 76 ). Alternatively, one or more reference genomes can be provided by the user. Users can also provide sequences for sequence alignment with BLAST+ and per-base coverage with BWA ( 77 , 78 ) and Bedtools ( 79 ).

Bactopia tools.

After BaAP has successfully finished, it will create a directory for each strain with subdirectories for each analysis result. The directory structure is independent of the project or options chosen. Bactopia Tools (BaTs) are a set of comparative-analysis workflows written using Nextflow that take advantage of the predictable output structure from BaAP. Each BaT is created from the same framework and a subcommand (“bactopia tools create”) is available to simplify the creation of future BaTs.

Five BaTs were used for analyses in this article. The “summary” BaT outputs a summary report of the set of samples and a list of samples that failed to meet thresholds set by the user. This summary includes basic sequence and assembly stats as well as technical (pass/fail) information. The “roary” BaT creates a pan-genome of the set of samples with Roary ( 7 ), with the option to include RefSeq ( 16 ) completed genomes. The “fastani” BaT determines the pairwise average nucleotide identity (ANI) for each sample with FastANI ( 6 ). The “phyloflash” BaT reconstructs 16S rRNA gene sequences with phyloFlash ( 25 ). The “gtdb” BaT assigns taxonomic classifications from the Genome Taxonomy Database (GTDB) ( 5 ) with GTDB-tk ( 21 ). Each Bactopia tool has a separate Nextflow workflow with its own conda environment, Docker image, and Singularity image. Additional BaTs are currently available for eggNOG-mapper ( 80 , 81 ), ISMapper ( 82 ), Mashtree ( 83 ), and PIRATE ( 84 ).

Data availability.

Supplementary material, acknowledgments.

We thank Torsten Seemann, Oliver Schwengers, Narciso Quijada, Michelle Su, Michelle Wright, Matt Plumb, Sean Wang, Ahmed Babiker, and Monica Farley for their helpful suggestions and feedback. We also acknowledge our gratitude to the many scientists and their funders who provided genome sequences to the public domain, to ENA and SRA for storing and organizing the data, and to the authors of the open source software tools and data sets used in this work.

Support for this project came from an Emory Public Health Bioinformatics Fellowship funded by the CDC Emerging Infections Program (U50CK000485) PPHF/ACA: Enhancing Epidemiology and Laboratory Capacity.

The review history of this article can be read here .

BIOLOGICAL SCIENCES MAJOR

Class Schedules

Winter 2024 Class Schedule

First year seminars, distribution courses, core courses, 300 level courses, winter 2022 course descriptions, first year seminars & 100 level courses, biol_sci 101.8: the genetic basis of disease.

We will study the alterations to the genome that are responsible for various human diseases. Students will learn about traditional and potential experimental targeted treatment (gene-editing) of the diseases. We will discuss the impact of these diseases on healthcare as well as their social implications. Discussions will center on scientific studies and literature. The course is structured to increase the basic understanding of human genetics.

Registration Requirement: First-years only.

BIOL_SCI 116.6: First Year Seminar - Science Research Preparation

This course will provide students with skills and training to be successful in research environments. Under the guidance of faculty, graduate mentors and peer facilitators, students are expected to develop an independent research project, write a research funding proposal, give an oral presentation on their project, and develop basic laboratory skills.

Registration Requirements: enrollment open only to participants in the Bioscientist program. Students should contact Dr. Flores to obtain a permission number.

BIOL_SCI 150: Human Genetics

This class will examine basic principles of human inheritance and the role of genetic variation in human biology. The course will progress from simple Mendelian genetics to the study of complex traits controlled by multiple genes. We will examine how genetic variation affects disease, learn how genes are mapped in humans, and discuss the implications of the human genome project and gene editing in medicine and society.

Natural Sciences Distro Area

Core Courses

Biol_sci 203: genetics & evolution.

This course will present the fundamentals of genetics and evolution. From the rules of heredity to the complex genetics of humans, we will cover the methods and logic of genetics as applied to inheritance, development, neurobiology, and populations. These concepts will transition to the process and tempo of evolution. From natural selection to speciation, we will cover evolution with an emphasis on how genetics plays a critical role.

Prerequisite: Students must have completed, with a C- or better, BIOL_SCI 202-0 or 219-0 to register for this course. Must be taken concurrently with BIOL_SCI 233-0.

BIOL_SCI 233: Genetics & Molecular Processes Laboratory

This is the second course in a three-quarter sequence of introductory biology laboratory, which meets once a week for four hours and includes an online lecture component. The course is designed to provide students with an authentic laboratory experience that investigates relevant scientific research and teaches scientific inquiry skills such as experimental design, writing research proposals, data collection, data analysis/interpretation, and the presentation of results. The experimental model revolves around aggregate prone proteins in nematodes and how RNA interference (RNAi) can be used to affect protein folding and the clearance of protein aggregates. Students will learn and become proficient at various cell and molecular biology techniques.

Prerequisite: Students must have completed BIOL_SCI 232-0. Must be taken concurrently with BIOL_SCI 203-0. Credit not allowed for both BIOL_SCI 220-0 and BIOL_SCI 233-0.

ISP COurses

Biol_sci 241: biochemistry, molecular and cell biology - 2 for isp.

Extending the biochemistry segment from Biol_Sci 240, t his class seeks to provide deeper understanding of select topics in biochemistry, including the structure and function of macromolecules, biological transport and signaling, chemical logic of metabolic reactions and select cellular pathways. The course str ongly emphasizes conceptual understanding and aims to develop and integrated understanding that allows students to apply their knowledge to solve complex problems.

Pre-requistites: General Chemistry 171, 172 Organic Chemistry Chem 212-1 BiolSci 240 (Instructor Permission required if missing).

300 level courses

Biol_sci 310: human physiology.

Biol_Sci 310 explores the functions of the human body with an emphasis on homeostatic mechanisms, interdependence of organs and organ systems and the influence of modulatory systems. Topics include: nervous, cardiovascular, respiratory, renal, and digestive systems as well as endocrine application in these systems. Readings and activities focusing on the contributions of scientists of color to the advancement of physiology, and examples of social injustice that have occurred during the pursuit of physiology research, will be included.

Prerequisites: Students must have completed BIOL_SCI 201-0 or BIOL_SCI 215-0, BIOL_SCI 202-0 or BIOL_SCI 219-0, and CHEM 132-0, CHEM 152-0, or CHEM 172-0. Credit not allowed for both BIOL_SCI 310-0 and BIOL_SCI 217-0.

BIOL_SCI 315: Advanced Cell Biology

Current themes and experimental approaches in cell biology will be discussed through readings of text and original research articles. Discu ssion sections will focus on experimental approaches , including controls and statistical significance , and understanding data as presented in primary literature.

Pre-requisites: BIOL_SCI 215 or BIOL_SCI 201, BIOL_SCI 219 or 202; and BIOL_SCI 301.

BIOL_SCI 319: Biology of Animal Viruses

This course will introduce students to animal viruses, their replication mechanisms, their interactions with hosts, triggering immunity and pathogenesis.

Prerequisites: Students must have completed BIOL_SCI 202-0, BIOL_SCI 203-0, and BIOL_SCI 301-0 to register for this course.

BIOL_SCI 323: Bioinformatics

In a knowledge - based economy, critical thinking and coding skills are paramount for success. This course will prepare students to address informatics challenges in academia and industry. The course will explore through case studies and classroom discussions, the principles and practical applications of comp utational tools in contemporary molecular and structural biology research. Besides gaining an appreciation for the algorithmic aspects of these tools and their limitations, students will learn to code in Python, design and perform experiments in silico, an d critically evaluate results.

Pre-requisities: BIOL SCI 241, BIOL SCI 301, OR equivalent; BIOL SCI 361 OR equivalent recommended but not required. Aptitude for computers and software is required; coding experience would be advantageous but not required.

BIOL_SCI 325: Animal Physiology

Bio 325 is a lecture/group discussion course designed to explore advanced concepts regarding the physiology of the major organ systems, with an emphasis on comparisons between vertebrate groups, and between vertebrates and invertebrates.

Prerequisite: Students must have completed BIOL_SCI 310-0 to register for this course.

BIOL_SCI 349: Community & Population Ecology

Community ecology investigates the composition, structure, functioning, and dynamics of ecological communities. Readings, discussions, lectures, and activities will address how communities are organized, how they interact with their biotic and abiotic environments, how they are studied, and how they are influenced by anthropogenic impacts like climate change.

Pre-requisite: Students must have completed BIOL_SCI 203-0, or BIOL_SCI 215-0, or BIOL_SCI 339-0, or BIOL_SCI 341, or BIOL_SCI 342-0, or ENVR_SCI 202-0 to register for this course.

BIOL_SCI 350: Plant Diversity & Evolution Laboratory

This course is an introduction to the diversity and evolutionary history of land plants for advanced undergraduates and graduate students. It will introduce principles of plant structure, classification, phylogeny, and paleontology in an evolutionary framework. Morphological, anatomical, molecular and fossil evidence for the evolutionary history and relationships of each group will be presented. Laboratories will focus on diversity and structural characteristics of each group and their fossils. Field trips will complement lecture and laboratory activities. In addition to lecture and lab, students will prepare an annotated bibliography on a topic of their choosing (subject to approval).

BIOL_SCI 378: Functional Genomics

The sequencing and assembly of genomes has sparked a new era in biomedical science, in which analyses of very large datasets drive new understanding of fundamental biological phenomena. This course will introduce students to the fundamentals of genome sequencing and assembly, analysis of important genome features, and large-scale identification of genes and regulatory elements. Moreover, it will cover genome-scale "transcriptomic" experiments that identify important gene expression patterns, proteomic analysis that seeks to define the dynamic molecular machines underlying life processes, and analysis of genes in complex functional networks. The course will introduce key concepts in bioinformatics and molecular evolution and will teach students to use computational analyses to derive interesting information from large datasets.

Prerequisites: Students must have completed BIOL_SCI 202-0 and BIOL_SCI 203-0 to register for this course.

BIOL_SCI 381: Stem Cells & Regeneration

The use of stem cells for growth, repair, and maintenance of tissue is widespread throughout the animal kingdom. In addition, species vary in their natural abilities of repair tissue in adult hood, from wound healing and scar formation to complete cell/tissue/organ regeneration after damage. What are the molecular processes that imbue stem cells with their unique abilities, how are these controlled by the organism, and how can they be harnessed therapeutically? This course takes a comparative approach to explore this fascinating problem by critically examining classic and modern scientific literature about the developmental and molecular biology of regeneration and both embryonic and adult stem cells.

Prerequisites: BIOL_SCI 203-0 or BIOL_SCI 215-0 and BIOL_SCI 202-0 or BIOL_SCI 219-0, to register for this course.

BIOL_SCI 397: Honors Colloquium

A student intending to write a Thesis in Biological Sciences must register for Senior Thesis Colloquium (BIOL SCI 397) during Winter Quarter of the Senior Year. I t is in the context of this class that Senior or Honors Theses are written. D o not also register for a 399 that quarter; for the Winter, BIOL SCI 397 replaces 399 with regard to both your research and its write - up.

Pre-requisities: At least one BIOL SCI 398 or 399 registration must have preceded BIOL SCI 397. Do not register for a 398 or 399 during the same quarter as 397. Please contact the instructor for a permission number to register for this course.

Back to top

Developing a Thesis for Rhetorical Analysis: Strategies and Examples

This essay about the development of a thesis for rhetorical analysis provides a comprehensive exploration of strategies and examples drawn from various rhetorical discourses. It emphasizes the importance of identifying rhetorical devices, understanding contextual nuances, and maintaining clarity and specificity in thesis formulation. Through examples such as Martin Luther King Jr.’s “I Have a Dream” speech and contemporary political rhetoric, the essay illustrates how effective theses encapsulate the essence of discourse and its persuasive intent. It underscores the significance of honing analytical skills to navigate the complexities of rhetoric and shape discourse with precision.

How it works

As a diligent student of rhetoric, I find myself perpetually immersed in the intricate art of persuasion. Central to this endeavor is the crafting of a compelling thesis for rhetorical analysis. In this essay, I aim to dissect the strategies and exemplify the process of developing such a thesis, drawing from the rich tapestry of rhetorical discourse.

At its core, a rhetorical analysis thesis serves as the fulcrum upon which the entire analysis pivots. It encapsulates the essence of the discourse, delineating the rhetorical strategies employed by the author to convey their message effectively.

Crafting such a thesis requires a meticulous approach, intertwining keen observation with insightful interpretation.

One strategy essential to the formulation of a robust thesis is the identification of the rhetorical devices employed within the text. These devices serve as the building blocks of persuasion, enabling the author to wield language with precision and efficacy. From ethos, pathos, and logos to metaphor, simile, and irony, the rhetorical arsenal is vast and multifaceted. As a discerning student, it is imperative to unravel these devices, discerning their purpose and impact on the audience.

For instance, consider a thesis centered on Martin Luther King Jr.’s iconic “I Have a Dream” speech. By dissecting the text, one may identify King’s adept use of pathos through emotive language and vivid imagery. A thesis could thus assert: “Through the strategic deployment of pathos, Martin Luther King Jr. invokes a profound emotional resonance, galvanizing his audience towards the pursuit of racial equality.” Here, the thesis not only identifies the rhetorical strategy employed but also hints at its broader implications within the socio-political context of the Civil Rights Movement.

Furthermore, a nuanced understanding of rhetorical context is indispensable in thesis development. Context encompasses the myriad factors surrounding the discourse, including the historical backdrop, the intended audience, and the author’s overarching purpose. By contextualizing the text within its temporal and socio-cultural milieu, one can glean deeper insights into the rhetorical strategies at play.

Returning to our example of King’s speech, a contextualized thesis might elucidate: “Against the backdrop of pervasive racial injustice in 1960s America, Martin Luther King Jr. strategically harnesses the power of rhetoric to catalyze a movement for social change.” Here, the thesis not only acknowledges the historical context but also underscores the transformative potential of rhetorical discourse in precipitating societal shifts.

Moreover, a successful thesis for rhetorical analysis transcends mere identification of rhetorical elements; it delves into their cumulative effect on the audience and the broader discourse. This necessitates a nuanced analysis of tone, argumentative structure, and the author’s stance vis-à-vis the subject matter.

Consider the formulation of a thesis pertaining to a contemporary political speech. By dissecting the speaker’s tone, one may discern underlying nuances indicative of their rhetorical intent. A thesis could thus posit: “Through a combination of assertive rhetoric and appeals to national identity, the speaker constructs a compelling argument aimed at garnering public support for their policy agenda.” Here, the thesis not only scrutinizes the rhetorical strategies employed but also interprets their persuasive implications within the realm of public discourse.

In addition to strategic formulation, the efficacy of a rhetorical analysis thesis hinges on its clarity and specificity. A well-crafted thesis not only outlines the overarching rhetorical strategies but also articulates a clear analytical stance. Ambiguity or vagueness can dilute the potency of the thesis, rendering the subsequent analysis unfocused and disjointed.

To illustrate, consider a thesis that lacks specificity: “The author employs rhetorical devices to convey their message.” While technically accurate, this thesis falls short in providing a discernible analytical standpoint. Conversely, a refined thesis might assert: “Through the juxtaposition of statistical evidence and anecdotal narratives, the author constructs a persuasive argument in favor of healthcare reform.” Here, the thesis not only identifies the rhetorical strategies at play but also offers a precise interpretation of their persuasive intent.

Ultimately, the process of developing a thesis for rhetorical analysis is a dynamic interplay between observation, interpretation, and synthesis. As a diligent student of rhetoric, I am continually inspired by the transformative power of persuasive discourse. By honing the craft of thesis development, we equip ourselves with the analytical tools necessary to navigate the labyrinthine realm of rhetoric and shape discourse with precision and efficacy.

Cite this page

Developing a Thesis for Rhetorical Analysis: Strategies and Examples. (2024, Apr 14). Retrieved from https://papersowl.com/examples/developing-a-thesis-for-rhetorical-analysis-strategies-and-examples/

"Developing a Thesis for Rhetorical Analysis: Strategies and Examples." PapersOwl.com , 14 Apr 2024, https://papersowl.com/examples/developing-a-thesis-for-rhetorical-analysis-strategies-and-examples/

PapersOwl.com. (2024). Developing a Thesis for Rhetorical Analysis: Strategies and Examples . [Online]. Available at: https://papersowl.com/examples/developing-a-thesis-for-rhetorical-analysis-strategies-and-examples/ [Accessed: 14 Apr. 2024]

"Developing a Thesis for Rhetorical Analysis: Strategies and Examples." PapersOwl.com, Apr 14, 2024. Accessed April 14, 2024. https://papersowl.com/examples/developing-a-thesis-for-rhetorical-analysis-strategies-and-examples/

"Developing a Thesis for Rhetorical Analysis: Strategies and Examples," PapersOwl.com , 14-Apr-2024. [Online]. Available: https://papersowl.com/examples/developing-a-thesis-for-rhetorical-analysis-strategies-and-examples/. [Accessed: 14-Apr-2024]

Don't let plagiarism ruin your grade

Hire a writer to get a unique paper crafted to your needs.

Our writers will help you fix any mistakes and get an A+!

Please check your inbox.

You can order an original essay written according to your instructions.

Trusted by over 1 million students worldwide

1. Tell Us Your Requirements

2. Pick your perfect writer

3. Get Your Paper and Pay

Hi! I'm Amy, your personal assistant!

Don't know where to start? Give me your paper requirements and I connect you to an academic expert.

short deadlines

100% Plagiarism-Free

Certified writers

IMAGES

Genome sequencing guide: An introductory toolbox to whole‐genome
Pan-genome analysis. Clustering of genomes based on the...
PPT
(PDF) Initial Sequencing and Analysis of the Human Genome
Comprehensive Mapping of Long-Range Interactions Reveals Folding
PPT

VIDEO

Pharmacogenomics (2010)
Genetic Genealogy
Conclusion
Conclusion
Conclusion
Gian Sepulveda Thesis Defense

COMMENTS

(PDF) Whole Genome Sequencing Analysis: Computational Pipelines and
The purpose of genome analysis is to explore the effect of genomic variants on wild-type gene function and further on overall. individual health. The progress of WGS has been yielding an ...
PDF Whole Genome Investigation of Genomic Aberrations of Rare Genetic
Analysis utilizes genomic information of the proband, the index individual, as well as their unaffected family members, often first degree, to examine genome wide patterns of inheritance. A comparison between the DNA of the proband and each parent can elucidate the possible effect of genetic variants in the proband (Kovesdi & Patocs, 2019).
Genome Annotation and Analysis
Chapter 5. Genome Annotation and Analysis. In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes ...
PDF Lecture 10 : Whole genome sequencing and analysis
After DNA fragments (reads) are sequenced we want to assemble then together to reconstruct the entire target sequence. Most fragment assembly algorithms include the following 3 steps: Overlap - finding potentially overlapping fragments. Layout - finding the order of the fragments. Consensus - deriving DNA sequence from the layout.
A complete reference genome improves analysis of human genetic ...
Science. For the past 20 years, the human reference genome (GRCh38) has served as the bedrock of human genetics and genomics ( 1 - 3 ). One of the central applications of the human reference genome, and of reference genomes in general, has been to serve as a substrate for clinical, comparative, and population genomic analyses.
Genome sequencing guide: An introductory toolbox to whole‐genome
Whole genome sequencing analysis. (a) Example four lines of each read in a FASTQ file. Components in the FASTQ file are labeled with a text box of the same color, which include the sequence ID, nucleotide sequence, and quality score. (b) Example reads mapped to a reference genome (black). An example of 1x coverage (left) and 5x coverage (right ...
PDF Bioinformatic Analysis of Whole Genome Sequencing Data
The aim of this thesis project has been to carry out extensive bioinformatic analyses of whole genome sequencing data to reveal SNPs, InDels and selective sweeps in the chicken, pig and dog genome. Pig genome sequencing revealed loci under selection for elongation of back and
A versatile workflow to integrate RNA-seq genomic and ...
The whole process is performed using the Genome Analysis ToolKit (v4.1.3.0) , and it was designed following the GATK best practices for the variant calling from RNA-Seq data. Similar to germline variant discovery with DNA sequencing, this sub-workflow specifically includes a step to mark duplicate reads, which will help to reduce the direct ...
PDF Chapter 1 Introduction to Bioinformatics
1.3 The Role of Bioinformatics in Gene/Genome Mapping 5 1.4 Role of Bioinformatics in Sequence Alignment and Similarity Search 6 ... During the Human Genome Project, comprehensive analysis of Human DNA was performed from multiple perspectives. The project started with the sequencing of all the base pairs, discovery of genes (more than 20,000 ...
PDF Statistical Methods for Gene Differential Expression Analysis of RNA
A model on gene expression is fit for each gene. Two common parameterizations for RNA-Seq include a linear model on log-transformed counts and a negative binomial model on discrete counts. limma19and sleuth20use a linear model; edgeR21and DESeq22/DESeq223use the negative binomial model.
PDF Bioinformatic analysis of next-generation sequencing data
Master`s Thesis Bioinformatics Masters Degree Programme, ... Genome browser, Ensemble, microRNA.org and Vista. To evaluate the pathogenicity of the variants, three tolerance predictor programs were used: Mutation Taster, ... analysis was conducted for the prostate cancer gene set using WebGestalt2. iv Results: ...
PDF Genome-wide analysis of
Genome-wide analysis of. Genome-wide analysis of trimethylated lysine 4 of histone H3 (H3K4me3) in Aspergillus niger. Christina Sawchyn. A Thesis in The Department of Biology. Presented in Partial Fulfillment of the Requirements for the Degree of Master of Science (Biology) at Concordia University Montreal, Quebec, Canada.
(PDF) Next Generation Sequencing
Genome Project (Altshuler et al. 2012), and Exome Aggregation Consortium (ExAC) (Lek et al. 2016 ) to decipher population-risk variants and to predict disorders linked to rare mutations.
Thesis
Second, a web portal - the Potato Genome Diversity Portal (PGDP) - was developed and implemented to visualize the published potato genomes using JBrowse and to provide a tool to investigate the genome alignments along with aiding the structural variation analysis PGDP also enables researchers to conduct research and share data
Genome-wide association studies in plant pathosystems: success or
Our meta-analysis of genome-wide association (GWA) studies (GWAS) in plant pathosystems highlights the power of GWA mapping to characterize thoroughly the genetic architecture of plant responses to a wide range of pathogens, subsequently leading to the identification of novel defense mechanisms. GWAS in pathogens revealed fewer, but nonetheless ...
Whole-genome sequence analysis through online web interfaces: a review
Abstract. The recent development of whole-genome sequencing technologies paved the way for understanding the genomes of microorganisms. Every whole-genome sequencing (WGS) project requires a considerable cost and a massive effort to address the questions at hand. The final step of WGS is data analysis.
PDF Genome-wide Association Studies, False Positives, and How We Interpret Them
A genome-wide association study (GWAS) is a study design used to find associations between a given trait or disease and loci across the entire genome. -wide During annotation of genome association studies, false positive results can mislead and misdirect researchers, leading to wasted resources, time, and money.
Genome-wide association study and transcriptome analysis dissect the
The meta-analysis was performed using METAL with the p value, β-coefficients, and standard errors from single-subgroup GWAS . In the single-population GWAS and meta-analysis, the genome-wide significant (1/N) thresholds by Bonferroni correction, in which N is the number of SNPs, were used in the analysis.
A chromosomal-scale genome assembly of modern cultivated ...
This analysis resulted in an 802-Mb monoploid genome for Ss and a 1.15-Gb monoploid genome for So. We further annotated the two monoploid genomes, leading to 34,010 representative protein-coding ...
PDF Dissertation Genome-scale Metabolic Modeling of Cyanobacteria: Network
discovery, prediction of cellular phenotypes, analysis of biological network properties, multi-species interactions, engineering of microbes for product synthesis, and studying evolutionary processes. This thesis is concerned with the development and application of metabolic network modeling to cyanobacteria as well as E. coli.
[PDF] Genome analysis of Taraxacum kok-saghyz Rodin provides new
The draft TKS genome assembly has a length of 1.29 Gb, containing 46 731 predicted protein-coding genes and 68.56% repeats, in which the LTR-RT elements predominantly contribute to the genome enlargement. We analyzed the heterozygous regions/genes, suggesting its possible involvement in inbreeding depression. ... Genome analysis of Taraxacum ...
Development and analysis of Genome-Scale metabolic ...
This thesis developed a pathway database and genome-scale metabolic model for an oleaginous microalga, Microchloropsis gaditana, which is a potential source for biodiesel production. The database covered 141 metabolic pathways with reactions, associated enzymes, genes, and metabolites. The genome-scale metabolic model for M. gaditana, constituting 720 reactions, represented the primary ...
New investigation of encoding secondary metabolites gene by genome
Genome features. The BBR56 strain was isolated from seawater, and the genome size was 5.5 Mb and included chromosome 1 (4.2 Mbp) and chromosome 2 (1.3 Mbp); these contained 61 pseudogenes, 4 noncoding RNAs, 113 tRNAs, 31 rRNAs, 4,505 coding DNA sequences, 4 clustered regularly interspaced short palindromic repeats, and 4,444 coding genes and had a 49.5% GC content.
Dissecting Meta-Analysis in GWAS Era: Bayesian Framework for Gene
Over the past decades, advanced high-throughput technologies have continuously contributed to genome-wide association studies (GWASs). GWAS meta-analysis has been increasingly adopted, has cross-ancestry replicability, and has power to illuminate the genetic architecture of complex traits, informing about the reliability of estimation effects and their variability across human ancestries.
Yak genome database: a multi-omics analysis platform
Background The yak (Bos grunniens) is a large ruminant species that lives in high-altitude regions and exhibits excellent adaptation to the plateau environments. To further understand the genetic characteristics and adaptive mechanisms of yak, we have developed a multi-omics database of yak including genome, transcriptome, proteome, and DNA methylation data. Description The Yak Genome Database ...
DNA methylation remodeling and the functional implication during male
Epigenetic marks are reprogrammed during sexual reproduction. In flowering plants, DNA methylation is only partially remodeled in the gametes and the zygote. However, the timing and functional significance of the remodeling during plant gametogenesis remain obscure. Here we show that DNA methylation remodeling starts after male meiosis in rice, with non-CG methylation, particularly at CHG ...
Thesis Defense: Ellie Wilson (Molecular Genetics and Genomics Program
Office of Neuroscience Research. MSC 8111-96-07-7122. 4370 Duncan Ave. St. Louis, Missouri 63110. [email protected]
Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial
The Bactopia Analysis Pipeline (BaAP) is a customizable workflow for the analysis of individual bacterial genome projects that is an extension and generalization of the previously published Staphylococcus aureus-specific Staphopia Analysis Pipeline (StAP) . The inputs to BaAP are FASTQ files from bacterial Illumina sequencing projects, either ...
Winter 2024 Class Schedule: Biological Sciences Major
This course will introduce students to the fundamentals of genome sequencing and assembly, analysis of important genome features, and large-scale identification of genes and regulatory elements. ... A student intending to write a Thesis in Biological Sciences must register for Senior Thesis Colloquium (BIOL SCI 397) ...
Developing a Thesis for Rhetorical Analysis: Strategies and Examples
Central to this endeavor is the crafting of a compelling thesis for rhetorical analysis. In this essay, I aim to dissect the strategies and exemplify the process of developing such a thesis, drawing from the rich tapestry of rhetorical discourse. At its core, a rhetorical analysis thesis serves as the fulcrum upon which the entire analysis pivots.

A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

Author summary

Introduction

Design and implementation

Quality control and alignment

Variant calling and annotation

Normalization and differential expression analysis

Integrative mechanistic signaling pathway activity analysis

Modularity of the workflow

MIGNON integrative approach for the mechanistic interpretation of multi-omic information into a pathway framework

Workflow performance evaluation

Functionality of current available workflows

Conclusions

Supporting information

S2 Table. List of datasets used to assess MIGNON performance.

Introduction to Bioinformatics

Buying options

Author information

Corresponding author

Editor information

Rights and permissions

Copyright information

About this chapter

Download citation

Share this chapter

Genome analysis of the diploid wild potato Solanum bukasovii

Genome-wide association study and transcriptome analysis dissect the genetic control of silique length in Brassica napus L.

Conclusions

Introduction

Phenotypic evaluation showed wide variation widely

RNA-seq found that plant hormones transportation and synthesis of carbohydrates involved in the development of rapeseed silique

Linkage mapping and GWAS co-located several major QTLs related to SL

Meta-GWAS and polymorphisms in the candidate region were associated with SL

Co-expression network analysis reveals transcript level differences in photosynthesis and secondary cell wall biosynthesis in long and short siliques

Lignin biosynthesis plays an important role in silique elongation

Materials and methods

QTL mapping, GWAS and meta-GWAS

RNA-seq and WGCNA

Candidate genes analysis and qRT–PCR

Microscopic observation and lignin quantification

Availability of data and materials

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Supplementary Information

Additional file 2: Figure S1.

Additional file 3: Figure S2.

Additional file 4: Table S2.

Additional file 5: Table S3.

Additional file 6: Figure S3.

Additional file 7: Table S4.

Additional file 8: Figure S4.

Additional file 9: Table S5.

Additional file 10: Figure S5.

Additional file 11: Figure S6.

Additional file 12: Table S6.

Additional file 13: Table S7.

Additional file 14: Figure S7.

Additional file 15: Figure S8.

Additional file 16: Figure S9.

Additional file 17: Table S8.

Additional file 18: Figure S10.

Additional file 19: Table S9.

Rights and permissions

About this article

Share this article

Biotechnology for Biofuels and Bioproducts

A chromosomal-scale genome assembly of modern cultivated hybrid sugarcane provides insights into origination and evolution

Similar content being viewed by others

A draft chromosome-scale genome assembly of a commercial sugarcane

Genomic insights into the recent chromosome reduction of autopolyploid sugarcane Saccharum spontaneum

The complex polyploid genome architecture of sugarcane

Introduction

Karyotype and genome assembly

Annotation and genomic features

The pedigree identification and analysis