• Bioinformatics
  • Agriculture
  • Gene Therapy
  • Medical Devices

Select Page

What is Gene Annotation in Bioinformatics?

Posted by Biolyse | Nov 3, 2018 | Bioinformatics | 0 |

what is gene annotation in bioinformatics

Over the years scientist and researchers have made tremendous efforts through various inventions and innovation to make life better.   Bioinformatics as an interdisciplinary approach has created numerous opportunities in scientific advancements and promoted efforts towards the realization of better living. A considerable milestone development in bioinformatics goes down to the necessary level of life: genes. Previously identification and ability to distinguishing genes were limited hindering scientific manipulations and diagnostic procedures. With a clear understanding of the gene sequencing process, we can surely achieve massive success in the management of various conditions and generally maintaining a healthy generation. Gene annotation has made this to be in reach.

What is gene annotation?

In molecular biology, genomes make the basic genetic material and typically consist of DNA. Whereby, genome include the genes (coding) and the non-coding regions, of interest to us, are the coding regions as they actively influence basic life processes. The genes contain useful biological information that is required in building up and maintaining an organism. Gene annotation can be defined merely as the process of making nucleotide sequence meaningful. However, it’s a much complex process encompassing several procedures and a broad range of activities.

Gene annotation involves the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding layers of analysis and interpretation necessary to extracting biologically significant information and placing such derived details into context. Through the aid of bioinformatics, there exists software to perform such complex procedures. The first gene annotation software system was developed in1995 at The Institute for Genomic Research, and this was used to sequence and analyze the genes of the bacterium Haemophilus influenza.

As a process of identification of gene location and coding regions, gene annotation helps us have an insight of what these genes do in the body by establishing structural aspects and relating them to functions of different proteins. Currently, the process is automated, and the National Center for Biomedical Ontology have a database for records and to enable comparison.

Learn More: How to Learn Bioinformatics Why is Bioinformatics important in Genetic Research? How to Get Into Bioinformatics

How is gene annotation performed?

Gene annotation can either be manual or electronic with the aid of tools developed by an amalgamation of organizations. The downsides of the manual technique are that it is time-consuming and the turn-over rate is much low. However, it remains useful for predictive purposes thus serves a complementary function. There exist three main steps in the process of gene annotation:

Identification of the non-coding regions of the genome (exons). This is vital to limit the range of analysis and only focus on the essential components as it is needless doing the tedious work on portions that give no or little biological information.

Gene prediction; these give an overview of the amino acid components of the genes and the role of such elements. Also referred to as gene finding, this process identifies regions of genomic DNA that encode genes. Empirical methods or Ab Initio methods can do it.

Establishing a connection and a correlation between the identified elements and the biological information at hand. Linking of biological functions and data is possible this way.

Homology-based tools for example Blast has hugely simplified the process of gene annotation, and this can now be done without much hassle as witnessed in manual methods that require human expertise.

Modalities of gene annotation

Genomics is a broad study and can be subdivided as structural genomics, functional genomics, and comparative genomics to leverage the understanding of this crucial topic. Similarly, gene annotation exists as a double-phased entity comprising of structural gene annotation and functional gene annotation.

Structural annotation

The initial process in gene annotation and involve identification by physical appearance, chemical composition, molecular weight variations, and general morphology. Such differences as coding regions, gene structures, ORFs and their locations , as well as regulatory motifs, are crucial information that is derived from this procedure and influence the process of gene identification as well as distinction. The accuracy of this process can be evaluated based on two parameters; specificity and accuracy. Where sensitivity is the percentage of right signals predicted among all possible correct strengths while specificity refers to the proportion of right signal among all that are forecasted.

Functional annotation

The process of relating crucial biological functions to the genetic elements as depicted in the structural annotation step. Biochemical functions, physiological functions, involved regulations and interactions atop expressions are some of the critical roles that are often considered in DNA annotation.

The above steps can involve biological experiments as well as in silico analysis mimicking the internal conditions. A new method seeking to improve genomics annotation- Proteogenomics is currently in use, and it utilizes information from expressed proteins, such information is obtained from mass spectrometry.

Essential components

Gene annotation is a purposeful process, and some of the vital information that we seek to extract from this process include; CDs, mRNA, Pseudogenes, promoter and poly-A signals, mcRNA among others. Such elements are minute and identification may be hectic. Scientists have developed software and tools to aid the process and notable tools frequently used are; ORF detectors, promoter detectors and start/stop codon identifiers. Automation of this process has created enhanced accuracy, and now there exist large discrepancies between with the manually conducted procedures as gene sequencing is a dynamic topic.

After a successful gene annotation process, it is expected that the obtained information should be published, stored in the database and shared for research purposes.

Gene annotation is a new and exceedingly promising idea, much remains unfolded, and there is a lot of potentially beneficial areas that remains to be explored. Fortunately, many groups have invested in gene annotation, and new developments arise daily. Some of the ongoing projects on gene annotation include; Ensembl, GENCODE and GeneRIF among others. It is important to appreciate that modern literature gets published daily concerning this topic and it is prudent to keep updated.

DNA annotation reveals much of the information contained in the genomes therefore complete gene annotation is descriptive of organisms being and thus remains a milestone invention.

About The Author

Related posts, why is bioinformatics important in genetic research.

November 3, 2018

How To Learn Bioinformatics

September 27, 2018

How to get into Bioinformatics

Recent posts.

Harvey Cushing/John Hay Whitney Medical Library

Bioinformatics Tools: Gene Prediction/ Annotation

Visualization / Genome Browsers

Genome browsers integrate genomic sequence and annotation data from different sources and provide an interface for users to browse, search, retrieve and analyze these data. These are the main genome browsers:

University of California Santa Cruz genome browser

Ensemble genome browser

NCBI's Genome Browser

NCBI's Genome Workbench

The Vertebrate Genome Annotation (VEGA) is a  repository for high-quality gene models produced by the manual annotation of vertebrate genomes.

Genome Databases

The NCBI's Genome database organizes information on genomes including sequences, maps, chromosomes, assemblies, and annotations.

Genomes Online Database (GOLD) , is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world

Ab initio and Gene Prediction Tools

GENEID  a program to predict genes, exons, splice sites and other signals along a DNA sequence. 

 JIGSAW a program that predicts gene models using the output from other annotation software. It uses a statistical algorithm to identify patterns of evidence corresponding to gene models.

AUGUSTUS is an open source program that predicts genes in eukaryotic genomic sequences.It has a protein profile extension (PPX) which allows to use protein family specific conservation in order to identify members and their exon-intron structure of a protein family given by a block profile.By incorporating mRNA alignments, EST alignments, conservation and other sources of informationcan predict alternative splicing and alternative transcripts, the 5'UTR and 3'UTR including introns.

EuGene is an open integrative gene finder for eukaryotic and prokaryotic genomes- it is characterized by its ability to simply integrate arbitrary sources of information in its prediction process, including RNA-Seq, protein similarities, homologies and various statistical sources of information.

PseudoPipe is a stand alone computational pipeline for pseudogene annotation.

Peak Calling

Genome wide Event finding and Motif discovery (GEM) links binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence, resolves ChIP data into explanatory motifs and binding events at unsurpassed spatial resolution. GEM reciprocally improves motif discovery using binding event locations, and binding event predictions using discovered motifs.

SPP is a R package especially designed for the analysis of Chip-Seq data from Illumina.

Library homepage

selected template will load here

This action is not available.

Biology LibreTexts

7.13B: Annotating Genomes

Genome annotation is the identification and understanding of the genetic elements of a sequenced genome.

LEARNING OBJECTIVES

Define genome annotation

Key Takeaways

Genome projects are scientific endeavors that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist, or a virus). They annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome.

Once a genome is sequenced, it needs to be annotated to make sense of it. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Since the 1980’s, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do.

Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all of this by computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline (process). The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that. However, nowadays more and more additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline.

Structural annotation consists of the identification of genomic elements: ORFs and their localization, gene structure, coding regions, and the location of regulatory motifs. Functional annotation consists of attaching biological information to genomic elements: biochemical function, biological function, involved regulation and interactions, and expression.

These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomics annotations. A variety of software tools have been developed to permit scientists to view and share genome annotations. Genome annotation is the next major challenge for the Human Genome Project, now that the genome sequences of human and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological “parts list” for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts “fit together. ”

Please note that Internet Explorer version 8.x is not supported as of January 1, 2016. Please refer to this page for more information.

Genome Annotation

Whole genome annotation also includes the identification of rRNA (ribosomal RNA) and tRNA (transfer RNA) sequences as well as IS (insertion sequence) elements.

From: Advances in Applied Microbiology , 2002

Related terms:

Josep F. Abril , Sergi Castellano , in Encyclopedia of Bioinformatics and Computational Biology , 2019

Genome annotation is the process of identifying functional elements along the sequence of a genome, thus giving meaning to it. It is necessary because the sequencing of DNA produces sequences of unknown function. In the last three decades, genome annotation has evolved from the computational annotation of long protein-coding genes on single genomes (one per species), and the experimental annotation of short regulatory elements on a small number of them, into the population annotation of sole nucleotides on thousands of individual genomes (many per species). This increased resolution and inclusiveness of genome annotations (from genotypes to phenotypes) is leading to precise insights into the biology of species, populations and individuals alike.

Bioinformatics and biological data mining

Aditya Harbola , ... Rajesh Kumar Kesharwani , in Bioinformatics , 2022

27.7.2 Annotation of gene/protein structure and function

Genome annotation is the process of deriving the structural and functional information of a protein or gene from a raw data set using different analysis, comparison, estimation, precision, and other mining techniques. Genome annotation is essential because the sequencing of the genome or DNA generates sequence information without its functional role. After the genome is sequenced, it must be annotated to bring more logical information about its structural features and functional roles ( Salzberg, 2019 ). It consists of three major steps:

recognizing pieces of the genome that do not encode for proteins;

recognizing essentials of the genome, a procedure called gene prediction; and

recognizing organic information to these elements.

The genome sequence information is stored in annotation files. Some of the file formats are FASTA, GFF3, and GENBANK. There are different file formats for the representation of sequence, structure, and pathway information related to gene and protein, and the facility to select and download a particular file is available over online databases.

Using gene annotation approaches, the genes or proteins that may be recruited by a particular genome sequence can be predicted. Functional annotation of these new genes or proteins can be done by searching their similarity with well experimentally verified sequences available in the databases. For example; if an unknown gene A shows 85% sequence similarity with another gene B whose structure, function, and related protein information is known, then the structure, function, and other information related to gene B can be assigned to gene A.

Next-Generation Sequencing and Data Analysis

Pablo H.C.G. de Sá , ... Rommel T.J. Ramos , in Omics Technologies and Bio-Engineering , 2018

11.3.3 Genome Annotation

Genome annotation consists of describing the function of the product of a predicted gene (through an in silico approach). This can be achieved using bioinformatics software with specific features, including (1) signal sensors (e.g., for TATA box, start and stop codon, or poly-A signal detection), (2) content sensors (e.g., for G+C content, codon usage, or dicodon frequency detection), and (3) similarity detection (e.g., between proteins from closely related organisms, mRNA from the same organism, or reference genomes) ( Stein, 2001 ).

However, the method for predicting gene and genome structures (e.g., tRNAs, rRNAs, promoter regions) is associated with the applied assembly strategies and sequencing platforms ( Chen et al., 2013 ).

Genome annotation can be divided into three basic categories. The first is a nucleotide-level annotation, which seeks to identify the physical location of DNA sequences to determine where components such as genes, RNAs, and repetitive elements are located. Sequencing and/or assembly errors at this stage can result in false pseudogenes through indels. The second is a protein-level annotation, which seeks to determine the possible functions of genes, identifying which one a given organism does or does not have. The third is a process-level annotation, which aims to identify the pathways and processes in which different genes interact, assembling an efficient functional annotation. In the last two levels, sequencing and/or assembly errors may compromise the inference of the true gene function because of reduced similarity ( Miller et al., 2010; Reeves et al., 2009; Stein, 2001 ).

Genome Annotation: Perspective From Bacterial Genomes

Alan Christoffels , Peter van Heusden , in Encyclopedia of Bioinformatics and Computational Biology , 2019

Stepwise Approach to Genome Annotation

Genome annotation is preceded by a process of genome assembly using a reference genome-based method or de novo approach. The annotation of the assembled genome ( Fig. 1 ) starts with identifying and masking RNA genes using RNAmmer ( Lagesen et al. , 2007 ) and tRNAScanSE ( Schattner et al. , 2005 ). Gene finding tools; such as, Prodigal ( Hyatt et al. , 2010 ), GeneMark ( Besemer et al. , 2001 ) and MetageneAnnotator ( Noguchi et al. , 2008 ); are used to identified open reading frames (ORFs) in the genome sequence. These ORFs are BLAST searched against databases such as GENBANK and UniProt to identify putative functions and protein evidence. The ORFs are mapped to metabolic pathways using a KEGG database. Protein domains are identified through InterProScan searches. This search assigns GO terms to each of the protein domains and these features are later used to carry out functional enrichment analyses. ORFs are searched against the conserved domain database ( Marchler-Bauer et al. , 2013 ) that includes COGs to identify corresponding orthologs.

Sushma Naithani , ... June B. Nasrallah , in Handbook of Biologically Active Peptides (Second Edition) , 2013

The SCR -like ( SCRL ) gene family in plants

In most genome annotations of sequenced plants, genes encoding small peptides are routinely ignored. The difficulty in identifying these genes, including genes that encode SCR-like (SCRL) secreted peptides, stems primarily from the fact that gene-finding algorithms often ignore small ORFs (encoding   <   50 amino acids) for which empirical evidence of expression is lacking. The identification of SCRLs in particular presents two additional difficulties. The first relates to the gene structure of SCRL genes, in which the signal sequence with its initiating ATG codon, is separated from the rest of the coding region by an intron. The second difficulty results from the fact that SCR alleles exhibit a high degree of sequence polymorphism ( Fig. 1 ) and standard searches based on sequence homology are not suitable for identifying related sequences. Nevertheless, 28 SCRL genes that fall into 7 groups on the basis of sequence similarity ( Fig. 3 ), were identified in the A. thaliana genome by iterative searches with the tBLASTN program using sequences of the seven most diverse SCR alleles and three pollen coat proteins from Brassica species. 34 Most A. thaliana SCRLs are predicted to encode ORFs containing a signal peptide, the conserved cysteine residues, the glycine in the G x C2 motif, and the aromatic amino acid in the C3xxxY/F motif found in most SCRs ( Fig. 1 ). Three SCRLs , however, lack some of these conserved residues and are inferred to be nonfunctional. A substantial fraction of the SCRL genes are arranged in tandem in the genome, and these closely linked genes share relatively high sequence similarity, suggesting that they may have redundant functions.

what is gene annotation in bioinformatics

FIGURE 3 . Grouping of A. thaliana SCRL genes based on phylogenetic relationships. The multiple sequence alignment of the predicted full-length SCRL proteins was generated using ClustalW2 ( http://www.ebi.ac.uk/Tools/msa/clustalw2/ ), and the tree was plotted using TreeView ( http://taxonomy.zoology.gla.ac.uk/rod/treeview.html ). SCRL sequences were retrieved from the TAIR database ( http://www.arabidopsis.org/ ).

Of 25 potentially functional SCRLs, none have been assigned a biological function. Furthermore, not much information is currently available regarding their expression patterns and the cells in which they are expressed. A search of ESTs, cDNAs, and microarray datasets available in the public domain identifies cDNAs and ESTs for six SCRL genes (At1g60986, At1g60987, At1g60985, At1g60989, At4g15735, At3g27503), most of which exhibit flower-specific expression patterns. Among these genes, only At1g60985 is represented on the ATH1 whole genome array, and expression data again indicate that this gene is expressed preferentially in floral organs, with the highest expression in carpels. Another SCRL gene, At1g65113, is represented only by ESTs in the NCBI database ( http://www.ncbi.nlm.nih.gov/ ). A search of the MPSS database ( http://mpss.udel.edu/at/ ), which harbors sequence information for small RNAs generated using Massively Parallel Signature Sequencing (MPSS), confirmed the expression of several SCRL genes (At1g65113, At1g60987, At1g60989, At1g60986) and additionally identified four SCRL genes (At4g10115, At4g32717, At4g22105, and At2g25685) for which no expression data were previously known. As shown in Fig. 4 , these results show that 11 of the 25 apparently functional SCRL genes exhibit differential expression in various tissues, with the majority being predominantly expressed in flowers and thus possibly functioning in some aspect of reproduction. Furthermore, several SCRL genes have overlapping expression profiles, suggesting possible functional redundancy. Additional information related to gene expression, subcellular localization of the gene product, and mutant phenotypes is required to elucidate the biological function of the SCRL peptides.

what is gene annotation in bioinformatics

FIGURE 4 . Analysis of SCRL gene expression using MPSS data. 8 Hierarchical clustering of gene expression pattern is based on Pearson correlation. The highest levels of upregulation and downregulation are indicated in black and white shading respectively. SAP – sup/ap1 inflorescence; AP1 – ap1-10 inflorescence; INS – Inflorescence, signature MPSS , S52 – Leaves, 52   h after salicylic acid treatment ; GSE – Germinating seedlings; AP3 – inflorescence; S04 – Leaves, 4   h after salicylic acid treatment; AGM – agamous inflorescence; INF – Inflorescence - buds, classic MPSS; LES – Leaves – 21 day, untreated; CAS – Callus – actively growing, signature MPSS; CAF – Callus - actively growing, classic MPSS; ROS – Root – 21 day, untreated; ROF – root – 21 day, untreated, classic MPSS; SIS – Silique –24–48   h postfertilization, signature MPSS; SIF – Silique –24–48   h postfertilization, classic MPSS.

The Enzymes

Robert J. Bastidas , Maria E. Cardenas , in The Enzymes , 2010

IX Targeting the Tor Pathway: A Novel Therapeutic Antifungal Approach

Advances in genome sequencing and annotation technologies have become an invaluable tool in aiding our understanding of organismal biology. Capitalizing on this genomic revolution, the Fungal Genome Initiative has produced and analyzed the sequence of over 25 fungal organisms that are important to medicine, agriculture, and industry. These include fungi that are pathogens of humans (i.e., C. albicans , C. neoformans , Aspergillus fumigatus ) and plants (i.e., Magnaporthe grisea and Ustilago maydis ). Comparative genomics between closely related organisms has emerged as an important tool for understanding phenotypic differences, such as pathogenicity, and has facilitated the identification of conserved molecular pathways that can serve as targets for the development of broad-spectrum antimicrobial drugs.

Genome comparative analysis has now demonstrated a remarkable conservation of the Tor molecular cascade throughout the fungal kingdom. The Tor kinase, TORC1 and TORC2 constituents, and their regulators and effectors have been identified in the genomes of representative species of medical relevance ( C. albicans , C. neoformans ), in particular in basal lineages such as in the zygomycetes Rhizopus oryzae and Mucor circinelloides ( Table 11.2 , C. Shertz et al ., unpublished results). Both R. oryzae and M. circinelloides are common etiological agents of mucormycosis, an aggressive and invasive human fungal disease.

Table 11.2 . Tor Cascade Signature Components and Putative Homologs in Pathogenic Fungi

Tor pathway signaling homologs in pathogenic fungi identified through reciprocal best-hit BLASTp searches against characterized S. cerevisiae and S. pombe components.

Remarkably, our own analysis and recent findings reveal a lack of a Tor homolog and all known Tor signaling components in the microsporidian pathogen Encephalitozoon cuniculi , representing the first eukaryote examined to date in which the entire Tor signaling cascade has been lost (C. Shertz et al ., unpublished results; [99] ). Phylogenetic classification of these species has been controversial and ambiguous due their sparse and small genomes and rapidly evolving genes. While at first thought to be an ancient eukaryotic lineage closely related to fungi, recent studies provide evidence that they are true fungi that descended from a zygomycete ancestor and therefore represent a new and distinct basal fungal lineage [100] . Given that Tor controls essential processes in the cell, including protein synthesis, ribosomal biogenesis, autophagy, and cytoskeletal organization, it is unprecedented that a eukaryotic organism could survive in the absence of this essential signal transduction cascade. Strikingly, many other protein kinases and pathways involved in sensing nutrients and generating energy are absent from the E. cuniculi genome, and this is a reflection of the rampant gene loss that sculpted its 2.9   Mb genome, the smallest known for any eukaryote [99, 101] . The striking loss of this suite of kinases presumably arose during E. cuniculi 's streamlined and specialized adaptation as an obligate parasite, since the Tor cascade is also present in the intracellular pathogen Trypanosoma cruzi , one of the most ancient and evolutionarily divergent eukaryotes [102] . Within its parasitophorous vacuole, E. cuniculi relies on the host cell for acquisition of energy, nutrients, and for an osmotically stabilized environment that must be homeostatic relative to the changing environments of free-living fungi. Whole genome sequences for the microsporidian species Enterocytozoon bieneusi and Antonospora locustae will soon be available and it will be interesting to query whether these species have lost the Tor pathway as well.

Conservation of the Tor signaling signature network among pathogenic basal fungal lineages and its presence in trypanosomes suggests that this pathway arose early on in eukarya, in accord with its conservations in plants and metazoans (C. Shertz et al ., unpublished results; [102] ). This evolutionary conservation serves as a platform for the design of novel antifungal therapies, which can also be applied to basal fungal pathogens. Over the last decade, the incidence and types of life-threatening fungal infections have raised due to the increasing number of immunocompromised individuals (resulting from HIV infection, neutropenia induced by chemotherapy, organ transplantation, and from the use of broad spectrum antibiotics and glucocorticosteroids), who are at risk for acquiring fungal infections. The present drug portfolio employed for treating systemic fungal infections consists of the polyene amphotericin B and its liposomal variants, as well as the azoles, allylamines, thiocarbamates, and fluorocytosine [103] . The need for new and broad spectrum antifungal agents with novel modes of action continues due to severe toxic side effects, fungistatic modes of action, and emergence of resistance to the current drug armamentarium.

The Tor kinase has received wide attention as an antifungal target due to its inhibition by the natural product rapamycin. Indeed, rapamycin was first identified for its potent antimicrobial activity against C. albicans [104, 105] . In comparison with amphotericin B, the mainstay antifungal used for combating fungal disease, rapamycin remains one of the most potent anti Candida drugs ever identified [106] . Subsequently, rapamycin was shown to have robust antifungal activity against several human fungal pathogens, including Candida stelloidea , C. neoformans , A. fumigatus , Fusarium oxysporum , and several pathogenic Penicillium species [107, 108] . However, the antifungal potential of rapamycin has been overshadowed by its potent immunosuppressive activity, which makes this compound less attractive as a therapeutic agent for treatment of fungal infections. Nevertheless, less immunosuppressive rapamycin analogs have been synthesized that retain antifungal activity against pathogenic Candida species as well as C. neoformans [109, 110] .

The problem of systemic fungal infections will continue to grow as the number of individuals requiring immunosuppressive therapy increases. Less immunosuppressive rapamycin analogs offer new options in antifungal therapy. Topical applications and targeted delivery of these analogs are novel treatments that can also be explored for therapeutic use and can circumvent the immunosuppressive effect of rapamycin. Moreover, the use of rapamycin as an antifungal agent in an in vivo setting was reported to improve survival of mice with invasive aspergillosis [111] . Recent reports show that rapamycin encapsulated in lipid micelles retains high levels of potency in vitro [112] . In combination with solubilized amphotericin B and 5-flucytosine (5-FC), rapamycin synergistically increased the in vitro drug susceptibility of C. albicans isolates [112] . The synergistic activity of rapamycin in conjunction with amphotericin B and 5-FC combinations is encouraging as micelle encapsulation reduces the poor solubility of rapamycin in most drug vehicles and increases its compatibility with antifungal drugs. Furthermore, these in vitro results have promising therapeutic value since combinatorial therapy resulting in inhibition of multiple pathways simultaneously enhances efficacy of individual drugs by limiting exposure to toxic side effects and decreasing emergence of drug resistance. The challenge remains to exploit such combinatorial therapy by avoiding the immunosuppressive effects of rapamycin. The potential use of rapamycin and its analogs as antifungals appears promising and further development of new analogs is warranted.

Mimiviridae

In Virus Taxonomy , 2012

Genome organization and replication

The initial mimivirus genome annotation predicted 911 protein-coding genes and 6 tRNAs ( Figure 3 ). More recent data obtained through transcriptome sequencing (RNA-Seq) and deep genome resequencing allowed the identification of a total of 1018 genes, including 979 protein-coding genes, 6tRNAs and 33 non-coding mRNAs. The latest genome sequence and the most current annotation (including the location of identified promoter signals and known 5′-end and 3′-end transcript boundaries) is available in the RefSeq database under accession number NC_014649.1, and in GenBank under accession number HQ336222.

what is gene annotation in bioinformatics

Figure 3 . Map of the mimivirus chromosome. The predicted protein coding sequences are shown on both strands and colored according to the function category of their matching COG. Genes with no COG match are shown in gray. Abbreviations for the COG functional categories are as follows: E, amino acid transport and metabolism; F, nucleotide transport and metabolism; J, translation; K, transcription; L, replication, recombination, and repair; M, cell wall/membrane biogenesis; N, cell motility; O, posttranslational modification, protein turnover, and chaperones; Q, secondary metabolites biosynthesis, transport, and catabolism; R, general function prediction only; S, function unknown. Small red arrows indicate the location and orientation of tRNAs. The A+C excess profile is shown on the innermost circle, exhibiting a peak around position 380,000.

The penetration of the particle inner core within the host cytoplasm is followed by a complete eclipse phase that lasts approximately two hours in Acanthamoeba castellanii (ATCC 30010), after which time mimivirus virion factories become visible. Mimivirus replication entirely takes place in the cytoplasm of the host Acanthamoeba cell, through the successive expression of early (from 0 to 3   h post infection), intermediate (from 3   h to 6   h post-infection) and late (after 6   h post-infection) transcripts, each gene class representing approximately one-third of the mimivirus genome. The virion factories develop from the core of individual uncoated virus particles (seeds). The earliest viral transcripts are detected as soon as 15   minutes post infection, most likely produced by the viral transcription machinery within the uncoated particles. Most of the genes involved in nucleotide synthesis and DNA replication are transcribed from 3   h to 6   h post-infection. Late genes (after 6   h) include virion structural components, as well as most of the virally-encoded transcription apparatus components. This expression pattern suggests that the early and intermediate mimivirus transcripts detected before the appearance of fully mature cytoplasmic virion factories are generated by the transcription apparatus associated with the virion core. Mimivirus particles (at least one thousand per infected cell) are continually produced for up to 12   h by the growing virion factories (up to 6   µm in diameter) ( Figure 4 ). Mature mimivirus particles increasingly fill the host cytoplasm and are progressively released from the dying cell. No budding or sudden cell bursts are seen.

what is gene annotation in bioinformatics

Figure 4 . The distinctive giant mimivirus virion factory in full production (8   h post infection in Acanthamoeba castellanii ). The dark circle (about 4.5   µm in diameter) is the virion factory from which mimivirus particles can be seen emerging, first empty, then filled with a dense core, then covered with their outer fiber layer (transmission electron microscopy).

Genome sequence assembly and annotation

Nachimuthu Saraswathy , Ponnusamy Ramalingam , in Concepts and Techniques in Genomics and Proteomics , 2011

Review questions and answers

What is genome annotation ?

The genome sequence has to be named and its function has to be assigned. This process is known as genome annotation.

What is the draft genome sequence?

The draft genome sequence is characterized by the presence of gaps, i.e. the genomic DNA is represented as supercontigs rather than single chromosomes, with the presence of base ambiguities and low accuracy, otherwise presence of error in the sequence, misalignment in ordering of contigs.

Why are there gaps in the genome assembly?

There are two types of gaps such as the physical gap and the sequence gap. This is due to two reasons: a particular clone may not be picked up in sequencing or a particular DNA is not present in the library.

What is a contig or Bactig?

A contig is the assembly of overlapping clones without a gap, i.e. the unbroken series of clones assembled using overlapping sequences. Bactigs are contigs prepared from BAC clones.

Reconstruction of Genome-Scale Metabolic Networks

Hooman Hefzi , ... Nathan E. Lewis , in Handbook of Systems Biology , 2013

Stage 2: Manual Curation

For most organisms, genome annotation is done primarily through homology methods. Therefore, reconstructions based solely on genome annotation may have many incorrect enzymatic activities, and will be missing reactions for which the associated enzymes were missed in the annotation process. Therefore, great care is taken to ensure that the reconstruction is accurate and complete for the organism of interest – i.e. efforts are made to verify that all reactions and genes included are actually present in the organism and that all known reactions and genes in the organism are included in the reconstruction. In addition, the cellular composition is determined. That is, the amounts of metabolites needed for cell growth and maintenance are determined. For example, the total amounts of proteins, mRNA, DNA, lipids, etc. are measured. Much of this information is organism specific. Thus the primary resources in this stage include either new experimental measurements or organism-specific databases (e.g., EcoCyc [21] , AraCyc [22] , SGD [23] , etc.), textbooks, publications, and experts.

mRNA 3' End Processing and Metabolism

Austin E. Gillen , ... J. Matthew Taliaferro , in Methods in Enzymology , 2021

2.1 Filtering transcripts

LABRAT takes in a genome annotation in gff format. From this annotation it derives the 3′ ends of transcripts to be quantified. However, it does not consider every transcript. In many annotations, there are dubious transcripts that result from incomplete transcript assemblies, old idiosyncratic ESTs, RNAs that haven't yet been fully processed, and other error prone sources. Because these may negatively impact the accuracy of APA quantification, LABRAT uses a set of filters to remove these transcripts.

Some of these filters utilize specific transcript tags found in the supplied annotation. These tags may not be found in every annotation, but are always found in Gencode gff annotations. Because Gencode annotations are only offered for human and mouse genomes, this restricts the species compatible to analysis with LABRAT. To ameliorate this limitation, we wrote specific versions of LABRAT that are compatible with Ensembl annotations for rat and Drosophila genomes.

The first filter used ensures that the transcript is protein coding. Although APA may regulate noncoding transcripts including lncRNAs, a large fraction of the undesired, spurious transcripts are not protein coding. To filter these, LABRAT selects transcripts that have the “protein_coding” attribute.

Transcripts whose 3′ end is not well defined have the potential to induce artifacts in APA quantification. These transcripts often arise from degraded or partial transcripts, yet still end up in many genome annotations. To remove these transcripts from the analysis, LABRAT filters out transcripts that contain the attribute “mRNA_end_NF.”

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

The NCBI Eukaryotic Genome Annotation Pipeline

The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide , Protein , BLAST , Gene and the Genome Data Viewer genome browser.

This page provides an overview of the annotation process. Please refer to the Eukaryotic Genome Annotation chapter of the NCBI Handbook for algorithmic details.

The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignment of sequences and the prediction of genes, to the submission of the accessioned annotation products to public databases. Core components of the pipeline are alignment programs ( Splign and ProSplign ) and an HMM-based gene prediction program ( Gnomon ) developed at NCBI.

Important features of the pipeline include:

The products of an annotation run (chromosome, scaffolds and model transcripts and proteins) are labeled with an Annotation Name. There are two formats for the Annotation Name, which is used throughout NCBI as a way to uniquely identify annotation products originating from the same annotation run.

Source of genome assemblies

Transcript alignments, transcriptomics long read alignments, rna-seq read alignments, protein alignments, model prediction, curated refseq genomic sequence alignments, choosing the best models for a gene, protein naming and determination of locus type, assignment of geneids, annotation of small rnas, annotation of transcription start sites (tss), special considerations, annotation of multiple assemblies, re-annotation, annotation quality, annotation products, data availability.

Please see The Eukaryotic Genome Annotation chapter in the NCBI Handbook for more details about the algorithms.

The figure below provides an overview of the annotation process. The genomic sequences are masked (grey) and transcripts (blue), proteins (green) and RNA-Seq reads and, if available in SRA, long reads transcriptomes and Cap Analysis Gene Expression (CAGE) data (orange) are aligned to the genome. If available for the organism being annotated, curated RefSeq genomic sequences are also aligned (pink). Gene model prediction based on transcript and protein alignments is then performed (brown). The best models are selected among the RefSeq and the predicted models, named and accessioned (purple). Finally, the annotation products are formatted and deployed to public resources (yellow).

pipeline_overview

The RefSeq assemblies that are annotated by NCBI are copies of the genome assemblies that are public in INSDC ( DDBJ , ENA and GenBank ). Unplaced scaffolds with length below 1000 bases may not be included in the RefSeq copy of the assembly if the INSDC assembly contains more than 300,000 unplaced scaffolds and more than 25,000 of them are below 1000 bases. Both RefSeq and GenBank assemblies are further described in the Assembly resource.

Masking is done using RepeatMasker or WindowMasker . Human and mouse are masked with RepeatMasker using their respective Dfam libraries, while genomes from other species are masked with WindowMasker .

The set of transcripts selected for alignment to the genome varies by species, and may include transcripts from other organisms. This set generally includes:

Sequences highly likely to be mitochondrial or to have cloning vector or IS element contamination, and sequences identified as low quality by RefSeq curation staff are screened out.

RefSeq transcripts and non-RefSeq transcripts that pass the contamination screen are aligned locally to the genome using BLAST to identify the location(s) at which transcripts align. Global re-alignment at these locations is performed with Splign to refine the identification of splice sites. Alignments are then ranked and filtered based on customizable criteria (such as coverage, identity, rank). Typically, only the best-placed (rank 1) alignment for a given query is selected for use in the downstream steps.

Transcriptomics reads from SRA generated using long read sequencing technologies such as PacBio or Oxford Nanopore are aligned to the genome using Minimap2 . Each transcript's best-placed (rank 1) alignment is selected for use in the downstream steps, if above 85% identity.

RNA-Seq reads for the species or closely related species are aligned to the genome. When a very large number of samples amd reads (multiple billions) are available in SRA , projects with samples spanning the widest range of tissues and developmental stages are chosen over others, with a preference for untreated or non-diseased samples. RNA-Seq reads are aligned to the genome with STAR . To address the short length, redundancy and abundance of the reads, alignments with the same splice structure and the same or similar start and end points are collapsed into a single representative alignment. Information is recorded about the samples and number of reads represented by each alignment, so the level of support can be used to filter alignments and evaluate gene predictions. Alignments representing very rare introns likely to be background noise are filtered out.

The set of proteins selected for alignment to the genome varies by species, and may include proteins from other organisms. This set generally includes:

Highly repetitive sequences are removed from the set. Proteins are aligned locally to the genome with BLAST and re-aligned globally using ProSplign . Alignments are then ranked and filtered based on customizable criteria.

Protein, transcript, transcriptomics and RNA-Seq read alignments are passed to Gnomon for gene prediction. Gnomon first chains together non-conflicting alignments into putative models. In a second step, Gnomon extends predictions missing a start or a stop codon or internal exon(s) using an HMM-based algorithm. Gnomon additionally creates pure ab initio predictions where open reading frames of sufficient length but with no supporting alignment are detected.

This first set of predictions is further refined by alignment against a subset of the nr (non-redundant) database of protein sequences. The additional alignments are added to the initial alignments, and the chaining and ab initio extension steps are repeated. The results constitute the set of Gnomon predictions.

Gnomon predictions may include deletions or insertions of Ns with respect to the genomic sequence. These differentes are introduced to compensate for frameshifts or stop codons in the literal translation of the genome, when the aligning proteins provides evidence of an intact ORF.

For some organisms, a set of genomic sequences is curated ( RefSeq accessions with NG_ prefixes). These sequences represent either non-transcribed pseudogenes, a manually annotated gene cluster that is difficult to annotate via automated methods, and human RefSeqGene records. They are aligned to the genome, and their best placement is identified.

The final set of annotated features comprises, in order of preference, pre-existing RefSeq sequences and a subset of well-supported Gnomon -predicted models. It is built by evaluating together at each locus the known RefSeq transcripts, the features projected from curated RefSeq genomic alignments and the models predicted by Gnomon .

1. Models based on known and curated RefSeq

RefSeq transcripts are given precedence over overlapping Gnomon models with the same splice pattern. Alignments of known same-species RefSeq transcripts or curated genomic sequences are used directly to annotate the gene, RNA and CDS features on the genome. Since the RefSeq sequence may not align perfectly or completely to the genomic sequence, a consequence of this rule is that the annotated product may differ from the conceptual translation of the genome. Differences between the RefSeq transcripts and the genome are provided in a note on the RefSeq genomic record (scaffold or chromosome).

2. Models based on Gnomon predictions

Gnomon predictions are included in the final set of annotations if they do not share all splice sites with a RefSeq transcript and if they meet certain quality thresholds including:

3. Integrating RefSeq and Gnomon annotations

As a result of the model selection process, a gene may be represented by multiple splice variants, with some of them known RefSeq and others model RefSeq (originating from Gnomon predictions).

Gnomon predictions selected for the final annotation set are assigned model RefSeq accessions with XM_ or XR_ prefixes for transcripts and XP_ prefixes for proteins to distinguish them from known RefSeq with NM_/NR_ and NP_ prefixes. Model RefSeq can be searched in Entrez with the query “srcdb_refseq_model[properties]” while known RefSeq sequences can be obtained with the query “srcdb_refseq_known[properties]”.

Genes in the final set of models are assigned GeneIDs in NCBI's Gene database.

Starting with software release 9.0, Cap Analysis Gene Expression (CAGE) data that is available in SRA for the species are aligned to the genome with Splign and used for annotating transcription start sites.

When multiple assemblies of good quality are available for a given organism, annotation of all is done in coordination. To ensure that matching regions across assemblies are annotated the same way, assemblies are aligned to each other before the annotation.

Assembly-assembly alignments are available through the NCBI Genome Remapping Service .

Organisms are periodically re-annotated when new evidence is available (e.g. RNA-Seq) or when a new assembly is released. Special attention is given to tracking of models and genes from one release of the annotation to the next. Previous and current models annotated at overlapping genomic locations are identified and the locus type and GeneID of the previous models are taken into consideration when assigning GeneIDs to the new models. If the assembly was updated between the two rounds of annotation, the assemblies are aligned to each other and the alignments used to match previous and current models in mapped regions.

The quality of the annotation is assessed prior to publishing, based on the intrinsic characteristics of the annotated models and on the expectations for the species. Indicators of a low quality annotation may disqualify a genome from being included in RefSeq. These indicators are: high count of coding genes that lack near-full coverage by alignments of experimental evidence, high count of partial coding genes (lacking a start or stop codon, or internal exons), high count of low-quality genes with suspected frameshifts or premature stop codons, low BUSCO completeness score (see below), and, for vertebrates, low count of genes with orthologs to a reference species.

BUSCO run in "protein" mode provides an estimate of the completeness of the gene set. The BUSCO models (single-copy marker genes) for the most fitting lineage based on NCBI Taxonomy are searched against the longest protein for each annotated coding gene. Results are reported in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).

The sequence records for scaffolds, chromosomes and predicted transcripts and proteins for NCBI Pongo abelii Annotation Release 103 contain the following comment:

##Genome-Annotation-Data-START## Annotation Provider         :: NCBI Annotation Status           :: Full annotation Annotation Name             :: Pongo abelii Annotation Release 103 Annotation Version          :: 103 Annotation Pipeline         :: NCBI eukaryotic genome annotation pipeline Annotation Software Version :: 8.0 Annotation Method           :: Best-placed RefSeq; Gnomon Features Annotated          :: Gene; mRNA; CDS; ncRNA ##Genome-Annotation-Data-END##

The sequence records for scaffolds, chromosomes and predicted transcripts and proteins for NCBI GCF_016801865.2-RS_2022_12 contain the following comment:

##Genome-Annotation-Data-START## Annotation Provider         :: NCBI RefSeq Annotation Status           :: Full annotation Annotation Name             :: GCF_016801865.2-RS_2022_12 Annotation Pipeline         :: NCBI eukaryotic genome annotation pipeline Annotation Software Version :: 10.1 Annotation Method           :: Gnomon; cmsearch; tRNAscan-SE Features Annotated          :: Gene; mRNA; CDS; ncRNA ##Genome-Annotation-Data-END##

The data produced by the annotation pipeline is available in various resources:

what is gene annotation in bioinformatics

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

Last updated: 2023-01-23T19:56:22Z

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Save citation to file

Email citation, add to collections.

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

Bioinformatics: databasing and gene annotation

Affiliation.

"Omics" experiments amass large amounts of data requiring integration of several data sources for data interpretation. For instance, microarray, metabolomic, and proteomic experiments may at most yield a list of active genes, metabolites, or proteins, respectively. More generally, the experiments yield active features that represent subsequences of the gene, a chemical shift within a complex mixture, or peptides, respectively. Thus, in the best-case scenario, the investigator is left to identify the functional significance, but more likely the investigator must first identify the larger context of the feature (e.g., which gene, metabolite, or protein is being represented by the feature). To completely annotate function, several different databases are required, including sequence, genome, gene function, protein, and protein interaction databases. Because of the limited coverage of some microarrays or experiments, biological data repositories may be consulted, in the case of microarrays, to complement results. Many of the data sources and databases available for gene function characterization, including tools from the National Center for Biotechnology Information, Gene Ontology, and UniProt, are discussed.

Similar articles

LinkOut - more resources

Full text sources.

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings .

what is gene annotation in bioinformatics

Essential Bioinformatics

Book contents

8 - Gene Prediction

Published online by Cambridge University Press:  05 June 2012

With the rapid accumulation of genomic sequence information, there is a pressing need to use computational approaches to accurately predict gene structure. Computational gene prediction is a prerequisite for detailed functional annotation of genes and genomes. The process includes detection of the location of open reading frames (ORFs) and delineation of the structures of introns as well as exons if the genes of interest are of eukaryotic origin. The ultimate goal is to describe all the genes computationally with near 100% accuracy. The ability to accurately predict genes can significantly reduce the amount of experimental verification work required.

However, this may still be a distant goal, particularly for eukaryotes, because many problems in computational gene prediction are still largely unsolved. Gene prediction, in fact, represents one of the most difficult problems in the field of pattern recognition. This is because coding regions normally do not have conserved motifs. Detecting coding potential of a genomic region has to rely on subtle features associated with genes that may be very difficult to detect.

Through decades of research and development, much progress has been made in prediction of prokaryotic genes. A number of gene prediction algorithms for prokaryotic genomes have been developed with varying degrees of success. Algorithms for eukarytotic gene prediction, however, are still yet to reach satisfactory results. This chapter describes a number of commonly used prediction algorithms, their theoretical basis, and limitations.

Access options

Save book to kindle.

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle .

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service .

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox .

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive .

Profile Picture

Students also viewed

Bio chapter 21.

Profile Picture

Sets found in the same folder

Biology 1407 ch 21, chapter 21: genomes.

Profile Picture

Chapter 18 End of notes questions

Other sets by this creator, firefighter, firefighter, urinary system, other quizlet sets, a&p exam 2 madysen, prepu chapter 33: activity, skywest indoc 06/2022.

Profile Picture

Information

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

what is gene annotation in bioinformatics

genes-logo

Article Menu

what is gene annotation in bioinformatics

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Cell type annotation model selection: general-purpose vs. pattern-aware feature gene selection in single-cell rna-seq data  †.

what is gene annotation in bioinformatics

1. Introduction

2. materials and methods, 2.1. framework, 2.2. dataset, 2.3. data pre-processing, 2.4. hyperparameter tuning, 2.5. feature selection, 2.6. xgboost, 3. results and discussion, 3.1. classification results, 3.2. biological validation, 4. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

Share and Cite

Vasighizaker, A.; Trivedi, Y.; Rueda, L. Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data. Genes 2023 , 14 , 596. https://doi.org/10.3390/genes14030596

Vasighizaker A, Trivedi Y, Rueda L. Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data. Genes . 2023; 14(3):596. https://doi.org/10.3390/genes14030596

Vasighizaker, Akram, Yash Trivedi, and Luis Rueda. 2023. "Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data" Genes 14, no. 3: 596. https://doi.org/10.3390/genes14030596

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Genome annotation

What is genome annotation in bioinformatics.

The technique of linking biological information to genome sequences is termed genome annotation. Gene annotation is the method of identifying gene locations and coding sections. It helps us understand what these genes are doing in the body through establishing structural characteristics and linking them to the actions of various proteins.

The importance of genome annotation

Genome projects are scientific undertakings that try to determine an organism's full genome sequence. To understand the meaning of a genome after it has been sequenced, it must be annotated. Molecular biology and bioinformatics have necessitated genome annotation since the 1980s. Researchers identify all protein-coding genes and assign each protein a function when a genome is annotated. Now that the deoxyribonucleic acid (DNA) nucleotide sequences of over a thousand individual humans (The 100,000 Genomes Project, UK) and some model organisms are fully complete. Genome annotation remains a key hurdle for scientists exploring the human genome.

The diagrammatic representation of genome annotation of a DNA sample is shown in the figure.

Manual curation and automatic annotation

In contrast to manual annotation, also known as curation, which requires human skill, automatic annotation technologies try to execute these processes using computer analysis. These methodologies should ideally coexist and complement one another in the same annotation workflow. To generate gene models and functional predictions, computational methods can be used, although they are prone to errors. Annotating gene sequences manually, according to Terry Gaasterland and Christoph Sensen, could take up to a year per person per megabase. In light of genome annotation experiences, researchers now feel that this estimate is inflated by a factor of five or six. Nonetheless, genome annotation has undoubtedly become the limiting stage in most genome studies. Humans, after all, are intended to be inconsistent and prone to making mistakes. As a result, there are financial incentives to automate as much of the annotation process as possible.

Genome annotation databases

In recent years, a variety of genome annotation databases have been built to accommodate the growing volume of genomic data collected for commercial and public use, whether they are industrial, educational, or governmental. These databases make it possible to find and annotate genes as well as their functions. This can be done automatically, but users can also manually annotate genes. Some examples of genome annotation databases are Mouse Genome Informatics(MGI), WormBase (a nematode information resource), and FlyBase (the drosophila database).

How does genome annotation operate?

The two main steps involved in genome annotation are:

Structural annotation (gene prediction) : Structural annotation is the determination of which parts of the genome do not encode for proteins. It involves gene prediction or finding, which is the process of recognizing elements in the genome.

Functional annotation : This involves assigning biological information to these recognized elements.

Structural genome annotation

To begin, we must first identify the genomic structures that encode proteins. The term ‘structural annotation’ refers to this step of the annotation process. It includes information on the identification and positioning of open reading frames (ORFs), gene architecture and coding sequences, and regulatory motifs. There are numerous tools in bioinformatics to annotate structure. Augustus (for eukaryotes) and Glimmer 3 (for prokaryotes) are two tools used in bioinformatics for gene prediction.

Gene prediction or gene finding

The process of discovering the sections of the genome that encode genes is known as gene finding or gene prediction. This comprises both protein-coding genes and RNA (ribonucleic acid)-coding genes, as well as the prediction of other functional elements like regulatory regions. Once a species' genome has been sequenced, discovering genes is one of the first and most crucial steps in comprehending it.

Structural annotation tools for genes

AUGUSTUS:  This is a free program that detects genes from eukaryotic genome sequences. This has a protein profile extension (PPX) that allows it to recognize members and associated exon-intron organization of a family of proteins provided by a block profile by using protein family-specific conservation. Alternative splicing and alternate transcripts, including introns, can be predicted using mRNA (messenger RNA) alignments, EST (expressed sequence tag) alignments, conservation, and other sources of information. GENEID:  This is a program that predicts genes, genomic untranslated regions, splice sites, and other genomic DNA information. Repeat asker:  A repeat asker is a program that looks for interspersed repetitions and low-complex sequences in DNA ( Deoxyribonucleic acid). Codon Usage Database (Kazusa) : The Codon Usage Database has codon usage tables for a variety of species. AtGDB Geneseqer Web server : The AtGDB Geneseqer Webserver is for determining splice junctions in Arabidopsis sequences. GENEMARK : The Genemark is the collection of algorithms for predicting genes in genomic DNA, offered by Georgia Institute of Technology's Bioinformatics Group. TSSP-TCM (TSSplant-transductive confidence machine) : SSP-TCM offers plant promoter identification. WISE2:  WISE2 matches the sequence of a protein to the nucleotide sequence of genomic DNA, accounting for introns and frameshifting defects.

Functional genome annotation

The term ‘functional gene annotation’ refers to the description of a protein's biochemical and biological activity. Functional gene annotation analyses can be used in the identification of transmembrane domains in polypeptide sequences and similarity searches. Prediction of gene clusters of secondary metabolites and searching for gene ontology terms are done using functional gene annotation analyses. Researchers use the NCBI BLAST (Basic Local Alignment Search Tool) + BLASTP (Basic Local Alignment Search Tool Program) to locate identical proteins in a protein data bank for similarity searches.

Functional annotation tools

Blast2GO (used to find Go annotation terms), Wolf Sort (used for predicting the subcellular localization of eukaryote proteins), and TMHMM-Transmembrane Helices; Hidden Markov Model (used to find transmembrane domains of protein sequences) are some examples of functional annotation tools used in bioinformatics to annotate function.  Using BLAST to detect similarities and then annotate genome sequences based on those is the most basic level of annotation in bioinformatics. However, the annotation platform is now receiving an increasing amount of supplementary information. Manual annotators can use the additional information to deconvolute differences between genes that have the same annotation.

The diagrammatic representation of structural annotation is shown in the figure.

Context and Applications

This topic is significant in the exams at school, graduate, and post-graduate levels, especially for Bachelors in Zoology/Genetics/Biotechnology and Masters in Zoology/Genetics/Biotechnology.

Practice Problems

Question 1 : Which of the following is used as a tool in gene prediction in genome annotation?

Answer: Option a is correct.

Explanation: The AUGUSTUS is a tool for gene prediction, and others are annotation databases.

Question 2: Which of the following is used for plant promoter identification?

Answer: Option b is correct.

Explanation: TSSP-TCM (TSSplant-transductive confidence machine) is a structural annotation tool. It offers plant promoter identification.

Question 3: NCBI BLAST+BLASTP is used for _____.

Explanation : Researchers use the NCBI BLAST+ BLASTP to locate identical proteins in a protein data bank for similarity searches.

Question 4: What is the function of structural genome annotation?

Answer: Option d is correct.

Explanation: The annotation process involves identifying and positioning open reading frames (ORFs), gene architecture and coding sequences, and regulatory motifs.

Question 5: Which of the following is an example of the database used to find and annotate genes and their functions?

Explanation: WormBase is an example of an annotation database, and others are gene prediction tools.

Want more help with your biology homework?

*Response times may vary by subject and question complexity. Median response time is 34 minutes for paid subscribers and may be longer for promotional offers.

Search. Solve. Succeed!

Study smarter access to millions of step-by step textbook solutions, our Q&A library, and AI powered Math Solver. Plus, you get 30 questions to ask an expert each month.

Genome annotation Homework Questions from Fellow Students

Browse our recently answered Genome annotation homework questions.

Bioinformatics

Quick Links

Nucleotide sequences, protein sequences.

Gene Expression

GDC Document Keyword Search Modal

NIH National Cancer Institute GDC Documentation

Introduction

RNA-Seq Alignment Workflow

Rna-seq alignment command line parameters, mrna expression workflow, mrna quantification command line parameters, upper quartile fpkm, calculations, star-fusion pipeline, arriba fusion pipeline, scrna gene expression pipeline, scrna analysis pipeline, file access and availability, mrna analysis pipeline.

The GDC mRNA quantification analysis pipeline measures gene level expression with STAR as raw read counts. Subsequently the counts are augmented with several transformations including Fragments per Kilobase of transcript per Million mapped reads (FPKM), upper quartile normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These values are additionally annotated with the gene symbol and gene bio-type. These data are generated through this pipeline by first aligning reads to the GRCh38 reference genome and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses.

Data Processing Steps

The mRNA Analysis pipeline begins with the Alignment Workflow , which is performed using a two-pass method with STAR . STAR aligns each read group separately and then merges the resulting alignments into one. Following the methods used by the International Cancer Genome Consortium ICGC ( github ), the two-pass method includes a splice junction detection step, which is used to generate the final alignment. This workflow outputs a genomic BAM file, which contains both aligned and unaligned reads. Quality assessment is performed pre-alignment with FASTQC and post-alignment with Picard Tools .

Files that were processed after Data Release 14 have associated transcriptomic and chimeric alignments in addition to the genomic alignment detailed above. This only applies to aliquots with at least one set of paired-end reads. The chimeric BAM file contains reads that were mapped to different chromosomes or strands (fusion alignments). The genomic alignment files contain chimeric and unaligned reads to facilitate the retrieval of all original reads. The transcriptomic alignment reports aligned reads with transcript coordinates rather than genomic coordinates. The transcriptomic alignment is also sorted differently to facilitate downstream analyses. BAM index file pairing is not supported by this method of sorting, which does not allow for BAM slicing on these alignments. The splice-junction file for these alignments are also available.

Files that were processed after Data Release 25 will have associated gene fusion files .

As of Data Release 32 the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used.

RNA Alignment Pipeline

Note that version numbers may vary in files downloaded from the GDC Data Portal due to ongoing pipeline development and improvement.

The primary counting data is generated by STAR and includes a gene ID, unstranded, and stranded counts data. Following alignment, the raw counts files produced by STAR are augmented with commonly used counts transformations (FPKM, FPKM-UQ, and TPM) along with basic annotations as part of the RNA Expression Workflow . These data are provided in a tab-delimited format. GENCODE v36 was used for gene annotation.

Note that the STAR counting results will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero.

mRNA Expression Transformation

RNA-Seq expression level read counts produced by the workflow are normalized using three commonly used methods: FPKM, FPKM-UQ, and TPM. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated.

The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity.

The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the protein coding gene in the 75th percentile position is substituted for the sequencing quantity. This is thought to provide a more stable value than including the noisier genes at the extremes.

The transcripts per million calculation is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size.

FPKM Calculations

Note: The read count is multiplied by a scalar (10 9 ) during normalization to account for the kilobase and 'million mapped reads' units.

Sample 1: Gene A

FPKM for Gene A = 1,000 * 10^9 / (3,000 * 50,000,000) = 6.67

FPKM-UQ for Gene A = 1,000) * 10^9 / (3,000 * 2,000 * 19,029) = 8.76

TPM for Gene A = (1,000 * 1,000 / 3,000) * 1,000,000 / (9,000,000) = 37.04

Fusion Pipelines

The GDC uses two pipelines for the detection of gene fusions.

The GDC gene fusion pipeline uses the STAR-Fusion v1.6 algorithm to generate gene fusion data. STAR-Fusion pipeline processes the output generated by STAR aligner to map junction reads and spanning reads to a junction annotation set. It utilizes a chimeric junction file from running the STAR aligner and produces a tab-limited gene fusion prediction file. The prediction file provides fused gene names, junction read count and breakpoint information.

The Arriba gene fusion pipeline uses Arriba v1.1.0 to detect gene fusions from the RNA-Seq data of tumor samples.

scRNA-Seq Pipeline (single-nuclei)

The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the Cell Ranger pipeline to calculate gene expression followed by Seurat for secondary expression analysis.

The gene expression pipeline, which uses Cell Ranger, generates three files:

The analysis pipeline, which uses the Seurat software, generates three files from an input of Filtered counts matrix:

When the input RNA was extracted from nuclei instead of cytoplasm, a slightly modified quantification method is implemented to include introns. Currently, these single-nuclei RNA-Seq (snRNA-Seq) analyses share the same experimental strategy (scRNA-Seq) in the Data Portal, and can be filtered by querying for aliquot.analyte_type = "Nuclei RNA".

To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal.

IMAGES

  1. 4. Genome assembly

    what is gene annotation in bioinformatics

  2. Gene Annotation In Bioinformatics : Embracing bioinformatics in gene banks

    what is gene annotation in bioinformatics

  3. Gene Annotation In Bioinformatics : Embracing bioinformatics in gene banks

    what is gene annotation in bioinformatics

  4. Gene Annotation In Bioinformatics : Embracing bioinformatics in gene banks

    what is gene annotation in bioinformatics

  5. Gene Annotation In Bioinformatics : Embracing bioinformatics in gene banks

    what is gene annotation in bioinformatics

  6. Gene Annotation In Bioinformatics : Embracing bioinformatics in gene banks

    what is gene annotation in bioinformatics

VIDEO

  1. BIO732_Topic129

  2. NCBI Workshop Series

  3. BIO732_Topic118

  4. BIO732_Topic092

  5. BIO732_Topic068

  6. PART3: DNA Homologous recombination: Recombinant Confirmation by Illumina whole genome sequencing

COMMENTS

  1. What is Gene Annotation in Bioinformatics?

    Gene annotation involves the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding layers of analysis and interpretation necessary to extracting biologically significant information and placing such derived details into context.

  2. Bioinformatics Tools: Gene Prediction/ Annotation

    GENEID a program to predict genes, exons, splice sites and other signals along a DNA sequence. JIGSAW a program that predicts gene models using the output from other annotation software. It uses a statistical algorithm to identify patterns of evidence corresponding to gene models.

  3. 7.13B: Annotating Genomes

    Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements.

  4. Genome Annotation

    Genome annotation is the process of deriving the structural and functional information of a protein or gene from a raw data set using different analysis, comparison, estimation, precision, and other mining techniques.

  5. Gene Ontology

    Annotation [ edit] Genome annotation encompasses the practice of capturing data about a gene product, and GO annotations use terms from the GO to do so. Annotations from GO curators are integrated and disseminated on the GO website, where they can be downloaded directly or viewed online using AmiGO. [9]

  6. Chapter 21 quiz Flashcards

    What is bioinformatics? A.a procedure that uses software to order DNA sequences in a variety of comparable ways B. a software program available from NIH to design genes C. a technique using 3-D images of genes in order to predict how and when they will be expressed

  7. Liftoff: accurate mapping of gene annotations

    Here, we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript and gene. We show ...

  8. What is gene annotation in bioinformatics?

    Gene annotation is the process of giving meaning to the nucleotide sequence. It encompasses a broad range of activities. It goes from finding the genes on a nucleotide sequence all the way to associating those genes with function.

  9. BUSCO: assessing genome assembly and annotation completeness with

    Abstract. Motivation: Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50. Results: We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content.

  10. Genome annotation 1

    Bioinformatics genome annotations genome annotation is the process of identifying the locations of genes and all of the coding regions in genome and determining Skip to document Ask an Expert Sign inRegister Sign inRegister Home Ask an ExpertNew My Library Discovery Institutions Bengaluru North University Mahatma Gandhi University Anna University

  11. The NCBI Eukaryotic Genome Annotation Pipeline

    The NCBI Eukaryotic Genome Annotation Pipeline The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene and the Genome Data Viewer genome browser. This page provides an overview of the annotation process.

  12. Gene prediction

    Gene prediction is one of the key steps in genome annotation, following sequence assembly, the filtering of non-coding regions and repeat masking. Gene prediction is closely related to the so-called 'target search problem' investigating how DNA-binding proteins (transcription factors) locate specific binding sites within the genome.

  13. Bioinformatics: databasing and gene annotation

    To completely annotate function, several different databases are required, including sequence, genome, gene function, protein, and protein interaction databases. Because of the limited coverage of some microarrays or experiments, biological data repositories may be consulted, in the case of microarrays, to complement results.

  14. DNA annotation

    DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. [1]

  15. Gene Prediction (Chapter 8)

    Summary. With the rapid accumulation of genomic sequence information, there is a pressing need to use computational approaches to accurately predict gene structure. Computational gene prediction is a prerequisite for detailed functional annotation of genes and genomes. The process includes detection of the location of open reading frames (ORFs ...

  16. ch 21 Flashcards

    What is gene annotation in bioinformatics? A) finding transcriptional start and stop sites, RNA splice sites, and ESTs in DNA sequences B) assigning names to newly discovered genes C) describing the functions of noncoding regions of the genome D) matching the corresponding phenotypes of different species d Bioinformatics includes _____.

  17. Genes

    With the advances in high-throughput sequencing technology, an increasing amount of research in revealing heterogeneity among cells has been widely performed. Differences between individual cells' functionality are determined based on the differences in the gene expression profiles. Although the observations indicate a great performance of clustering methods, manual annotation of the ...

  18. Genome annotation

    Gene annotation is the method of identifying gene locations and coding sections. It helps us understand what these genes are doing in the body through establishing structural characteristics and linking them to the actions of various proteins. The importance of genome annotation

  19. Genes, Proteins, & Sequence Analysis

    Bioinformatics; Genes, Proteins, & Sequence Analysis; Search this Guide Search. Bioinformatics. Resources for those interested in the subject of bioinformatics, the interdisciplinary science that uses information technology to solve molecular biology problems. ... Comprehensive resource for protein sequence and annotation data. Help. more ...

  20. Bioinformatics Pipeline: mRNA Analysis

    GENCODE v36 was used for gene annotation. Note that the STAR counting results will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero. Overlapped Genes (stranded) Overlapped Genes (unstranded)

  21. Bioinformatics

    Bioinformatics (/ ˌ b aɪ. oʊ ˌ ɪ n f ər ... although the exact sequence found in these regions can vary between genes. Genome annotation can be classified into three levels: the nucleotide, protein, and process levels. Gene finding is a chief aspect of nucleotide-level annotation. For complex genomes, the most successful methods use a ...