In silico gene prediction tools
Authors: Deepak V Pawar1, Kishor U Tribhuvan 1, Jyoti Singh 1
1ICAR-NRCPB, I.A.R.I, New Delhi-12

Identification of specific genes is basic to their isolation and cloning, elucidation of their function, and their utilization for the development of products and/or services, if any, for human welfare. Prior to the era of genome sequencing, gene detection and isolation involved a series of cumbersome and technically demanding experiments using living cells and organisms. These methods used genomic DNA clones and cDNA libraries for analysis with a variety of sophisticated techniques and were suitable for detection of individual genes. But as complete and virtually sequencing error-free genome sequences became available, the technology for genome-wide in silico search for genes was rapidly developed and refined. These endeavors have resulted in the creation of powerful computational resources, which have greatly facilitated gene identification by analyzing genome sequences. Some of the commonly used tools and database servers dedicated to gene prediction are listed in Table 1. The development of these efficient computer programs for gene prediction is considered as one of the most important single developments that have facilitated functional analysis of genomes.

Sr No. Name Function and Prediction method
1 GENOMESCAN Prediction of the locations of exon–intron boundaries in genome sequences
2 ATGpr Identification of translation initiation sites in cDNA sequences
3 AUGUSTUS Prediction of genes in eukaryotic genome sequences
4 ORF FINDER A graphical analysis tool for prediction of open reading frames
5 BGF A program for hidden Markov model-based ab initio gene prediction
6 GENIUS The predicted genes from complete genome sequences are linked to the known protein 3D structures listed in the database
7 GENEID This server predicts genes, signal sequences, and exons
8 GENEPARSER Detection of introns and exons in the genes predicted from genome sequences
9 GeneMark A family of gene prediction programs; based on a modified GeneScan algorithm
10 GeneMark.hmm A gene prediction program for genome sequences of prokaryotes and eukaryotes
11 NIX Web tool gene prediction based on combining results from different programs
12 VEIL A server using hidden Markov model for finding genes in vertebrate DNA
13 Splice Predictor This program identifies potential splice sites in plant pre-mRNA using Bayesian methods
14 GENESCAN Gene prediction using Fourier transform
15 Fgenesh The fastest and most accurate ab initio gene prediction program for eukaryotic genome sequences
16 NNPP Promoter prediction by neural networks
17 NNSPLICE Splice site prediction using neural network
18 GrailEXP Predicts exons, genes, promoters, poly-As, CpG islands, and repetitive elements within DNA sequences
19 EuGe`ne Gene detection in eukaryotic genomes; uses probabilistic models to discriminate between coding and noncoding sequences and to distinguish between effective splice sites and false splice sites tRNAscanSE Prediction of tRNA encoding genes
Table 1. A list of some important and widely used gene prediction servers and tools

Gene prediction or gene finding

Refers to identification, by analysis of genome sequences, of such genomic regions that function as genes, i.e., encode proteins or various types of RNA species. Gene prediction is the first step in genome annotation taken up after the genome sequence has been assembled and checked for errors. Genome annotation is the process of identifying genes, their 5’ and 3’ regulatory sequences, as well as their functions. In addition, mobile genetic elements and repetitive sequence families are also identified and characterized. Thus, genome annotation involves not only the identification of protein and RNA encoding genes and their regulatory sequences, but also the detection and description of such other functional elements that have regulatory functions or are relevant in some other way for genome organization and function. In short, in silico gene prediction is one of the first and most important steps in the quest for understanding the genome organization and function of a species with the help of a detailed analysis of its genome sequence. The findings from the in silico analyses are subsequently validated by suitably designed in vitro and in vivo studies. The first step in the identification of a protein coding gene using a DNA sequence is the determination of the correct reading frame. A reading frame is the arrangement of sets of three bases, each representing a codon, beginning at a specific nucleotide in a DNA sequence. Therefore, three reading frames are possible for each strand of a DNA molecule. In view of this, the correct reading frame is determined by carrying out a six-frame translation of a given DNA sequence. The longest reading frame that is not interrupted by a translation termination or nonsense codon (TAA, TAG, or TGA) is presumed to be the correct reading frame; generally, such reading frames are known as open reading frames (ORFs). An ORF has an initiation codon (typically, ATG) at its beginning and at least one of the termination codons at its end. The determination of the 3’ ends of ORFs is relatively easier than that of their 5’ ends since the ATG codon can occur at internal sites of the genes as well. Therefore, additional criteria have to be used to locate the 5’ ends of ORFs, e.g., the presence of a Kozak sequence ( CCGCCATGGG) that includes the ATG codon. The 5’ ends of many vertebrate genes have characteristic CpG islands, and analysis of codon usage may provide helpful indications. However, sequencing errors may hamper the correct identification of ORFs.

Protein coding genes

Protein coding genes are usually identified by using a computer program for inspecting the genome sequence for such features that are specific to genes. For example, protein-coding genes, as a rule, comprise ORFs, and their detection is very effective in gene identification in the case of bacteria. In general, the longer is an ORF, the greater is the chance that it represents a gene. However, several features of eukaryotic genes make a direct search for genes on the basis of ORFs very difficult. For example, most eukaryotic genes comprise alternating exons (coding regions) and introns (noncoding regions) in the place of continuous ORFs. Further, the genes in humans and other eukaryotes are often widely spaced; this feature increases the chances of finding “false” genes in the long intergenic regions. The newer versions of ORF scanning software for eukaryotic genomes account for these features and enable an efficient scanning for genes in eukaryotic genomes.

Strategies for the detection of genes

There are mainly two strategies for the detection of genes from genome sequences. The first strategy is based on the nucleotide sequences of already identified genes, cDNAs and ESTs, and the amino acid sequences of known proteins available in various databases. These sequences are used for searching homologous sequences present in the given genome sequence using tools like BLAST. The sequences used for homology search may belong to the same species, a related species, or even a distant species. The reason for this relaxed requirement is that the coding sequences have usually been highly conserved during evolution. For example, sequences of Mlo gene family from Arabidopsis thaliana have been used for detecting genes in the genome sequences of soybean, rice, sorghum, wheat, etc. This approach can be used for identification of specific genes and genes belonging to particular gene families, but it cannot be used for a search of all the genes present in the genome of a given organism.

In the second approach of gene prediction, specialized softwares are used to search the genome sequences for the presence of genes; this is termed as ab initio gene prediction. This is relatively easy and quite efficient in the case of prokaryotes. Computer programs like GeneMark.hmm and GLIMMER are capable of identifying all types of genes in the prokaryotic genome sequences; these programs can detect even overlapping genes. GeneMarkS is a self-training program and is suitable for gene prediction from novel genomes. MetaGeneMark is designed for analysis of metagenomic sequences. Several sophisticated tools for gene prediction from eukaryotic genome sequences, e.g., GeneMark-E, GeneMark.hmm-E, AUGUSTUS, GENESCAN, EUGENE, Fgenesh, etc., are now available (Table 1). Some of these programs are designed for gene hunt in a specific species or group of species; e.g., the program EUGENE was developed for A. thaliana. GeneMark-ES is a self-training program and suitable for use with novel eukaryotic genomes. Fgenesh is perhaps the fastest program for gene prediction from eukaryotic genomes, and it is also considered to be the most accurate of such programs. Some programs serve specific functions, e.g., NNPP performs promoter prediction using neural networks, Splice Predictor identifies potential splice sites in plant pre-mRNA using Bayesian methods, GENEPARSER detects introns and exons in the genes predicted from genomic sequences, etc. (Table 1). the fastest program for gene prediction from eukaryotic genomes, and it is also considered to be the most accurate of such programs. Some programs serve specific functions, e.g., NNPP performs promoter prediction using neural networks, Splice Predictor identifies potential splice sites in plant pre-mRNA using Bayesian methods, GENEPARSER detects introns and exons in the genes predicted from genomic sequences, etc. (Table 1).

The gene prediction programs search for gene-specific features, such as promoters, splice sites, and polyadenylation sites or for pertinent gene contents like ORFs. Many of the currently available gene search programs combine different search criteria and their sensitivities vary widely. The identification of ORFs, usually, exceeding 300 nucleotides, is sufficient to find most genes in prokaryotic genomes. However, such a simple search criterion will miss smaller genes and overlapping genes. These problems are resolved by using algorithms that consider differences in base composition between genes and noncoding DNA, e.g., in programs like GeneMark. The gene prediction programs used in eukaryotes use the output from several algorithms to generate a whole gene model. In this model, a gene is defined as a series of exons that are coordinately transcribed. The various features of eukaryotic genes including transcriptional and translational controls like TATA box, cap site, Kozak sequence, and polyadenylation sites are recognized during the gene detection process. But problems arise as TATA box is missing in ~70 % of human genes, and polyadenylation signal sequences can differ considerably from the consensus sequence AATAAA. Further, the above criteria identify only the first and the last exon of a given gene. Therefore, additional features have been included in the modern gene search tools; these features include 5’ and 3’ splice sites, differences in base composition between coding and noncoding DNA, etc.

Determination of gene function

Once the genes are predicted, their functions can be determined as follows. The simplest method for identifying the function of a new gene is to translate its base sequence into the amino acid sequence it is expected to encode. This protein sequence is then compared with a protein database like PDB (the Protein Data Bank); a program like tBLASTx will perform both these operations. If the predicted protein is homologous to a protein in the database, it suggests the gene function and confirms the identification of a new gene. Alternatively, the gene sequence may be compared with the genes present in the syntenic genomic region of a related species that has rich genomic resources. Homology with a gene in the syntenic region would indicate the most likely function of the gene. Several types of RNA species are noncoding, e.g., rRNA, tRNA, and a variety of small RNA species, etc. Of the various types of RNA species, the genes encoding rRNA are the easiest to detect; this is done by sequence similarity search since their sequence is highly conserved across species. The program tRNAscanSE searches for tRNA encoding genes.

About Author / Additional Info:
I am PhD research scholar, pursuing PhD at IARI, New Delhi in the discipline of Molecular Biology and Biotechnology. I am working on blast disease resistance in O. sativa