Overlapping Gene Topology and Function
What is Overlapping Genes?
Many people were surprised to learn that overlap in genes, open reading frames, and even coding sequences is common and functionally integrated into prokaryotic, eukaryotic, and viral genomes. This information was disclosed by modern genome-scale approaches that find novel genes, such as proteogenomics and ribosome profiling.
In addition, bioengineering can take advantage of the limitations overlapping areas put on genomic sequences and their development to create more durable synthetic strains and creations. This review addresses the identification, topology, and biogenesis of overlapping protein-coding and RNA-coding genes within the context of their genome biology.
We highlight novel applications of sequence overlap for translation control, compression of synthetic genetic constructs, and mutation resistance.
Scientists had been perplexed by a riddle for a while until Frederick Sanger’s 1977 sequencing of the first DNA genome and its conclusions. Previous research on the proteins made during infection by the bacteriophage X174 suggested that longer coding sequences (CDSs) were needed than what was found in the phage genome.
When the genome sequence was analyzed, it was discovered that there was a significant overlap between the coding areas, with the internal scaffolding gene covering the genome replication gene and the lysis gene completely encased within the exterior scaffolding gene. Due to the compacted form of these viral genes, it was determined that more unidentified polypeptide synthesis sites could be present in the genome.
A different start location inside the genome replication gene A, which produced a truncated protein with a CDS similar to the C-terminus of the A protein but carrying a different function, was revealed by further refining of the X174 gene model. As a result, overlapping genes have been noted since the inception of sequencing and genomics.
Since then, overlapping genes — more particularly, open reading frames (ORFs) and CDSs — have emerged as a frequent genetic trait documented during viral genome annotation, notably within the SARS-CoV-2 genome.
However, outside of the field of viral genomics, their true quantity and significance were underestimated, and their identification and annotation inside cellular genomes have often been seen as peculiar and idiosyncratic.
Due to the quick development of genome-scale protein and RNA measuring methods and more sophisticated prediction algorithms (Box 1), the area is currently experiencing a renaissance.
These developments have revealed a large number of overlapping genes and ORFs inside cellular genomes. Estimates of overlapping characteristics in the human genome are substantially larger than previously anticipated, including 26% of all protein-coding genes. This estimate will probably rise in the future as previously annotated genes in the human genome are progressively found to have tiny ORFs (sORFs) encoding microproteins.
Overlapping Gene Topology and Function
The analysis of overlapping genes in cellular and viral genomes reveals several topological patterns of overlap that differ in occurrence between prokaryotes and eukaryotes. These trends have been noticed, and they are either the result of more frequent biogenesis of particular kinds, evolutionary selection favoring the preservation of particular topologies, or a mix of the two.
There is currently no agreement on how important these two factors — creation vs retention — are about one another. Sequence extension reorganization of existing genes or de novo gene and ORF generation inside an existing gene are two of the at least six methods that are considered to cause overlap to occur.
There are three different directional overlap topologies. Genes that are encoded on the same strand occasionally experience unidirectional overlaps, which may be further divided into groups based on the reading frames of the overlapping ORFs. The final two topologies, convergent and divergent, are found between genes on opposing strands.
While divergent and convergent overlaps are more common in eukaryote genomes, unidirectional overlaps are more common in the genomes of bacteria and viruses. The interaction between the two genes may be categorized as either overlapped, where just a portion of each gene’s sequence is in the same genomic area, or nested, where a smaller gene completely encloses the bigger gene.
‘Internal-external’ or mother-daughter’ genes are alternative terms used to explain the interaction between overlapping and nested genes.
Estimates of the common sorts of overlaps between prokaryotes and eukaryotes may have been skewed by the differing ways that genes are characterized in these organisms in the literature. For instance, gene overlaps are only taken into account when the CDSs of the genes overlap in the research on prokaryotes and viruses, but in the literature on eukaryotes, overlaps are frequently taken into account between the major transcript boundaries.
Due to these various definitions, it appears that some overlap occurs more frequently in eukaryotes than in prokaryotes, but if the same definitions were applied to both, these apparent discrepancies may vanish. For instance, overlaps between the 5′ and 3′ UTR do not have the same restrictions on a relative reading frame and sequence composition as overlaps between CDSs.
We analyze and examine prokaryotic and eukaryotic gene overlap from both their distinctive characteristics as well as their commonalities, if present. However, the limits given by the way overlapping genes are reported in the literature must be taken into consideration.