Brief background information and descriptions of the inner workings of the database can be found in the headings below; for thorough information on how the website works and why it was made, see the preprint.
During maturation of many eukaryotic RNAs, including nearly all pre-mRNAs and lncRNAs, many sequences, known as introns, are excised from the primary transcripts. The remaining sequences, called exons, are joined together to form functional mature RNAs. This process is called RNA splicing, and is carried out by the spliceosome, a complex of hundreds of proteins and an assortment of small nuclear RNAs. For a thorough review of the spliceosome, see Papasaikas and Valcárcel (2016).
Most introns found in eukaryotic pre-mRNA transcripts are removed by the major, or U2-dependent (named for one of the snRNA components) spliceosome. In many eukaryotes, there is also a minor, or U12-dependent spliceosome that removes a tiny fraction (often less than 1%) of introns in pre-mRNAs. Although they are very rare, they occur in key genes of critical developmental pathways in a wide variety of evolutionarily diverse organisms. Initially, it was thought that the primary difference between U2- and U12-dependent introns was the identity of the terminal dinucleotides- introns starting with GT and ending with AG were U2-dependent, and those starting with AT and ending with AC were U12-dependent. Eventually, it was discovered that there are U2-dependent introns with ATAC as terminal dinucleotides and U12-dependent introns with GTAG. There are larger regions at both splice sites and the branch point region that determine which spliceosome is used to excise a given intron. For a thorough review of the minor spliceosome, see Turunen et al. (2013). For a discussion of the conservation of the role of U12-dependent splicing in eukaryotic development, see Gault et al. (2017).
The core splice site recognition sequences are often too variable for the spliceosome to unambiguously define splice sites. Many other sequences in the primary transcript serve as recognition sites for trans-acting splicing factors that play a major role in determining which of many possible splice sites are used in a given primary transcript. Consequently, the same primary transcript may be spliced into a variety of mature RNAs, which may encode different proteins or carry out different functions as RNAs. The phenomenon of alternative splicing occurs as a part of normal regulation of gene expression, but many diseases arise from disruptions of splicing regulation that ultimately result in pathogenic alternative splicing events. For a comprehensive review of the role of aberrant splicing in human diseases, see Daguenet et al. (2015). For a review of diseases associated with errors in minor splicing in particular, see Padgett R (2012).
Nearly all of the data in the database was generated by intronIC, a program that uses a whole-genome sequence file and an annotation file to annotate all introns in that genome with an intron class assignment, along with whatever other annotation information is available in the annotation file. In brief, this is done by evaluating certain sequences of every intron using a position-weight matrix developed from consensus U12-dependent introns and assigning each intron a score quantifying its distance from the consensus U12 5' splice site and branch-point sequences. For a detailed description of how intronIC works, see (insert link to paper). The whole-genome FASTA and gtf files used as input to intronIC were downloaded from the Ensembl ftp servers for the latest releases of each division (Ensembl release 93 and Ensembl Plants, Fungi, and Metazoa release 39).
A custom R script used the biomaRt R API for the Ensembl Biomart to obtain gene symbols for every intron represented in the database. A custom Python script then combined the output of intronIC, the list of gene symbols, and the list of orthologous clusters (described in the "Ortholog Search Engine" section below) into a PostgreSQL database.
The main search engine allows users to search for introns using criteria such as organism name, genome assembly, gene symbol, Ensembl gene or transcript ID, and intron class. These fields were chosen because they were believed to be the most common criteria by which users would want to search for an intron. The advanced search engine allows users to submit queries that can use any field in the database as a search criterion. Similarly, the choices of columns in the results table available on the main search page were chosen because they were believed to be the most commonly desired output fields, while the advanced search page allows users to obtain any combination of fields in the results table.
The ortholog search engine searches a distinct table in the PostgreSQL database that contains only one column: a list of intron IDs corresponding to a group of orthologous introns. Pairs of orthologous introns were determined by reciprocal pairwise alignment of all introns in regions of the analyzed genomes that had strong alignment as annotated by BLASTp.
Users can use an intron's unique intron ID (presumably obtained from searches on one of the other search engines) to obtain a list of all intron sequences in each ortholog cluster. Since all introns are not the same length, the sequence shown for each intron contains 15 nucleotides of the upstream exon, a vertical bar denoting the 5' splice site, the first 10 nucleotides of the intron, the most likely branch point region, the last 5 nucleotides of the intron, a vertical bar denoting the 3' splice site, and 15 nucleotides of the downstream exon. These sequences were chosen because they were felt to concisely represent the most evolutionarily important regions of each intron. The intron ID on each intron's individual page is a link to the results of an ortholog search for that intron.
The U12 search engine provides a simple search interface for searching only the U12-dependent introns contained within the database. The search engine conducts full text searches against only the genome assembly name, taxonomic name, common name, gene symbol, Ensembl gene ID, Ensembl transcript ID, and terminal dinucleotides columns of a table containing all U12-dependent introns in the database. For details about how PostgreSQL full text search queries work, see the documentation here
BED and FASTA files containing most of the annotation and sequence information for every intron in the database are available on the downloads page. There are two files of each type for each genome assembly represented in the database- one containing only the U2-dependent introns and one containing only the U12-dependent introns. There are also files containing the orthology data available for download. Each file has one row for each intron in that genome and class that has an annotated ortholog, with the intron ID in the first column and a comma-delimited list of all intron IDs orthologous to that intron (but not necessarily orthologous to every other intron in the list) in the second.