Thursday, November 28, 2013

How to Get Transcripts (also Exons & Introns) of a Gene using Ensembl API

As a part of my project, I need to obtain exons and introns of certain genes. These genes are actually human genes that are determined for a specific reason that I will describe later when I explain my project. But for now, I want to share the way to obtain this information using (Perl) Ensembl API. Note that Ensembl has started a beautiful way (Ensembl REST API) of getting data but it is beta and it doesn't provide exons / introns information. So we have to use Ensembl API.

If you haven't installed Ensembl API, visit my Ensembl (Perl) API installation post.

To begin with, I want to share a tutorial set by Ensembl, which I used to learn the API. Tutorials are really useful so for more detailed information about the API, please visit filmed API workshop. Also, Doxygen Perl documentation provides information about classes of the API.

First, let's create a registry to be able to use adaptors:

use Bio::EnsEMBL::Registry;
my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous');

Basically, in this API (specifically Ensembl core database API) we have "adaptors" (of genes, transcripts, ...) and "objects" (of genes, transcripts, ...). Adaptors are used to retrieve objects from Ensembl database. So, if you want to get a gene (object) information, first you have to generate a gene adaptor. Then, using "fetch_by_stable_id" method passing an argument as Ensembl gene ID (e.g. ENSG00000198590) gene object is obtained.

my $gene_adaptor  = $registry->get_adaptor('Homo sapiens', 'Core', 'Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000198590');

This gene object, then will be used to get transcripts and each transcript will be used to get exons and introns. We don't have to generate more adaptors because when we obtain gene object, transcript adaptor is automatically generated. This is the same for exons (introns don't have an adaptor because they are not stored separately in Ensembl databases). So, to get transcripts, we need to use "get_all_Transcripts" method in gene object:

foreach my $transcript (@{ $gene->get_all_Transcripts }) {
}

In foreach loop above, exons and introns can be retrieved by "get_all_Exons" and "get_all_Introns" methods in transcript object. And of course, each exon / intron can be obtained by looping in the same way.

foreach my $exon (@{ $transcript->get_all_Exons }) {
}

foreach my $intron (@{ $transcript->get_all_Introns }) {
}

I suggest you check if you have a non-empty object for all because for some genes, Ensemble databases return null objects and if you try to use any method over them, you get errors. So do checks using an if clause:

if ($gene) {
# get its transcripts
}

if ($exon) {
# get its sequence
}

After you get each object there are other methods to obtain its ID, sequence, location, etc. Here I will give methods for ID and sequence retrieval but you can always refer to Doxygen Perl (core API) documentation for more information.

So to print ID of genes, transcripts and exons (not introns because they don't have...), we need to use "stable_id" method in objects:

print $gene->stable_id;
print $exon->stable_id;

To print sequence of objects:

print $exon->seq->seq();
print $intron->seq();

Please note the small difference in printing intron sequences.

So that's all. The complete script that I use to get exons and introns of a gene in FASTA format is available here in my GitHub repo. You run the script by supplying gene ID as an argument:

gungor@gungor:~$ perl projects/eiban/eiSingleGet.pl ENSG00000198590

No comments: