Thursday, November 28, 2013

How to Get Transcripts (also Exons & Introns) of a Gene using Ensembl API

As a part of my project, I need to obtain exons and introns of certain genes. These genes are actually human genes that are determined for a specific reason that I will describe later when I explain my project. But for now, I want to share the way to obtain this information using (Perl) Ensembl API. Note that Ensembl has started a beautiful way (Ensembl REST API) of getting data but it is beta and it doesn't provide exons / introns information. So we have to use Ensembl API.

If you haven't installed Ensembl API, visit my Ensembl (Perl) API installation post.

To begin with, I want to share a tutorial set by Ensembl, which I used to learn the API. Tutorials are really useful so for more detailed information about the API, please visit filmed API workshop. Also, Doxygen Perl documentation provides information about classes of the API.

First, let's create a registry to be able to use adaptors:

use Bio::EnsEMBL::Registry;
my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous');

Basically, in this API (specifically Ensembl core database API) we have "adaptors" (of genes, transcripts, ...) and "objects" (of genes, transcripts, ...). Adaptors are used to retrieve objects from Ensembl database. So, if you want to get a gene (object) information, first you have to generate a gene adaptor. Then, using "fetch_by_stable_id" method passing an argument as Ensembl gene ID (e.g. ENSG00000198590) gene object is obtained.

my $gene_adaptor  = $registry->get_adaptor('Homo sapiens', 'Core', 'Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000198590');

This gene object, then will be used to get transcripts and each transcript will be used to get exons and introns. We don't have to generate more adaptors because when we obtain gene object, transcript adaptor is automatically generated. This is the same for exons (introns don't have an adaptor because they are not stored separately in Ensembl databases). So, to get transcripts, we need to use "get_all_Transcripts" method in gene object:

foreach my $transcript (@{ $gene->get_all_Transcripts }) {
}

In foreach loop above, exons and introns can be retrieved by "get_all_Exons" and "get_all_Introns" methods in transcript object. And of course, each exon / intron can be obtained by looping in the same way.

foreach my $exon (@{ $transcript->get_all_Exons }) {
}

foreach my $intron (@{ $transcript->get_all_Introns }) {
}

I suggest you check if you have a non-empty object for all because for some genes, Ensemble databases return null objects and if you try to use any method over them, you get errors. So do checks using an if clause:

if ($gene) {
# get its transcripts
}

if ($exon) {
# get its sequence
}

After you get each object there are other methods to obtain its ID, sequence, location, etc. Here I will give methods for ID and sequence retrieval but you can always refer to Doxygen Perl (core API) documentation for more information.

So to print ID of genes, transcripts and exons (not introns because they don't have...), we need to use "stable_id" method in objects:

print $gene->stable_id;
print $exon->stable_id;

To print sequence of objects:

print $exon->seq->seq();
print $intron->seq();

Please note the small difference in printing intron sequences.

So that's all. The complete script that I use to get exons and introns of a gene in FASTA format is available here in my GitHub repo. You run the script by supplying gene ID as an argument:

gungor@gungor:~$ perl projects/eiban/eiSingleGet.pl ENSG00000198590

Tuesday, November 12, 2013

Install Ensembl API and BioPerl 1.2.3 on Your System

I'm going to work on a project that requires lots of queries on Ensembl databases so I wanted to install Ensembl API to begin with. Since it's programmed in Perl, I will be using Perl in this project.

There is a nice tutorial on Ensembl website for API installation. Here I will describe some steps.

1. Download the API and BioPerl

Go to Ensembl FTP ftp://ftp.ensembl.org/pub/ and download "ensembl-api.tar.gz" or click here

Go to BioPerl downloads page http://bioperl.org/DIST/old_releases/ download stable 1.6.1 1.2.3 version (in tar.gz)

2. Place your in a source directory

On Ubuntu, you can use the code below to generate your source folder, extract the downloads and then move the content to your source folder

mkdir ~/src
tar xvfz ensembl-api.tar.gz
mv ensembl ~/src/ensembl
mv ensembl-compara ~/src/ensembl-compara
mv ensembl-functgenomics ~/src/ensembl-functgenomics
mv ensembl-tools ~/src/ensembl-tools
mv ensembl-variation ~/src/ensembl-variation
tar xvfz BioPerl-1.2.3.tar.gz
mv BioPerl-1.2.3 ~/src/bioperl-1.2.3

3. Set your environment variables so tat Perl 5 can find the source directory and files inside

gedit ~/.bashrc

Add following lines to at the end of .bashrc;

PERL5LIB=${HOME}/src/bioperl-1.2.3
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl/modules
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-compara/modules
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-variation/modules
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-functgenomics/modules
export PERL5LIB

source ~/.bashrc

4. Check if the installation is successful

perl ~/src/ensembl/misc-scripts/ping_ensembl.pl

If you get "Installation is good. Connection to Ensembl works and you can query the human core database", it's done.

For more information and the steps in installation on Mac and Windows see the original tutorial.