Thursday, November 28, 2013

How to Get Transcripts (also Exons & Introns) of a Gene using Ensembl API

As a part of my project, I need to obtain exons and introns of certain genes. These genes are actually human genes that are determined for a specific reason that I will describe later when I explain my project. But for now, I want to share the way to obtain this information using (Perl) Ensembl API. Note that Ensembl has started a beautiful way (Ensembl REST API) of getting data but it is beta and it doesn't provide exons / introns information. So we have to use Ensembl API.

If you haven't installed Ensembl API, visit my Ensembl (Perl) API installation post.

To begin with, I want to share a tutorial set by Ensembl, which I used to learn the API. Tutorials are really useful so for more detailed information about the API, please visit filmed API workshop. Also, Doxygen Perl documentation provides information about classes of the API.

First, let's create a registry to be able to use adaptors:

use Bio::EnsEMBL::Registry;
my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => '', -user => 'anonymous');

Basically, in this API (specifically Ensembl core database API) we have "adaptors" (of genes, transcripts, ...) and "objects" (of genes, transcripts, ...). Adaptors are used to retrieve objects from Ensembl database. So, if you want to get a gene (object) information, first you have to generate a gene adaptor. Then, using "fetch_by_stable_id" method passing an argument as Ensembl gene ID (e.g. ENSG00000198590) gene object is obtained.

my $gene_adaptor  = $registry->get_adaptor('Homo sapiens', 'Core', 'Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000198590');

This gene object, then will be used to get transcripts and each transcript will be used to get exons and introns. We don't have to generate more adaptors because when we obtain gene object, transcript adaptor is automatically generated. This is the same for exons (introns don't have an adaptor because they are not stored separately in Ensembl databases). So, to get transcripts, we need to use "get_all_Transcripts" method in gene object:

foreach my $transcript (@{ $gene->get_all_Transcripts }) {

In foreach loop above, exons and introns can be retrieved by "get_all_Exons" and "get_all_Introns" methods in transcript object. And of course, each exon / intron can be obtained by looping in the same way.

foreach my $exon (@{ $transcript->get_all_Exons }) {

foreach my $intron (@{ $transcript->get_all_Introns }) {

I suggest you check if you have a non-empty object for all because for some genes, Ensemble databases return null objects and if you try to use any method over them, you get errors. So do checks using an if clause:

if ($gene) {
# get its transcripts

if ($exon) {
# get its sequence

After you get each object there are other methods to obtain its ID, sequence, location, etc. Here I will give methods for ID and sequence retrieval but you can always refer to Doxygen Perl (core API) documentation for more information.

So to print ID of genes, transcripts and exons (not introns because they don't have...), we need to use "stable_id" method in objects:

print $gene->stable_id;
print $exon->stable_id;

To print sequence of objects:

print $exon->seq->seq();
print $intron->seq();

Please note the small difference in printing intron sequences.

So that's all. The complete script that I use to get exons and introns of a gene in FASTA format is available here in my GitHub repo. You run the script by supplying gene ID as an argument:

gungor@gungor:~$ perl projects/eiban/ ENSG00000198590

Tuesday, November 12, 2013

Install Ensembl API and BioPerl 1.2.3 on Your System

I'm going to work on a project that requires lots of queries on Ensembl databases so I wanted to install Ensembl API to begin with. Since it's programmed in Perl, I will be using Perl in this project.

There is a nice tutorial on Ensembl website for API installation. Here I will describe some steps.

1. Download the API and BioPerl

Go to Ensembl FTP and download "ensembl-api.tar.gz" or click here

Go to BioPerl downloads page download stable 1.6.1 1.2.3 version (in tar.gz)

2. Place your in a source directory

On Ubuntu, you can use the code below to generate your source folder, extract the downloads and then move the content to your source folder

mkdir ~/src
tar xvfz ensembl-api.tar.gz
mv ensembl ~/src/ensembl
mv ensembl-compara ~/src/ensembl-compara
mv ensembl-functgenomics ~/src/ensembl-functgenomics
mv ensembl-tools ~/src/ensembl-tools
mv ensembl-variation ~/src/ensembl-variation
tar xvfz BioPerl-1.2.3.tar.gz
mv BioPerl-1.2.3 ~/src/bioperl-1.2.3

3. Set your environment variables so tat Perl 5 can find the source directory and files inside

gedit ~/.bashrc

Add following lines to at the end of .bashrc;

export PERL5LIB

source ~/.bashrc

4. Check if the installation is successful

perl ~/src/ensembl/misc-scripts/

If you get "Installation is good. Connection to Ensembl works and you can query the human core database", it's done.

For more information and the steps in installation on Mac and Windows see the original tutorial.

Monday, September 2, 2013

Last Submissions to the Challenge

Today, I submitted in silico and experimental data network inference results on Synapse for the next leaderboard on this Wednesday.

For experimental part, I had to exclude edges with FGFR1 and FGFR3 because the data lacks phosphorylated forms of these proteins and networks must be constructed using only phosphoproteins in the data.

Since there was an update for in silico part, I had to modify the script and resubmit the results. I will see the score for this part on this Wednesday as well.

Also, I added the scripts to a public GitHub repo called "netinf-bigcat".

Thursday, August 29, 2013

Network Visualization Using Cytoscape

Cytoscape is a nice tool to visualize network for better understanding and delivery. I used it for in silico data network visualization and the result was really pretty. Now, I have networks constructed using experimental data from HPN-DREAM Challenge.

In this post, I want to demonstrate how to visualize a network with scores. I'm using Cytoscape 2.8 on Ubuntu 12.

First, the network will be read from a SIF file which is default format of Cytoscape for networks. But, to have scores of edges in the network, this SIF file will be different. This SIF file has 4 columns: parent node, relation, child node and score. These columns will be named not to confuse us during visualization. Below, you can find an example.

Next, by clicking on "File" -> "Import" -> "Network from Table (Text/MS Excel)...", we will open a window to read data properly. On the window, we will open additional options by clicking on "Show Text File Import Options", then "Transfer first line as attribute names...". This option will help use column names in the network. Next, we will define source (parent), interaction (relation) and target (child) and also click on ''Score" column to have it as an edge attribute. After all this, we should have this kind of selections:

Then, we click on "Import".

To have a better look of imported network, we can use "Force-directed layout" option. To change the visualization, we go to VizMapper tab of Control Panel on the left. From the list, we select mappings we want to change. Every change will be seen on the next side.

In VizMapper, for each mapping, we define the attribute, and mapping type. Attributes are the columns that we provide in SIF file. There are three mapping types: Continuous, passthrough and discrete mapper. Continuous mapper is used for numbers and provides some kind of gradient look for the mapping. For example, in the example below, edge thickness is done using this option and width is determined by its score. Passthrough mapper puts directly what each row has according to the attribute. Examples are edge scores as edge labels and node names as node labels. In discrete mapper, we select different options for different elements. For example, there are only two relation types and I want "inhibition" (-1) relation to have red colored edge with T-shaped end and "activating" (1) relation to have green colored edge with arrow-shaped end.

Also, we change "Defaults" from Control Panel for node color, font size, background color.

Below, you can see the result for this network.

"Cytoscape is an open source software platform for visualizing complex networks and integrating these with any type of attribute data." Find out more on their website.

Tuesday, August 27, 2013

Plotting Expression Curves for Experimental Data

As I can plot expression curves for in silico data. I moved on experimental data which is more complex and larger. This data is the result of RPPA experiments on different breast cancer cell lines and it includes protein abundance measurements for about 45 phophoproteins. These phosphoproteins are treated with different inhibitors and stimuli and by comparing their expressions, I will try to infer relations between them.

Before moving on inferring part, I want to have a script that can plot the graphs so that I can see particular results for specific cases. For that, I used the same script as in in silico part but I has to modify it a lot to handle experimental data. I mentioned some of my progress on this in the previous post. Now, I have completed it by ordering the data according to pre-defined fashion, handling duplicates (taking average of them), putting NAs for the missing data points and estimating data points which are in the middle (taking average of neighboring data points in the series). I do this estimation because they are not many and most of them are missing only one point. And in this way, I'm able to have more edges in networks.

Then, the network is inferred and the edges are scored. Since second and third inhibitors act on two proteins, before showing the results, the script splits those with two targets and keeps the scores and relations the same. If there is already an edge after the split, based on their score, the one with higher score is written in the result.

In this inference, I consider each inhibitor with its target(s) and construct relation between its target(s) and phophoproteins in the files. And these targets are actually in the data but since they are given in unphophorylated forms, they are not exactly the same as in the ones in the data. For example, "GSK690693" is given as "AKT inhibitor" and when I use it for inferring networks, I find relations between "AKT" and other phophoproteins. "AKT" is present in the data in several phosphorylated forms such as: "AKT_pS473" and "AKT_pT308". So, I have a relation between "AKT" and "AKT_pS473", for example. This is not I want and not what it's asked because in the network I should have only phosphoproteins from the files. Therefore, I might do this: I can duplicate each edge with "AKT" and name them as the forms of AKT present in the file, which means I will assume that "GSK690693" inhibits all phosphorylated forms of AKT. This is true for "MEK", too. I haven't seen the other inhibited proteins "FGFR1" and "FGFR3" in the data, which is strange. I will go over it later.

Above, there is a plot from experimental data. Cell line is "MCF7" and stimulus is "Serum". The antibody being investigated is "AKT_pT308". The invervention condition is "Serum__GSK690693", which means I'm investigating relation of  "AKT" forms on themselves. This plot shows a confident result of inhibiting relation. So we can conclude that "AKT" forms inhibit themselves. Because, under inhibition, its expression increases.

It was run using these parameters:
bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInfPlotter_experimental.R MCF7 AKT_pT308 Serum Serum__GSK690693

Actually, there are not many things I can do anymore because I'm leaving in two weeks and also the challenge will end soon. In the following days, I will try to learn more on statistics and try to improve that part of network inference. I will also try to visualize the networks in Cytoscape and write another post on it.

Wednesday, August 14, 2013

Experimental Data Optimization for Network Inference

As I mentioned in my previous post, experimental data from the challenge has missing data values that create problems during analyses. To solve it, first thing I did was to optimize data, which includes detecting missing conditions and putting NAs for data values and sorting them if necessary.

I wrote two functions in the script. First one ranks the data according to the fashion and sorts it based on these ranks. The second one gets sorted list and looks for missing data values and puts NAs as data values.

Above there is the list of conditions which are designed to represent "fashion" in the data. Also, it's the result of the functions I mentioned above and for this time point, there are 3 missing data values. Actually, "NRG1__GSK690693_GSK1120212" and "PBS__GSK690693_GSK1120212" are missing for all time points. So I won't be able to make any inference using these conditions. "IGF1__GSK690693_GSK1120212" is the only data value missing among the others. There are some other missing data values in different time points and all of these can be estimated.

The first approach I will use for the estimation is calculating formula for regression line and finding corresponding data value using time point. There are several others and I will mention them later.

But before moving on to estimation part, I want to be sure that this optimization approach is working for all cell lines. I applied this approach for MCF7 cell line and it wasn't working as expected. Because in MCF7 cell line data DMSO treatments (no inhibitors) have duplicates, which cannot be handled by this version of script. This is also similar for UACC812 and BT549 cell line.

I will modify the script to work with these and then move to estimation part.

Working with Experimental Data from Network Inference Challenge

As I almost finished with in silico data, I moved on to analyses of experimental data using the same script. But since the characteristics of data is somehow different, before inferring network, I need to modify the script to be able to read experimental data files.

These differences include missing data values for some conditions. This makes analyses difficult because I have to estimate a value for them and this will decrease the confidence score of edges.

I'm working with BT20 (breast cancer) cell line data for the beginning. It has 234 rows of conditions and some of them (zeroth time points) are duplicates. Zero time points do not have stimulus treatment so for now I will exclude them and start with 5 seconds which is the second time point. So I have 6 time points (5, 15, 30, 60, 120, 240).

Stimuli are FGF1, Insulin, EGF, IGF1, HGF, Serum, NRG1, PBS and inhibitors are AKT, AKT & MEK and FGFR1 & FGFR3. And there are 48 phosphoproteins in BT20 cell line data. These make 192 different conditions and time points and 9216 data values.

In R, I defined a fashion for the data and I'm checking the data if it follows the expected pattern. If not, I create an empty data value for that condition and time point. Later, I'll try to estimate values for them. Also, I realized for 240 seconds rows, data does not follow the expected pattern. So now, I have to fix this and start estimation.

After these estimations, I will have a complete dataset as in in silico part and will use the same scripts for plotting the graphs of expression profiles and inferring and scoring networks.

Monday, August 12, 2013

In silico Network Inference Last Improvements and Visualization of Result in Cytoscape

I'm almost done with the analysis of in silico data, although I need to decide if I need further analysis with the inhibiting parent nodes in the network. Last, I couldn't filter out duplicate edges, which were scored differently. Now, with some improvements in the script, low scores duplicates are filtered and there is a better final list of edges which is ready to be visualized.

I also tried visualizing it on Cytoscape. This is not required in the challenge but better to have to see the result. Below, there is a visualization of the result. It has colored edges and shaped edge ends (red and T-shaped edge is inhibiting, green and arrow-shaped edge is activating) according to type of interaction. Each edge has a score and it is indicated numerically and visually by having different thickness. The thicker the edge, the larger score it has relatively.

In silico data-derived network from HPN-DREAM Challenge. Click on the network to enlarge it

I will decide to do further analysis for in silico part according to the rank in the Leaderboard 1B. I have already started modifying the script to read and process experimental data. I will define relations between inhibitors, stimuli and antibodies for experimental data as I did for in silico data, then infer the network using the same script as in in silico part.

Thursday, August 8, 2013

Some String Functions in R, String Manipulation in R

I have programmed with Perl, Python, and PHP before, and string manipulation was more direct and easier in them than in R. But still there are useful functions for string manipulation in R. I'm not an expert in R but I've been dealing with it for a while and I've learned some good functions for this purpose.

Concatenate strings

Concatenation is done with paste function. It gets concatenated strings as arguments separated bu comma and also separator character(s). Default for separator character is one white space. It directly returns output as string and starting strings remain the same.

str1 <- "Gungor"
str2 <- "Budak"
paste(str1, str2, sep="_") # Output: [1] "Gungor_Budak"

More on the documentation for paste

Extract character(s) from strings

This can be done with substr function. It gets string, start and stop indices for the extraction. It directly returns output as string and starting string remains the same.

str1 <- "Gungor"
substr(str1, 1, 3) # Output: [1] "Gun"

More on the documentation for substr

Split characters from strings

One function is strsplit which gets string and the character(s) from which it will be split as arguments. It returns a list and a list of splits in that list. So, to get a direct list of splits. You can use unlist.

str1 <- "Gungor"
unlist(strsplit(str1, "ng")) # Output: [1] "Gu" "or"

More on the documentation for strsplit

Check if character(s) are present in strings

Here, you can use grep functions. grepl takes pattern and string as arguments and returns TRUE if it matches and FALSE otherwise.

str1 <- "Gungor"
grepl("A", str1) # Output: [1] FALSE

More on the documentation for grep

Replace character(s) in strings

gsub is the function I use for this. It takes pattern, replacement and string as arguments. It returns changed string directly.

str1 <- "Gungor"
gsub("ng", "gn", str1) # Output: [1] "Gugnor"

More on the documentation for gsub

These are the ones I use most, but there are many more and I will be adding them to this post in the future.

Latest Progress on Network Inference and Edge Scoring

I have improved network inference part of the script slightly by changing the way of comparing intervention (presence of inhibitor and stimulus) and no intervention (presence of stimulus) data from in silico part.

Now, I'm using a function (simp) from an R package called StreamMetabolism, which gets time points and data values and (does integration) calculates the area under the curve (Sefick, 2009). I do this integration for both condition and then I compare them.

I also set a threshold for edge scoring. First, I set it as 0.05 and looked at the graphs of low scores. The curves from the graphs were not significantly different for these low scores. So I made it 0.1 and again observed low scores. In this case, the curves were looking different enough to be compared and to be included as an edge in the networks.

I'm inferring the network in this part for these conditions:

Inhibitors Stimuli Targets
INH1_loLIG1 loLIG1 AB12_L1
INH1_hiLIG1 hiLIG1 AB12_H1
INH2_loLIG2 loLIG2 AB5_L2
INH2_hiLIG2 loLIG2 AB5_H2
INH3_loLIG1 loLIG1 AB8_L1
INH3_hiLIG1 hiLIG1 AB8_H1
INH3_loLIG2 loLIG2 AB8_L2
INH3_hiLIG2 hiLIG2 AB8_H2

So each condition is scored separately but in networks characters like "L1" are stripped and only target names are written with the antibodies. Because of this, in networks I can get multiple results for the same kind of edge. Of course, they differ in score and I will only write the result with highest score.

Below, you can see the output from latest script.

Somehow, I need to find more edges with considering targets (AB12, AB5 and AB8) as inhibitors or stimuli. For example, according to the output above, AB8 causes an inhibition in AB2. So the new analysis should be done using AB8 as inhibitor and stimuli for AB8 as stimuli for AB2. This will give another set of results of relations between AB8 and the rest. In this way, there can be some more subsequent analysis to obtain enough edge to construct a complete network.


Sefick, Stephen. 2009. Stream Metabolism-A package for calculating single station metabolism from diurnal Oxygen curves. From

Friday, August 2, 2013

Scoring Edges Network Inference HPN-DREAM Challenge

Yesterday, I managed to infer a network for some part of in silico data from the challenge. Since the challenge also asks for scoring the edges in networks, I developed the script further and add a function for that.

edgeScorer function gets data object of average time points for each curve in intervention/no-intervention sets and scores each edge for each set of conditions. For this, first, it looks for the largest difference among the sets and set it as maxDifference and later, it stores differences divided by maxDifference in another data object. In this way, the edge with maximum difference is scored as 1 and the rest is scored relative to it.

I will use these scores for network inference. I will determine a threshold and the network printed will be consisting only of the edges with a score above the threshold.

Thursday, August 1, 2013

Determining Edges More Progress on Network Inference

Lately, I have been writing an R script to infer network using in silico data. Last version of the script was reading MIDAS file and plotting expression profiles. I have modified it and now it reads MIDAS file, does some analyses and prints causal relations to a file. This file is a SIF file as required.

This dataset is generated with 20 antibodies but only 3 of them are perturbed. Also, for one, stimulus is missing. So, right now I can only determine causal edges between 2 of them and the rest.

In script, midasReader functions gets the content and stores each experiment in data object. And dataOptimizer reads the data object and makes it ready for inference. It does some kind of curve comparison (the curves obtained using dataPlotter, for example) and stores the results in another data object. Finally, networkInferrer gets this data object and compares related experiments and decides which kind of edge and prints in into a SIF file.

dataOptimizer does its job by taking average of all time points that belong to the same kind of experiment (the same stimulus and/or inhibitor set). Decision will be made by comparing the average of values obtained by intervention and the ones by no-intervention.

networkInferrer compares intervention and no-intervention averages and sets "relation" variable accordingly. If intervention average value is smaller than no-intervention average value, then the "relation" variable is set to ''1", because inhibition of parent node causes a decrease in the amount of child node, so the conclusion is that there is an excitatory (1) relationship between parent node and child node.

The image above shows an example for two different sets. First, inhibition of AB12 with INH1 and stimulation with low amount of LIG1. Second, there is a change in the amount of stimulus. Both resulted the same kind of relation that is excitatory (1). Notice that these relations are defined between AB1 and AB12 because INH1 and LIG1 target AB12.

The same kind of inference can be done for AB5 which is the other antibody targeted by an inhibitor and stimulus. But since there is only inhibitor for AB8, I don't know how to analyze it yet.

In this approach, I can only look for relations between these 3 antibodies and the rest. But, since inhibition of a child node might cause something on another grandchild node, with this kind of subsequent analysis, relations for all antibodies can be determined with each other.

Wednesday, July 24, 2013

Plotting Expression Profiles Data Analysis for Network Inference

For in silico data network inference I decided to develop a script because the existing tools have bugs and they are not compatible with the data. At the same time, I will try to report bugs and the compatibility issues to developers.

in silico data has 660 experiment results of 20 antibodies, 4 kinds of stimuli and 3 kinds of inhibitors. Antibodies are treated with a stimulus, say at t_0 and in the case of inhibitors, say at t_i, antibodies are pre-incubated for some time (t_pre) and then, treated with a stimulus. So, the relation is:

t_i = t_0 - t_pre

The causal edges between nodes (antibodies) in the network will be determined by the expression profile of any node, in the presence of intervention on parent node. That is, when AB12 is inhibited, if AB1 expression decreases over time, then we can draw (excitatory) edge between AB12 and AB1.

In the script for network inference, right now there are two functions: midasReader and dataPlotter.

midasReader takes MIDAS file path and child node as the arguments, reads MIDAS file into a data frame and then according to properties of data, it returns a data frame that can be used to plot expression profile graphs. Child node argument is used to get only related (expression of child node in different cases) data from whole dataset. In in silico data case, each experiment is done three times for each treatment (there are 20 different treatments) and each time point (there are 11 time points). So, there are 220 experiments, which are returned by midasReader in a data frame in three columns (treatments, timePoints, data).

Since there are three data values for each case, I took average of those for this time. But there must be more accurate ways of using them. I have to work on that more.

After the data frame is created, dataPlotter, gets three arguments to plot the graph of expression profiles, First argument is the data frame. The others are no intervention and intervention conditions (i.e. hiLIG1 and INH1+hiLIG1). The relations of inhibitors, stimuli and their targets are given in the table below.

Relationship between ligands, stimuli and targets for the in silico model, Taken from:!Wiki:syn1720047/ENTITY/56061

So, when I give hiLIG1 and INH1+hiLIG1 as input and select the child node as AB1, I'm looking for relation between AB12 and AB1. Below, there is the graph for this case. I ran the script as shown below on terminal:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 hiLIG1 INH1+hiLIG1

In this graph, there is no clear difference between expression profiles, so we might not tell if there is a relation looking at this graph. More accurate analyses should be done. After being sure about the accuracy of these analyses, I will move on constructing networks using these graphs.

Another example which shows clear difference is below:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 loLIG1 INH1+loLIG1

Here, we can say intervention on AB12 in the case of (INH1+loLIG1) causes a descrease in AB1 expression so in low amount of stimulus, AB12 is triggering AB1 amount.

However, as I mentioned above, the analysis is not efficient and should be developed. And then, more accurate results can be obtained.

Thursday, July 18, 2013

Webinar on HPN-DREAM Breast Cancer Network Inference Challenge

DREAM8 organizers plan a webinar about HPN-DREAM Breast Cancer Network Inference Challenge on July 19, at 10:30 - 11:30 (PDT / UTC -7). General setup of the challenge, demo submissions to the leaderboard will be discussed and also questions about the challenge will be accepted during webinar. The number of the participants to the challenge is also announced: 138.

Registration to the webinar is done using this form. There are limited number of "seats", but later recordings will be published.

Network Inference Challenge in silico Data

I had a meeting with BiGCaT this week and we discussed DREAM Breast Cancer Challenge. I presented the challenge and also some ways that I have found to solve the first sub-challenge network inference. Tina, from BiGCaT, suggested starting with in silico data which is much simpler than breast cancer data. Later, I can use the methods I develop for in silico data in experimental data.

in silico data contains 20 antibodies, 3 inhibitors and 2 ligand stimuli with 2 different concentration for each. And data for 10 time-points post-stimulus is provided. Inhibitors work by blocking the kinase domain of a protein. Any phosphorylated kinase is considered to be active and an inhibition on it also affects its downstream kinase activity. Inhibitors sequester away the unphosphorylated form of the target kinase and thus limit the amount of phosphorylated forms.

I used CellNOptR and CNORfeeder tools for network inference. CNORfeeder is a tool with several methods for network inference and it has its FEED method that creates boolean tables using the data and then it is possible to infer network based only on the data (i.e. data-derived network).

I plotted the network above using only expression data from the challenge in CNORfeeder. For this, mapBTables2model function that converts boolean table to network model should be called without a model or "model=NULL". I had to convert treatment headers in MIDAS files into a different format because there is a small error in reading MIDAS and converting into CNOlist object: "TR:ABC:Stimuli" to "TR:ABC" and "TR:ABC:Inhibitors" to "TR:ABCi". And a new form of makeBTables function CNORfeeder because there was no stimuli/no inhibitors condition in the data, which is not suitable for the old version.

CellNOptR and CNORfeeder have been developed by CNO developers from European Bioinformatics Institute (EBI) more information can be found on their official website.

Thursday, July 4, 2013

First Impressions and Thoughts on Rosalind Project

Actually, I signed up 8 months ago, I didn't really play around with it. But last week, in a BiGCaT science cafe, after I learnt it, I was more interested than before and I just started solving problems.

In each problem, you have a description about the context and also about the problem. Also, there is a sample input and output. Sometimes there are hints about the solution. What I did was to write a solution that works for the sample and hopefully for the problem. I wrote the scripts in Python, because the project says it is optimized for that and also I want to learn more about Python.

After submitting the solution, if it is successful, you can see the suggested solutions coming from other users. Each solution is really unique and worth checking out. In this way, you can also learn new approaches. And, the project directs you the next question which is available and good to be solved after that solution, which is nice, too. Moreover, there are user success statistics, such as top 100 and by country. This might help users to be motivated by competition. From Turkey, there are only seven users and they don't look like solving problems right now, but I hope there will be more.

There are many badges that can be earned after certain amount of successful solutions and they are motivating you to continue.

My username on Rosalind is gungorbudak. I have solved all the problems in "Python Village" and some from "Bioinformatics Stronghold". I will go on solving more in my spare time.

Rosalind is a good way to learn programming and basics in biology, genetics, statistics, computer science and bioinformatics.

Playing around with CellNOptR Tool and MIDAS File

With CellNOptR, we will try to construct network models for the challenge. For this, the tool needs two inputs. First one is a special data object called CNOlist that stores vectors and matrices of data. Second one is a .SIF file that contains prior knowledge network which can be obtained from pathway database and analysis tools.

CNOlist contains following fields: namesSignals, namesCues, namesStimuli and namesInhibitors, which are vectors storing the names of measurements. Another vector called timeSignals stores time points. The field valueSignals contains protein abundance measurements and the fi elds valueCues, valueStimuli and valueInhibitors) are boolean matrices with a 1 or 0, for each condition (row) when the corresponding cue (column) is present or absent, respectively[4].

A. Experimental setup; B. Data obtained from the experiment; C. MIDAS description
Taken from Bioinformatics (2008) 24 (6): 840-847.doi: 10.1093/bioinformatics/btn018

We are given MIDAS files and each row in a MIDAS file belongs to a single experimental sample  and each column is one sample attribute, such as identity, treatment condition, or value obtained from an experimental assay. The column header has two values: a two-letter code defining the type of column, (e.g. TR for treatment, DV for data value), and a short column name (e.g. a small molecule inhibitor added or a protein assayed). Each cell stores the corresponding value for the row (sample) such as a presence or absence of stimulus/inhibitor, time point, or data value[1].

CellNOptR has a function that can read a MIDAS file and with another one it is possible to convert it to CNOlist and work on it. readMIDAS gets the content of CSV file to a data object and using that data object, makeCNOlist makes a CNOlist[2]. After I did it and look at the CNOlist, it worked but there were some issues with naming. Stimuli vector was named as inhibitors and inhibitors vector was named as stimuli.  It also stores variances of data values. The variances are not zero if replicates are found[3].

So I was able to read experimental data as an object with a small problem in naming that can be solved easily by changing the names of the vectors.

This was the first input required for CellNOptR tool. Second one is the prior knowledge network and in the following days, I will work on it.

1. Flexible informatics for linking experimental data to mathematical models via DataRail
2. Training of boolean logic models of signalling networks using prior knowledge networks and perturbation data with CellNOptR,
3. Package ‘CellNOptR’,
4. In Silico Systems Biology: Scripting with CellNOptR,

Tuesday, July 2, 2013

Progress on Network Inference Sub-Challenge

This sub-challenge has several requirements:

- Directed and causal edges on the models (32 models - 4 cell lines × 8 stimuli)
- Edges should be scored (normalizing to range between 0 and 1) that will show confidence
- Nodes will be phosphoproteins from the data
- Prior knowledge network (that can be constructed using pathway databases) is allowed to be used (actually this is a must for some network inference tools)

First thing was to look for existing tools. ddepn seemed a good option but it didn't work for us. We checked for different tools on Bioconductor for our purpose. There was a tool called CellOptNR, which I thought we could use it for the second sub-challenge. Actually, on Synapse, the second sub-challenge has been modified recently, so right now I'm sure about it. But for the first sub-challenge, CellOptNR and a tool related to it called CNORfeeder will be useful.

This tool gets two input. One is data from microarray experiments, which is simply protein abundance measurements under several stimuli/inhibitors and the other is a prior knowledge network (PKN) that can be constructed using different pathway sources such as WikiPathways, KEGG. It uses various inference methods to integrate the data and validate the network models.

So, we need to construct a PKN with the phosphoproteins in the data and then infer the network models. Next, we need to score each edge on the models and store the results in SIF and EDA files.

This sub-challenge also has in silico data part where there are similar requirements.

I tried an example (from DREAM 4 Challenge) given in vignette of CNORfeeder and it worked as expected. In vignette, a data-drive network is shown without an integration with any PKN but how it is created is not given.

Thursday, June 27, 2013

Retrieving Data with AJAX using jQuery, PHP and MySQL

Last semester, I took a course from Informatics Institute at METU called "Biological Databases and Data Analysis Tools" where first we learned what is a database and how to do queries on it. Also, the technology behind databases are taught. Then, we learned many biological databases and data analysis tools available. These include gene, protein and pathway databases, tools for creating databases.

As a final project, we were asked to create an online tool that can search a database and get the data and display it on any web browsers. For that, we were given a table and using that table and some given conditions, we retrieved another table from Biomart Ensembl and created the database. Then, in searching and displaying the data, we created a user interface.

MySQL database was used and PHP was the choice of programming language, which is powerful and good in web programming.

For our task, we implemented AJAX using jQuery (a JavaScript library). The purpose was to make the search process easy and fast. In this way, the search was triggered when the first three letters from the query were entered and at the same time, the result was displayed on the page without page refresh.

The project is available online on this website.

To do this, as I said, we used AJAX calls. AJAX works on client-side and provides asynchronous calls from the server without any intervention on the current state on the page. That is, to get the data, we don't have to stop viewing the page and get it, then refresh the page with the data. So, in this way, without page refresh it is possible to get any data from the database.

The method includes the use of jQuery method "ajax". This method gets the necessary information from user, sends it to a scripting language that will work on server such as PHP, and at the end, retrieves the result from the server-side language and shows it to the user.

In the function below, the script gets the value in the text field with id "query" and also the names of check boxes (which determine which column to get from the database) and stores them in an object called "data". Then, when the value of "query" is greater than 2, it executes "ajax" method where submission type is set as "post", the script that weill interact with the database is set as "/process.php", data is given and a callback function which will insert the result into a div with id "results".

function getResults() {
var data = {};
data['query'] = $("#query").val().trim();
var boxes = $("input[name=options]:checked");

$.each(boxes, function(key, value){
data[key] = $(value).val();

if (data['query'].length > 2) {

url: '/process.php',
type: 'post',
data: data,
success: function(response) {

And in process.php file, the data coming from AJAX is accepted and used to query the database. Basically, while the data is being retrieved the html codes for the insertion is generated and echoed out at the end. This is how jQuery gets it and displays.

This JavaScript function is able to get the results but it should be somehow executed. There are meny possibilities for this. Typing letters into text field, pasting something into text field, pressing enter, and changing search options are the ones I did. There are great jQuery methods for these, and really simple. We used change(), bind(), on() and keypress() methods.

For example, below you can see how ENTER key (the number 13 indicates it) is used to trigger the function. And note that here we prevent the key from submitting form by returning false.

$("#query").keypress(function(action) {

if(action.which == 13) {
return false;

The use of others can be found on jQuery documentation.

If you have any question about this post, please leave a comment below.

Using Online Tools for Teaching Bioinformatics

I attended one of science cafe meetings of BiGCaT group today and we discussed use of online tools for teaching bioinformatics.

Andra Waagmeester (PhD student form BiGCaT) introduced Rosalind Project as a teaching tool. This project mainly focuses on bioinformatics solutions. Various questions about bioinformatics are asked on the website. Actually, those are various problems that can be seen in any bioinformatics research and by solving them, it helps you learn bioinformatics.

On the website, it is possible to start a class (with a faculty member account) and generate a curriculum with the desired content from the project. It is also possible to post new problems. There is also discussion part where one can ask questions about problems and look for help. The replies to those can be up or downvoted so it can generally be useful.

On Rosalind, problems can be solved using any programming language but they said it's optimized to Python, so Python should be better choice. Among problems, one set is about learning Python, which is good. There are also two sets Bioinformatics Stronghold and Bioinformatics Armory where you build up bioinformatics knowledge.

There is also Code Academy where both learning and teaching programming & coding are possible. It doesn't focus on bioinformatics specifically but it might be used to teach bioinformatics so it's worth checking out.

The use of online tools as a studying platform in classes is really novel and I think it should be done by including this study to grading (which might raise the interest of students to the study). However, there are also some issues about it, which are discussed in the meeting. One is the stability and availability of the tools for whole course time. Second is the availability of solutions on web such as on GitHub. There can be these problems of using these tools for teaching.

Network Inference DREAM Breast Cancer Challenge

The inference of causal edges are described as the change on a node seen after the intervention of another node. If the curves obtained over time overlap (under intervention or no intervention), then there is no relation. Otherwise, we can draw an edge between those nodes and according to the level, up or down, the edge will be activating or inhibiting. These causal edges are context-specific so in different cell line data, we may have different relations.

Also, edge confidence scores should be obtained. Right now, I have no idea how to get them but we will discuss.

The relations and scores will be stored in SIF and EDA files and submitted to the competition.

This can be done by writing scripts specific to the task. However, before that I have looked for existing tools. I have found some. There is an R package called RPPAnalyzer which is designed to read RPPA result and compare the samples and plot a graph at the end but this is not exactly what we need in this challenge (See its CRAN page). Another R package specifically written for constructing signaling networks is present, and it is called ddepn (Dynamic Deterministic Effects Propagation Networks). It infers signalling networks for time-course RPPA data (See its R-Forge page).

So I started with ddepn. I installed the package (R version 3.0.1). And before using our data, I used the example described in its vignette. A similar example is also present in its documentation. However, I got an error before plotting the network. When I attempted to run ddepn function to apply genetic algorithm, it gives  "Error in get("envDDEPN") : object 'envDDEPN' not found". So I have to find a way to solve it, then I can move on doing the same for our data.

It looks like the inference I got from this step will be necessary for the next step with the data files. Because for the next step the use of CellNOptR package (See its official page) is suggested and it needs noth network and data to make predictions.

DREAM Breast Cancer Sub-challenges

I have been going over the sub-challenges before attempting to solve them. As I mentioned, there are three sub-challenges and somehow they are connected.

First, using given data and other possible data sources such as pathway databases, the causal signaling network of the phosphoproteins. There are 4 cell lines and 8 stimulus so they make total 32 networks at the end. Nodes are phosphoproteins and edges should be directed and causal (activator or inhibitor).

4 different treatments are applied to samples before stimulation, these are inhibition treatments and one of them is vehicle control (DMSO). After that, the samples are stimulated and their levels are measured in different time points.

This sub-challenge has also another part where in silico data is provided and only one network inference using the data is asked. The characteristics of this data is different. The training dataset has time points for 20 phosphoproteins under various stimuli and inhibition of nodes (See Sub-challenge 1: Network Inference for more).

Second, predictions on phosphoprotein trajectories should be made. Also, it's asked to propose a model that can cover beyond of this data (breast cancer proteomics and in silico datasets) (See Sub-challenge 2: Time-course Prediction for more).

Third, visualization of the data should be made to be interpreted in meaningful ways. This is only for breast cancer proteomics dataset (See Sub-challenge 3: Visualization for more).

Wednesday, June 26, 2013

HPN-DREAM Breast Cancer Network Inference Challenge

Understanding signaling networks might bring more insights on cancer treatment because cells respond to their environment by activating these networks and phosphorylation reactions play important roles in these networks.

The goal of this challenge is to advance our ability and knowledge on signaling networks inference and protein phosphorylation dynamics prediction. Also, we are asked to develop a visualization method for the data.

The dataset provided is extensive and a result of RPPA (reverse-phase protein array) experiments. It has four (breast cancer) cell lines, each has proteomics data obtained under 3 different inhibitors and one control (DMSO) and 8 different stimuli over 7 time points. And each contains levels of about 45 phosphoproteins. There is also additional dataset with all proteins measured (phosphorylated forms and total proteins) later time points. Moreover, there is an in silico data with similar characteristics (See Data Description on Synapse).

RPPA is a method to quantitate protein levels in lysates from cells or tissues. A video about this technique can be watched on this link.

Using this data, we are asked to complete three sub-challenges.

(1) Network Inference: Modeling causal signaling networks from training data
(2) Time-course Prediction: Prediction of trajectories of protein levels following inhibitor perturbation(s) not seen in the training data
(3) Visualization: Designing a visualization strategy for high-dimensional molecular time-course data sets such as the ones used in this challenge

More information can be found their official website.

And more about the sub-challenges and how we approach to solve them are coming soon.

Dream Challenge

This year, 8th Dream Challenge takes place and I will be working on this project as my internship job in BiGCaT, Bioinformatics, UM. The challenge brings scientists to catalyze the interaction between experiment and theory in the area of cellular network inference and quantitative model building in systems biology (as said on their webpage).

In this competition, I will work on a specific challenge about network modeling, dynamic response predictions and data visualization. The name of this specific challenge is "HPN-DREAM breast cancer network inference challenge" and the information can be found on this Synapse page.

In this blog, I will write the progression of the project, try to explain the steps and the tools and methods I use and explain more about the challenge.