Wednesday, July 24, 2013

Plotting Expression Profiles Data Analysis for Network Inference

For in silico data network inference I decided to develop a script because the existing tools have bugs and they are not compatible with the data. At the same time, I will try to report bugs and the compatibility issues to developers.

in silico data has 660 experiment results of 20 antibodies, 4 kinds of stimuli and 3 kinds of inhibitors. Antibodies are treated with a stimulus, say at t_0 and in the case of inhibitors, say at t_i, antibodies are pre-incubated for some time (t_pre) and then, treated with a stimulus. So, the relation is:

t_i = t_0 - t_pre

The causal edges between nodes (antibodies) in the network will be determined by the expression profile of any node, in the presence of intervention on parent node. That is, when AB12 is inhibited, if AB1 expression decreases over time, then we can draw (excitatory) edge between AB12 and AB1.

In the script for network inference, right now there are two functions: midasReader and dataPlotter.

midasReader takes MIDAS file path and child node as the arguments, reads MIDAS file into a data frame and then according to properties of data, it returns a data frame that can be used to plot expression profile graphs. Child node argument is used to get only related (expression of child node in different cases) data from whole dataset. In in silico data case, each experiment is done three times for each treatment (there are 20 different treatments) and each time point (there are 11 time points). So, there are 220 experiments, which are returned by midasReader in a data frame in three columns (treatments, timePoints, data).

Since there are three data values for each case, I took average of those for this time. But there must be more accurate ways of using them. I have to work on that more.

After the data frame is created, dataPlotter, gets three arguments to plot the graph of expression profiles, First argument is the data frame. The others are no intervention and intervention conditions (i.e. hiLIG1 and INH1+hiLIG1). The relations of inhibitors, stimuli and their targets are given in the table below.

Relationship between ligands, stimuli and targets for the in silico model, Taken from: https://www.synapse.org/#!Wiki:syn1720047/ENTITY/56061

So, when I give hiLIG1 and INH1+hiLIG1 as input and select the child node as AB1, I'm looking for relation between AB12 and AB1. Below, there is the graph for this case. I ran the script as shown below on terminal:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 hiLIG1 INH1+hiLIG1


In this graph, there is no clear difference between expression profiles, so we might not tell if there is a relation looking at this graph. More accurate analyses should be done. After being sure about the accuracy of these analyses, I will move on constructing networks using these graphs.

Another example which shows clear difference is below:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 loLIG1 INH1+loLIG1


Here, we can say intervention on AB12 in the case of (INH1+loLIG1) causes a descrease in AB1 expression so in low amount of stimulus, AB12 is triggering AB1 amount.

However, as I mentioned above, the analysis is not efficient and should be developed. And then, more accurate results can be obtained.

Thursday, July 18, 2013

Webinar on HPN-DREAM Breast Cancer Network Inference Challenge

DREAM8 organizers plan a webinar about HPN-DREAM Breast Cancer Network Inference Challenge on July 19, at 10:30 - 11:30 (PDT / UTC -7). General setup of the challenge, demo submissions to the leaderboard will be discussed and also questions about the challenge will be accepted during webinar. The number of the participants to the challenge is also announced: 138.

Registration to the webinar is done using this form. There are limited number of "seats", but later recordings will be published.

Network Inference Challenge in silico Data

I had a meeting with BiGCaT this week and we discussed DREAM Breast Cancer Challenge. I presented the challenge and also some ways that I have found to solve the first sub-challenge network inference. Tina, from BiGCaT, suggested starting with in silico data which is much simpler than breast cancer data. Later, I can use the methods I develop for in silico data in experimental data.

in silico data contains 20 antibodies, 3 inhibitors and 2 ligand stimuli with 2 different concentration for each. And data for 10 time-points post-stimulus is provided. Inhibitors work by blocking the kinase domain of a protein. Any phosphorylated kinase is considered to be active and an inhibition on it also affects its downstream kinase activity. Inhibitors sequester away the unphosphorylated form of the target kinase and thus limit the amount of phosphorylated forms.

I used CellNOptR and CNORfeeder tools for network inference. CNORfeeder is a tool with several methods for network inference and it has its FEED method that creates boolean tables using the data and then it is possible to infer network based only on the data (i.e. data-derived network).


I plotted the network above using only expression data from the challenge in CNORfeeder. For this, mapBTables2model function that converts boolean table to network model should be called without a model or "model=NULL". I had to convert treatment headers in MIDAS files into a different format because there is a small error in reading MIDAS and converting into CNOlist object: "TR:ABC:Stimuli" to "TR:ABC" and "TR:ABC:Inhibitors" to "TR:ABCi". And a new form of makeBTables function CNORfeeder because there was no stimuli/no inhibitors condition in the data, which is not suitable for the old version.

CellNOptR and CNORfeeder have been developed by CNO developers from European Bioinformatics Institute (EBI) more information can be found on their official website.

Thursday, July 4, 2013

First Impressions and Thoughts on Rosalind Project

Actually, I signed up Rosalind.info 8 months ago, I didn't really play around with it. But last week, in a BiGCaT science cafe, after I learnt it, I was more interested than before and I just started solving problems.

In each problem, you have a description about the context and also about the problem. Also, there is a sample input and output. Sometimes there are hints about the solution. What I did was to write a solution that works for the sample and hopefully for the problem. I wrote the scripts in Python, because the project says it is optimized for that and also I want to learn more about Python.

After submitting the solution, if it is successful, you can see the suggested solutions coming from other users. Each solution is really unique and worth checking out. In this way, you can also learn new approaches. And, the project directs you the next question which is available and good to be solved after that solution, which is nice, too. Moreover, there are user success statistics, such as top 100 and by country. This might help users to be motivated by competition. From Turkey, there are only seven users and they don't look like solving problems right now, but I hope there will be more.

There are many badges that can be earned after certain amount of successful solutions and they are motivating you to continue.

My username on Rosalind is gungorbudak. I have solved all the problems in "Python Village" and some from "Bioinformatics Stronghold". I will go on solving more in my spare time.

Rosalind is a good way to learn programming and basics in biology, genetics, statistics, computer science and bioinformatics.

Playing around with CellNOptR Tool and MIDAS File

With CellNOptR, we will try to construct network models for the challenge. For this, the tool needs two inputs. First one is a special data object called CNOlist that stores vectors and matrices of data. Second one is a .SIF file that contains prior knowledge network which can be obtained from pathway database and analysis tools.

CNOlist contains following fields: namesSignals, namesCues, namesStimuli and namesInhibitors, which are vectors storing the names of measurements. Another vector called timeSignals stores time points. The field valueSignals contains protein abundance measurements and the fi elds valueCues, valueStimuli and valueInhibitors) are boolean matrices with a 1 or 0, for each condition (row) when the corresponding cue (column) is present or absent, respectively[4].

A. Experimental setup; B. Data obtained from the experiment; C. MIDAS description
Taken from Bioinformatics (2008) 24 (6): 840-847.doi: 10.1093/bioinformatics/btn018

We are given MIDAS files and each row in a MIDAS file belongs to a single experimental sample  and each column is one sample attribute, such as identity, treatment condition, or value obtained from an experimental assay. The column header has two values: a two-letter code defining the type of column, (e.g. TR for treatment, DV for data value), and a short column name (e.g. a small molecule inhibitor added or a protein assayed). Each cell stores the corresponding value for the row (sample) such as a presence or absence of stimulus/inhibitor, time point, or data value[1].

CellNOptR has a function that can read a MIDAS file and with another one it is possible to convert it to CNOlist and work on it. readMIDAS gets the content of CSV file to a data object and using that data object, makeCNOlist makes a CNOlist[2]. After I did it and look at the CNOlist, it worked but there were some issues with naming. Stimuli vector was named as inhibitors and inhibitors vector was named as stimuli.  It also stores variances of data values. The variances are not zero if replicates are found[3].

So I was able to read experimental data as an object with a small problem in naming that can be solved easily by changing the names of the vectors.

This was the first input required for CellNOptR tool. Second one is the prior knowledge network and in the following days, I will work on it.

1. Flexible informatics for linking experimental data to mathematical models via DataRailhttp://bioinformatics.oxfordjournals.org/content/24/6/840.full
2. Training of boolean logic models of signalling networks using prior knowledge networks and perturbation data with CellNOptR, http://www.bioconductor.org/packages/release/bioc/vignettes/CellNOptR/inst/doc/CellNOptR-vignette.pdf
3. Package ‘CellNOptR’, http://www.bioconductor.org/packages/release/bioc/manuals/CellNOptR/man/CellNOptR.pdf
4. In Silico Systems Biology: Scripting with CellNOptR, http://www.cellnopt.org/cytocopter/resources/tutorial_wtac_2013.pdf

Tuesday, July 2, 2013

Progress on Network Inference Sub-Challenge

This sub-challenge has several requirements:

- Directed and causal edges on the models (32 models - 4 cell lines × 8 stimuli)
- Edges should be scored (normalizing to range between 0 and 1) that will show confidence
- Nodes will be phosphoproteins from the data
- Prior knowledge network (that can be constructed using pathway databases) is allowed to be used (actually this is a must for some network inference tools)

First thing was to look for existing tools. ddepn seemed a good option but it didn't work for us. We checked for different tools on Bioconductor for our purpose. There was a tool called CellOptNR, which I thought we could use it for the second sub-challenge. Actually, on Synapse, the second sub-challenge has been modified recently, so right now I'm sure about it. But for the first sub-challenge, CellOptNR and a tool related to it called CNORfeeder will be useful.

This tool gets two input. One is data from microarray experiments, which is simply protein abundance measurements under several stimuli/inhibitors and the other is a prior knowledge network (PKN) that can be constructed using different pathway sources such as WikiPathways, KEGG. It uses various inference methods to integrate the data and validate the network models.

So, we need to construct a PKN with the phosphoproteins in the data and then infer the network models. Next, we need to score each edge on the models and store the results in SIF and EDA files.

This sub-challenge also has in silico data part where there are similar requirements.

I tried an example (from DREAM 4 Challenge) given in vignette of CNORfeeder and it worked as expected. In vignette, a data-drive network is shown without an integration with any PKN but how it is created is not given.