Thursday, August 29, 2013

Network Visualization Using Cytoscape

Cytoscape is a nice tool to visualize network for better understanding and delivery. I used it for in silico data network visualization and the result was really pretty. Now, I have networks constructed using experimental data from HPN-DREAM Challenge.

In this post, I want to demonstrate how to visualize a network with scores. I'm using Cytoscape 2.8 on Ubuntu 12.

First, the network will be read from a SIF file which is default format of Cytoscape for networks. But, to have scores of edges in the network, this SIF file will be different. This SIF file has 4 columns: parent node, relation, child node and score. These columns will be named not to confuse us during visualization. Below, you can find an example.


Next, by clicking on "File" -> "Import" -> "Network from Table (Text/MS Excel)...", we will open a window to read data properly. On the window, we will open additional options by clicking on "Show Text File Import Options", then "Transfer first line as attribute names...". This option will help use column names in the network. Next, we will define source (parent), interaction (relation) and target (child) and also click on ''Score" column to have it as an edge attribute. After all this, we should have this kind of selections:


Then, we click on "Import".


To have a better look of imported network, we can use "Force-directed layout" option. To change the visualization, we go to VizMapper tab of Control Panel on the left. From the list, we select mappings we want to change. Every change will be seen on the next side.

In VizMapper, for each mapping, we define the attribute, and mapping type. Attributes are the columns that we provide in SIF file. There are three mapping types: Continuous, passthrough and discrete mapper. Continuous mapper is used for numbers and provides some kind of gradient look for the mapping. For example, in the example below, edge thickness is done using this option and width is determined by its score. Passthrough mapper puts directly what each row has according to the attribute. Examples are edge scores as edge labels and node names as node labels. In discrete mapper, we select different options for different elements. For example, there are only two relation types and I want "inhibition" (-1) relation to have red colored edge with T-shaped end and "activating" (1) relation to have green colored edge with arrow-shaped end.


Also, we change "Defaults" from Control Panel for node color, font size, background color.


Below, you can see the result for this network.


"Cytoscape is an open source software platform for visualizing complex networks and integrating these with any type of attribute data." Find out more on their website.

Tuesday, August 27, 2013

Plotting Expression Curves for Experimental Data

As I can plot expression curves for in silico data. I moved on experimental data which is more complex and larger. This data is the result of RPPA experiments on different breast cancer cell lines and it includes protein abundance measurements for about 45 phophoproteins. These phosphoproteins are treated with different inhibitors and stimuli and by comparing their expressions, I will try to infer relations between them.

Before moving on inferring part, I want to have a script that can plot the graphs so that I can see particular results for specific cases. For that, I used the same script as in in silico part but I has to modify it a lot to handle experimental data. I mentioned some of my progress on this in the previous post. Now, I have completed it by ordering the data according to pre-defined fashion, handling duplicates (taking average of them), putting NAs for the missing data points and estimating data points which are in the middle (taking average of neighboring data points in the series). I do this estimation because they are not many and most of them are missing only one point. And in this way, I'm able to have more edges in networks.

Then, the network is inferred and the edges are scored. Since second and third inhibitors act on two proteins, before showing the results, the script splits those with two targets and keeps the scores and relations the same. If there is already an edge after the split, based on their score, the one with higher score is written in the result.

In this inference, I consider each inhibitor with its target(s) and construct relation between its target(s) and phophoproteins in the files. And these targets are actually in the data but since they are given in unphophorylated forms, they are not exactly the same as in the ones in the data. For example, "GSK690693" is given as "AKT inhibitor" and when I use it for inferring networks, I find relations between "AKT" and other phophoproteins. "AKT" is present in the data in several phosphorylated forms such as: "AKT_pS473" and "AKT_pT308". So, I have a relation between "AKT" and "AKT_pS473", for example. This is not I want and not what it's asked because in the network I should have only phosphoproteins from the files. Therefore, I might do this: I can duplicate each edge with "AKT" and name them as the forms of AKT present in the file, which means I will assume that "GSK690693" inhibits all phosphorylated forms of AKT. This is true for "MEK", too. I haven't seen the other inhibited proteins "FGFR1" and "FGFR3" in the data, which is strange. I will go over it later.



Above, there is a plot from experimental data. Cell line is "MCF7" and stimulus is "Serum". The antibody being investigated is "AKT_pT308". The invervention condition is "Serum__GSK690693", which means I'm investigating relation of  "AKT" forms on themselves. This plot shows a confident result of inhibiting relation. So we can conclude that "AKT" forms inhibit themselves. Because, under inhibition, its expression increases.

It was run using these parameters:
bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInfPlotter_experimental.R MCF7 AKT_pT308 Serum Serum__GSK690693

Actually, there are not many things I can do anymore because I'm leaving in two weeks and also the challenge will end soon. In the following days, I will try to learn more on statistics and try to improve that part of network inference. I will also try to visualize the networks in Cytoscape and write another post on it.

Wednesday, August 14, 2013

Experimental Data Optimization for Network Inference

As I mentioned in my previous post, experimental data from the challenge has missing data values that create problems during analyses. To solve it, first thing I did was to optimize data, which includes detecting missing conditions and putting NAs for data values and sorting them if necessary.

I wrote two functions in the script. First one ranks the data according to the fashion and sorts it based on these ranks. The second one gets sorted list and looks for missing data values and puts NAs as data values.


Above there is the list of conditions which are designed to represent "fashion" in the data. Also, it's the result of the functions I mentioned above and for this time point, there are 3 missing data values. Actually, "NRG1__GSK690693_GSK1120212" and "PBS__GSK690693_GSK1120212" are missing for all time points. So I won't be able to make any inference using these conditions. "IGF1__GSK690693_GSK1120212" is the only data value missing among the others. There are some other missing data values in different time points and all of these can be estimated.

The first approach I will use for the estimation is calculating formula for regression line and finding corresponding data value using time point. There are several others and I will mention them later.

But before moving on to estimation part, I want to be sure that this optimization approach is working for all cell lines. I applied this approach for MCF7 cell line and it wasn't working as expected. Because in MCF7 cell line data DMSO treatments (no inhibitors) have duplicates, which cannot be handled by this version of script. This is also similar for UACC812 and BT549 cell line.

I will modify the script to work with these and then move to estimation part.

Working with Experimental Data from Network Inference Challenge

As I almost finished with in silico data, I moved on to analyses of experimental data using the same script. But since the characteristics of data is somehow different, before inferring network, I need to modify the script to be able to read experimental data files.

These differences include missing data values for some conditions. This makes analyses difficult because I have to estimate a value for them and this will decrease the confidence score of edges.

I'm working with BT20 (breast cancer) cell line data for the beginning. It has 234 rows of conditions and some of them (zeroth time points) are duplicates. Zero time points do not have stimulus treatment so for now I will exclude them and start with 5 seconds which is the second time point. So I have 6 time points (5, 15, 30, 60, 120, 240).

Stimuli are FGF1, Insulin, EGF, IGF1, HGF, Serum, NRG1, PBS and inhibitors are AKT, AKT & MEK and FGFR1 & FGFR3. And there are 48 phosphoproteins in BT20 cell line data. These make 192 different conditions and time points and 9216 data values.

In R, I defined a fashion for the data and I'm checking the data if it follows the expected pattern. If not, I create an empty data value for that condition and time point. Later, I'll try to estimate values for them. Also, I realized for 240 seconds rows, data does not follow the expected pattern. So now, I have to fix this and start estimation.

After these estimations, I will have a complete dataset as in in silico part and will use the same scripts for plotting the graphs of expression profiles and inferring and scoring networks.

Monday, August 12, 2013

In silico Network Inference Last Improvements and Visualization of Result in Cytoscape

I'm almost done with the analysis of in silico data, although I need to decide if I need further analysis with the inhibiting parent nodes in the network. Last, I couldn't filter out duplicate edges, which were scored differently. Now, with some improvements in the script, low scores duplicates are filtered and there is a better final list of edges which is ready to be visualized.

I also tried visualizing it on Cytoscape. This is not required in the challenge but better to have to see the result. Below, there is a visualization of the result. It has colored edges and shaped edge ends (red and T-shaped edge is inhibiting, green and arrow-shaped edge is activating) according to type of interaction. Each edge has a score and it is indicated numerically and visually by having different thickness. The thicker the edge, the larger score it has relatively.

In silico data-derived network from HPN-DREAM Challenge. Click on the network to enlarge it

I will decide to do further analysis for in silico part according to the rank in the Leaderboard 1B. I have already started modifying the script to read and process experimental data. I will define relations between inhibitors, stimuli and antibodies for experimental data as I did for in silico data, then infer the network using the same script as in in silico part.

Thursday, August 8, 2013

Some String Functions in R, String Manipulation in R

I have programmed with Perl, Python, and PHP before, and string manipulation was more direct and easier in them than in R. But still there are useful functions for string manipulation in R. I'm not an expert in R but I've been dealing with it for a while and I've learned some good functions for this purpose.

Concatenate strings

Concatenation is done with paste function. It gets concatenated strings as arguments separated bu comma and also separator character(s). Default for separator character is one white space. It directly returns output as string and starting strings remain the same.

str1 <- "Gungor"
str2 <- "Budak"
paste(str1, str2, sep="_") # Output: [1] "Gungor_Budak"

More on the documentation for paste

Extract character(s) from strings

This can be done with substr function. It gets string, start and stop indices for the extraction. It directly returns output as string and starting string remains the same.

str1 <- "Gungor"
substr(str1, 1, 3) # Output: [1] "Gun"

More on the documentation for substr

Split characters from strings

One function is strsplit which gets string and the character(s) from which it will be split as arguments. It returns a list and a list of splits in that list. So, to get a direct list of splits. You can use unlist.

str1 <- "Gungor"
unlist(strsplit(str1, "ng")) # Output: [1] "Gu" "or"

More on the documentation for strsplit

Check if character(s) are present in strings

Here, you can use grep functions. grepl takes pattern and string as arguments and returns TRUE if it matches and FALSE otherwise.

str1 <- "Gungor"
grepl("A", str1) # Output: [1] FALSE

More on the documentation for grep

Replace character(s) in strings

gsub is the function I use for this. It takes pattern, replacement and string as arguments. It returns changed string directly.

str1 <- "Gungor"
gsub("ng", "gn", str1) # Output: [1] "Gugnor"

More on the documentation for gsub

These are the ones I use most, but there are many more and I will be adding them to this post in the future.

Latest Progress on Network Inference and Edge Scoring

I have improved network inference part of the script slightly by changing the way of comparing intervention (presence of inhibitor and stimulus) and no intervention (presence of stimulus) data from in silico part.

Now, I'm using a function (simp) from an R package called StreamMetabolism, which gets time points and data values and (does integration) calculates the area under the curve (Sefick, 2009). I do this integration for both condition and then I compare them.

I also set a threshold for edge scoring. First, I set it as 0.05 and looked at the graphs of low scores. The curves from the graphs were not significantly different for these low scores. So I made it 0.1 and again observed low scores. In this case, the curves were looking different enough to be compared and to be included as an edge in the networks.

I'm inferring the network in this part for these conditions:

Inhibitors Stimuli Targets
INH1_loLIG1 loLIG1 AB12_L1
INH1_hiLIG1 hiLIG1 AB12_H1
INH2_loLIG2 loLIG2 AB5_L2
INH2_hiLIG2 loLIG2 AB5_H2
INH3_loLIG1 loLIG1 AB8_L1
INH3_hiLIG1 hiLIG1 AB8_H1
INH3_loLIG2 loLIG2 AB8_L2
INH3_hiLIG2 hiLIG2 AB8_H2

So each condition is scored separately but in networks characters like "L1" are stripped and only target names are written with the antibodies. Because of this, in networks I can get multiple results for the same kind of edge. Of course, they differ in score and I will only write the result with highest score.

Below, you can see the output from latest script.


Somehow, I need to find more edges with considering targets (AB12, AB5 and AB8) as inhibitors or stimuli. For example, according to the output above, AB8 causes an inhibition in AB2. So the new analysis should be done using AB8 as inhibitor and stimuli for AB8 as stimuli for AB2. This will give another set of results of relations between AB8 and the rest. In this way, there can be some more subsequent analysis to obtain enough edge to construct a complete network.

Reference

Sefick, Stephen. 2009. Stream Metabolism-A package for calculating single station metabolism from diurnal Oxygen curves. From http://CRAN.R-project.org/package=StreamMetabolism

Friday, August 2, 2013

Scoring Edges Network Inference HPN-DREAM Challenge

Yesterday, I managed to infer a network for some part of in silico data from the challenge. Since the challenge also asks for scoring the edges in networks, I developed the script further and add a function for that.

edgeScorer function gets data object of average time points for each curve in intervention/no-intervention sets and scores each edge for each set of conditions. For this, first, it looks for the largest difference among the sets and set it as maxDifference and later, it stores differences divided by maxDifference in another data object. In this way, the edge with maximum difference is scored as 1 and the rest is scored relative to it.

I will use these scores for network inference. I will determine a threshold and the network printed will be consisting only of the edges with a score above the threshold.

Thursday, August 1, 2013

Determining Edges More Progress on Network Inference

Lately, I have been writing an R script to infer network using in silico data. Last version of the script was reading MIDAS file and plotting expression profiles. I have modified it and now it reads MIDAS file, does some analyses and prints causal relations to a file. This file is a SIF file as required.

This dataset is generated with 20 antibodies but only 3 of them are perturbed. Also, for one, stimulus is missing. So, right now I can only determine causal edges between 2 of them and the rest.

In script, midasReader functions gets the content and stores each experiment in data object. And dataOptimizer reads the data object and makes it ready for inference. It does some kind of curve comparison (the curves obtained using dataPlotter, for example) and stores the results in another data object. Finally, networkInferrer gets this data object and compares related experiments and decides which kind of edge and prints in into a SIF file.

dataOptimizer does its job by taking average of all time points that belong to the same kind of experiment (the same stimulus and/or inhibitor set). Decision will be made by comparing the average of values obtained by intervention and the ones by no-intervention.

networkInferrer compares intervention and no-intervention averages and sets "relation" variable accordingly. If intervention average value is smaller than no-intervention average value, then the "relation" variable is set to ''1", because inhibition of parent node causes a decrease in the amount of child node, so the conclusion is that there is an excitatory (1) relationship between parent node and child node.


The image above shows an example for two different sets. First, inhibition of AB12 with INH1 and stimulation with low amount of LIG1. Second, there is a change in the amount of stimulus. Both resulted the same kind of relation that is excitatory (1). Notice that these relations are defined between AB1 and AB12 because INH1 and LIG1 target AB12.

The same kind of inference can be done for AB5 which is the other antibody targeted by an inhibitor and stimulus. But since there is only inhibitor for AB8, I don't know how to analyze it yet.

In this approach, I can only look for relations between these 3 antibodies and the rest. But, since inhibition of a child node might cause something on another grandchild node, with this kind of subsequent analysis, relations for all antibodies can be determined with each other.