Wednesday, September 16, 2015

Generating 2D SVG Images of MOL Files using RDKit Transparent Background

The latest release of RDKit (2015-03) can generate SVG images with several lines of codes but by default the generated SVG image has a white background. The investigations on sources didn't solve my problem as I couldn't find any option for setting background to transparent background.

An example of SVG image generation can be found on RDKit blog post called New Drawing Code.

In [3] shows the SVG image generation and it returns the SVG file content in XML. When you open this file in a text editor, you'll see there is a rect element with a style attribute having fill #ffffff which is white. If you make this none, the whole SVG background becomes transparent.

So, if you obtain the SVG file content as XML from using moltosvg function in the blog post, just use the following function to make its background transparent.

import xml.etree.ElementTree as ET

def transparentsvg(svg):
    # Make the white background transparent
    tree = ET.fromstring(svg)
    rect = tree.find('rect')
    rect.set('style', rect.get('style').replace('#ffffff', 'none'))
    # Recover some missing attributes for correct browser rendering
    tree.set('version', '1.1')
    tree.set('xmlns', '')
    tree.set('xmlns:rdkit', '')
    tree.set('xmlns:xlink', '')
    return '<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(tree).strip()

You can write this to an SVG file easily

svg = transparentsvg(svg)
with open('path/to/svg/file', 'w') as f:

Install Cairo Graphics and PyCairo on Ubuntu 14.04 / Linux Mint 17

Cairo is a 2D graphics library implemented as a library written in the C programming language but if you'd like to use Python programming language, you should also install Python bindings for Cairo.

This guide will go through installation of Cairo Graphics library version 1.14.2 (most recent) and py2cairo Python bindings version 1.10.1 (also most recent).

1. Install Cairo

It's very easy with the following repository. Just add it, update your packages and install.

sudo add-apt-repository ppa:ricotz/testing
sudo apt-get update
sudo apt-get install libcairo2-dev

2. Install py2cairo

cd ~
git clone git://

See what's your prefix

python -c "import sys; print sys.prefix"

Install dependencies

sudo apt-get install automake pkg-config libtool


cd ~/py2cairo
./ --prefix=/usr
sudo make
sudo make install

3. Verify

>>> import cairo
>>> cairo.cairo_version_string()
>>> cairo.version

Now, you can use up-to-date versions of these softwares in your computer.

Install RDKit 2015-03 Build on Ubuntu 14.04 / Linux Mint 17

RDKit is an open source toolkit for cheminformatics. It has many functionalities to work with chemical files.

Follow the below guide to install RDKit 2015-03 build on an Ubuntu 14.04 / Linux Mint 17 computer. Since Ubuntu packages don’t have the latest RDKit for trusty, you have to build RDKit from its source.

1. Install Dependencies

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost1.54-all-dev

2. Download the Build

cd /usr/local
sudo wget
sudo tar -xzf RDKit_2015_03_1.tgz
sudo mv rdkit-Release_2015_03_1/ rdkit
cd rdkit/
sudo mkdir build

3. Set Environment Variables

vim ~/.bashrc
# Enter following three lines at the end of .bashrc
export RDBASE="/usr/local/rdkit"
source ~/.bashrc

4. Download InChi API (Optional, remove the arg in the next step if you skip)

sudo bash

5. Build

cd $RDBASE/build
sudo make
sudo make install

6. Verify

>>> import rdkit
>>> rdkit.rdBase.rdkitVersion

Please comment if you have any issue with this installation.

Sunday, August 30, 2015

Generating 2D Images of Molecules from MOL Files using Open Babel

Open Babel is a tool to work with molecular data in any way from converting one type to another, analyzing, molecular modeling, etc. It also has a method to convert MOL files into SVG or PNG images to represent them as 2D images.

Install Open Babel in Linux as following or go to their page for different operating systems

sudo apt-get install openbabel

Open Babel uses the same command to generate SVG or PNG and recognizes the file format using the given filename to as the output option (-O). Also, it's possible to generate transparent images in SVG using option "-xb" with a value "none". This doesn't work for PNGs. There are also other options, one of which is "--title" to write the name of the molecule to the image. Leave it blank if you don't want to see any title.

An example PNG generation from MOL file of benzene molecule.

gungor@gungors-mint ~/Desktop $ obabel benzene.mol -O benzene.png --title Benzene
1 molecule converted

To see all other options for available image formats, follow the Open Babel documentation Image formats page.

Simple Way of Python's subprocess.Popen with a Timeout Option

subprocess module in Python provides us a variety of methods to start a process from a Python script. We may use these methods to run an external commands / programs, collect their output and manage them. An example use of it might be as following:

from subprocess import Popen, PIPE 
p = Popen(['ls', '-l'], stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
print stdout, stderr

These lines can be used to run ls -l command in Terminal and collect the output (standard output and standard error) in stdout and stderr variables using communicate method defined in the process.

However, if the ls command never ends, we will never get any output because the external program just hangs. This happens sometimes with some programs e.g. trying to generate 3D molecule from 2D MOL file using Open Babel when the MOL file is not correctly formed. As the Open Babel just tries to generate 3D and doesn't check if the MOL file is okay, we never get an output from that run. So the program and our script hang.

Python has many different options to solve this problem like using multiprocessing or threading module, signal module etc. But here I'll describe a very simple method that worked for me fine. This is tested in Linux environment with Python version 2.7.

from time import sleep
from subprocess import Popen, PIPE 
def popen_timeout(command, timeout):
    p = Popen(command, stdout=PIPE, stderr=PIPE)
    for t in xrange(timeout):
        if p.poll() is not None:
            return p.communicate()
    return False 
print popen_timeout(['python',
                    '/home/gungor/Desktop/'], 25)

import time 
for i in xrange(10):
    print "Gungor", i

Example run 1 where the external program doesn't exceed the timeout threshold

gungor@gungors-mint ~/Desktop $ python
('Gungor 0\nGungor 1\nGungor 2\nGungor 3\nGungor 4\nGungor 5\nGungor 6\nGungor 7\nGungor 8\nGungor 9\n', '')

In this example, I ran with 25 seconds timeout on an external program ( which runs for 20 seconds and outputs lines of strings to the standard output which are collected with communicate method by

Example run 2 where the external program would take longer than the timeout threshold

gungor@gungors-mint ~/Desktop $ python

The just returns False. Because the external program was still running when the timeout has been achieved and it has been killed afterwards.

You can use this method to control the execution of the external programs in Python.

Saturday, June 27, 2015

Convert XLS/XLSX to CSV in Bash

In most of the modern Linux distributions, Libre Office is available and it can be used to convert XLS or XLSX file(s) to CSV file(s) in bash.

For XLS file(s):

for i in *.xls; do libreoffice --headless --convert-to csv "$i"; done

For XLSX file(s):

for i in *.xlsx; do libreoffice --headless --convert-to csv "$i"; done

You may get following warning but it still works fine:

javaldx: Could not find a Java Runtime Environment!
Warning: failed to read path from javaldx

You'll see a standard output similar to following when the conversion is successful:

convert /home/gbudak/Downloads/cmap/cmap_instances_02.xls -> /home/gbudak/Downloads/cmap/cmap_instances_02.csv using Text - txt - csv (StarCalc)

Sunday, June 7, 2015

Obtaining Molecule Description using Open Babel / PyBel

Open Babel is a great tool to analyze and investigate molecular data (.MOL, .SDF files). Its Python API is particularly very nice if you are familiar with Python already. In this post, I'll demonstrate how you can obtain molecule description such as molecular weight, HBA, HBD, logP, formula, number of chiral centers using PyBel.


$ sudo apt-get install openbabel python-openbabel

Usage for MW, HBA, HBD, logP

After reading .MOL file, we need to use calcdesc method with descnames argument for getting the descriptions. Getting formula is different, though.

gungor@gungors-mint ~/Desktop $ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybel import *
>>> mol = readfile('mol', 'benzene.mol').next()
>>> desc = mol.calcdesc(descnames=['MW', 'logP', 'HBA1', 'HBD'])
>>> desc['formula'] = mol.formula
>>> print desc
{'HBA1': 0.0, 'HBD': 0.0, 'MW': 78.11184, 'logP': 1.6865999999999999, 'formula': 'C6H6'}

Usage for Number of Chiral Centers

This is not included in PyBel, or I couldn't find it but Open Babel comes with obchiral terminal command which returns the chiral atoms found in the molecules. We can parse this output to get the number of chiral centers.

>>> from subprocess import Popen, PIPE
>>> p = Popen(['obchiral', 'brg_flap_1.mol'], stdout=PIPE, stderr=PIPE)
>>> stdout, stderr = p.communicate()
>>> chiral_num = stdout.count('Is Chiral')
>>> print chiral_num

So you can use above codes to get molecule description including molecular weight, HBA, HBD, logP, formula, number of chiral centers.

Wednesday, May 6, 2015

How to Get (or Load) NCBI GEO Microarray Data into R using GEOquery Package from Bioconductor

R, especially with lots of Bioconductor packages, provides nice tools to load, manage and analyze microarray data. If you are trying to load NCBI GEO data into R, use GEOquery package. Here, I'll describe how to start with it and probably in my future posts I'll mention more.



gds <- getGEO("GDS5072")

gds <- getGEO(filename="path/to/GDS5072.soft.gz")

getGEO function return a complex class type GDS object which contains the complete dataset. For example to obtain sample organism information:

organism <- gds@header$sample_organism
[1] "Homo sapiens"

Obtain sample IDs:

sample <- gds@dataTable@columns$sample
[1] GSM1095883 GSM1095886 GSM1095877 GSM1095878 GSM1095879 GSM1095880
[7] GSM1095881 GSM1095882 GSM1095884 GSM1095885 GSM1095876
11 Levels: GSM1095876 GSM1095877 GSM1095878 GSM1095879 ... GSM1095886

Obtain gene IDs:

genes <- gds@dataTable@table$IDENTIFIER
31596 Levels: ADAM32 AFG3L1P AK9 ALG10 ARMCX4 ATP6V1E2 ...

Obtain levels for the first sample:

sample_1 <- gds@dataTable@table$GSM1095883
[1] "3362.6" "400" "70.9" "362.8" "5" "849"

GEOquery loads all parts of the data in a good format so that you can start your analysis directly. For more information visit package's GitHub vignettes.

Wednesday, January 21, 2015

Salmonella - Host Interaction Network - A Detailed, Better Visualization

We're almost done with the analyses and we're making the final visualization of the network. As I previously posted, the network was clustered and visualized by time points. After that, we have done several more analyses and here I report how we visualized them. I'm going to post more about how we did the analyses separately.

First, the nodes are grouped into experimental and not experimental (PCSF nodes). This can easily be done by parsing experimental network output and network outputs of PCSF. Second, using Salmonella human host interactome is used to find out proteins that are host proteins in Salmonella infection. These are imported as network tables into Cytoscape and visualized as shown below.

Next, another set of analyses (not visualized yet) has been done to find out "the most important" proteins in the network. Four features are used: path damaging percentages, in degree centralities, out degree centralities and betweenness centralities. I used Python package NetworkX again for computing these features. After I obtained a value for each feature for each protein, I also imported them as a network table into Cytoscape.

Here is how I visualized each feature:

For time points, a protein has all four colors when it's seen at all time points. Colors change according to time point(s) proteins are seen at.

Below, there is a subgraph of newly visualized network:

The features that are not visualized yet will also be done and reported soon.

Friday, January 16, 2015

Finding k-cores and Clustering Coefficient Computation with NetworkX

Assume you have a large network and you want to find k-cores of each node and also you want to compute clustering coefficient for each one. Python package NetworkX comes with very nice methods for you to easily do these.

k-core is a maximal subgraph whose nodes are at least k degree[1]. To find k-cores:

Add all edges you have in your network in a NetworkX graph, and use core_number method that gets graph as the single input and returns node – k-core pairs.

cores = networkx.core_number(G)

If you also need to compute clustering coefficient each node, use clustering method which gets graph and node as inputs for unweighted and returns a floating number as the clustering coefficient of the node.

cc = networkx.clustering(G, node)

1. Preventing Unraveling in Social Networks: The Anchored k-Core Problem,

Below there is an example script that does these operations. Read the code and prepare your input accordingly.

Thursday, January 15, 2015

GO Enrichment of Network Clusters

In my previous post, I mentioned how I clustered the network we obtained at the end. For functional annotation gene ontology (GO) enrichment has been done on these clusters.

There were 20 clusters and the HGNC names are obtained separately for each cluster and using DAVID functional annotation tool API, GO and pathway annotations are collected per cluster and these are saved separately.,GOTERM_CC_FAT,GOTERM_MF_FAT,BBID,BIOCARTA,KEGG_PATHWAY&ids=HGNC_NAME1,HGNC_NAME2,HGNC_NAME3,...

Above URL was used to obtain chart report for some GO and pathways chart records. "ids" query was changed per cluster with HGNC names coming from each.

Then, downloaded chart reports for each cluster are read and a matrix is created to show p values for each chart record per cluster. The matrix columns are 20 clusters and it has 1344 chart record. If the p value of the chart record in any cluster is smaller than or equal to 0.05, then it's stored in the corresponding cell, otherwise 0 is written as the p value.

Network Clustering with NeAT - RNSC Algorithm

As we have obtained proteins at different times points from the experimental data, then we have found intermediate nodes (from human interactome) using PCSF algorithm and finally with a special matrix from the network that PCSF created, we have validated the edges and also determined edge directions using an approach which a divide and conquer (ILP) approach for construction of large-scale signaling networks from PPI data. The resulting network is a directed network and will be used and visualized  for further analyses.

The first step was to cluster the network. RNSC (Restricted Neighbourhood Search Cluster Algorithm) was used to cluster the network into 20 and these were imported to Cytoscape as a network table and the network is manually splitted into 20 clusters based on RNSC results on Cytoscape.

Next, nodes in the network is colored based on time points when they are found to be significantly changed. In previously visualized network, some nodes seen at more than one time point were not shown correctly. In this one, if they are seen at more than one time point, they are colored with the corresponding colors of time points.

Tuesday, January 13, 2015

Searching Open Reading Frames (ORF) in DNA sequences - ORF Finder

Open reading frames (ORF) are regions on DNA which are translated into protein. They are in between start and stop codons and they are usually long.

The Python script below searches for ORFs in six frames and returns the longest one. It doesn't consider start codon as a delimiter and only splits the sequence by stop codons. So the ORF can start with any codon but ends with a stop codon (TAG, TGA, TAA). Six frames are the sequence itself and its reverse complement, and each has three frames obtained by shifting one nucleotide right twice.

There are also local and online tools that perform the same task. My suggestions are:

NCBI ORF Finder (Online)
STAR: Orf by MIT (Online / Local)

Reconstructed Salmonella Signaling Network Visualized and Colored

After fold changes were obtained and HGNC names were found for each phosphopeptide, these were used to construct Salmonella signaling network using PCSF and then with the nodes that PCSF found as well, we generated a matrix which has node in the rows and time points in the columns and each cell shows the presence of corresponding protein under the corresponding time point(s).

The matrix has 658 nodes (proteins) and 4 time points as indicated before: 2 min, 5 min, 10 min and 20 min.

Öykü Eren Özsoy and Tolga Can from METU has developed a method which gets this matrix and an interactome and reconstructs a signaling network. This method enhances the interactions of the nodes in the network. Below, see this network on Cytocape.

Then, to understand which nodes are coming from the data and which of them belong to which time points, I colored the nodes according to a two-column table that has nodes in the first column and their time points in the second column.

Note there are nodes that are seen at multiple time points and in this network they are not specially colored. They will be considered later.

Python: Get Longest String in a List

Here is a quick Python trick you might use in your code.

Assume you have a list of strings and you want to get the longest one in the most efficient way.
>>>l=["aaa", "bb", "c"]
>>>longest_string = max(l, key = len)

Monday, January 12, 2015

Python: defaultdict(list) Dictionary of Lists

Most of the time, when you need to work on large data, you'll have to use some dictionaries in Python. Dictionaries of lists are very useful to store large data in very organized way. You can always initiate them by initiating empty lists inside an empty dictionary but when you don't know how many of them you'll end up with and if you want an easier option, use defaultdict(list). You just need to import it, first:
from collections import defaultdict
And then initiate anywhere:
dict = defaultdict(list)
You can always collect data in this way:

Python: extend() Append Elements of a List to a List

When you append a list to a list by using append() method, you'll see your list is going to be appended as a list:
>>>l2=["a", "b"]
['a', ['a', 'b']]
If you want to append elements of the list directly without creating nested lists, use extend() method:
>>>l2=["a", "b"]
['a', 'a', 'b']

Salmonella Data Preprocessing for PCSF Algorithm

This post describes data preprocessing in Salmonella project for Prize-Collecting Steiner Forest Problem (PCSF) algorithm.

Salmonella data taken from Table S6 in Phosphoproteomic Analysis of Salmonella-Infected Cells Identifies Key Kinase Regulators and SopB-Dependent Host Phosphorylation Events by Rogers, LD et al. has been converted to tab delimited TXT file from its original XLS file for easy reading in Python.

The data should be separated into time points files (2, 5, 10 and 20 minutes) each of which will contain corresponding phophoproteins and their fold changes. Data also has another dimension which is location (membrane, nucleus, cytosol) and when we looked at each location and each time point, we saw there are not many overlapping proteins in the same time point and different locations, therefore we decided to merge location data for each time point in each protein.

Number of overlapping proteins in each dimensions (2N; 2 min. & Nucleus; 10C: 10 min. & Cytosol).

The data table is read for each time point row by row and if the row has any data satisfying conditions (significance 0.05 and variance 15) in one or more than one locations, they are stored and the maximum fold change among them (in the case of more than one locations) is obtained and stored with HGNC name of the protein. HGNC names of the proteins are obtained by converting IPI IDs of their peptides to HGNC names of the proteins (DAVID is used for this conversion). After collecting all HGNC names and maximum fold changes, each HGNC name and corresponding maximum fold change(s) are written as an output. If there are multiple maximum fold changes for an HGNC name due to the same protein in multiple rows, then their maximum is obtained and written next to that HGNC name.

Outputs of this preprocessing are four TXT files of HGNC name – fold change pairs and each file contains proteins that are found significant and the most activated or deactivated at one time point.

Saturday, January 10, 2015

UPGMA Algorithm Described - Unweighted Pair-Group Method with Arithmetic Mean

UPGMA is an agglomerative clustering algorithm that is ultrametric (assumes a molecular clock - all lineages are evolving at a constant rate) by Sokal and Michener in 1958.

The idea is to continue iteration until only one cluster is obtained and at each iteration, join two nearest clusters (which become a higher cluster). The distance between any two clusters are calculated by averaging distances between elements of each cluster.

To understand better, see UPGMA worked example by Dr Richard Edwards.