Wednesday, May 25, 2016

Computing Significance of Overlap between Two Sets using Hypergeometric Test

There are many cases where we have two sets (e.g. under two different conditions) of things such as transcripts, genes or proteins and we want to compute the significance of the overlap between them. Hypergeometric test is very simple and widely used option for such cases.

I'll use the phyper function in R but you can use the same idea in SciPy (Python).

Let's say you have from 200 genes (A);

  • 10 genes common or overlapping (set B ∩ set C)
  • 25 genes in set B
  • 50 genes in set C
  • 135 genes not in set B or set C

To compute the significance of overlap use;

phyper(10, 50, 200 - 50, 25, lower.tail = FALSE)
[1] 0.0214406

So, if your threshold for p-value is 0.05 (or 5%), then you can say the overlap is significant.

Tuesday, May 10, 2016

ODTÜ Enformatik Enstitüsü'nün 20. Yılı Etkinliği

ODTÜ Enformatik Enstitüsü kuruluşunun 20. yılını bir bilim festivaliyle kutluyor. 16 Mayıs 2016'da, ODTÜ Kültür ve Kongre Merkezi'nde gerçekleştirilecek olan bilim festivaline herkes davetlidir!

Bilime, sanat ve müziğin de eşlik edeceği bu festivalde aşağıdaki ana konuşmacılar yer alacaktır:

  • Prof. Dr. Jennifer Hayes
    New England Microsoft Araştırma ve New York Microsoft Araştırma yönetici ve eş kurucu
  • Assoc. Prof. Claudio Ferretti
    Milano-Bicocca Üniversitesi, Bilgisayar Bilimi, Sistemleri ve İletişimi
  • Dr. Christian Borgs
    Araştırmacı, New England Microsoft Araştırma vekil yönetici ve eş kurucu

Etkinlik programı

Saturday, January 16, 2016

Mann Whitney U Test (Wilcoxon Rank-Sum Test) Javascript Implementation

Currently Javascript is really poor in statistical methods compared to Python (SciPy) and R. There are several efforts to fill this gap, most notably from jStat. However, still many functions, distributions and tests are missing in this library. In one of my projects, I had to implement a Javascript version of Mann Whitney U test (or also called Wilcoxon rank-sum test). Here, I'm giving a link to its source code and describing how it works.


Mann Whitney U test is a nonparametric test with a null hypothesis that two samples belong to the same population. Consider you have two groups of numbers, they don't follow any known distribution and you want to test if they are different. In such cases, you'd use Mann Whitney U test.

The above plots show that between two groups (A/A and G/G, G/A) minimal CD4 levels are significantly different (Kobayashi et al., Jpn. J. Infect. Dis., 55, 131-133, 2002). And their significance are shown at the top as p-values.

This implementation is adapted from SciPy and R source codes and tested in both with several datasets.

GitHub Gist for Mann Whitney U test Javascript implementation (mannwhitneyu.js).

How to use

Just download mannwhitneyu.js file and add a script tag to your HTML pointing the downloaded file and call the test with your datasets. Create a file called index.html and paste the following in it and save. Also make sure you place mannwhitneyu.js next to it.

    <title>Mann Whitney U test</title>
    <script type="text/javascript" src="mannwhitneyu.js"></script>
    <script type="text/javascript">
        var x = [2, 4, 6, 2, 3, 7, 5, 1],
            y = [8, 10, 11, 14, 20, 18, 19, 9];
        var t = mannwhitneyu.test(x, y, alternative = 'less');

Open index.html, and look at Console (Ctrl + Shift + J), you'll see the result.

Object {U: 0, p: 0.0004654861357875073}

This result shows that numbers in x are significantly different and smaller than the ones in y. The alternative argument can also be greater, which is again a one-sided test, testing if the first group has numbers that are significantly different and greater. Also, you may give two-sided as alternative, which will compute a two-sided test.

Soon, I'll send this code to jStat and hopefully it'll be available there. I'm also considering making it as a Node module so that developers can easily include it in their Node projects.

Wednesday, September 16, 2015

Generating 2D SVG Images of MOL Files using RDKit Transparent Background

The latest release of RDKit (2015-03) can generate SVG images with several lines of codes but by default the generated SVG image has a white background. The investigations on sources didn't solve my problem as I couldn't find any option for setting background to transparent background.

An example of SVG image generation can be found on RDKit blog post called New Drawing Code.

In [3] shows the SVG image generation and it returns the SVG file content in XML. When you open this file in a text editor, you'll see there is a rect element with a style attribute having fill #ffffff which is white. If you make this none, the whole SVG background becomes transparent.

So, if you obtain the SVG file content as XML from using moltosvg function in the blog post, just use the following function to make its background transparent.

import xml.etree.ElementTree as ET

def transparentsvg(svg):
    # Make the white background transparent
    tree = ET.fromstring(svg)
    rect = tree.find('rect')
    rect.set('style', rect.get('style').replace('#ffffff', 'none'))
    # Recover some missing attributes for correct browser rendering
    tree.set('version', '1.1')
    tree.set('xmlns', '')
    tree.set('xmlns:rdkit', '')
    tree.set('xmlns:xlink', '')
    return '<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(tree).strip()

You can write this to an SVG file easily

svg = transparentsvg(svg)
with open('path/to/svg/file', 'w') as f:

Install Cairo Graphics and PyCairo on Ubuntu 14.04 / Linux Mint 17

Cairo is a 2D graphics library implemented as a library written in the C programming language but if you'd like to use Python programming language, you should also install Python bindings for Cairo.

This guide will go through installation of Cairo Graphics library version 1.14.2 (most recent) and py2cairo Python bindings version 1.10.1 (also most recent).

1. Install Cairo

It's very easy with the following repository. Just add it, update your packages and install.

sudo add-apt-repository ppa:ricotz/testing
sudo apt-get update
sudo apt-get install libcairo2-dev

2. Install py2cairo

cd ~
git clone git://

See what's your prefix

python -c "import sys; print sys.prefix"

Install dependencies

sudo apt-get install automake pkg-config libtool


cd ~/py2cairo
./ --prefix=/usr
sudo make
sudo make install

3. Verify

>>> import cairo
>>> cairo.cairo_version_string()
>>> cairo.version

Now, you can use up-to-date versions of these softwares in your computer.

Install RDKit 2015-03 Build on Ubuntu 14.04 / Linux Mint 17

RDKit is an open source toolkit for cheminformatics. It has many functionalities to work with chemical files.

Follow the below guide to install RDKit 2015-03 build on an Ubuntu 14.04 / Linux Mint 17 computer. Since Ubuntu packages don’t have the latest RDKit for trusty, you have to build RDKit from its source.

1. Install Dependencies

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost1.54-all-dev

2. Download the Build

cd /usr/local
sudo wget
sudo tar -xzf RDKit_2015_03_1.tgz
sudo mv rdkit-Release_2015_03_1/ rdkit
cd rdkit/
sudo mkdir build

3. Set Environment Variables

vim ~/.bashrc
# Enter following three lines at the end of .bashrc
export RDBASE="/usr/local/rdkit"
source ~/.bashrc

4. Download InChi API (Optional, remove the arg in the next step if you skip)

sudo bash

5. Build

cd $RDBASE/build
sudo make
sudo make install

6. Verify

>>> import rdkit
>>> rdkit.rdBase.rdkitVersion

Please comment if you have any issue with this installation.

Sunday, August 30, 2015

Generating 2D Images of Molecules from MOL Files using Open Babel

Open Babel is a tool to work with molecular data in any way from converting one type to another, analyzing, molecular modeling, etc. It also has a method to convert MOL files into SVG or PNG images to represent them as 2D images.

Install Open Babel in Linux as following or go to their page for different operating systems

sudo apt-get install openbabel

Open Babel uses the same command to generate SVG or PNG and recognizes the file format using the given filename to as the output option (-O). Also, it's possible to generate transparent images in SVG using option "-xb" with a value "none". This doesn't work for PNGs. There are also other options, one of which is "--title" to write the name of the molecule to the image. Leave it blank if you don't want to see any title.

An example PNG generation from MOL file of benzene molecule.

gungor@gungors-mint ~/Desktop $ obabel benzene.mol -O benzene.png --title Benzene
1 molecule converted

To see all other options for available image formats, follow the Open Babel documentation Image formats page.

Simple Way of Python's subprocess.Popen with a Timeout Option

subprocess module in Python provides us a variety of methods to start a process from a Python script. We may use these methods to run an external commands / programs, collect their output and manage them. An example use of it might be as following:

from subprocess import Popen, PIPE 
p = Popen(['ls', '-l'], stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
print stdout, stderr

These lines can be used to run ls -l command in Terminal and collect the output (standard output and standard error) in stdout and stderr variables using communicate method defined in the process.

However, if the ls command never ends, we will never get any output because the external program just hangs. This happens sometimes with some programs e.g. trying to generate 3D molecule from 2D MOL file using Open Babel when the MOL file is not correctly formed. As the Open Babel just tries to generate 3D and doesn't check if the MOL file is okay, we never get an output from that run. So the program and our script hang.

Python has many different options to solve this problem like using multiprocessing or threading module, signal module etc. But here I'll describe a very simple method that worked for me fine. This is tested in Linux environment with Python version 2.7.

from time import sleep
from subprocess import Popen, PIPE 
def popen_timeout(command, timeout):
    p = Popen(command, stdout=PIPE, stderr=PIPE)
    for t in xrange(timeout):
        if p.poll() is not None:
            return p.communicate()
    return False 
print popen_timeout(['python',
                    '/home/gungor/Desktop/'], 25)

import time 
for i in xrange(10):
    print "Gungor", i

Example run 1 where the external program doesn't exceed the timeout threshold

gungor@gungors-mint ~/Desktop $ python
('Gungor 0\nGungor 1\nGungor 2\nGungor 3\nGungor 4\nGungor 5\nGungor 6\nGungor 7\nGungor 8\nGungor 9\n', '')

In this example, I ran with 25 seconds timeout on an external program ( which runs for 20 seconds and outputs lines of strings to the standard output which are collected with communicate method by

Example run 2 where the external program would take longer than the timeout threshold

gungor@gungors-mint ~/Desktop $ python

The just returns False. Because the external program was still running when the timeout has been achieved and it has been killed afterwards.

You can use this method to control the execution of the external programs in Python.

Saturday, June 27, 2015

Convert XLS/XLSX to CSV in Bash

In most of the modern Linux distributions, Libre Office is available and it can be used to convert XLS or XLSX file(s) to CSV file(s) in bash.

For XLS file(s):

for i in *.xls; do libreoffice --headless --convert-to csv "$i"; done

For XLSX file(s):

for i in *.xlsx; do libreoffice --headless --convert-to csv "$i"; done

You may get following warning but it still works fine:

javaldx: Could not find a Java Runtime Environment!
Warning: failed to read path from javaldx

You'll see a standard output similar to following when the conversion is successful:

convert /home/gbudak/Downloads/cmap/cmap_instances_02.xls -> /home/gbudak/Downloads/cmap/cmap_instances_02.csv using Text - txt - csv (StarCalc)

Sunday, June 7, 2015

Obtaining Molecule Description using Open Babel / PyBel

Open Babel is a great tool to analyze and investigate molecular data (.MOL, .SDF files). Its Python API is particularly very nice if you are familiar with Python already. In this post, I'll demonstrate how you can obtain molecule description such as molecular weight, HBA, HBD, logP, formula, number of chiral centers using PyBel.


$ sudo apt-get install openbabel python-openbabel

Usage for MW, HBA, HBD, logP

After reading .MOL file, we need to use calcdesc method with descnames argument for getting the descriptions. Getting formula is different, though.

gungor@gungors-mint ~/Desktop $ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybel import *
>>> mol = readfile('mol', 'benzene.mol').next()
>>> desc = mol.calcdesc(descnames=['MW', 'logP', 'HBA1', 'HBD'])
>>> desc['formula'] = mol.formula
>>> print desc
{'HBA1': 0.0, 'HBD': 0.0, 'MW': 78.11184, 'logP': 1.6865999999999999, 'formula': 'C6H6'}

Usage for Number of Chiral Centers

This is not included in PyBel, or I couldn't find it but Open Babel comes with obchiral terminal command which returns the chiral atoms found in the molecules. We can parse this output to get the number of chiral centers.

>>> from subprocess import Popen, PIPE
>>> p = Popen(['obchiral', 'brg_flap_1.mol'], stdout=PIPE, stderr=PIPE)
>>> stdout, stderr = p.communicate()
>>> chiral_num = stdout.count('Is Chiral')
>>> print chiral_num

So you can use above codes to get molecule description including molecular weight, HBA, HBD, logP, formula, number of chiral centers.