Monday, December 22, 2014

Structural Superimposition of Local Sequence Alignment using BioPython

This task was given to me as a homework in one of my courses at the university and I wanted to share my solution as I saw there is no such entry on the Internet.

Objectives here are;
  • Download (two) PDB files automatically from the server
  • Do the pairwise alignment after getting their amino acid sequences
  • Superimpose them and report RMSD
Bio.PDB module from BioPython works very well in this case. It's what I used in this solution.

Wednesday, December 3, 2014

How to Install openpyxl on Windows

openpyxl is a Python library to read/write Excel 2007 xlsx/xlsm files. To download and install on Windows:

Download it from Python Packages

Then to install, extract the tar ball you downloaded, open up CMD, navigate to the folder that you extracted and run the following:

C:\Users\Gungor>cd Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2
C:\Users\Gungor\Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2>python install

It's going to install everything and will report any error. If there is nothing that seems like an error. You're good to go. Verify the installation by opening Python (command line) and running the following line in Python CMD:

>>> from openpyxl import Workbook

openpyxl documentation is a great place to start with it.

Wednesday, November 26, 2014

How to Install Numpy Python Package on Windows

Numpy (Numerical Python) is a great Python package that you should definitely make use of if you're doing scientific computing

Installing it on Windows might be difficult if you don't know how to do it via command line. There are unofficial Windows binaries for Numpy for Windows 32 and 64 bit which make it super easy to install.

Go to the link below and download the one for your system and Python version:

Then, just double click and do usual Windows installation steps.

Taken from

Friday, September 19, 2014

Data Preprocessing II for Salmon Project

So in our Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells project, we have several methods to construct the networks so the data is still needed to be preprocessed so that it can be ready to be analyzed with these methods.

One method needed to have a matrix first row as protein name and time series (2 min, 5 min, 10 min, 20 min), and the values of the proteins in each time series were to be 1 or 0 according to variance, significance and the size of fold change. So, if the variance was less than 15% and the significance was less than 0.05, the script collected values of peptides belonging to the same protein in all time series and all locations (N, M, C) and then got the maximum value from all and determined at which time the maximum value was observed. According to this, for each protein, "1" was put, under the time where it had maximum fold change among all time points and locations. And "0" was put for the rest of the times.

Proteins with variance NA, significance NA or fold change NA were ignored. And if there were multiple proteins for one row, there were split and each got the same 1 and 0s.

Yet another method needs to have such matrix but here values of the proteins can be multiple 1s because it requires all observed maximum values from all time points satisfying the conditions. Here, since the data is multi-dimensional (it has time series and locations), we need to decide how we arrange the data according to locations. For now, we haven't done it yet, so I'm going ro mention it later.

The script is written in Python and will be published soon.

Thursday, September 18, 2014

How to Convert PED to FASTA

You may need the conversion of PED files to FASTA format in your studies for further analyses. Use below script for this purpose.

PED to FASTA converter on GitHub

Gets first 6 columns of each line as header line and the rest as the sequence replacing 0s with Ns and organizes it into a FASTA file.

Note 0s are for missing nucleotides defined by default in PLINK

How to run: perl "C:\path\to\PED_File"

Data Preprocessing I for Salmon Project

Since we'll be using R for most of the analyses, we converted XLS data file to CSV using MS Office Excel 2013 and then we had to fix several lines using Sublime Text 2 because three colums in these lines were left unquoted which later created a problem reading in RStudio.

The data contains phosphorylation data of 8553 peptides. There are many missing data points for many peptides and since IPI IDs were used for peptides and these are not supported now, we had to convert IPI IDs to HGNC approved symbols although data had these symbols as names but they looked outdated.

To convert these IDs, first these IPI IDs were saved into a TXT files each in a new line. Then, conversion tool from Database for Annotation, Visualization and Integrated Discovery (DAVID) was used to map IPI IDs to UniProt ACCs (57.125 lines of maps were obtained). Next, the obtained UniProt ACCs were used to map them to HGNC IDs using ID Mapping from UniProt (10.925 lines of maps were obtained). Finally, to obtain HGNC approved symbols, all protein coding gene data from HGNC was downloaded with costumized colums allowing conversion of HGNC IDs to HGNC approved symbols (43.127 lines of maps were obtained). After collecting these three files, in R, the files were read and after column names were fixed, the union of these three files by UniProt ACC and HGNC ID was obtained (37.316 lines of maps were obtained). Later, we included these HGNC approved symbols in the data as a column to use them readily.

Later, I'll publish the R code I used for the conversion in a GitHub repo.

Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells

In this study, we're going to use a phosphorylation data from a research paper on phosphoproteomic analysis of related cells.

The idea is to use and compare existing methods and develop these methods to be able to better understand the nature of signaling events in these cells and to find key proteins that might be targets for disease diagnosis, prevention and treatment.

This study will be submitted as a research paper so I'm not going to publish any results here for now but I'll mention the struggles I have and solutions I try to solve them.

Sunday, April 13, 2014

Download Human Reference Genome (HG19 - GRCh37)

Many variation calling tools and many other methods in bioinformatics require a reference genome as an input so may need to download human reference genome or sequences. There are several sources that freely and publicly provide the entire human genome and I'll describe how to download complete human genome from University of California, Santa Cruz (UCSC) webpage.

Index to the gzip-compressed FASTA files of human chromosomes can be found here at the UCSC webpage.

This is Feb 2009 human reference genome (GRCh37 - Genome Reference Consortium Human Reference 37). You can find more information about it in the page. Note that there are the chromosomes we know such as chr1, chrM (mitochondrial), chrX but also chr11_gl000202_random. These strange ones are scaffolds and patches that include the region that have not been fully understood and placed into an actual chromosome. They are safe to include them in your reference genome. Also, they are very small relative to the actual chromosome files.

The codes I'll present below will be valid for Terminal in Linux OS or MacOS. In Windows, you may use Cygwin that comes with below commands or you may setup a virtual machine and install a Linux distro (You'll love it).

1. Download all (GZ) files - chromosomes

Create a directory that will store the downloaded files:

$ mkdir ~/Downloads/hg19

Change directory to "hg19/":

$ cd ~/Downloads/hg19

Download all files under hg19 chromosomes:

$ wget --timestamping '*'

Or to download individual files (replace * with the file name) - if you don't want all of them:

$ wget --timestamping '' -O chr1.fa.gz

2. Uncompress each GZ file - chromosome in the directory

Create a directory that will store the uncompressed files:

$ mkdir uncompressed

Uncompress each using following single line loop into directory "uncompressed/" in Terminal:

$ for a in *.gz; do gunzip -c $a > uncompressed/`echo $a | sed s/.gz//`; done

3. Merge all chromosomes (1, 2, 3, ..., X, Y) in one FASTA file

Change directory to "uncompressed/"

$ cd uncompressed

Concatanate all FASTA files into one:

$ cat *.fa > hg19_ref_genome.fa

Or only actual chromosomes without scaffolds and patches:

$ cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrM.fa chrX.fa chrY.fa > hg19_ref_genome.fa

So, in this way, you will find hg19_ref_genome.fa (around 3.2 GB - containing whole genome of human) file in your "~/Downloads/hg19/uncompressed" directory.

Wednesday, March 26, 2014

JointSNVMix Installation on Linux Mint 16 Cython, Pysam Included

JointSNVMix is a software package that consists of a number of tools for calling somatic mutations in tumour/normal paired NGS data.

It requires Python (>= 2.7), Cython (>= 0.13) and Pysam (== 0.5.0).

Python must be installed by default ona Linux machine so I will describe the installation of others and JointSNVMix.

Note this guide may become outdated after some time so please make sure before following all.

Install Cython

You may need Python development headers for these Python packages so make sure you have them:

$ sudo apt-get update
$ sudo apt-get install python-dev
$ sudo apt-get install libevent-dev

To install Cython:

$ cd ~/Downloads
$ wget
$ tar xzvf Cython-0.20.1.tar.gz
$ cd Cython-0.20.1
$ sudo python install

Install Pysam

Pysam installation requires ez_setup that requires distribute.

To install distribute:

$ cd ~/Downloads
$ wget
$ unzip -d distribute-0.7.3
$ cd distribute-0.7.3
$ sudo python install

To install ez_setup:

$ cd ~/Downloads
$ wget
$ tar xzvf ez_setup-0.9.tar.gz
$ cd ez_setup-0.9
$ sudo python install 

Ready to install Pysam now:

$ cd ~/Downloads
$ wget
$ tar xzvf pysam-0.5.tar.gz
$ cd pysam-0.5
$ sudo python install

Now, install JointSNVMix:

$ cd ~/Downloads
$ wget
$ tar xzvf JointSNVMix-0.7.5.tar.gz
$ cd JointSNVMix-0.7.5
$ sudo python install


usage: JointSNVMix [-h] {train,classify} ...
JointSNVMix: error: too few arguments

Follow running guide for all arguments and options.

Tuesday, March 25, 2014

ClipCrop Installation on Linux Mint 16 nvm, Node, npm Included

ClipCrop is a tool for detecting structural variations from SAM files. And it's built with Node.js.

ClipCrop uses two softwares internally so they should be installed first.

Install SHRiMP2

SHRiMP is a software package for aligning genomic reads against a target genome.

$ mkdir ~/software
$ cd ~/software
$ wget
$ tar xzvf SHRiMP_2_2_3.lx26.x86_64.tar.gz
$ cd SHRiMP_2_2_3
$ file bin/gmapper

Install BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome.

BWA requires zlib so install it first

$ sudo apt-get install zlib1g-dev

Download latest BWA software from SourceForge

$ cd ~/Downloads
$ tar xjvf bwa-0.7.7.tar.bz2
$ cd bwa-0.7.7
$ make
$ sudo mv ~/Downloads/bwa-0.7.7 /usr/local/bin
$ sudo ln -s /usr/local/bin/bwa-0.7.7 /usr/local/bin/bwa
$ sudo nano /etc/profile

Copy-paste this at the end the document


Save it pressing CTRL + X, then Y and hit ENTER

Now, NVM and Node should be installed but before there are some dependencies.

$ sudo apt-get install build-essential
$ sudo apt-get install libssl-dev curl

Install NVM

NVM is Node Version Manager and allows you to use different versions of Node.js

$ git clone git:// ~/.nvm
$ source ~/.nvm/
$ nvm install v0.6.1
$ nvm use v0.6.1
Now using node v0.6.1

This installation also sets up Node.js v0.6.1, check:

$ node -v

To install ClipCrop we need one last thing which is NPM - Node Package Manager.

Install NPM

$ sudo apt-get install -y python-software-properties python g++ make
$ sudo add-apt-repository ppa:richarvey/nodejs
$ sudo apt-get update
$ sudo apt-get install nodejs npm

$ node -v

This last step will also install a Node.js version but before using ClipCrop, using NVM, Node.js version will be set to v0.6.1 like this:

$ nvm use v0.6.1
Now using node v0.6.1

If it says "No command 'nvm' found":

$ source ~/.nvm/

Install ClipCrop

$ cd ~
$ npm install clipcrop
$ cd ~/node_modules/clipcrop
$ npm link


$ clipcrop

This will tell you how to use the package and what options it has. More is available here.

Some pages to refer

Installation notes for BWA version 0.7.5a-r405,

Installing Node.js using NVM on Ubuntu,

How to install node.js and npm in ubuntu or mint,

Installing Node.js via package manager,

npm documentation,

Saturday, March 15, 2014

Set Up Google Cloud SDK on Windows using Cygwin

Windows isn't the best environment for software development I believe but if you have to use it there are nice softwares to make it easy for you. Cygwin here will help us to use Google Cloud tools but installation requires certain things that you should be aware of beforehand.

You'll need:
Python latest 2.7.x
Google Cloud SDK
Cygwin 32-bit (i.e. setup-x86.exe - note only this one works)
openssh, curl and latest 2.7.x python Cygwin packages
Note:  You'll need to select these packages during Cygwin installation. If you already have Cygwin 32-bit, just rerun the installer and make sure you select them all and later install all dependencies when you're asked.

To set up:
Open up Cygwin Terminal by right clicking and choosing "Run as administrator"
Navigate to the folder that has "google-cloud-sdk" (what's in GCloud SDK download so move it somewhere like "C:\")
Run "./google-cloud-sdk/"
Follow instructions

Hopefully, you won't have any error and will get it working.

Last note is to be able to run GCloud tools in Cygwin Terminal, you'll always have to run it "Run as administrator", or you'll get "Permission denied" errors.

Saturday, March 1, 2014

Super Long Introns of Euarchontoglires

There was another weird result I got about my exon/intron boundaries analysis research. To less diverse species' genes, intron lengths are shown to increase. However, according to my findings, at a point of Euarchontoglires or Supraprimates, this increase is very sharp and seems unexpected. So, I looked at exon/intron length each gene in each taxonomic rank and try to see what makes Euarchontoglires genes with that long introns.

As you see in the graph above, Euarchontoglires introns are very long compared to the rest. So I got the Euarchontoglires genes having more than 10000 bp long introns in average which are;

ENSG00000176124 (61886 bp)
ENSG00000255470 (48283 bp)
ENSG00000233611 (43231 bp)
ENSG00000253161 (23128 bp)
ENSG00000056487 (13482 bp)

When I checked their summaries on Ensembl, I saw that most of them have transcripts that are not protein coding so they tend to have longer introns relative to protein coding transcripts' introns.

So a solution might be retrieving biotypes of transcripts and filtering the ones that are not protein coding. Because in this project, we're focusing on the protein coding genes.

Remember Ensembl API, getting biotypes is really easy. All I need to do is add following to my script;


So I got my data with its biotype information and filtered out the ones that are not protein coding. Later, when I repeated the analysis with the new data, the unexpected peak at Euarchontoglires introns was gone.

There is still a lot to be done of course, but for this particular issue, I solved it in this way.

Thursday, February 27, 2014

An Exon of Length 2 Appeared in Ensembl

I want to share an interesting finding about our research on exon/intron analysis of human evolutionary history.

So I had the genes that emerged at each pass point of human history and I was using Ensembl API to get exons and introns of these genes to perform further analyses.

There was one gene (ENSG00000197568 - HERV-H LTR-associating 3 - HHLA3) with a surprise. Because it's one transcript (ENST00000432224) had an exon (ENSE00001707577) of length 2. At first I couldn't realize the oddness but later in group discussions it was obvious that an exon with only 2 bases cannot occur.

So we checked different databases (NCBI, UCSC Genome Browser) for the same gene and realized that that exon was not there and their gene finding algorithms placed those 2 bases as a part of an intron and the transcript has one less exon compared to the one in Ensembl databases.

This shows gene finding algorithms are still not in their best forms and different sources need to be checked before going into a conclusion about exons/introns.

Monday, February 24, 2014

How to Convert PLINK Binary Formats into Non-binary Formats

PLINK is a whole genome association analysis toolset and to save time and space, you need to convert your data files to binary formats (BED, FAM, BIM) but of course when you need to view the files, you have to convert them back to non-binary formats (PED, MAP) to be able to open them in your text editor such as Notepad on Windows OS.

This operation is really easy. It requires PLINK of course, and the following line of code written to DOS window (Run -> type cmd; hit ENTER) in the directory of PLINK:

plink --bfile YOUR_BINARY_FILE --recode --out YOUR_NON-BINARY_FILE

First, you need to install PLINK if you don't have.

Note this tut is for Windows OS.

Go to Download section and download the correct version for your system. For Windows OS, it's MS-DOS.

Then, extract it to "C:" folder in your Computer. Make sure that you have plink.exe in the extracted folder. That's it.

To convert your files, start a new DOS window and navigate to your PLINK directory which is "C:\plink-1.07-dos". To do that type:

cd c:\plink-1.07-dos

When you changed the directory to PLINK's dir, you are ready to start conversion.

Not to confuse, it's better to create a folder inside "C:\plink-1.07-dos", say, "files". Then, move BED, FAM and BIM files inside this folder. Then with the code below, you can convert these files into non-binary forms.

plink --bfile files/YOUR_BINARY_FILE_NAME --recode --out files/YOUR_NON-BINARY_FILE_NAME

Change "YOUR_BINARY_FILE_NAME" with the name of your files (they have the same name except for the extension). And change "YOUR_NON-BINARY_FILE_NAME" with anything you want.

Next, hit ENTER and wait for the analysis. After it's done you'll see:

"Analysis finished: CURRENT DATE"

You can navigate to your files folder (C:\plink-1.07-dos\files) and see your non-binary forms PED and MAP.

More about PLINK and information for other operating systems can be found on PLINK website.