Wednesday, March 26, 2014

JointSNVMix Installation on Linux Mint 16 Cython, Pysam Included

JointSNVMix is a software package that consists of a number of tools for calling somatic mutations in tumour/normal paired NGS data.

It requires Python (>= 2.7), Cython (>= 0.13) and Pysam (== 0.5.0).

Python must be installed by default ona Linux machine so I will describe the installation of others and JointSNVMix.

Note this guide may become outdated after some time so please make sure before following all.

Install Cython

You may need Python development headers for these Python packages so make sure you have them:

$ sudo apt-get update
$ sudo apt-get install python-dev
$ sudo apt-get install libevent-dev

To install Cython:

$ cd ~/Downloads
$ wget
$ tar xzvf Cython-0.20.1.tar.gz
$ cd Cython-0.20.1
$ sudo python install

Install Pysam

Pysam installation requires ez_setup that requires distribute.

To install distribute:

$ cd ~/Downloads
$ wget
$ unzip -d distribute-0.7.3
$ cd distribute-0.7.3
$ sudo python install

To install ez_setup:

$ cd ~/Downloads
$ wget
$ tar xzvf ez_setup-0.9.tar.gz
$ cd ez_setup-0.9
$ sudo python install 

Ready to install Pysam now:

$ cd ~/Downloads
$ wget
$ tar xzvf pysam-0.5.tar.gz
$ cd pysam-0.5
$ sudo python install

Now, install JointSNVMix:

$ cd ~/Downloads
$ wget
$ tar xzvf JointSNVMix-0.7.5.tar.gz
$ cd JointSNVMix-0.7.5
$ sudo python install


usage: JointSNVMix [-h] {train,classify} ...
JointSNVMix: error: too few arguments

Follow running guide for all arguments and options.

Tuesday, March 25, 2014

ClipCrop Installation on Linux Mint 16 nvm, Node, npm Included

ClipCrop is a tool for detecting structural variations from SAM files. And it's built with Node.js.

ClipCrop uses two softwares internally so they should be installed first.

Install SHRiMP2

SHRiMP is a software package for aligning genomic reads against a target genome.

$ mkdir ~/software
$ cd ~/software
$ wget
$ tar xzvf SHRiMP_2_2_3.lx26.x86_64.tar.gz
$ cd SHRiMP_2_2_3
$ file bin/gmapper

Install BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome.

BWA requires zlib so install it first

$ sudo apt-get install zlib1g-dev

Download latest BWA software from SourceForge

$ cd ~/Downloads
$ tar xjvf bwa-0.7.7.tar.bz2
$ cd bwa-0.7.7
$ make
$ sudo mv ~/Downloads/bwa-0.7.7 /usr/local/bin
$ sudo ln -s /usr/local/bin/bwa-0.7.7 /usr/local/bin/bwa
$ sudo nano /etc/profile

Copy-paste this at the end the document


Save it pressing CTRL + X, then Y and hit ENTER

Now, NVM and Node should be installed but before there are some dependencies.

$ sudo apt-get install build-essential
$ sudo apt-get install libssl-dev curl

Install NVM

NVM is Node Version Manager and allows you to use different versions of Node.js

$ git clone git:// ~/.nvm
$ source ~/.nvm/
$ nvm install v0.6.1
$ nvm use v0.6.1
Now using node v0.6.1

This installation also sets up Node.js v0.6.1, check:

$ node -v

To install ClipCrop we need one last thing which is NPM - Node Package Manager.

Install NPM

$ sudo apt-get install -y python-software-properties python g++ make
$ sudo add-apt-repository ppa:richarvey/nodejs
$ sudo apt-get update
$ sudo apt-get install nodejs npm

$ node -v

This last step will also install a Node.js version but before using ClipCrop, using NVM, Node.js version will be set to v0.6.1 like this:

$ nvm use v0.6.1
Now using node v0.6.1

If it says "No command 'nvm' found":

$ source ~/.nvm/

Install ClipCrop

$ cd ~
$ npm install clipcrop
$ cd ~/node_modules/clipcrop
$ npm link


$ clipcrop

This will tell you how to use the package and what options it has. More is available here.

Some pages to refer

Installation notes for BWA version 0.7.5a-r405,

Installing Node.js using NVM on Ubuntu,

How to install node.js and npm in ubuntu or mint,

Installing Node.js via package manager,

npm documentation,

Saturday, March 15, 2014

Set Up Google Cloud SDK on Windows using Cygwin

Windows isn't the best environment for software development I believe but if you have to use it there are nice softwares to make it easy for you. Cygwin here will help us to use Google Cloud tools but installation requires certain things that you should be aware of beforehand.

You'll need:
Python latest 2.7.x
Google Cloud SDK
Cygwin 32-bit (i.e. setup-x86.exe - note only this one works)
openssh, curl and latest 2.7.x python Cygwin packages
Note:  You'll need to select these packages during Cygwin installation. If you already have Cygwin 32-bit, just rerun the installer and make sure you select them all and later install all dependencies when you're asked.

To set up:
Open up Cygwin Terminal by right clicking and choosing "Run as administrator"
Navigate to the folder that has "google-cloud-sdk" (what's in GCloud SDK download so move it somewhere like "C:\")
Run "./google-cloud-sdk/"
Follow instructions

Hopefully, you won't have any error and will get it working.

Last note is to be able to run GCloud tools in Cygwin Terminal, you'll always have to run it "Run as administrator", or you'll get "Permission denied" errors.

Saturday, March 1, 2014

Super Long Introns of Euarchontoglires

There was another weird result I got about my exon/intron boundaries analysis research. To less diverse species' genes, intron lengths are shown to increase. However, according to my findings, at a point of Euarchontoglires or Supraprimates, this increase is very sharp and seems unexpected. So, I looked at exon/intron length each gene in each taxonomic rank and try to see what makes Euarchontoglires genes with that long introns.

As you see in the graph above, Euarchontoglires introns are very long compared to the rest. So I got the Euarchontoglires genes having more than 10000 bp long introns in average which are;

ENSG00000176124 (61886 bp)
ENSG00000255470 (48283 bp)
ENSG00000233611 (43231 bp)
ENSG00000253161 (23128 bp)
ENSG00000056487 (13482 bp)

When I checked their summaries on Ensembl, I saw that most of them have transcripts that are not protein coding so they tend to have longer introns relative to protein coding transcripts' introns.

So a solution might be retrieving biotypes of transcripts and filtering the ones that are not protein coding. Because in this project, we're focusing on the protein coding genes.

Remember Ensembl API, getting biotypes is really easy. All I need to do is add following to my script;


So I got my data with its biotype information and filtered out the ones that are not protein coding. Later, when I repeated the analysis with the new data, the unexpected peak at Euarchontoglires introns was gone.

There is still a lot to be done of course, but for this particular issue, I solved it in this way.