Friday, September 19, 2014

Data Preprocessing II for Salmon Project

So in our Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells project, we have several methods to construct the networks so the data is still needed to be preprocessed so that it can be ready to be analyzed with these methods.

One method needed to have a matrix first row as protein name and time series (2 min, 5 min, 10 min, 20 min), and the values of the proteins in each time series were to be 1 or 0 according to variance, significance and the size of fold change. So, if the variance was less than 15% and the significance was less than 0.05, the script collected values of peptides belonging to the same protein in all time series and all locations (N, M, C) and then got the maximum value from all and determined at which time the maximum value was observed. According to this, for each protein, "1" was put, under the time where it had maximum fold change among all time points and locations. And "0" was put for the rest of the times.

Proteins with variance NA, significance NA or fold change NA were ignored. And if there were multiple proteins for one row, there were split and each got the same 1 and 0s.

Yet another method needs to have such matrix but here values of the proteins can be multiple 1s because it requires all observed maximum values from all time points satisfying the conditions. Here, since the data is multi-dimensional (it has time series and locations), we need to decide how we arrange the data according to locations. For now, we haven't done it yet, so I'm going ro mention it later.

The script is written in Python and will be published soon.

Thursday, September 18, 2014

How to Convert PED to FASTA

You may need the conversion of PED files to FASTA format in your studies for further analyses. Use below script for this purpose.

PED to FASTA converter on GitHub

Gets first 6 columns of each line as header line and the rest as the sequence replacing 0s with Ns and organizes it into a FASTA file.

Note 0s are for missing nucleotides defined by default in PLINK

How to run: perl "C:\path\to\PED_File"

Data Preprocessing I for Salmon Project

Since we'll be using R for most of the analyses, we converted XLS data file to CSV using MS Office Excel 2013 and then we had to fix several lines using Sublime Text 2 because three colums in these lines were left unquoted which later created a problem reading in RStudio.

The data contains phosphorylation data of 8553 peptides. There are many missing data points for many peptides and since IPI IDs were used for peptides and these are not supported now, we had to convert IPI IDs to HGNC approved symbols although data had these symbols as names but they looked outdated.

To convert these IDs, first these IPI IDs were saved into a TXT files each in a new line. Then, conversion tool from Database for Annotation, Visualization and Integrated Discovery (DAVID) was used to map IPI IDs to UniProt ACCs (57.125 lines of maps were obtained). Next, the obtained UniProt ACCs were used to map them to HGNC IDs using ID Mapping from UniProt (10.925 lines of maps were obtained). Finally, to obtain HGNC approved symbols, all protein coding gene data from HGNC was downloaded with costumized colums allowing conversion of HGNC IDs to HGNC approved symbols (43.127 lines of maps were obtained). After collecting these three files, in R, the files were read and after column names were fixed, the union of these three files by UniProt ACC and HGNC ID was obtained (37.316 lines of maps were obtained). Later, we included these HGNC approved symbols in the data as a column to use them readily.

Later, I'll publish the R code I used for the conversion in a GitHub repo.

Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells

In this study, we're going to use a phosphorylation data from a research paper on phosphoproteomic analysis of related cells.

The idea is to use and compare existing methods and develop these methods to be able to better understand the nature of signaling events in these cells and to find key proteins that might be targets for disease diagnosis, prevention and treatment.

This study will be submitted as a research paper so I'm not going to publish any results here for now but I'll mention the struggles I have and solutions I try to solve them.