Coordinating and Bioinformatics unit
The Coordinating and Bioinformatics unit is responsible for the creation of the software and informatics infrastructure for the consortium as well as facilitating the efforts of the mouse engineering centers. This page provides information about the infrastructure created for the consortium as well as any software created for the scientific community.

Research Areas
Research interests include bioinformatics, automation, autoimmunity, and diabetes. The research efforts focus on building the computing infrastructure for management of microarray data and looking at the temporal gene expression changes during the etiology of diabetes in rodent and human populations.
Lab Personnel
sabi sima
Sarabjot Pabla
PhD Student
Simarjot Pabla
PhD Student
Mike Aufiero
Systems Analysts
Shan Bai
Systems Analysts
Danilo Guesela
Systems Analysts
Please click the links below for more information.
Research Projects
Development of Molecular Networks from High Throughput Data
This area of my lab works on developing molecular networks from both microarray and proteomics data to assess the global molecular signatures that reflect disease states.  Projects include looking at literature mining approaches as well as microarray data for both mRNA and miRNA.

Diabetic Complications Consortium
Overall Goals: Diabetic Complications Consortium (DiaComp) will bring together a number of projects representing a diverse set of disciplines and technologies with the goal of improving or creating mouse models of human diabetes complications. read more...

Mouse Metabolic Phenotyping Centers
Overall Goals: My laboratory is the Coordinating and Bioinformatics Unit for the MMPC.The mission of the MMPC is to advance medical and biological research by providing the scientific community with standardized, high quality metabolic and physiologic phenotyping services for mouse models of diabetes, diabetic complications, obesity and related disorders. read more...

PANDA The Prospective Assessment of Newborns for Diabetes Autoimmunity
Overall Goals: This project is a new born screening program for Type I Diabetes. Individuals are screened at birth for their risk of developing type 1 diabetes and subsequently monitored semi-annually for molecular markers of the diseases. read more...

TEDDY The Environmental Determinants of Diabetes in the Young
Overall Goals: This project is a new born screening program for Type I Diabetes. Individuals are screened at birth for their risk of developing type 1 diabetes and subsequently monitored semi-annually for molecular markers of the diseases. read more...

Genetic control of autoimmune exocrinopathy in NOD mice
Overall gaols: This project is to investigate the molecular and immunological mechanism underlying exocrinopathies in NOD mice.

Development of Microarray-based Biomarkers for Type 1 Daibetes
Overall gaols: This project is to develop a biomarkers for the prediction of type 1 diabetes pathogenesis using microarrays.

Infrastructure Information
MMPC IT Infrastructure
Our programming paradigm is to develop software systems based on an n-tier architecture, where we create the presentation layer, business logic and data layer into separate software systems. These systems have been developed to minimize maintenance, but provide a robust scalable model for future growth and interactions at the national level with other organism databases. These systems have been designed using the unified modeling language (UML) with the designs available to the general public. The two UML modeling tools we use are Rational Rose and Powerdesigner.

MMPC WebServices
MMPC has a broad array of WebServices available for implementing methods related to client accounts, the order process and catalog services. The following documentation is provided to assist the MMPCs understand the nomenclature and usage of these services. Please click here to view the pdf document.

MMPC Data Model
The core relational data model for the MMPC was created using SQL Server 2000 and was based on a number of existing schemas containing our key subject areas: animal models, genotypes (including array experiment data), histopathology, and phenotype Assays. The Mouse Models of Human Cancer Consortium (MMHCC) and the Jackson Labs were particularly helpful, and shared several successful models. Currently DiaComp Data Model has been migrated to SQL Server 2005 and has been modified to include MMPC (National Mouse Metabolic Phenotyping Centers) Data Schema. The current version of the database addresses several domains, including DiaComp - MMPC administration, models, strains, publications, external database references, experiments, phenotype assays, microarray data, histology, images and dataset persistence. Current data model has 250 tables, 55 functions, 994 stored procedures, 141 data views and a total of 9344 lines of code.

MMPC Administration Data Model

MMPC Science Data Model

* Note: Above links require Internet Explorer version 5.0 or above to view Data Model with Zoom capability. Also please make sure to accept ActiveX warning to start viewer. Viewer has links to different data schemas on Navigation Dropdown Box, you will need to click go Next to the Links to load different schema.

MMPC Object Model
The MMPC Object Model (MMPC-OM) created for the consortium fully describes the activities of the MMPC and provides an OOP API to access the data generated by the consortium. The MMPC-OM was designed using Powerdesigner and UML, written in C# and compiled as a .NET DLL. The object model contains both administrative and domain specific classes. However, only the data centric classes are available to the public. The Domain classes provide both object specific classes (e.g. Model, Strain, Experiment, Protocol, etc.) as well as DataManager and SearchCriteria classes used to retrieve data from the system. These DataManager classes are specific for each of the data types maintained by MMPC. For example, the StrainMgr class provides methods to retrieve strain specific data. The SearchCriteria classes are also datatype specific and are used by the DataManager classes to query the database using different type specific parameters. For example, the StrainSearchCriteria class provides queryable properties specific for the Strain data in the system.

MMPC Object model base was modified to add MMPC (National Mouse Metabolic Phenotyping Centers) schema. Currently common object model for both consortium contains classes to serve DiaComp and MMPC consortium web portals.

In order to provide the broadest access to the data, we are also creating a WebService that exposes specific portions fo the MMPC-OM to the public. Specifically, the WebService will provide access to all the object specific classes as well as the DataManager and SearchCriteria classes. This provides a mechanisms for programmers to create local MMPC-OM objects in other languages. The current version of the MMPC-OM has 185 object classes.

Software Applications
ParaKMeans is a high performance parallel processing implementation of the K Means Clustering algorithm. We designed the software so it can be deployed on most Windows operating systems. The applications are written for the .NET Framework v1.1 using the C# programming language. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Because we use a web service, it is essential that at least one computer has Internet Information Services (IIS v.5 or better) installed and running. The parallel K Means algorithm used in this application is based on the work of Ben Zhang, Meichun Hsu and George Forman. Documentation Available Here
If you make use of the program presented here, please cite the following article:

Kraj P, Sharma A, Garge N, Podolsky R, McIndoe RA: ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 2008;9:200.
Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high I/O costs involved and large distance matrices calculated, most of the clustering algorithms fail on large datasets (30,000+ genes/200+ arrays). We propose a new two-stage algorithm which partitions the high dimensional space associated with microarray data using hyper planes. The first stage is based on the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm with the second stage being a conventional k-Means clustering technique. Because the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared to popular k-Means programs. The software was written in C# (.NET 1.1). This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data.
If you make use of the program presented here, please cite the following article:

Sharma A, Podolsky R, Zhao J, McIndoe RA: A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. Bioinformatics 2009;25:1152-1157.
Significance Analysis of Microarrays (SAM) is a permutation-based method that relies on estimating the FDR for determining significance. SAM is freely available as an Excel plug-in and as an R-package module. However, for large datasets the memory requirements are high and the algorithm fails. To overcome the memory limitations, we have developed a parallelized version of the SAM algorithm called ParaSAM. This high performance multithreaded application does not require programming experience to run and is designed to provide the general scientific community with an easy and manageable client-server Windows application. The parallel nature of the application comes from the use of web services to perform the permutations. The software is written in C# (.NET 1.1) and is designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Our results indicate ParaSAM is not only faster than the serial versions, but can analyze extremely large datasets that cannot be performed using a single PC.
If you make use of the program presented here, please cite the following article:

Sharma A, Zhao J, Podolsky R, McIndoe RA: ParaSAM: A parallelized version of the significance analysis of microarrays algorithm. Bioinformatics 2010.