As part of the ongoing research within the Genetics and Genomics division of the Arthritis Research UK Epidemiology Unit we are developing tools and methods to analyse the wealth of data which is now available. These are mostly focused towards analysis of genome wide association studies (GWAS) to further refine and direct further research. Below are some tools which are being developed:
There is now a wealth of information available from the Encyclopaedia of DNA Elements (ENCODE) international consortium (Birney et al. 2007; ENCODE Project consortium 2004) hosted by the University of California Santa Cruz (UCSC) through their Genome Browser (Kent et al. 2006). This data has been generated from wet lab experiments including Chromatin ImmunoPrecipitation Sequencing (ChIP-Seq), DNase hypersensitivity and histone modification studies and thus provide better evidence of putative function compared to predictive algorithms used previously to infer function at a locus. An enormous amount of data is available including studies in different cell lines and different cell compartments but these sites cannot be easily interrogated by the user simultaneously.
Written in Perl, ASSIMILATOR retrieves, queries and processes information for the desired SNPs from the UCSC Genome Browser's public MySQL database and displays this in a simplified, user friendly manner. All available ENCODE tracks are queried in addition to predefined tracks, such as mRNAs, ESTs and CpG islands. Multiple systems have been designed to improve the efficiency of data retrieval such as an XML based track database, which minimises the number of database queries and multi-threading support to query multiple SNPs simultaneously, reducing processing time with minimal reduction in individual performance.
The output can be viewed in a standard web browser and allows the user to quickly identify SNPs which could be functionally important. To add extra functionality, the ability to view selected SNPs in NCBIs dbSNP (Sherry et al. 2001) and in the UCSC Genome Browser has been incorporated into the output. The user interface has been designed to allow further mining of the output to display information from the multiple cell types and links to external data. ASSIMILATOR automatically queries any new tracks appearing from the ENCODE project on UCSC and includes these in the analysis.
The program can be downloaded from here.
Birney, E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816.
ENCODE Project Consortium (2004) The ENCODE (ENCyclopaedia Of DNA Elements) Project. Science, 306, 636-640.
Kent, W. J. et al. (2002) The human genome browser at UCSC. Genome Res., 12, 996-1006.
Sherry, S. T. et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res., 29, 308-311.