1、基于R/Bioconductor 进行生物芯片数据分析,曹宗富 博奥生物有限公司 2011.5.28,Outline,Introduction to MicroarrayIntroduction to R/BioconductorExpression Profiling analysis using R/Bioconductor,2,Introduction to Microarray,DNA Array-based SNP Detection Array-based CNV Detection DNA Methylation Microarray RNA Gene Expression Pr
2、ofiling Microarray MicroRNA Microarray Protein Cell,Application Human health Prediction Prevention Personalization Species identification pathogen bacteria Breeding ,3,Introduction to Microarray,4,sample,image,Data analysis,Introduction to Microarray Data,Quality assessment Background adjustment non
3、-specific hybridization, the noise in the optical detection system Normalization different efficiencies of reverse transcription, labeling, or hybridization reactions physical problems with the arrays reagent batch effects laboratory conditions summarization multiple probes Non-specific filtering Di
4、fferentially expressed genes Multiple testing Heatmap,5,Introduction to R,Robert C. Gentleman,Ross Ihak,R vs. S, SAS, Matlab, Stata Started in 1992, first emerged in 1996 free, open-source program R and perl, C, Java ,http:/www.r-project.org/,Robert C. Gentleman 2009.9 至今, senior director, bioinform
5、atics and computational biology,Genentech 20042009.8, Adjunct Professor, Department of Statistics, University of Washington, Seattle WA 2005-2008,Adjunct Associate Professor, Department of Biostatistics, Harvard University, Boston, MA 2005-2006, Visiting Professor, University of Ghent, Ghent, Belgiu
6、m 2000-2004, Associate Professor, Dana-Farber Cancer Institute and Harvard University, Department of Biostatistics 2001, Bioconductor project, NIH 1999-2000, Visiting Scholar, Harvard University, School of Public Health, Department of Biostatistics 1998-2000, Senior Research Fellow, University of Au
7、ckland, Clinical Trials Research Unit, Department of Medicine 1996-2000, Senior Lecturer, University of Auckland, Department of Statistics 1992-1996, Lecturer, University of Auckland, Department of Mathematics and Statistics Developed R 1988-1992, Assistant Professor, University of Waterloo, Departm
8、ent of Statistics and Actuarial Science,Introduction to Bioconductor,R Bioconductor:http:/www.bioconductor.org The Bioconductor project started in 2001 and is overseen by a core team, based primarily at the Fred Hutchinson Cancer Research Center, and by other members coming from US and international
9、 institutions. It gained widespread exposure in a 2004 Genome Biology paper.,背景介绍,Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data.Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each
10、 year, more than 460 packages, and an active user community.,Introduction to Bioconductor,Bioconductor Books,Bioinformatics and Computational Biology Solutions Using R and Bioconductor R Programming for Bioinformatics Bioconductor Case Studies,Install Bioconductor Packages,Install R Install a select
11、ion of core Bioconductor packages source(“http:/bioconductor.org/biocLite.R“) biocLite() Install a particular package, e.g., limma biocLite(“limma“) biocLite(c(“GenomicFeatures“, “AnnotationDbi“),Bioconductor Mailing Lists,Search Mailing Lists bioconductorr-project.org,User Guides and Package Vignet
12、tes,http:/svitsrv25.epfl.ch/R-doc/doc/html/packages.html,Expression Profiling Analysis,Preprocessing: Oligonucleotide Arrays library(“affy“) ReadAffy(); #input data expresso(); #Background adjustment,Normalization,Summarization justRMA(); #more efficient exprs();library(simpleaffy)ampli.eset - call.
13、exprs(cel,“mas5“,sc = target)qcs - qc(cel,ampli.eset),14,Expression Profiling Analysis,Preprocessing: Two-Color Spotted Arrays library(limma) read.maimages(); #input data backgroundCorrect(); #Background adjustment normalizeWithinArrays(); #Normalize within arrays normalizeBetweenArrays(); #Normaliz
14、e between arrays exprs.MA(); #Extract expression values avereps(); #Summary plotMA(); # MA plot,15,Expression Profiling Analysis,Non-specific filtering Intensity-based variability across samples fraction of Present calls R packages:genefilter,16,Differentially expressed genes library(samr) samr(); #
15、Significance analysis of microarrayslibrary(multtest) mt.rawp2adjp(); #Adjusted p-values for simple multiple # testing procedureslibrary(limma) lmFit(); #Linear Model for Series of Arrays eBayes(); #Empirical Bayes Statistics for #Differential Expression,17,Expression Profiling Analysis,Clustering a
16、nd visualization library(amap) hcluster(); #Hierarchical Clustering#more efficient than hclust() dist(); #Distance Matrix Computation library(ctc) r2gtr(); #Write to gtr, atr, cdt file format for Treeview r2atr() r2cdt()library(“gplots“) heatmap.2(); #extensions to the standard R heatmap(),18,Expres
17、sion Profiling Analysis,Workflow Intergration Independence Methods Write R scripts/functions for each step Call the scripts according to the analysis demand DOS: R CMD BATCH SAM.r perl etc.,19,Expression Profiling Analysis,Efficiency Time: 8h vs. 24h Cost: Machine vs. people Accuracy: Reduce human error Experience: slaves and slave owners,20,Expression Profiling Analysis,Thank you!Questions?,