1、HEREDITAS (Beijing) 2012 6 , 34(6): 773 783 ISSN 0253-9772 技术与方法 收稿日期: 2011 1010; 修回日期: 20111204 基金项目: ( 31050110432, 68075049) 作者简介: , , Tel: 022-23500736; E-mail: gao_ 通讯作者: , , Tel: 022-23501449; E-mail: 网络出版时间: 2012-4-5 11:00:30 URL: http:/ DOI: 10.3724/SP.J.1005.2012.00773 下一代测序中 ChIP-seq 数据的
2、处理与分析 高山 1 , 张宁 1 , 李勃 2 , 徐硕 3 , 叶彦波 4 , 阮吉寿 11. , 300071; 2. , 400044; 3. , 100038; 4. , 430071; 将染色质免疫共沉淀技术(ChIP)与下一代高通量测序技术相结合的染色质免疫共沉淀测序(ChIP-seq), 已成为功能基因组学、 特别是基因表达调控领域研究的关键技术。 ChIP-seq 实验带来的海量数据向生物信息学 研究人员提出了新的挑战。 由于此领域数据处理技术的发展大大滞后于实验技术进步, 有必要系统地介绍和回 顾 ChIP-seq 数据处理的各个方面, 以便更多研究人员进入此领域设计或改进
3、相应的算法。 文章结合实例详细介 绍了 ChIP-seq 数据整个流程, 并重点讨论了其中的主要问题和关键环节, 为这一研究领域的科研人员提供一个 快速而深入的认识。 下一代测序; ChIP-seq; 数据处理; 转录调控; 表观遗传学 Processing and analysis of ChIP-seq data GAO Shan 1 , ZHANG Ning 1 , LI Bo 2 , XU Shuo 3 , YE Yan-Bo 4 , RUAN Ji-Shou 11. College of Mathematics, Nankai University, Tianjin 300071,
4、China; 2. College of Bioengineering, Chongqing University, Chongqing 400044, China; 3. Institute of Scientific and Technical Information of China, Beijing 100038, China; 4. Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan 430071, China; Abstract: The next-generation sequencing coupled
5、 with chromatin immunoprecipitation (ChIP-seq) is becoming a key technology for the study of transcriptional regulation in the context of functional genomics. Due to the overwhelming amount of data generated from ChIP-seq experiments, the ChIP-seq data processing brings many new challenges in the fi
6、eld of bioinformatics. Considering the development of data processing skills largely behind that of the ChIP-seq experiment techniques, it is urgent to give a review on the ChIP-seq data processing for more and more oncoming researchers to build or improve algorithms. This paper provides a brief ove
7、rview of the ChIP-seq data processing, highlighting the main prob- lems and methods in detail, to allow scientists to understand rapidly and deeply. Keywords: next generation sequencing; ChIP-seq; data process; transcription regulation; epigenetics 774 HEREDITAS (Beijing) 2012 34 (Next generation se
8、quencing, NGS) 13 ( ) , 4 , (ChIP-seq) 5 RNA (RNA-seq) 6 (Methyl-seq) 7, 8 , ChIP-seq , ChIP-seq , , , ChIP-seq ChIP-seq , , , , 1 ChIP-seq (Chromatin immunoprecipitation, ChIP) 9, 10 , DNA 11 ChIP ChIP-seq , 5 , ChIP 1 , DNA (Crosslink) , , , , DNA , (Reverse crosslink) DNA , ( PCR ) DNA ChIP N-ChI
9、P 12 X-ChIP 10 DNA , , ; DNA , DNA , , 图 1 ChIP-seq实验原理示意图 13 DNA (Read), DNA , ChIP-seq DNA (Binding sites) , (Cis-acting element) 14 ; 15 , DNA 16 2 1 NRSF(Neuron-restrictive silencer factor) 17 NRSF , , 18, 19 1 2 , , 2 1 PCR , 1 , 2 NRSF , 4 (CTCF 20 STAT1 21 GABP 22 FoxA1 22, 23 ) , ( ) NRSF CT
10、CF STAT1 BED ( ( ) ) http:/dir.nhlbi. nih.gov/papers/lmi/epigenomes/sissrs/ 6 : ChIP-seq 775 表 1 NRSF原始数据集(测序后) 1 2 4 756 090 5 108 543 2 126 823 3 100 468 3 661 543 3 834 288 1 697 893 2 319 582 3 661 543 3 661 543 1 697 893 1 697 893 17 表 2 项目 2 中的原始数据集(测序后) ES cells pan-H3 1 4490474 H3K4me3 2 839
11、8790 H3K9me3 2 4411447 H3K27me3 2 7211279 H3K36me3 2 7217118 H4K20me3 2 5139339 RNAP II 1 2736500 NPCs H3K4me3 2 6995068 H3K9me3 2 4614191 H3K27me3 2 8166774 H3K36me3 2 7899115 MEFs H3K4me3 2 11371374 H3K9me3 2 4468908 H3K27me3 2 12208145 H3K36me3 2 10315848 24 2 , 7 3 7 (Chromatin state profile), (
12、Chromatin state map) 24 7 6 RNA II 6 H3 4(Trimethylated histone H3 lysine 4, H3K4me3) H3K9me3 H3K27me3 H3K36me3 H4K20me3 pan-H3 3 (Embryonic stem cells, ES cells) (Neural pro- genitor cells, NPCs) (Embryonic fibroblasts, MEFs) 2 ChIP-seq ChIP-seq (“ ”) 1 17 , , ; 2 24 , “ ” “ ” GO 25 Pathway 26, 27
13、“ ” 28 ( 2) 3 ChIP-seq 3.1 (Read), DNA 5 , 25 50nt, (Alignment), , , , hg18 3.1.1 影响读长定位的主要因素 (Site) ( 3A) ( Illumina/Solexa , 3 29 ), ; 3B 1 3.1.2 常用的读长定位算法与工具 , 30 3 (1) , Maq 31 ELAND , , , 776 HEREDITAS (Beijing) 2012 34 图 2 ChIP-seq数据处理流程 图 3 影响读长定位的主要因素 13A , 21 , ; B , (Base level) 25( ), , ,
14、 1 , , ; (2)Burrows-Wheeler , Bowtie 32 BWA SOAP2 , B-W , ; (3)Smith-Waterman , BFAST SHRiMP , , Bowtie BWA , , , Bowtie ; BWA 6 : ChIP-seq 777 3.1.3 读长定位前的预处理 , , , , 2 36nt, 3 , , ( 5 , , Q20) , , PCR DNA RNA (Data clean), 3.1.4 定位的质量控制和输出选择 , , ( SNP) , , / ( Illumina ) Bowtie , (v ) , , (n ) ( )
15、 , , , , , 1 , , PCR , 1 3.2 ChIP-seq DNA , , UCSC Genome Browser , SAM BAM BED ; Web , ; Galaxy ; 1 , ; 2 , 2 (H3K4me3 H3K27me3) 3 , ES 2 , , H3K4me3 , H3K27me3 , (H3K4me3/H3K27me3) , ; H3K4me3 , (Active) ; H3K27me3 , , ( 4), 3.3 ChIP-seq DNA , , ( 100 500nt, 200 300nt) 5 , (Enrichment region) “ ”(
16、Peak), (Tag cluster) ( 5) , “ ” (Peak finding algorithms), ChIP-seq 3.3.1 结合位点识别的原理 “ ” “ ” “ ”, , (Overlapped tags) , (Bimodal pattern); 778 HEREDITAS (Beijing) 2012 34 图 4 双价标志在 3 种启动子区域的分布 24a H3K4me3 Polm 3 ; b H3K4me3/H3K27me3 Olig1 ES , NPCs H3K4me3, MEFs H3K27me3; c Neurog1 ; d Pparg ; e Fabp
17、7 ; f Foxp2 CpG (High CpG promoters, HCP) ; g Foxp2 CpG (Low CpG promoters, LCP) 图 5 从双峰模式中寻找结合位点(“峰”) 33A DNA , , d ; B DNA ( RNA ), DNA( ) ( ); ChIP DNA ( ), , ; , , , “ ” (Sense tag cluster) (Antisense tag cluster) ( 5), Watson tags Crick tags 6 : ChIP-seq 779 “ ” “ ”, “ ” “ ” (Score) “ ” ( ), (Tag density) (Fold enrichment); , “ ” p-value q-value t-value ; , “ ” , ; , n “ ” “ ” ( ) , , Top 10%; , p-value7.5 (False discovery rate, FDR), 0.01, FDR ( 1) 1 , N E (s) s “ ” , N C (s) s “ ” , b , 0, 13 b=0.5 () () E C Nsb Nsb + +(1) , , , ChIP ( ),