1、基因组科学与信息”培训研习班 第二代测序的基本数据处理 杜政霖 基因组科学与信息”培训研习班 基因组科学与信息”培训研习班 1. 基本数据处理 Resequencing Reads mapping SNP, Indel, Structure variations Exon capture: Nimblegene, SureSelect de novo sequencing Reads assembly Genome/Transcriptome 基因组科学与信息”培训研习班 2. 第二代测序平台数据 illumina Hiseq2000(solexa) Reads length: 50-100
2、bp format: fastq AB SOLiD Reads length: 50bp format: csfasta Roche GS FLX (454) Reads length: 400bp Format: sff/fasta 基因组科学与信息”培训研习班 2.1 Solexafastq格式 Flowcell ID Lane# Tile# X Y Reads1/2 Reads Sequence Reads Quality 基因组科学与信息”培训研习班 The standard Sanger variant to assess reliability of a base call, ot
3、herwise known as Phred quality score The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds ratio p/(1-p) instead of the probability p: Qv=ASII(char)-64 or Qv=ASII(char)-33 2.1 Solexafastq格式 基因组科学与信息”培训研习班 2.2 Solidcsf
4、asta格式 Reads Sequence Reads Quality 基因组科学与信息”培训研习班 2.3 fasta format Sequence ID Sequence 基因组科学与信息”培训研习班 3. Reads Alignment reference reads SNP Indel Structure Variations Consensus 基因组科学与信息”培训研习班 3.1 Reads Alignment software 基因组科学与信息”培训研习班 3.2 Solexa Reads mapping BWA http:/bio- SAMtools http:/ SOAP2
5、 http:/ SOAPsnp http:/ *Linux, 64bit CPU, 4G memory 基因组科学与信息”培训研习班 3.2 Solexa Reads mapping BWA Index reference sequences bwa index -a is/bwtsw ref.fa is: 2Gb Mapping bwa aln ref.fa short_read.fq aln_sa.sai Output alignments in the SAM format bwa samse ref.fa aln_sa.sai short_read.fq aln.sam bwa sam
6、pe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq aln.sam 基因组科学与信息”培训研习班 3.2 Solexa Reads mapping sam format http:/bio- 基因组科学与信息”培训研习班 3.2 Solexa Reads mapping SOAP2 Index reference sequences 2bwt-builder ref.fa Mapping Single soap -a -D -o Pair end soap -a -b -D -o -2 -m -x 基因组科学与信息”培训研习班 3.2 Sol
7、exa Reads mapping SOAP2 基因组科学与信息”培训研习班 基因组科学与信息”培训研习班 3.3 Solid Reads mapping BioScope 基因组科学与信息”培训研习班 3.3 Solid Reads mapping 基因组科学与信息”培训研习班 3.3 Solid Reads mapping 基因组科学与信息”培训研习班 3.4 454 Reads mapping newbler runMapping -o outputdir ref.fa 1.sff 454ReadStatus.txt 基因组科学与信息”培训研习班 4. de novo sequencin
8、g Reads correction Assembly Short reads: illumina Long reads: 3730, 454 reads Hybrid reads: short + long reads Scaffolding Fix gap 基因组科学与信息”培训研习班 de novo assembly reads contigs 基因组科学与信息”培训研习班 Scaffolding Contigs Solid MP reads Contig Links ContigID ContigID #Links Insert Length(bp) ContigA ContigB 1
9、000 5000 ContigA ContigC 200 7000 A B C 基因组科学与信息”培训研习班 4.1 Denovo software 基因组科学与信息”培训研习班 4.2 Solexa assembly Soapdenovo Correction tool for SOAPdenovo http:/ Soapdenovo http:/ Velvet http:/www.ebi.ac.uk/zerbino/velvet/ ABySS http:/www.bcgsc.ca/platform/bioinfo/software/abyss *Linux, 64bit CPU, 4G25
10、6G memory 基因组科学与信息”培训研习班 4.2 Solexa assembly Soapdenovo config file max_rd_len=100 LIB avg_ins=200 reverse_seq=0 asm_flags=3 rd_len_cutoff=80 rank=1 pair_num_cutoff=3 map_len=32 q1=fastqA_read_1.fq q2=fastqA_read_2.fq LIB avg_ins=2000 reverse_seq=1 asm_flags=2 rank=2 pair_num_cutoff=5 map_len=35 q1=fastqB_read_1.fq q2=fastqB_read_2.fq soapdenovo all -s config_file o output_prefix 基因组科学与信息”培训研习班 4.2 Solexa assembly Soapdenovo output *.contig Contigs file *.scafSeq Scaffolds file 基因组科学与信息”培训研习班 基因组科学与信息”培训研习班 基因组科学与信息”培训研习班