ImageVerifierCode 换一换
格式:PPT , 页数:45 ,大小:1.28MB ,
资源ID:4066339      下载积分:20 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.docduoduo.com/d-4066339.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(生物信息学课件 五 bioinformatics_2011_G5.ppt)为本站会员(dreamzhangning)主动上传,道客多多仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知道客多多(发送邮件至docduoduo@163.com或直接QQ联系客服),我们立即给予删除!

生物信息学课件 五 bioinformatics_2011_G5.ppt

1、生物信息学导论(第五讲),李金明,教授, 基础医学院生物信息学系E-Mail: Tel. 6164 8279,基因寻找(Gene Finding),主要内容,模序寻找(Motif Finding),We Have the Human Genome Sequence.now what?,So, what is the problem? Well. We dont know how many genes there are! We dont know where they are! We dont know what they do! We dont know how they interact

2、with each other! We dont know ,The cellular machinery recognize genes without access to GenBank, SwissProt or computers can we?,Needles Hiding in Genome Haystacks.,Genes are embedded in the genome sequenceProtein Coding regions constitute only 1-2% of human genomeCan we distinguish the gene features

3、 from the background?,AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAG

4、GCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAG

5、GAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG

6、TATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAA

7、CATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCT

8、CACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCC

9、TCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCT

10、AAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGA

11、GATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAA

12、TTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTC

13、CTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGC

14、TAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG,Can U spot the Gin?,Can U spot the Gene?,Needles Hiding in Human Genome Haystacks.,Intron-exon structure of genes

15、 Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb),Cartegni,L. et al.(2002) “Listening to Silence and understanding nonsense: Exonic mutations that affect splicing” Nature Reviews Genetics 3.4.285-, HMG p291-294,A challenge to automated annotation. How widespre

16、ad is it? Is it always functional? How does it evolve?,Human Genome by Numbers,Presently estimated Gene Number: 24.000, Average Gene Size: 27 kb The largest gene: Dystrophin 2.4 Mb - 0.6% coding 16 hours to transcribe. The shortest gene: tRNATYR 100% coding Largest exon: ApoB exon 26 is 7.6 kb Small

17、est: 10bp Average exon number: 9 Largest exon number: Titin 363 Smallest: 1 Largest intron: WWOX intron8 is 800 kb Smallest: 10s of bp Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones, Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions a

18、re unknown for over 50% of discovered genes.,worlds shortest intron-containing gene,ATGCCGTCTAGGTAA,Introduction,Gene finding is about detecting coding regions and infer gene structuretough jobDNA sequence signals have low information content (degenerated and highly unspecific)It is difficult to dis

19、criminate real signals from background noiseSequencing errors,Gene finding strategies,Ab initio methodRequires two types of informationcompositional informationsignal information HMM or ANN(Artificial Neural Network) methodsHomology method Gene structure can be deduced by homology Requires a not too

20、 distant homologous sequence Similarity to existing ESTs, genes, .,ORF Scanning,ORF (Open Reading Frame) usually starts with ATGand ends with TAA, TAG or TGA. Thus, the simplest gene prediction algorithm is to search for ORFs that begin with an ATG and end with an Txx: the ORF scanning. Three frames

21、 in each of the two directions. The ORF scanning works well for bacterial genome: short intergenic sequences, no overlapping genes, etc. Eukaryotic genes contains introns, which complicates the ORF scanning ORF Finder http:/www.ncbi.nlm.nih.gov/gorf/gorf.html,Long vs.Short ORFs,Long open reading fra

22、mes may be a gene At random, we should expect one stop codon every (64/3) = 21 codons However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold This is nave because some genes (e.g. some neural and immune system genes) are relatively

23、 short,Testing ORFs: Codon Usage,Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene,Codon Usage in Human Genome,What is codon b

24、ias,Codon bias is the probability of a given codon will be used to code for an amino acid over a different codon which codes the same amino acids,Codon bias,Codon Usage in Mouse Genome,Source: Kazusa DNA Research Institute (http:/www.kazusa.or.jp/codon/),AA codon /1000 frac Ser TCG 4.31 0.05 Ser TCA

25、 11.44 0.14 Ser TCT 15.70 0.19 Ser TCC 17.92 0.22 Ser AGT 12.25 0.15 Ser AGC 19.54 0.24Pro CCG 6.33 0.11 Pro CCA 17.10 0.28 Pro CCT 18.31 0.30 Pro CCC 18.42 0.31,AA codon /1000 frac Leu CTG 39.95 0.40 Leu CTA 7.89 0.08 Leu CTT 12.97 0.13 Leu CTC 20.04 0.20Ala GCG 6.72 0.10 Ala GCA 15.80 0.23 Ala GCT

26、 20.12 0.29 Ala GCC 26.51 0.38 Gln CAG 34.18 0.75 Gln CAA 11.51 0.25,Organisms usually have their own preference for codon usage,Although the genetic code is universal, organisms usually have their own preference for codon usage.,Codon Usage and ORF,An ORF is more “valid” than another if it has more

27、 “likely” codons Do sliding window calculations to find the ORF that has the best “likely” codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length.,Gene Finding for Prokaryotes,Have short integenic region, simple gene structure, lack introns Sever

28、al highly conserved sequences patterns are found in the promoter region and around the start sites of transcription and translation,Relatively easy,Gene Finding for Prokaryotes,HMMs (Hidden Markov Models) are the most attractive methods for prokaryotic gene finding.,Two microbial gene finding progra

29、ms:,GeneMark.HMM http:/opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgiGlimmer http:/bioinfo.hku.hk/glimmerintro.html http:/www.tigr.org/software/glimmer/,High Accuracy and reliable prediction,Eukaryotic Gene Structure,transcription,translation,Poly A,protein,5 - Promoter Exon1 Intron1 Exon2 Termina

30、tor 3UTR splice splice UTR,Genes and Signals,Splicing signal,Most introns start from the sequence GU and end with the sequence AG (in the 5 to 3 direction). They are referred to as the splice donor and splice acceptor site, respectively. However, the sequences at the two sites are not sufficient to

31、signal the presence of an intron. Another important sequence is called the branch site located 20 - 50 bases upstream of the acceptor site. The consensus sequence of the branch site is “CU(A/G)A(C/U)“, where A is conserved in all genes.In over 60% of cases, the exon sequence is (A/C)AG at the donor

32、site, and G at the acceptor site.,Eukaryotic Gene Prediction,Prediction relies on integration of several gene features promoters translational start and stop codes (ORFs) intron splice sites alternative splicing codon bias CpG islands GC contents Methods and programs based on HMMs well suited for ge

33、ne prediction,Difficulties in Eukaryotic Gene Finding,Weak feature signals Low gene density and complex gene structureAlternative splicingPseudo-genesPseudo-genes are sequences of genomic DNA with such similarity to normal genes that they are regarded as non-functional copies or close relatives of g

34、enes,Eukayotic Gene Prediction Software,GENSCAN (HMM; C.Burge Ueberbacher et al.)http:/grail.lsd.ornl.gov/grailexp/ MZEF (M. Zhang,1997)http:/rulai.cshl.org/tools/genefinder/ . .,Hidden Markov Models (HMM),Statistical Tool useful for modeling a sequence of events originally developed to analyze spee

35、ch patterns can be used to study sequences of DNA/amino acids Examples: multiple alignment, identifying protein families (profile HMMs); gene finding, identifying transmembraen domain, predicting protein secondary structure, doing genetic linkage analysis, .,Homology method of Gene Finding,Similarit

36、y-based: for example, many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.,cDNA and EST sequences can be used in finding genes with homology method,Expressed Sequence Tag,EST- Short cDNA s

37、equences prepared from mRNA extracted from a cell under particular conditions or in specific developmental phases (e.g. arabidopsis thaliana 2-week old shoots or valencia orange seeds ). EST can act as identifier of a gene and dont cover the entire coding sequence of a gene.,Why EST sequencing?,Syst

38、ematic sampling of the transcribed portion of the genome (“transcriptome”) Provides “sequence tags” allowing unique identification of genes Provides experimental evidence for the positions of exons Provides regions coding for potentially new proteins Provides clones for DNA microarrays,dbEST,GenBank

39、 have separate sections for EST sequences ESTs are the most abundant entries in the GenBank (60%).,dbEST summary (2006),Homo sapiens (human) 7,596,977 Mus musculus + domesticus (mouse) 4,690,536 Bos taurus (cattle) 837,648 Rattus sp. (rat) 812,662 Danio rerio (zebrafish) 689,613 Triticum aestivum (w

40、heat) 600,205 Gallus gallus (chicken) 588,739 Sus scrofa (pig) 536,842 Arabidopsis thaliana (thale cress) 421,027 Oryza sativa (rice) 407,545 Drosophila melanogaster (fruit fly) 383,407 .,dbEST release 012006 Summary by Organism - January 20, 2006Number of public entries: 32,889,225Species: 1020,Hom

41、o sapiens (human) 8,300,249 Mus musculus + domesticus (mouse) 4,852,146 Zea mays (maize) 2,018,798 Bos taurus (cattle) 1,558,493 Sus scrofa (pig) 1,538,636 Arabidopsis thaliana (thale cress) 1,527,299 Danio rerio (zebrafish) 1,481,930 Glycine max (soybean) 1,422,982 Xenopus (Silurana) tropicalis (we

42、stern clawed frog) 1,271,375 Oryza sativa (rice) 1,249,110 Triticum aestivum (wheat) 1,067,304 Rattus norvegicus + sp. (rat) 1,009,820 Drosophila melanogaster (fruit fly) 821,005 .,dbEST release 012910 Summary by Organism - January 29, 2010Number of public EST entries: 64,727,557Species: 1950,dbEST

43、Summary (2010),Problems with raw EST databases,The databases are highly redundant: e.g. 8.3x106 human sequences for 25000 genes. Moreover, only about 60-80% of these 25000 human genes are represented in dbEST (human).The error rates are high in individual ESTsFor most ESTs, there is no indication as

44、 to the gene from which is was derived,Characteristics of ESTs,Highly redundant Low sequence quality Cheap Reflect expressed genes May be tissue/stage specific,EST Data is Fragmented, but there is lots of it,Database of all genes and/all gene transcripts does yet not exist,Database of ESTs continues

45、 to grow rapidly,DNA Motif Finding,A DNA motif is a sequence pattern that occurs repeatedly in a group of related DNA sequences, a sequence motif has, or is conjectured to have, a biological significance. Example: regulatory sequence motifs in the upstream region of some co-regulated genes.,More abo

46、ut sequence motif http:/en.wikipedia.org/wiki/Sequence_motif,DNA Motif,Motif - Example daf-19 Binding Sites in C. elegans (Peter Swoboda),GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAACche-2daf-19osm-1osm-6 F02D8.3,-150,-1,Sequence Logo,daf-19 Binding Sites in C. elegans,http:

47、/weblogo.berkeley.edu/logo.cgi,Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004),Popular Motif Finding Programs,MEME http:/meme.sdsc.edu/BioProspector http:/bioprospector.stanford.edu/ (Liu Xiao-Le)Motif Sampler http:/www.esat.kuleuven.ac.be/thijs/bbcDemo/MotifSampler.html,Computer tools for discovering motifs in a group of related DNA or protein sequences,

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报