1、生物信息学导论(第五讲),李金明,教授, 基础医学院生物信息学系E-Mail: Tel. 6164 8279,基因寻找(Gene Finding),主要内容,模序寻找(Motif Finding),We Have the Human Genome Sequence.now what?,So, what is the problem? Well. We dont know how many genes there are! We dont know where they are! We dont know what they do! We dont know how they interact
2、with each other! We dont know ,The cellular machinery recognize genes without access to GenBank, SwissProt or computers can we?,Needles Hiding in Genome Haystacks.,Genes are embedded in the genome sequenceProtein Coding regions constitute only 1-2% of human genomeCan we distinguish the gene features
3、 from the background?,AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAG
4、GCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAG
5、GAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
6、TATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAA
7、CATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCT
8、CACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCC
9、TCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCT
10、AAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGA
11、GATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAA
12、TTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTC
13、CTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGC
14、TAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG,Can U spot the Gin?,Can U spot the Gene?,Needles Hiding in Human Genome Haystacks.,Intron-exon structure of genes
15、 Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb),Cartegni,L. et al.(2002) “Listening to Silence and understanding nonsense: Exonic mutations that affect splicing” Nature Reviews Genetics 3.4.285-, HMG p291-294,A challenge to automated annotation. How widespre
16、ad is it? Is it always functional? How does it evolve?,Human Genome by Numbers,Presently estimated Gene Number: 24.000, Average Gene Size: 27 kb The largest gene: Dystrophin 2.4 Mb - 0.6% coding 16 hours to transcribe. The shortest gene: tRNATYR 100% coding Largest exon: ApoB exon 26 is 7.6 kb Small
17、est: 10bp Average exon number: 9 Largest exon number: Titin 363 Smallest: 1 Largest intron: WWOX intron8 is 800 kb Smallest: 10s of bp Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones, Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions a
18、re unknown for over 50% of discovered genes.,worlds shortest intron-containing gene,ATGCCGTCTAGGTAA,Introduction,Gene finding is about detecting coding regions and infer gene structuretough jobDNA sequence signals have low information content (degenerated and highly unspecific)It is difficult to dis
19、criminate real signals from background noiseSequencing errors,Gene finding strategies,Ab initio methodRequires two types of informationcompositional informationsignal information HMM or ANN(Artificial Neural Network) methodsHomology method Gene structure can be deduced by homology Requires a not too
20、 distant homologous sequence Similarity to existing ESTs, genes, .,ORF Scanning,ORF (Open Reading Frame) usually starts with ATGand ends with TAA, TAG or TGA. Thus, the simplest gene prediction algorithm is to search for ORFs that begin with an ATG and end with an Txx: the ORF scanning. Three frames
21、 in each of the two directions. The ORF scanning works well for bacterial genome: short intergenic sequences, no overlapping genes, etc. Eukaryotic genes contains introns, which complicates the ORF scanning ORF Finder http:/www.ncbi.nlm.nih.gov/gorf/gorf.html,Long vs.Short ORFs,Long open reading fra
22、mes may be a gene At random, we should expect one stop codon every (64/3) = 21 codons However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold This is nave because some genes (e.g. some neural and immune system genes) are relatively
23、 short,Testing ORFs: Codon Usage,Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene,Codon Usage in Human Genome,What is codon b
24、ias,Codon bias is the probability of a given codon will be used to code for an amino acid over a different codon which codes the same amino acids,Codon bias,Codon Usage in Mouse Genome,Source: Kazusa DNA Research Institute (http:/www.kazusa.or.jp/codon/),AA codon /1000 frac Ser TCG 4.31 0.05 Ser TCA
25、 11.44 0.14 Ser TCT 15.70 0.19 Ser TCC 17.92 0.22 Ser AGT 12.25 0.15 Ser AGC 19.54 0.24Pro CCG 6.33 0.11 Pro CCA 17.10 0.28 Pro CCT 18.31 0.30 Pro CCC 18.42 0.31,AA codon /1000 frac Leu CTG 39.95 0.40 Leu CTA 7.89 0.08 Leu CTT 12.97 0.13 Leu CTC 20.04 0.20Ala GCG 6.72 0.10 Ala GCA 15.80 0.23 Ala GCT
26、 20.12 0.29 Ala GCC 26.51 0.38 Gln CAG 34.18 0.75 Gln CAA 11.51 0.25,Organisms usually have their own preference for codon usage,Although the genetic code is universal, organisms usually have their own preference for codon usage.,Codon Usage and ORF,An ORF is more “valid” than another if it has more
27、 “likely” codons Do sliding window calculations to find the ORF that has the best “likely” codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length.,Gene Finding for Prokaryotes,Have short integenic region, simple gene structure, lack introns Sever
28、al highly conserved sequences patterns are found in the promoter region and around the start sites of transcription and translation,Relatively easy,Gene Finding for Prokaryotes,HMMs (Hidden Markov Models) are the most attractive methods for prokaryotic gene finding.,Two microbial gene finding progra
29、ms:,GeneMark.HMM http:/opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgiGlimmer http:/bioinfo.hku.hk/glimmerintro.html http:/www.tigr.org/software/glimmer/,High Accuracy and reliable prediction,Eukaryotic Gene Structure,transcription,translation,Poly A,protein,5 - Promoter Exon1 Intron1 Exon2 Termina
30、tor 3UTR splice splice UTR,Genes and Signals,Splicing signal,Most introns start from the sequence GU and end with the sequence AG (in the 5 to 3 direction). They are referred to as the splice donor and splice acceptor site, respectively. However, the sequences at the two sites are not sufficient to
31、signal the presence of an intron. Another important sequence is called the branch site located 20 - 50 bases upstream of the acceptor site. The consensus sequence of the branch site is “CU(A/G)A(C/U)“, where A is conserved in all genes.In over 60% of cases, the exon sequence is (A/C)AG at the donor
32、site, and G at the acceptor site.,Eukaryotic Gene Prediction,Prediction relies on integration of several gene features promoters translational start and stop codes (ORFs) intron splice sites alternative splicing codon bias CpG islands GC contents Methods and programs based on HMMs well suited for ge
33、ne prediction,Difficulties in Eukaryotic Gene Finding,Weak feature signals Low gene density and complex gene structureAlternative splicingPseudo-genesPseudo-genes are sequences of genomic DNA with such similarity to normal genes that they are regarded as non-functional copies or close relatives of g
34、enes,Eukayotic Gene Prediction Software,GENSCAN (HMM; C.Burge Ueberbacher et al.)http:/grail.lsd.ornl.gov/grailexp/ MZEF (M. Zhang,1997)http:/rulai.cshl.org/tools/genefinder/ . .,Hidden Markov Models (HMM),Statistical Tool useful for modeling a sequence of events originally developed to analyze spee
35、ch patterns can be used to study sequences of DNA/amino acids Examples: multiple alignment, identifying protein families (profile HMMs); gene finding, identifying transmembraen domain, predicting protein secondary structure, doing genetic linkage analysis, .,Homology method of Gene Finding,Similarit
36、y-based: for example, many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.,cDNA and EST sequences can be used in finding genes with homology method,Expressed Sequence Tag,EST- Short cDNA s
37、equences prepared from mRNA extracted from a cell under particular conditions or in specific developmental phases (e.g. arabidopsis thaliana 2-week old shoots or valencia orange seeds ). EST can act as identifier of a gene and dont cover the entire coding sequence of a gene.,Why EST sequencing?,Syst
38、ematic sampling of the transcribed portion of the genome (“transcriptome”) Provides “sequence tags” allowing unique identification of genes Provides experimental evidence for the positions of exons Provides regions coding for potentially new proteins Provides clones for DNA microarrays,dbEST,GenBank
39、 have separate sections for EST sequences ESTs are the most abundant entries in the GenBank (60%).,dbEST summary (2006),Homo sapiens (human) 7,596,977 Mus musculus + domesticus (mouse) 4,690,536 Bos taurus (cattle) 837,648 Rattus sp. (rat) 812,662 Danio rerio (zebrafish) 689,613 Triticum aestivum (w
40、heat) 600,205 Gallus gallus (chicken) 588,739 Sus scrofa (pig) 536,842 Arabidopsis thaliana (thale cress) 421,027 Oryza sativa (rice) 407,545 Drosophila melanogaster (fruit fly) 383,407 .,dbEST release 012006 Summary by Organism - January 20, 2006Number of public entries: 32,889,225Species: 1020,Hom
41、o sapiens (human) 8,300,249 Mus musculus + domesticus (mouse) 4,852,146 Zea mays (maize) 2,018,798 Bos taurus (cattle) 1,558,493 Sus scrofa (pig) 1,538,636 Arabidopsis thaliana (thale cress) 1,527,299 Danio rerio (zebrafish) 1,481,930 Glycine max (soybean) 1,422,982 Xenopus (Silurana) tropicalis (we
42、stern clawed frog) 1,271,375 Oryza sativa (rice) 1,249,110 Triticum aestivum (wheat) 1,067,304 Rattus norvegicus + sp. (rat) 1,009,820 Drosophila melanogaster (fruit fly) 821,005 .,dbEST release 012910 Summary by Organism - January 29, 2010Number of public EST entries: 64,727,557Species: 1950,dbEST
43、Summary (2010),Problems with raw EST databases,The databases are highly redundant: e.g. 8.3x106 human sequences for 25000 genes. Moreover, only about 60-80% of these 25000 human genes are represented in dbEST (human).The error rates are high in individual ESTsFor most ESTs, there is no indication as
44、 to the gene from which is was derived,Characteristics of ESTs,Highly redundant Low sequence quality Cheap Reflect expressed genes May be tissue/stage specific,EST Data is Fragmented, but there is lots of it,Database of all genes and/all gene transcripts does yet not exist,Database of ESTs continues
45、 to grow rapidly,DNA Motif Finding,A DNA motif is a sequence pattern that occurs repeatedly in a group of related DNA sequences, a sequence motif has, or is conjectured to have, a biological significance. Example: regulatory sequence motifs in the upstream region of some co-regulated genes.,More abo
46、ut sequence motif http:/en.wikipedia.org/wiki/Sequence_motif,DNA Motif,Motif - Example daf-19 Binding Sites in C. elegans (Peter Swoboda),GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAACche-2daf-19osm-1osm-6 F02D8.3,-150,-1,Sequence Logo,daf-19 Binding Sites in C. elegans,http:
47、/weblogo.berkeley.edu/logo.cgi,Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004),Popular Motif Finding Programs,MEME http:/meme.sdsc.edu/BioProspector http:/bioprospector.stanford.edu/ (Liu Xiao-Le)Motif Sampler http:/www.esat.kuleuven.ac.be/thijs/bbcDemo/MotifSampler.html,Computer tools for discovering motifs in a group of related DNA or protein sequences,