1、B1 81 MY E蔡振辉1,2,戈晓斐3,胡 磊3,黄 晓4(1.苏州大学计算机科学与技术系,江苏苏州 215006;2.江苏省南通市邮政局,江苏南通 226000;3.武汉大学计算机学院,湖北武汉 430072;4.江苏移动通信有限责任公司南京分公司,江苏南京 210003)l :2004-02-26Te:(1971), 3, 2Y , V 3,Z_ E。K 1: “ 3?Z, 8 MYXB ?V 3 4 | 3ZE。 1 8Bt Q,) 8 MY E(MEME)$,EM(expectation maximization) E,MEME E y EM E$ , N MEME E,iMEME
2、 EBt51 HW、 E ? W%), E KZT 。 Ll ,MEME B z 8 MY E, ? MYDNA 8, v 2。1oM: 8; E;Kv E;MEME Ems |:TP301.6 DS M :A cI|:1005-3751(2004)10-0123-04An Algorithm of Automatic Discovery of Protein MotifCAI Zhen-hui1,2,GE Xiao-fei3,HU Lei3,HUANG Xiao4(1.Computer Department of Soochow University,Suzhou 215006,China;
3、2.Nantong Post Bureau of Jiangsu Province,Nantong 226000,China;3.Computer School of Wuhan University,Wuhan 430072,China;4.Nanjing Branch,Jiangsu Mobile Communication Co.,Ltd.,Nanjing 210003,China)Abstract:With thedevelopment ofmolecular biology,motif identification has becomea method of learning val
4、uableinformation from biologicalsequence.This paper firstintroduces thedefinition ofmotif,then discusses in detailthe basis ofM EME:EM(expectation maximization)algo-rithm, and because EM algorithm is the base of MEM E algorithm, so the paper introduces MEM E algorithm and discusses some basicprob-le
5、ms for example time complexity and performance of the algorithm,at thesame time it explains the limitations and improvements of the algo-rithm.It proves that MEME is a good algorithm which can identify singleor multiple motifs from protein or DNA sequence, and with greatflexibility.Key words:motif;a
6、lgorithm;EM algorithm;MEME algorithm0 M , “ 3?Z, 8 MYXB ?V 3 4 | 3ZE。$ l, 8 MY X i t+ T MYM1。 T B Bs o,7 o 31,|Z T1, * ,+ T V MYEB。 , T|X+ T+ T “ ,y o,5 V + T,V7 EB。 “ -, T o EB1“,V7w ?,X ZE。 ?C 8, VwEB ?,i ?CBtM1EB。yN, MY 8B1 5。 M , MY 8 E X$ ,1 HMMERZE、MEMEZEMETAMEMEZE, t E ? uY,M。 E a , 3 ,t ETz,
7、yN, ZE f i ?M14 10 2004 M10 ? ZMicrocomputer Development Vol.14 No.10Oct.20041。/ ,| B 81 MYZEMEME。1 MEME(MultipleEM forMotif Elicitation)e1.1 8 8 BFM1DNAC 3il T, ?VBEB1 。1 8V UZE1B T。VUZES/e d:3 X(P,Q)V UvPlQ i31;PQV U3 P Q;33W L“-”V U。 ,C -X(2,4)-C -X(3)-LIVMFYWC -X(8)-H B 8, V“B EB, c/01 EB:C 7h,2
8、 4 i3, C#3 i3, L ,I,V,M,F,Y , W,C83 iB,8 i3,KH。1.2 1MEMEMEME B ?BFM1 DNA MY 8 。 n5 Timothy L.BaileyCharles Elkan4 , 0MEMEMASTF。MEME MY 8,7MAST ? MEME MY 8 1 o 。8 MY B1 5,MEMEZEYVB V ?f Kv %5,V7 P5Me 。BtV LXMEMEZE MY 8 0nc -wj=1nc,j K =0 M, V/ :Kv (EM) E-M -V。EM “ ” Q, H,E- ?9 ,9 V ? 11。 8 MY5, X =(X
9、1,X2,Xn) f /Z(t)ij ( t, 8 XiVjC q)。 | L(,|X ,Z)Kv V ?, (t +1),M - (t)(t)。E-M-B V ?。EM ? P V ? rKv,yN E KvCC, k+Q SZE 1。 MEMEZE EM E$7 ,i /: SQ k;LB 8 VCQ7 BQ;B C 8。2.2 E (Input): BF LSTTE(1) 8)PASSES(1 MY 8 “)NITER(EM 1Q )MAXP( “ 8CQ )E(Algorithm):for motif=1 to passesfor “ 0 V0 S,EM, NITERQW(o)4 S
10、(o)EMiu 8 V “ “X) V 8N E =,MEME EsY S 。 , S “0 ,7 *t ? P |Kv V ? S$,EM E V S 7SiK Q 3 。2.3 E HWEM E Q L HW ,7 O, “0 “9 L V U, ,MEME E 1 HWO(n2), ,n “3 。2.4 MEME E ? kMEME“dys ?,MEME E MY BFysEB 8,i MAST E MY 8 o 。 “ “, V1 MEMEE 8 MY E1 HMMER ?z。k m1 U。图 1 试验中的数据流2.5 L “ k 30short-chainalcohol dehy-1
11、2510 :B1 81 MY Edrogenase , ysEB, oPrositeSwissprot4 。Bt k “9 V1 U。表1 试验数据集 oSwissprotL 90180 254 89926 302.6 1MEME E NZEL !S ,S+ EB “,S- EB “,f o(s )f , B 8 s , T S 8 , * f(S)=true,5,f(S)=false。N V/ l:(1)Lfp(falsepositives): SS-,f(S)=true;(2)L|fn(false negatives): SS+,f(S)=false;(3)tp(true positive
12、s): SS+,f(S)=true;(4)|tn(true negatives): SS-,f(S)=false。MEME E T, N。 =tp/(tp+fp),=tp/(tp+fn)5 。3 E K)3.1 K E V,MEME 8 MY EB aS, , EKv K CY。1XM ,1 E k,i 8 MY E k k。3.2 )MEMEM1Bt E1METAMEMEHMMER , MEME ? tZ EM1 ,i MEME1METAMEMEHMMER1z, Ei“Bt),1 , T E,V7 PTA 4。/+ VT1 EZ: ? Y: VYV L k LC。 +:1X +, I n /
13、+ 1: SM H, ? M?1 M +F k H, ? ? , ? ? “ 8 MY: L T,MEMEMEME -MASTHMMER1 H, V AMEME -MASTHMMER ?1 M , ,yY k H,MEME-MAST ?1Bt。 f , V |Bt “ E k, zT。:MEMET E, 8 MY E1 HBKHq。7 O,E V4B q il, T ?T E| dil。yN,MEMEE, V a, P T E。4 “9 / 3 5 w,13 81 MY/ XB / , !9z MY Ez 3 3 5B1 5。 MEME B z 8 MY E, ? MYDNA 8,7 O 8
14、SiFK, ,i E , 9Bt ),。 ID:1 Bailey T L,ElkanC.Unsupervised Learning of Multiple Motifsin Biopolymers using Expectation Maximization R.UCSDTechnical Report,CS93-302,University of California at SanDiego,1993.2 Bailey T L,ElkanC P.Fitting a mixture model by expectationmaximization to discover motifs in b
15、iopolymersA .Proceed-ings of the Second InternationalConference on Intelligent Sys-tems for Molecular BiologyC .Menlo Park,California:AAAIPress,1994.28-36. 3 Bailey T L,Elkan CP.The value of prior knowledge indiscov-ering motifs with MEMEA .Proceedings of the Third Inter-nationalConference on Intell
16、igent Systemsfor Molecular Biolo-gyC.Menlo Park,California,AAAI Press:1995.21-29.4 Bailey T L,Baker M E,ElkanC P.Anartificialintelligence ap-proach to motif discovery in protein sequences:application tosteriod dehydrogenasesJ.JSteroid Biochem Mol Biol,1997,62(1):29-44.5 Durbin R,Eddy S,Krogh A,et al. 3 s, c: q M .: bv,2000.126 ?Z 14