必备蛋白质的结构分析流程教程.doc-道客多多

资源描述

1、http:/ 对序列与结构的比对举报删除此信息广告cnlics (站内联系 TA)实验数据许多实验数据可以辅助结构预测过程，包括：二硫键，固定了半胱氨酸的空间位置光谱数据，可以提供蛋白的二级结构内容定位突变研究，可以发现活性或结合位点的残基蛋白酶切割位点，翻译后修饰如磷酸化或糖基化提示了残基必须是暴露的其他预测时，必须清楚所有的数据。必须时刻考虑：预测与实验结果是否一致？如果不是，就有必要修改做法。cnlics (站内联系 TA)蛋白序列数据对蛋白序列的初步分析有一定价值。例如，如果蛋白是直接来自基因预测，就可能包含多个结构域。更严重的是，可能会包含不太可能是球形或可溶性的区域。此流

2、程图假设你的蛋白是可溶的，可能是一个结构域并不包含非球形结构域。需要考虑以下方面：是跨膜蛋白或者包含跨膜片段吗？有许多方法预测这些片段，包括：o TMAP (EMBL) o PredictProtein (EMBL/Columbia) o TMHMM (CBS, Denmark) o TMpred (Baylor College) o DAS (Stockholm) 如果包含卷曲 (coiled-coils)可以在 COILS server 预测 coiled coils 或者下载 COILS 程序（最近已经重写，注意 GCG程序包里包含了 COILS 的一个版本）蛋白包含低复杂性区域

3、？蛋白经常含有数个聚谷氨酸或聚丝氨酸区，这些地方不容易预测。可以用 SEG（GCG 程序包里包含了一个版本的SEG 程序）检查。如果出现以上一种情况，就应该将序列打成碎片，或忽略序列中的特定区段，等等。这个问题与细胞定位结构域相关。cnlics (站内联系 TA)搜索序列数据库分析任何新序列的第一步显然是搜索序列数据库以发现同源序列。这样的搜索可以在任何地方或者在任何计算机上完成。而且，有许多 WEB 服务器可以进行此类搜索，可以输入或粘贴序列到服务器上并交互式地接收结果。序列搜索也有许多方法，目前最有名的是 BLAST 程序。可以容易得到在本地运行的版本（从 NCBI 或者 Wash

4、ington University），也有许多的WEB 页面允许对多基因或蛋白质序列的数据库比较蛋白质或 DNA序列，仅举几个例子：National Center for Biotechnology Information (USA) Searches European Bioinformatics Institute (UK) Searches BLAST search through SBASE (domain database; ICGEB, Trieste) 还有更多的站点最近序列比较的重要进展是发展了gapped BLAST 和 PSI-BLAST (position specif

5、ic interated BLAST)，二者均使 BLAST 更敏感，后者通过选取一条搜索结果，建立模式（profile），然后用再它搜索数据库寻找其他同源序列（这个过程可以一直重复到发现不了新的序列为止），可以探测进化距离非常远的同源序列。很重要的一点是，在利用下面章节方法之前，通过 PSI-BLAST把蛋白质序列和数据库比较，找寻是否有已知结构。将一条序列和数据库比较的其他方法有：FASTA软件包 (William Pearson, University of Virginia, USA) SCANPS (Geoff Barton, European Bioinformatics In

6、stitute, UK) BLITZ (Compugens fast Smith Waterman search) 其他方法. It is also possible to use multiple sequence information to perform more sensitive searches. Essentially this involves building a profile from some kind of multiple sequence alignment. A profile essentially gives a score for each type o

7、f amino acid at each position in the sequence, and generally makes searches more sentive. Tools for doing this include: PSI-BLAST (NCBI, Washington) ProfileScan Server (ISREC, Geneva) HMMER 隐马氏模型（Sean Eddy， Washington University） Wise package （Ewan Birney， Sanger Centre；用于蛋白质对 DNA 的比较）其他方法. A diffe

8、rent approach for incorporating multiple sequence information into a database search is to use a MOTIF. Instead of giving every amino acid some kind of score at every position in an alignment, a motif ignores all but the most invariant positions in an alignment, and just describes the key residues t

9、hat are conserved and define the family. Sometimes this is called a “signature“. For example, “H-x-x-G-x(5)-H-x(3)-“ describes a family of DNA binding proteins. It can be translated as “histidine, followed by either a phenylalanine or tryptophan, followed by an amino acid (x), followed by leucine, i

10、soleucine, valine or methionine, followed by any amino acid (x), followed by glycine,. “. PROSITE (ExPASy Geneva) contains a huge number of such patterns, and several sites allow you to search these data: ExPASy EBI It is best to search a few different databases in order to find as many homologues a

11、s possible. A very important thing to do, and one which is sometimes overlooked, is to compare any new sequence to a database of sequences for which 3D structure information is available. Whether or not your sequence is homologous to a protein of known 3D structure is not obvious in the output from

12、many searches of large sequence databases. Moreover, if the homology is weak, the similarity may not be apparent at all during the search through a larger database. One last thing to remember is that one can save a lot of time by making use of pre-prepared protein alignments. Many of these alignment

13、s are hand edited by experts on the particular protein families, and thus represent probably the best alignment one can get given the data they contain (i.e. they are not always as up to date as the most recent sequence databases). These databases include: SMART (Oxford/EMBL) PFAM (Sanger Centre/Was

14、h-U/Karolinska Intitutet) COGS (NCBI) PRINTS (UCL/Manchester) BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle) SBASE (ICGEB, Trieste) 通常把蛋白质序列和数据比较都有很多的方法，这些对于识别结构域非常有用。cnlics (站内联系 TA)确定结构域If you have a sequence of more than about 500 amino acids, you can be nearly certain that it will be di

15、vided into discrete functional domains. If possible, it is preferable to split such large proteins up and consider each domain separately. You can predict the locatation of domains in a few different ways. The methods below are given (approximately) from most to least confident. If homology to other

16、 sequences occurs only over a portion of the probe sequence and the other sequences are whole (i.e. not partial sequences), then this provides the strongest evidence for domain structure. You can either do database searches yourself or make use of well-curated, pre-defined databases of protein domai

17、ns. Searches of these databases (see links below) will often assign domains easily. o SMART (Oxford/EMBL) o PFAM (Sanger Centre/Wash-U/Karolinska Intitutet) o COGS (NCBI) o PRINTS (UCL/Manchester) o BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle) o SBASE (ICGEB, Trieste) You can also find do

18、main descriptions in the annotations in SWISSPROT. Regions of low-complexity often separate domains in multidomain proteins. Long stretches of repeated residues, particularly Proline, Glutamine, Serine or Threonine often indicate linker sequences and are usually a good place to split proteins into d

19、omains. Low complexity regions can be defined using the program SEG which is generally available in most BLAST distributions or web servers (a version of SEG is also contained within the GCG suite of programs). Transmembrane segments are also very good dividing points, since they can easily separate

20、 extracellular from intracellular domains. There are many methods for predicting these segments, including: o TMAP (EMBL) o PredictProtein (EMBL/Columbia) o TMHMM (CBS, Denmark) o TMpred (Baylor College) o DAS (Stockholm) Something else to consider are the presence of coiled-coils. These unusual str

21、uctural features sometimes (but not always) indicate where proteins can be divided into domains. You can predict coiled coils at the COILS server or you can download the COILS program (recently re-written by me of all people; a version of SEG is also contained within the GCG suite of programs). Seco

22、ndary structure prediction methods (see below) will often predict regions of proteins to have different protein structural classes. For example one region of sequence may be predicted to contain only lpha helices and another to contain only beta sheets. These can often, though not always, suggest li

23、kely domain structure (e.g. an all alpha domain and an all beta domain) If you have separated a sequence into domains, then it is very important to repeat all the database searches and alignments using the domains separately. Searches with sequences containing several domains may not find all sub-ho

24、mologies, particularly if the domains are abundent in the database (e.g. kinases, SH2 domains, etc.). There may also be “hidden“ domains. For example if there is a stretch of 80 amino acids with few homologues nested in between a kinase and an SH2 domain, then you may miss matches found when searchi

25、ng the whole sequence against a database. Anyway, here is my slide from the talk related to this subject:cnlics (站内联系 TA)多序列比对Regardless of the outcome of your searches, you will want a multiple sequence alignment containing your sequence and all the homologues you have found above. Some sites for p

26、erforming multiple alignment: EBI (UK) Clustalw Server IBCP (France) Multalin Server IBCP (France) Clustalw Server IBCP (France) Combined Multalin/Clustalw MSA (USA) Server BCM Multiple Sequence Alignment ClustalW Sever (USA) If you are going to do a lot of alignments, then it is probably best to ge

27、t your own copy of one of many programs, some FTP sites for some of these are: HMMer (HMM method, Wash U) SAM (HMM method, Santa Cruz) ClustalW (EBI,UK) ClustalW (USA) MSA (USA) AMPS (UK) Note that PileUp is contained within the GCG commercial package. Most institutions with people doing this sort o

28、f work will have access to this software, so ask around if you want to use it. Probably the most important advance since these pages first appeared are Hidden Markov Models for sequence alignment. Several methods are listed above. Alignments can provide: Information as to protein domain structure Th

29、e location of residues likely to be involved in protein function Information of residues likely to be buried in the protein core or exposed to solvent More information than a single sequence for applications like homology modelling and secondary structure prediction. Some tips Dont just take everyth

30、ing found in the searches and feed them directly into the alignment program. Searches will almost always return matches that do not indicate a significant sequence similarity. Look through the output carefully and throw things out if they dont appear to be a member of the sequence family. Inclusion

31、of non-members in your alignment will confuse things and likely lead to errors later. Remember that the programs for aligning sequences arent perfect, and do not always provide the best alignment. This is particularly so for large families of proteins with low sequence identities. If you can see a b

32、etter way of aligning the sequences, then by all means edit the alignment lics (站内联系 TA)比较或同源建模如果蛋白序列和已知三维结构的其他蛋白有显著的相似性，就可以通过同源建模的方法获得这个蛋白相当精确的 3D结构。It is also possible to build models if you have found a suitable fold via fold recognition and are happy with the alignment of sequence to structure (

33、Note that the accuracy of models constructed in this manner has not been assessed properly, so treat with caution). It is possible now to generate models automatically using the very useful SWISSMODEL server. Some other sites useful for homology modelling include: WHAT IF (G. Vriend, EMBL, Heidelber

34、g) MODELLER (A. Sali, Rockefeller University) MODELLER Mirror FTP site Sequence alignments, particularly those involving proteins having low percent sequence identities can be inacurrate. If this is the case, then a model built using the alignment will obvious be wrong in some places. I would sugges

35、t that you look over the alignment carefully before building a model. Note that when using SWISSMODEL it is possible to send in a protein sequence only. I would only recommend doing this if the degree of sequence homology is high (50% or greater) for the above reasons. It is best, particularly if on

36、e has edited an alignment, to send an alignment directly to the server. Once you have a three-dimensional model, it is useful to look at protein 3D structures. There are numerous free programs for doing this, including: GRASP Anthony Nicholls, Columbia, USA. MolMol Reto Koradi, ETH, Zurrich, C.H. Pr

37、epi Suhail Islam, ICRF, U.K. RasMol Roger Sayle, Glaxo, U.K. Most places with groups studying structural biology also have commercial packages, such as Quanta, SYBL or Insight, which contain more features than the visualisation packages described above. Crystallographers also tend to use O and FRODO

38、, though these require a lot of experience to use with lics (站内联系 TA)二级结构预测方法和链接有许多做结构预测的 WEB 服务器，下面是简单的总括： PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick) JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton, EBI) DSC King & Sternberg （本服务器） PR

39、EDATORFrischman & Argos（EMBL） PHD home page Rost & Sander，EMBL，Germany ZPRED server Zvelebil et al.，Ludwig，U.K. nnPredict Cohen et al，UCSF，USA. BMERC PSA Server Boston University，USA SSP (Nearest-neighbor) Solovyev and Salamov，Baylor College， USA. With no homologue of known structure from which to m

40、ake a 3D model, a logical next step is to predict secondary structure. Although they differ in method, the aim of secondary structure prediction is to provide the location of alpha helices, and beta strands within a protein or protein family. 单条序列的方法二级结构预测已经存在约 1/4 世纪了，早期的方法受制于缺乏数据，仅对单条序列进行预测，而不是对同源

41、序列家族，而且能得到数据的已知3D 结构较少。早期最有名的方法是 Chou & Fasman，Garnier，Osguthorbe & Robson (GOR)以及 Lim。尽管作者开始声称准确率很高（70-80 %），仔细检查后，这些方法仅有 56 到 60%的准确率（Kabsc h & Sander，1984，见下）。早期预测二级结构的一个问题是 An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the se

42、t of structures used to assess the accuracy of the method. 关于主题的一些好的参考资料：对单条序列的早期方法Early methods on single sequences o Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222. o Lim, V.I. (1974). Journal of Molecular Biology, 88, 857-872. o Garnier, J., Osguthorpe, D.J. & Robson, B. (1978).Jour

43、nal of Molecular Biology, 120, 97-120. o Kabsch, W. & Sander, C. (1983). FEBS Letters, 155, 179-182. (An assessment of the above methods) Later methods on single sequences o Deleage, G. & Roux, B. (1987). Protein Engineering , 1, 289-294 (DPM) o Presnell, S.R., Cohen, B.I. & Cohen, F.E. (1992). Bioc

44、hemistry, 31, 983-993. o Holley, H.L. & Karplus, M. (1989). Proceedings of the National Academy of Science, 86, 152-156. o King, R. & Sternberg, M. J.E. (1990). Journal of Molecular Biology, 216, 441-457. o D. G. Kneller, F. E. Cohen & R. Langridge (1990) Improvements in Protein Secondary Structure

45、Prediction by an Enhanced Neural Network, Journal of Molecular Biology, 214, 171-182. (NNPRED) Recent improvmentsThe availability of large families of homologous sequences revolutionised secondary structure prediction. Traditional methods, when applied to a family of proteins rather than a single se

46、quence proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actu

47、ally much more useful than those for single sequence, since they tend to predict the core accurately. Moreover, the limit of 70-80% may be a function of secondary structure variation within homologous proteins. Automated methodsThere are numerous automated methods for predicting secondary structure

48、from multiply aligned protein sequences. Some good references on the subject include (the acronyms in parentheses given after each reference refer to the associated WWW servers, given below): Zvelebil, M.J.J.M., Barton, G.J., Taylor, W.R. & Sternberg, M.J.E. (1987). Prediction of Protein Secondary Structure and Active Sites Using the Alignment of Homologous Sequences Journal of Molecular Biology, 195, 957-961. (ZPRED) Rost, B. & Sander, C. (1993),

展开阅读全文