ImageVerifierCode 换一换
格式:PPT , 页数:49 ,大小:863.50KB ,
资源ID:6697963      下载积分:10 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.docduoduo.com/d-6697963.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(BASIC TECHNIQUES IN STATISTICAL NLP.ppt)为本站会员(kpmy5893)主动上传,道客多多仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知道客多多(发送邮件至docduoduo@163.com或直接QQ联系客服),我们立即给予删除!

BASIC TECHNIQUES IN STATISTICAL NLP.ppt

1、September 2003,1,BASIC TECHNIQUES IN STATISTICAL NLP,Word prediction n-grams smoothing,September 2003,2,Statistical Methods in NLE,Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use: VARIETY (no programmer can really take into account al

2、l possibilities) AMBIGUITY (need to have ways of choosing between alternatives) In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION,September 2003,3,We are good at word prediction,Stocks plunged this morning, despite a cut in interest,Stocks

3、 plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall,Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began .,September 2003,4,Real Spelling Errors,They are leaving in about fifteen minuets to go to her house The st

4、udy was conducted mainly be John Black. The design an construction of the system will take more than one year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of this problem. He is trying to fine out.,September 2003,5,Handwriting recog

5、nition,From Woody Allens Take the Money and Run (1969) Allen (a bank robber), walks up to the teller and hands her a note that reads. “I have a gun. Give me all your cash.“ The teller, however, is puzzled, because he reads “I have a gub.“ “No, its gun“, Allen says. “Looks like gub to me,“ the teller

6、 says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.,September 2003,6,Applications of word prediction,Spelling checkers Mobile phone texting Speech recognition Handwriting recognition Disabled users,September 2003,7,Statis

7、tics and word prediction,The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (condi

8、tional) probability is highest.,P(w | W1 . WN-1),September 2003,8,Using corpora to estimate probabilities,But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how

9、many times a given sequence is encountered. Maximum because doesnt waste any probability on events not in the corpus,September 2003,9,Maximum Likelihood Estimation for conditional probabilities,In order to estimate P(w|W1 WN), we can use instead:Cfr.: P(A|B) = P(A&B) / P(B),September 2003,10,Aside:

10、counting words in corpora,Keep in mind that its not always so obvious what a word is (cfr. yesterday) In text: He stepped out into the hall, was delighted to encounter a brother. (From the Brown corpus.) In speech: I do uh main- mainly business data processing LEMMAS: cats vs cat TYPES vs. TOKENS,Se

11、ptember 2003,11,The problem: sparse data,In principle, we would like the n of our models to be fairly large, to model long distance dependencies such as: Sue SWALLOWED the large green However, in practice, most events of encountering sequences of words of length greater than 3 hardly ever occur in o

12、ur corpora! (See below) (Part of the) Solution: we APPROXIMATE the probability of a word given all previous words,September 2003,12,The Markov Assumption,The probability of being in a certain state only depends on the previous state:P(Xn = Sk| X1 Xn-1) = P(Xn = Sk|Xn-1)This is equivalent to the assu

13、mption that the next state only depends on the previous m inputs, for m finite(N-gram models / Markov models can be seen as probabilistic finite state automata),September 2003,13,The Markov assumption for language: n-grams models,Making the Markov assumption for word prediction means assuming that t

14、he probability of a word only depends on the previous n words (N-GRAM model),September 2003,14,Bigrams and trigrams,Typical values of n are 2 or 3 (BIGRAM or TRIGRAM models):P(Wn|W1 W n-1) P(Wn|W n-2,W n-1)P(W1,Wn) P(Wi| W i-2,W i-1) What bigram model means in practice: Instead of P(rabbit|Just the

15、other day I saw a) We use P(rabbit|a) Unigram: P(dog) Bigram: P(dog|big) Trigram: P(dog|the,big),September 2003,15,The chain rule,So how can we compute the probability of sequences of words longer than 2 or 3? We use the CHAIN RULE:E.g., P(the big dog) = P(the) P(big|the) P(dog|the big)Then we use t

16、he Markov assumption to reduce this to manageable proportions:,September 2003,16,Example: the Berkeley Restaurant Project (BERP) corpus,BERP is a speech-based restaurant consultant The corpus contains user queries; examples include Im looking for Cantonese food Id like to eat dinner someplace nearby

17、 Tell me about Chez Panisse Im looking for a good place to eat breakfast,September 2003,17,Computing the probability of a sentence,Given a corpus like BERP, we can compute the probability of a sentence like “I want to eat Chinese food” Making the bigram assumption and using the chain rule, the proba

18、bility can be approximated as follows: P(I want to eat Chinese food) P(I|”sentence start”) P(want|I) P(to|want)P(eat|to)P(Chinese|eat)P(food|Chinese),September 2003,18,Bigram counts,September 2003,19,How the bigram probabilities are computed,Example of P(I,I): C(“I”,”I”): 8 C(“I”): 8 + 1087 + 13 . =

19、 3437 P(“I”|”I”) = 8 / 3437 = .0023,September 2003,20,Bigram probabilities,September 2003,21,The probability of the example sentence,P(I want to eat Chinese food) P(I|”sentence start”) * P(want|I) * P(to|want) * P(eat|to) * P(Chinese|eat) * P(food|Chinese) = .25 * .32 * .65 * .26 * .002 * .60 = .000

20、016,September 2003,22,Examples of actual bigram probabilities computed using BERP,September 2003,23,Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method,For unigrams: Choose a random value r between 0 and 1 Print out w such that P(w) = r For bigrams: Choose a random bigram

21、 P(w|) Then pick up bigrams to follow as before,September 2003,24,The Shannon/Miller/Selfridge method trained on Shakespeare,September 2003,25,Approximating Shakespeare, contd,September 2003,26,A more formal evaluation mechanism,Entropy Cross-entropy,September 2003,27,The downside,The entire Shakesp

22、eare oeuvre consists of 884,647 tokens (N) 29,066 types (V) 300,000 bigrams All of Jane Austens novels (on Manning and Schuetzes website): N = 617,091 tokens V = 14,585 types,September 2003,28,Comparing Austen n-grams: unigrams,September 2003,29,Comparing Austen n-grams: bigrams,September 2003,30,Co

23、mparing Austen n-grams: trigrams,September 2003,31,Maybe with a larger corpus?,Words such as ergativity unlikely to be found outside a corpus of linguistic articles More in general: Zipfs law,September 2003,32,Zipfs law for the Brown corpus,September 2003,33,Addressing the zeroes,SMOOTHING is re-eva

24、luating some of the zero-probability and low-probability n-grams, assigning them non-zero probabilities Add-one Witten-Bell Good-Turing BACK-OFF is using the probabilities of lower order n-grams when higher order ones are not available Backoff Linear interpolation,September 2003,34,Add-one (Laplaces

25、 Law),September 2003,35,Effect on BERP bigram counts,September 2003,36,Add-one bigram probabilities,September 2003,37,The problem,September 2003,38,The problem,Add-one has a huge effect on probabilities: e.g., P(to|want) went from .65 to .28! Too much probability gets removed from n-grams actually e

26、ncountered (more precisely: the discount factor,September 2003,39,Witten-Bell Discounting,How can we get a better estimate of the probabilities of things we havent seen? The Witten-Bell algorithm is based on the idea that a zero-frequency N-gram is just an event that hasnt happened yet How often the

27、se events happen? We model this by the probability of seeing an N-gram for the first time (we just count the number of times we first encountered a type),September 2003,40,Witten-Bell: the equations,Total probability mass assigned to zero-frequency N-grams: (NB: T is OBSERVED types, not V) So each z

28、ero N-gram gets the probability:,September 2003,41,Witten-Bell: why discounting,Now of course we have to take away something (discount) from the probability of the events seen more than once:,September 2003,42,Witten-Bell for bigrams,We relativize the types to the previous word:,September 2003,43,Ad

29、d-one vs. Witten-Bell discounts for unigrams in the BERP corpus,September 2003,44,One last discounting method .,The best-known discounting method is GOOD-TURING (Good, 1953) Basic insight: re-estimate the probability of N-grams with zero counts by looking at the number of bigrams that occurred once

30、For example, the revised count for bigrams that never occurred is estimated by dividing N1, the number of bigrams that occurred once, by N0, the number of bigrams that never occurred,September 2003,45,Combining estimators,A method often used (generally in combination with discounting methods) is to

31、use lower-order estimates to help with higher-order ones Backoff (Katz, 1987) Linear interpolation (Jelinek and Mercer, 1980),September 2003,46,Backoff: the basic idea,September 2003,47,Backoff with discounting,September 2003,48,Readings,Jurafsky and Martin, chapter 6 The Statistics Glossary Word prediction: For mobile phones For disabled users Further reading: Manning and Schuetze, chapters 6 (Good-Turing),September 2003,49,Acknowledgments,Some of the material in these slides was taken from lecture notes by Diane Litman & James Martin,

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报