1、朴素贝叶斯优点:在数据较少的情况下仍然有效,可以处理多类别问题缺点:对于输入数据的准备方式较为敏感适用数据类型:标称型数据贝叶斯准则:使用朴素贝叶斯进行文档分类朴素贝叶斯的一般过程(1)收集数据:可以使用任何方法。本文使用 RSS 源(2)准备数据:需要数值型或者布尔型数据(3)分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好(4)训练算法:计算不同的独立特征的条件概率(5)测试算法:计算错误率(6)使用算法:一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。准备数据:从文本中构建词向量摘自机器学习实战。my,dog,has,fl
2、ea,problems,help,please, 0maybe,not,take,him,to,dog,park,stupid, 1my,dalmation,is,so,cute,I,love,him, 0stop,posting,stupid,worthless,garbage, 1mr,licks,ate,my,steak,how,to,stop,him, 0quit,buying,worthless,dog,food,stupid 1以上是六句话,标记是 0 句子的表示正常句,标记是 1 句子的表示为粗口。我们通过分析每个句子中的每个词,在粗口句或是正常句出现的概率,可以找出那些词是粗口
3、。在 bayes.py 文件中添加如下代码:python view plaincopy1. # coding=utf-8 2. 3. def loadDataSet(): 4. postingList = my, dog, has, flea, problems, help, please,5. maybe, not, take, him, to, dog, park, stupid, 6. my, dalmation, is, so, cute, I, love, him, 7. stop, posting, stupid, worthless, garbage, 8. mr, licks,
4、 ate, my, steak, how, to, stop, him, 9. quit, buying, worthless, dog, food, stupid 10. classVec = 0, 1, 0, 1, 0, 1 # 1 代表侮辱性文字,0 代表正常言论 11. return postingList, classVec 12. 13. def createVocabList(dataSet): 14. vocabSet = set() 15. for document in dataSet: 16. vocabSet = vocabSet | set(document) 17.
5、 return list(vocabSet) 18. 19. def setOfWords2Vec(vocabList, inputSet): 20. returnVec = 0 * len(vocabList) 21. for word in inputSet: 22. if word in vocabList: 23. returnVecvocabList.index(word) = 1 24. else: 25. print “the word: %s is not in my Vocabulary!“ % word 26. return returnVec 运行结果:训练算法:从词向量
6、计算概率python view plaincopy1. # 朴素贝叶斯分类器训练函数 2. # trainMatrix: 文档矩阵, trainCategory: 由每篇文档类别标签所构成的向量 3. def trainNB0(trainMatrix, trainCategory): 4. numTrainDocs = len(trainMatrix) 5. numWords = len(trainMatrix0) 6. pAbusive = sum(trainCategory) / float(numTrainDocs) 7. p0Num = zeros(numWords); 8. p1Nu
7、m = zeros(numWords); 9. p0Denom = 0.0; 10. p1Denom = 0.0; 11. for i in range(numTrainDocs): 12. if trainCategoryi = 1: 13. p1Num += trainMatrixi 14. p1Denom += sum(trainMatrixi) 15. else: 16. p0Num += trainMatrixi 17. p0Denom += sum(trainMatrixi) 18. p1Vect = p1Num / p1Denom 19. p0Vect = p0Num / p1D
8、enom 20. return p0Vect, p1Vect, pAbusive 运行结果:测试算法:根据现实情况修改分类器上一节中的 trainNB0 函数中修改几处:p0Num = ones(numWords);p1Num = ones(numWords);p0Denom = 2.0;p1Denom = 2.0;p1Vect = log(p1Num / p1Denom)p0Vect = log(p0Num / p1Denom)python view plaincopy1. # 朴素贝叶斯分类器训练函数 2. # trainMatrix: 文档矩阵, trainCategory: 由每篇文档
9、类别标签所构成的向量 3. def trainNB0(trainMatrix, trainCategory): 4. numTrainDocs = len(trainMatrix) 5. numWords = len(trainMatrix0) 6. pAbusive = sum(trainCategory) / float(numTrainDocs) 7. p0Num = ones(numWords); 8. p1Num = ones(numWords); 9. p0Denom = 2.0; 10. p1Denom = 2.0; 11. for i in range(numTrainDocs
10、): 12. if trainCategoryi = 1: 13. p1Num += trainMatrixi 14. p1Denom += sum(trainMatrixi) 15. else: 16. p0Num += trainMatrixi 17. p0Denom += sum(trainMatrixi) 18. p1Vect = log(p1Num / p1Denom) 19. p0Vect = log(p0Num / p1Denom) 20. return p0Vect, p1Vect, pAbusive 21. 22. # 朴素贝叶斯分类函数 23. def classifyNB
11、(vec2Classify, p0Vec, p1Vec, pClass1): 24. p1 = sum(vec2Classify * p1Vec) + log(pClass1) 25. p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) 26. if p1 p0: 27. return 1 28. else: 29. return 0 30. 31. def testingNB(): 32. listOPosts, listClasses = loadDataSet() 33. myVocabList = createVocabList(li
12、stOPosts) 34. trainMat = 35. for postinDoc in listOPosts: 36. trainMat.append(setOfWords2Vec(myVocabList, postinDoc) 37. 38. p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses) 39. 40. testEntry = love, my, dalmation 41. thisDoc = array(setOfWords2Vec(myVocabList, testEntry) 42. print testE
13、ntry, classified as: , classifyNB(thisDoc, p0V, p1V, pAb) 43. 44. testEntry = stupid, garbage 45. thisDoc = array(setOfWords2Vec(myVocabList, testEntry) 46. print testEntry, classified as: , classifyNB(thisDoc, p0V, p1V, pAb) 运行结果:准备数据:文档词袋模型词集模型(set-of-words model):每个词是否出现,每个词只能出现一次词袋模型(bag-of-word
14、s model):一个词可以出现不止一次python view plaincopy1. # 朴素贝叶斯词袋模型 2. def bagOfWords2VecMN(vocabList, inputSet): 3. returnVec = 0 * len(vocabList) 4. for word in inputSet: 5. if word in vocabList: 6. returnVecvocabList.index(word) += 1 7. return returnVec 示例:使用朴素贝叶斯过滤垃圾邮件(1)收集数据:提供文本文件(2)准备数据:将文本文件解析成词条向量(3)分析
15、数据:检查词条确保解析的正确性(4)训练算法:使用我们之前建立的 trainNB0()函数(5)测试算法:使用 classifyNB(),并且构建一个新的测试函数来计算文档集的错误率(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上准备数据:切分文本使用正则表达式切分句子测试算法:使用朴素贝叶斯进行交叉验证python view plaincopy1. # 该函数接受一个大写字符的字串,将其解析 为字符串列表 2. # 该函数去掉少于两个字符的字符串,并将所有字符串 转换为小写 3. def textParse(bigString): 4. import re 5.
16、 listOfTokens = re.split(rW*, bigString) 6. return tok.lower() for tok in listOfTokens if len(tok) 2 7. 8. # 完整的垃圾邮件测试函数 9. def spamTest(): 10. docList = 11. classList = 12. fullText = 13. # 导入并解析文本文件 14. for i in range(1, 26): 15. wordList = textParse(open(email/spam/%d.txt % i).read() 16. docList.
17、append(wordList) 17. fullText.extend(wordList) 18. classList.append(1) 19. 20. wordList = textParse(open(email/ham/%d.txt % i).read() 21. docList.append(wordList) 22. fullText.extend(wordList) 23. classList.append(0) 24. 25. vocabList = createVocabList(docList) 26. trainingSet = range(50) 27. testSe
18、t = 28. # 随机构建训练集 29. for i in range(10): 30. randIndex = int(random.uniform(0, len(trainingSet) 31. testSet.append(trainingSetrandIndex) 32. del(trainingSetrandIndex) 33. 34. trainMat = 35. trainClasses = 36. for docIndex in trainingSet: 37. trainMat.append(setOfWords2Vec(vocabList, docListdocIndex
19、) 38. trainClasses.append(classListdocIndex) 39. 40. p0V, p1V, pSpam = trainNB0(array(trainMat), array(trainClasses) 41. errorCount = 0 42. # 对测试集分类 43. for docIndex in testSet: 44. wordVector = setOfWords2Vec(vocabList, docListdocIndex) 45. if classifyNB(array(wordVector), p0V, p1V, pSpam) != classListdocIndex: 46. errorCount += 1 47. print “classification error“,docListdocIndex 48. print the error rate is: , float(errorCount) / len(testSet) 运行结果:因为这些电子邮件是随机选择的,所以每次输出的结果可能会不一样