收藏 分享(赏)

A statistical approach for sex identification in chat mediums.doc

上传人:dzzj200808 文档编号:5749259 上传时间:2019-03-15 格式:DOC 页数:4 大小:112.50KB
下载 相关 举报
A statistical approach for sex identification in chat mediums.doc_第1页
第1页 / 共4页
A statistical approach for sex identification in chat mediums.doc_第2页
第2页 / 共4页
A statistical approach for sex identification in chat mediums.doc_第3页
第3页 / 共4页
A statistical approach for sex identification in chat mediums.doc_第4页
第4页 / 共4页
亲,该文档总共4页,全部预览完了,如果喜欢就下载吧!
资源描述

1、Cemal KSE, Vasif NABİYEV, zcan ZYURTA STATISTICAL APPROACH FOR SEX IDENTIFICATION IN CHAT MEDIUMS Department of Computer Engineering, Faculty of Engineering, Karadeniz Technical University, 61080 Trabzon, TURKEYckose, vasif, oozyurtktu.edu.trAbstract. Chat mediums are becoming an important part of h

2、uman life in societies and provide quite useful information about people such as current interests, habits, social behaviors and tendency of the people. In this study, we have presented an identification system that is designed to identify the sex of a person in a Turkish chat medium. Here, the sex

3、identification is taken as a base study in the information mining in chat mediums. This identification system acquires data from a chat medium, and then automatically detects the chatters sex from the information exchanged between chatters and compares them with the known identities of the chatters.

4、 To do this task a simple discrimination function is proposed. The system has achieved over accuracy 80% in the sex identification in the real chat medium.1. IntroductionA chat medium contains a vast amount of information, which is potentially relevant to a societys current interests, habits, social

5、 behaviors, crime and other tendencies. Users may spend a large portion of their time to find out information in chat mediums. An intelligent system may help the users in finding the interested information in the medium 1, 2, 3, 6, 8. In a chat conversation, chatter considers the corresponding chatt

6、ers sex, and the course and contents of the conversation may be shaped according to the corresponding persons sexual identity. Therefore, an example identification system is implemented to determine chatters sex identity in chat mediums. To do this, many conversations are acquired from a chat medium

7、s designed on purpose, and then statistical results are obtained from the conversations 4,5, 7. These results are used to determine weighting coefficients of the proposed discrimination function. The proposed function includes some important parameters representing a group of words and signs such as

8、 abbreviations, interjections, shouting, and sex and interest related words. Each weighting coefficient of the proposed function is determined with respect to usage frequency of words in a group and determinative characteristic of each word groups. In this paper we presented an intelligent identific

9、ation system to collect information from chat mediums and evaluate the information for sex identification 5,7. This system with the discrimination function is evaluated on the data acquired from a purposely-designed chat system. Performance of the system is also measured in real chat mediums. The re

10、st of this paper is organized as follows. The proposed discrimination function for sex identification is presented in Section 2. A detailed description of methods used in the system is given in the same section. The implementation and results are discussed in Section 3. The conclusion and future wor

11、k are given in Section 4.2. The Identification SystemTo evaluate the identification system, real information is collected and extracted from the chat mediums. Also some statistical data collected from specially designed medium is used to evaluate the discrimination function. The most frequently used

12、 signs from the specially designed chat medium (SDCM) and the (mIRC) or real Internet medium (RIM) are also used to evaluate the system.2.1. Words and Word GroupsIn a chat medium, many word groups may be defined to identify chatters sex in a dialogue. In this study, eighth word groups are defined to

13、 cover as many sex related concepts and subjects as possible in a chat medium. These groups are abbreviations and signs, slang and jargon words, politeness and delicacy words, interjections and shouting, sex and age related words, question words, particle and conjunction words, and other word groups

14、. Table 1. Some most frequently used words in each word groupsNoAbbreviationand signsSlang and jargon words Politeness delicacy wordsInterjections Shouting words1 Hi (Slm) My son! (Oglum) Nice (Gzel) Hey!/Man! (Yaw)2 Answer (Cvp) Man! (Lan) Thanks (Tk) Hmm (Hmm)3 What is the news (Nbr) Uncle! (Day!)

15、 Well done (Aferin) And, soo (Ee)4 You! (u) Go away! (Defol) Yes! (Efendim) Oh! (Aa)5 Thank you (tk) Repentance! (Tvbe!) You (Siz) Well (İi)NoParticle and conjunction wordsAge and sexuality related wordsQuestionWordsOther words1 Such/so/that (yle) Age (Ya) What (for)? (Niye?) You (Sen)2 If not/other

16、wise (Yoksa) Sexuality (Cinsiyet) Why? (Neden?) I/me (Ben)3 In order to (Diye) My love (Akm) Which? (Hangi?) If only (Olsun)4 Another /Other/ (Baka) My lady (Bayanm) Where? (Nerde?) You (Seni)5 Thus/so/such (Byle) My man/gent. (Erkeim) Where are you? (nerdesin?)Look (Bak)These groups and some import

17、ant words in the groups are listed in Tables 1. The weight coefficients of each word group are assigned related to usage frequencies and determinative power of words in each group.2.2. Statistical Sex IdentificationA simple discrimination function is designed to identify sex of a person in a chat me

18、dium. This function considers each word in conversations separately and collectively. Therefore, statistical information related to each chosen word is collected from the purposely designed and Internet chat mediums. By using the statistical information a weight coefficient is determined for each wo

19、rd in each group. Practically, weight coefficient of any word is determined by equation (1). )./(mfffx(1)where, is the usage frequency of a word by female chatters, is the usage frequency of a fx mxword by male chatters, is the weight coefficient of a word in a word group for female, and fis the wei

20、ght coefficient of a word in a word group for male. Each weight coefficient is mnormalized into the interval from 0 to 1, and then each word in a group is also normalized by the number of words in the group that exists in the conversation. If a word is female dominant, varies from 0.5 to 1 but if th

21、e word is male dominant, varies from 0 to 0.5. For each conceptually related word group, a sexual identity value is calculated by equation (2). ./),.(21 ikiiii wwg(2).,21ikiii(3)where, gi varies from 0 to 1 and determines the chatters sexual identity as female or male for i.th word group, is the wei

22、ght coefficient of j.th word in i.th word group and varies related to the ijnumber of words in the interested text, represent the existing j.th words in the interested text jw(if a word exists in the text, then =1.0 else = 0.0), k is the number of word in i.th word jjgroup and is normalization divid

23、er for the current number of existing words in the i.th word igroup and calculated by equation (3). As explained before, words are also classified into several groups considering the conceptual relations. Thus, the importance of some word groups can be emphasized collectively. So, several word group

24、s are defined considering words acquired from the conversations in the chat mediums. A weight coefficient is also determined for each word group. Then, the proposed discrimination function is formed for the sex identification as Equation (4). The equation can be used to determine sex identity of any

25、 chatter in a conversation. ./)*,.*(21 nggg(4)where, varies from 0 to 1 and determines the chatters sexual identity as female or male, giis the weight coefficient for i.th female or male word group and is normalization divider for the current number of existing groups in a conversation and it is cal

26、culated by equation (5),.21gng(5)where, is the weight coefficient of i.th groups. Hence, the weight coefficients of each group giare determined according to dominant sexual identity of the group. Then, the sex of the chatters may be identified as female when is determined between 0.5 and 1. On the o

27、ther hand, chatters may be identified as male when is determined between 0.0 and 0.5. Here, the accuracy of the results increases that it shows female or male gender when approaches to 0.0 and 1.0 respectfully.3. ResultsIn this paper, we have presented a full-scale implementation of a chat system to

28、 collect information from conversations and a method to identify chatters profiles. This method describes how to use a discrimination function for sex identification in the medium. About two hundreds conversations have been collected from specially designed chat and real mediums. Forty-nine of the c

29、onversations are chosen as the training set and including ninety-eight chatters (forty-four female and fifty-four male) for testing. Experimental results are indicating that the proposed discrimination function has sufficient discriminative power for the sex identification in the chat mediums. We al

30、so find that the system can quite accurately predict the chatters sex in the mediums. Table 2. The general result of sex identification for the specially designed medium and mIRCMale Chatters Female ChattersSDCM MIRC SDCM mIRCNumber of chatters 54 19 44 8Number of correct decision 45 5 36 5Number of

31、 wrong decisions 6 4 8 3Number of undecided results 3 0 0 0Percent. of correct decision 83.3% 78.9% 81.8% 62.5%Percent. of wrong decisions 11.1% 21.1% 18.2% 37.5%Percent. of undecided results 5.6% 0.0% 0.0% 0.0%Table 2 presents sex classification results for the conversations between chatters in the

32、 medium. The accuracy of decision of the system reaches to 83.3% percentage.4. Conclusions and Future WorkNowadays chat mediums are becoming an important part of human life and provide quite useful information about people in a society. In this paper, a simple discrimination function is defined for

33、the sex identification. The identification system with the discrimination function achieves accuracy over 80% in the sex identification in the mediums.In the future work, a Neuro-Fuzzy method considering the intersection of the word groups, can be employed to determine the weighting coefficients of

34、the proposed discrimination function. Then, the weighting coefficients of the proposed discrimination function would be calculated more precisely and accuracy of the identification system could be improved. References1. Baumgartner R., Eiter T., Gottlob G., Herzog M., Koch C., Information extraction

35、 for the semantic, Lecture Notes in Computer Science Reasoning Web., Vol. 3564, pp. 275-289, 2005.2. Gao Xiaoying, Zhang Mengjie, Learning knowledge bases for information extraction from multiple text based Web sites, IEEE/WIC International Conference on Intelligent Agent Technology, pp. 119 125, 20

36、03.3. Iiritano S., Ruffolo M., Managing the knowledge contained in electronic documents: a clustering method for text mining., 12th International Workshop on Database and Expert Systems Applications, pp. 454 458, 2001.4. Kaban A., Wang Xin., Context based identification of user communities from Inte

37、rnet chat., IEEE International Joint Conference on Neural Networks, Vol. 4, pp. 3287 3292, 2004.5. Khan Faisal M., Fisher Todd A., Shuler Lori, Wu Tianhao and Pottenger William M., Mining Chatroom Conversations for Social and Semantic Interactions., Lehigh University Technical Report LU-CSE-02-011,

38、2002.6. Pazzani M, Billsus D., Learning and revising user profiles: The identification of interesting Web sites, Machine Learning 27 (3): pp. 313-331, 1997.7. Wu Tianhao, Khan Faisal M., Fisher Todd A., Shuler Lori A. and Pottenger William M., Error-Driven Boolean-Logic-Rule-Based Learning for Mining Chat-room Conversations., Lehigh University Technical Report LU-CSE-02-008, 2002 8. Nabiyev, V.V., “Artificial Intelligence: Problems , Methods , Algorithms” , Second Edition, Sekin Publishing , Ankara, 2005, 764 pp. (in Turkish)

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 高等教育 > 大学课件

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报