基于机器学习的恶意网页检测.pdf-道客多多

资源描述

1、Obfuscated Malicious Javascript Detection using Classification TechniquesPeter Likarish, Eunjin (EJ) JungDept. of Computer ScienceThe University of IowaIowa City, IA 52242fplikaris, ejjunggcs.uiowa.eduInsoon JoDistributed Computing Systems LabSchool of Computer Science and EngineeringSeoul National

2、Universityischodcslab.snu.ac.krAbstractAs the World Wide Web expands and more usersjoin, it becomes an increasingly attractive means ofdistributing malware. Malicious javascript frequentlyserves as the initial infection vector for malware. Wetrain several classifiers to detect malicious javascriptan

3、d evaluate their performance. We propose featuresfocused on detecting obfuscation, a common techniqueto bypass traditional malware detectors. As the classi-fiers show a high detection rate and a low false alarmrate, we propose several uses for the classifiers, in-cluding selectively suppressing pote

4、ntially maliciousjavascript based on the classifiers recommendations,achieving a compromise between usability and secu-rity.1. IntroductionMalware distributors on the web have a large num-ber of attack vectors available including: drive-bydownload sites, fake codec installation requests, ma-licious

5、advertisements and spam messages on blogs orsocial network sites. Most common attack methodsuse malicious javascript during part of the attack, in-cluding cross-site scripting 20 and web-based mal-ware distribution. Javascript may be used to redirecta user to a website hosting malicious software, to

6、 cre-ate a window recommending users download a fakecodec, to detect what software versions the user hasinstalled and select a compatible exploit or to directlyexecute an exploit.Malicious javascript often utilizes obfuscation tohide known exploits and prevent rule-based or reg-ular expression (rege

7、x)-based anti-malware softwarefrom detecting the attack. The complexity of obfus-cation techniques has increased, raises the resourcesnecessary to deobfuscate the attacks. For instance, at-tacks often include references to legitimate companiesto disguise their purpose and include context-sensitivein

8、formation in their obfuscation algorithm. Our detec-tor takes advantage of the ubiquity of this obfuscation.Fig. 1 shows the clear difference between obfuscatedjavascript and a benign script. Even though the differ-ence is easily discernable by the human eyes, obfus-cation detection is not trivial.

9、We investigate automat-ing the detection of malicious javascript using classi-fiers trained on features present in obfuscated scriptscollected from the internet. Of course, some benignjavascript is also obfuscated as well, and some mali-cious javascript is not. Our results show that we detectthe vas

10、t majority of malicious scripts while detectingvery few benign scripts as malicious. We further ad-dress this in Section 5.1.In the next section, we discuss prior research onmalicious javascript detection. Then, we describe thesystem we used to collect both malicious and benignjavascripts for traini

11、ng and testing machine learningclassifiers. We follow this with performance evalua-tion of four classifiers and conclude with recommen-dations based on our findings as well as detailing fu-ture work.2. Related workJavascript has become so widespread that nearly allusers allow it to execute without q

12、uestion. To pro-tect users, current browsers use sandboxing: limit-(a) Obfuscated javascript(b) Benign javascriptFigure 1. Example scriptsing the resources javascript can access. At a high-level, javascript exploits occur when malicious codecircumvents this sandboxing or utilizes legitimate in-struc

13、tions in an unexpected manner in order to foolusers into taking insecure actions. For an overviewof javascript attacks and defenses, readers are referredto 11.2.1. Disabling javascriptNoScript, an extension for Mozillas Firefox webbrowser, selectively allows javascript 13. NoScriptdisables javascrip

14、t, java, flash and other plugin con-tent types by default and only allows script executionfrom a website in a user-managed whitelist. However,many attacks, especially from user-generated content,are hosted at reputable websites and may bypass thiswhitelist check. For example, Symantec reported thatm

15、any of 808,000 unique domains hosting maliciousjavascript were mainstream websites 19.2.2. Automated deobfuscation of javascriptAs mentioned in Section 1, obfuscation is a com-mon technique to bypass malware detectors. Sev-eral projects aid anti-malware researchers by automat-ing the deobfuscation p

16、rocess. Caffeine Monkey 6is a customized version of the Mozillas SpiderMon-key 14 designed to automate the analysis of obfus-cated malicious javascript. Wepawet is an online ser-vice to which users can submit javascript, flash or pdffiles. Wepawet automatically generates a useful report,checking for

17、 known exploits, providing deobfuscationand capturing network activity 3. Jsunpack fromiDefense 8 and “The Ultimate Deobfuscator” fromWebSense 1 are two additional tools to automate theprocess of deobfuscating malicious javascript.2.3. Detecting and disabling potentially ma-licious javascriptEgele e

18、t al. mitigate drive-by download attacks bydetecting the presence of shellcode in javascript stringsusing x86 emulation (shellcode is used during heapspray attacks) 5.Hallaraker et al designed a browser-based audit-ing mechanism that can detect and disable javascriptthat carries out suspicious actio

19、ns, such as opening toomany windows or accessing a cookie for another do-main. The auditing code compares javascript execu-tion to high-level policies that specify suspicious ac-tions 7.BrowserShield 16 uses known vulnerabilities todetect malicious scripts and dynamically rewrite themin order to tra

20、nsform web content into a safe equiva-lent. The authors argue that when an exploit is found,a policy can be quickly generated to rewrite exploitcode before the software is patched. Others proposeda similar javascript rewriting approach as well 23.Finally, in 2008 Seifert et al. proposed a set of fea

21、-tures combining HTTP requests and page content, (in-cluding the presence and size of iFrames and the useof escaped characters) and used that to generate a de-cision tree 18. There is little overlap between thefeatures we evaluate here and those proposed in 18and it may be possible to combine the tw

22、o sets to im-prove detection. In addition, we examine additional2classifiers and determined that classifiers using verydifferent approaches perform similarly.2.4. Cross site scripting attacksOne of the most common web-based attack meth-ods is cross-site scripting, XSS. XSS attack beginswith code inj

23、ection into a webpage. When a vic-tim views this webpage, the injected code is executedwithout their knowledge. Potential results of the at-tack include: impersonation/session hijacking, privi-leged code execution, and identity theft.Ismail et al. have detailed a XSS vulnerability de-tection mechani

24、sm by manipulating HTTP request andresponse headers 10. In their system a local proxymanipulates the headers and checks if a website is vul-nerable to an XSS attack and alerts the user.Noxes, by Kirda et al., is a rule-based, client-sidemechanism intended to defeat XSS attacks. The au-thors propose

25、it as a application-level proxy/firewallwith manual and automatically generated allow/denyrules 12.Vogt et al. evaluate a client-side tool that combinesstatic analysis and dynamic data tainting to determineif the user is transferring data to a third party 21. Ifso, their Firefox extension asks the u

26、ser if they wishto allow the transfer. An interesting question raised bythis work is whether users could distinguish between afalse positive and an actual attack.2.5. Comparison to our approachWe largely view the related work summarized hereas complimentary with our work. If a classifier cansuccessf

27、ully detect malicious javascript, it may sim-plify the problem of developing policies or conduct-ing taint analysis. For instance, one could develop apolicy based on the results from a classifier that couldtake a number of actions, including disabling the ma-licious script, sending it to a central r

28、epository for fur-ther analysis or re-writing the malicious script to be be-nign. Most deobfuscation tools use dynamic analysis,which may slow down the web browsing experiencemore than static analysis, especially on websites withmany scripts. Classifiers can also assist in identify-ing potentially m

29、alicious scripts so that deobfuscationtools can focus solely on them. XSS attack detectioncan also benefit from malicious javascript detection, asXSS frequently uses malicious javascript as part of theattack 20.The advantage of using a classifier over the rule-based approaches is that a classifier w

30、ill detect pre-viously unseen instances of malicious scripts as longas they more closely resembles the malicious trainingset than the benign training set. If the script is usinga previously unknown exploit but is obfuscated, thenit is still likely to be detected even though a specificpolicy or rule

31、has not been generated (potentially ata lower cost of overhead than dynamically re-writingcode or keeping track of tainted data streams). Policiescould even aid the browser to allow benign javascriptmisclassified as malicious (false positives generated bythe classifier) to execute a subset of “safe”

32、 instruc-tions, potentially allowing the user to proceed unim-peded even when the classifier has labeled a script aspotentially malicious.3. Machine learning and malicious javascriptThis section provides an overview of the process ofusing machine learning classifiers to effectively dis-tinguish betw

33、een malicious and benign javascript. Asmentioned in 4, the majority of malicious javascriptis obfuscated and the obfuscation is becoming moreand more sophisticated. We aim to detect obfuscatedjavascripts with a high-degree of accuracy and preci-sion, so that we can selectively disable it or otherwis

34、eprotect the user against online infections. The two es-sential phases of the process are data collection andfeature extraction.3.1. Data collectionThe performance of any classifier is closely relatedto the quality of the data set used to train that classifier.The training dataset should be a repres

35、entative sampleof both benign and malicious javascripts so that thedistribution of samples reflects the distribution in theInternet.Benign javascript collection We conducted a crawlof a portion of the web using the Alexa 500 most pop-ular websites as the initial seeds. The crawl was con-ducted using

36、 the Heritrix web crawler, the open source3Start date January 26th, 2009End date February 3rd, 2009Initial seeds Alexa 500Pages downloaded 9;028;469Total domains 95;606Data collected 340GB (compressed)Est. number of scripts 63;000;000Table 1. Benign javascript crawl detailsweb crawler developed and

37、used by Internet Archiveto capture snapshots of the internet 9. Details of thecrawl are available in Table 1. We based this crawl on atemplate provided with Heritrix and extended the tem-plate to only download textual content while ignoringmedia and binary content. All told, the crawl gatheredconten

38、t from 95;606 domains.Examining a subset of the corpus using a pythonscript, we observed an average of 7 external scriptsper page, leading to our estimate that our corpus con-tains over 63 million scripts. Although this is a modestcrawl by modern standards, this amount of informa-tion was more than

39、sufficient to train the open sourceclassifiers we used.Malicious javascript collection While collectingexamples of benign javascript is relatively straight-forward, collecting examples of malicious javascriptis far more complicated, primarily because maliciousscripts are short-lived. The authors of

40、malicious scriptshave no interest in revealing their attack techniquesand website operators have a vested interest in remov-ing malicious scripts before site visitors are exposed.In order to collect live examples of malicious scripts,we created the system detailed in Fig. 2.In step 1, we fed the Her

41、itrix web crawler withURLs that had been blacklisted by anti-malwaregroups, such as lists from http:/and http:/. In step 2, Heritrixcrawls these websites and saves the results in HeritrixsARC (archive) format. The crawls typically resultedbetween 5 and 7 megabytes of data, although by thetime of the

42、 crawl, most of the exploit code had been re-moved. In step 3, we used python scripts to extract in-dividual scripts from the ARCs and in step 4 conducteda manual review of the scripts (we quickly discoveredthat most virus scanners do not detect web exploits).Figure 2. Malicious javascript workflowT

43、his review involved deobfuscation of each maliciousscript in a clean VM using Venkmans javascript de-bugger 17.Scripts we identified as malicious wereadded to the collection of malicious javascript.Over the course of several crawls conducted duringFebruary and March of 2009, we identified 62 mali-ci

44、ous scripts. All but one of these scripts utilized alarge amount of obfuscation.Combined Data Set From the benign corpus, we ex-tracted 50;000 scripts at random. To this set we addedthe 62 malicious scripts to form the data set we used totrain and test the classifiers.3.2. Feature ExtractionThe seco

45、nd major phase of our project consisted ofidentifying features based on an in-depth examinationof javascript and a comparison of instances of benignand malicious javascript. As Fig. 1 reveals, it is simplefor a human to visually discern the difference betweenthe two classes. The challenge is codifyi

46、ng these dif-ferences as features that allow the classifiers to distin-guish between them as well.The simplest approach is to tokenize the script intounigrams or bigrams and track the number of timeseach appears in benign scripts and in malicious scripts.This approach has worked well with documents

47、writ-4ten in natural language, as the success of Bayesianclassifiers and SVMs in spam filtering shows. How-ever, javascript, a structured language with keywords,has a very different distribution of tokens from naturallanguage. Tokenizing the scripts into unigrams (or bi-grams) results in a huge numb

48、er of features that onlyrarely appear in either benign or malicious javascript,resulting in a huge feature set with few meaningful fea-tures.Using unigrams and bigrams also ignores the struc-tural differences between the benign and maliciousscripts and does not take advantage of the knowledgean expe

49、rt might use to determine whether or not ascript is malicious. For instance, we note that obfusca-tion often utilizes a non-standard encoding for stringsor numeric variables (e.g. large amounts of unicodesymbols of hexidecimal numberings). In turn, thistends to increase the length of variables and strings,as well as decreasing the proportion of the script thatis whitespace. Examination of the malicious javascriptalso revealed a lack of comments.We observed that malicious javascript contains amuch smaller perc

展开阅读全文