收藏 分享(赏)

UAMCorpusTool3.0使用说明.pdf

上传人:精品资料 文档编号:9627521 上传时间:2019-08-19 格式:PDF 页数:49 大小:1.65MB
下载 相关 举报
UAMCorpusTool3.0使用说明.pdf_第1页
第1页 / 共49页
UAMCorpusTool3.0使用说明.pdf_第2页
第2页 / 共49页
UAMCorpusTool3.0使用说明.pdf_第3页
第3页 / 共49页
UAMCorpusTool3.0使用说明.pdf_第4页
第4页 / 共49页
UAMCorpusTool3.0使用说明.pdf_第5页
第5页 / 共49页
点击查看更多>>
资源描述

1、 1 UAM CorpusTool Version 3.0 Tutorial Introduction (June, 2013) Mick ODonnell michael.odonnelluam.es 2 About this Document This document provides a tutorial introduction to UAM CorpusTool 3.0 (henceforth: UAMCT3). For more detailed information about the options in each screen and menu of UAMCT3, pl

2、ease see the UAMCT3 User Manual. About UAM CorpusTool 3.0 UAM CorpusTool is a set of tools for the linguistic annotation of text. Core concepts include: The user defines a project, which is: a set of files, and a set of analyses which are applied to each of these files. All the files of a project ar

3、e stored in a single folder: the original texts (the corpus), the annotations on this text and the coding schemes (the tags applied to the texts). Each analysis can be seen as a layer of annotation. CorpusTool currently allows two types of annotation: 1. Document Coding: where the text as a whole is

4、 assigned features. For instance, these features could represent the register of the document (field, tenor, mode), or text-type. 2. Segment Coding: The user can select segments within a file, and assign features to each of these segments. Segments are specified by dragging the mouse over a span of

5、text, and the user is then prompted to specify the features of this segment. Annotation can be manual (the user swipes text and chooses categories for it) or automatic (the program does the annotation for you). Sometimes annotation is mixed, for instance, you can have the program recognise clause or

6、 noun-phase segments, but it is up to the you to code them.: CorpusTool is available from: http:/ See that site for instructions on how to install CorpusTool on your machine. 3 Tutorial 1: Starting a new project 1 Launch UAM CorpusTool Once UAM CorpusTool is installed on your machine, you can begin

7、working with it. The first thing to do is to create a new “project”: Windows: When installing CorpusTool, you had the option to place an icon on the desktop. Click on this icon to launch CorpusTool. Alternatively, there should be a UAM CorpusTool icon in the Programs menu in the Start menu on Window

8、s Toolbar. Select this to launch CorpusTool. Macintosh: The installation of CorpusTool placed the application in your Applications folder. Double-click on the application to launch it. You might find it useful to place the application in the Dock for easy access. If you have already created a projec

9、t, you can open it simply by double-clicking the .cp3 file in the Project folder. This file has an icon as below: MacOSX: Windows: The Opening Window A window should appear as in Figure 1.1. This window provides, amongst other information, the version number you are using (useful if you need to comm

10、unicate bugs). 4 Figure 1.1: The Opening Window The Window offers several options, Start New Project: create a new project from scratch. Open Project, to continue with a project you have already started, you will be prompted to select one. Import Project from UAMCT 2: If you have a project from UAMC

11、T 2, you can use the “Import Project from UAMCT 2” button to make a copy of your project in the UAMCT 3 format. Open SomeProjectName: If you have opened a project previously on this machine, there will also be a button to open the last project opened. 2 Click on the “Start New Project” button. After

12、 clicking this button, a “Create Project Wizard” will appear, which will lead you through the steps needed to create your project: 1. Providing a name for a new project 2. Specify the folder where your new projects folder is to be stored. For instance, choose the Desktop folder on your machine. When

13、 you click the “Finalise” button, CorpusTool will create your project, which is a folder containing all the details related to your project, including the corpus, and the annotation files. It also contains an icon which can be used to launch your project directly (the .ct3 file). Once you have finis

14、hed with the Create Project Wizard, the CorpusTool Main Window will open, showing the File pane. See Figure 1.2. This pane is where you add or remove files to your project, or open a file for annotation. 5 Figure 1.2: The File Management pane The buttons at the top of the pane allow you to switch be

15、tween the different panes of CorpusTool: Files (Tutorial 2), Layers (Tutorial 3) Search (Tutorial 5), Autocode (Tutorial 6), Statistics (Tutorial 7), Explore (Tutorial 8), Options and Help. We will assume for now that the “File” pane is selected. The name of your project is shown in the title bar of

16、 the Project window. In the space below is a box showing all the files in the project (initially empty), and for each file, one button for each of the possible analyses of that file. This ends the first tutorial. The next tutorial will show how to add content to your project. 6 Tutorial 2: Adding te

17、xt files to your project The next step is to add some files to the project. 1 Save Documents as plain text UAMCT 3 deals only with plain text files. If your files are in MS Word format or PDF, you need to save them as plain text. If you are on Windows, and your texts are in languages with non-wester

18、n characters (e.g., Cyrillic, Chinese, Korean, etc.), then it is better to open your .docx document with WordPad, and use the “Save as” option there, as it can save as a Unicode file. 2 Click on “Extend Corpus” Click on the Extend Corpus button in UAMCT. A window will appear to guide you through the

19、 process of adding files. You are given a choice between: Add a single file: You will be asked to select a file to add to the corpus. Additionally, you will be asked to specify a “subcorpus” for the text file. Texts in UAMCT are stored within subcorpora (folders within the Corpus folder). For instan

20、ce, you might have one subcorpus for native texts, and another for learner texts. Add a folder of files: You will be asked to select a folder to add to the corpus. This folder could be: o A folder of plain text files: the folder will be added as a “subcorpus” of the project. o A folder of folders of

21、 plain text files: each folder will be added as a “subcorpus” of the project. Paste from the Clipboard: you will be given a space in which to copy/paste text into. This is a useful way to take texts from the internet into UAMCT. In the first two cases, the files you select will be copied from where

22、they are into the Corpus folder of your project. The originals are left untouched. For this tutorial, lets use the “Paste From Clipboard” option. Copy the following paragraphs of text and follow the instructions below: Obama is like Apple, Google and Facebook: a once hip brand tainted by Prism Among

23、 the guests at the fabled Bilderberg meeting, held this weekend just outside London, are the top brass of Google, Amazon and Microsoft. How appropriate they should be there, alongside luminaries of the US political and military establishment. For this was the week that seemed to confirm all the old

24、bug-eyed conspiracy theories about governments and corporations colluding to enslave the rest of us. The Guardian revealed that the US National Security Agency has cracked open our online lives, that it can rifle through your emails, listen to your calls on Skype, watching “your ideas form as you ty

25、pe“, as a US intelligence officer put it apparently in cahoots with the corporate titans of the web. 7 1. Select “I want to paste from the clipboard (Figure 2.1) then press “Next”. Figure 2.1 2. Paste the text into the space (edit it here if you want). Figure 2.2 3. Type in a filename for the file (

26、e.g., “Obama1.txt”). 4. Leave “Subcorpus” set to “Add new subcorpus”. 5. Press “Next”. You will be prompted for the name of the subcorpus to add the file to. Type “News” and then press OK 6. Press “Finalise”. The file you added should not be displayed in the Project window.(see Figure 2.3). 8 Figure

27、 2.3: Files window after adding a text file The newly added files are under the caption “Files in corpus but not incorporated in project”. UAMCT makes a distinction between “incorporated” files, which have buttons to annotate at all available levels, and “unincorporated” files, which are in the corp

28、us but not yet opened for annotation. This distinction is made to make it easy to keep track of those files which you have started editing, distinct from those you may wish to add later. If you have 100 files in the corpus, but have only annotated five, then you want the five with annotations to be

29、clearly indicated. This allows for a gradual expansion of your corpus over time, but lets you get results at each point. 3 Incorporating Files To incorporate a file into the project, making it available for annotation, you can either: Click on the “Incorporate All” button to incorporate all unincorp

30、orated files, or Click on the “Action” button next to a file and select “Incorporate file” from the menu. This will incorporate just the single file. If you do either of the above, you will be presented with a window asking for some metadata regarding the file or files (See Figure 2.4). This include

31、s: Language: which language the text written in? This field is used to determine which language resources to use for the document. These resources include lexicons (for concordance searching, calculation of 9 lexical density, etc.), parsers (for automatic segmentation) and taggers. Currently, only E

32、nglish is really supported, but soon lexical resources for other languages will be provided. Encoding: text files are stored in a particular text encoding. You can tell CorpusTool what encoding your file is in by selecting from this field. The default option offered by UAMCT is a guess of what it sh

33、ould be, but if the text does not display properly, you may need to change it. To find out what encoding the document is in, try right clicking on the document and select “Open with” (or the MacOSX equivalent) and open the text with MS Word, which may help you choose the best encoding. Otherwise, us

34、ing Open with, select a browser, and look for the “Encoding” or “Character Encoding” menu item, and see which encoding this program gave the text. Display Font: Choose here the font family and size you want to use to display your text in the annotation windows. Some fonts will better cope with non-w

35、estern writing systems, e.g., some fonts are designed to display Chinese, etc. However, many modern fonts should display any writing system. Figure 2.4: File Metadata Window After incorporating the file, the Project Window appears as in Figure 2.5. 10 Figure 2.5: The Files Window after incorporating

36、 files. 11 Tutorial 3: Adding a manual annotation layer to your project The next thing we need to do is to specify what analyses you want in the project. Lets start by adding just one layer. A “Layer” is a type of analysis of the text files. We can add layers for coding clauses, for coding groups, f

37、or the register of the whole text, for appraisal analysis, etc. For this example, we will assume we are adding a layer for analysing noun phrases (NPs) in terms of both their content (what they express), and their form (proper, common, pronominal). 1 Change to the Layers pane Click on the “Layers” b

38、utton at the top of the window. 2 Click on the “Add Layer” button. When you click on “Add Layer”, a window will pop up asking several questions. Use the Next button to move between questions: 1. Layer Name: the name given to the layer. Put “Entity”. 2. Automatic or Manual Annotation: choose Manual.

39、3. Scheme: choose “Design Your own”. The other options allow you to use one of the schemes supplied with UAMCT3, or to use a scheme from another project you have created. 4. Kind of Segment: here you specify whether you want to assign features to a text as a whole (e.g., its register or text type) (

40、Whole Document), or whether you want to assign features to subsegments in the text (e.g., clauses). Lets assume that we are interested in the second, so click on “Segments within a Document”. 5. Special Layer: This window offers options for special kinds of annotation. Error annotation layers provid

41、e a special slot on the coding interface for you to provide the correction of the error. RST annotation provides a special interface for annotating the “rhetorical structure” of the text. For now, just select “No”. 6. Automatic Segmentation: here you can specify whether the text should be segmented

42、for you, recognising paragraphs, sentences, or words. For English texts, automatic recognition of clauses and NPs is also possible. For this tutorial, select “No”. 7. After following these steps, you will see a final window displaying your choices, as in Figure 3.1. If any of the settings vary from

43、yours, use the Back button to go back and change it. Then press “Create Layer” to return to the main window. 12 Figure 3.1: Last pane of the Add Layer Assistant Figure 3.2 shows the Layers window with one layer added. The Layer space provides some information about the layer. There are two buttons o

44、n the Layer control panel: Edit Scheme: this button will open a window to allow you to edit the coding scheme. We will come back to this in the next tutorial. Delete Layer: this will delete the layer, and all analyses of text files performed on this layer. Press this only before you begin coding of

45、the layer, or if you really want to delete the layer. Edit Details: this button is currently disabled, but will in the future allow you to change the characteristics of the layer (e.g., manual/automatic, auto-segment, etc.). Currently, you need to delete the layer and add it again to change the char

46、acteristics. 13 Figure 3.2: The Layers Window with one Layer added 2.1 Return to the Files pane If you click on the Files button, you will see that the display has changed slightly. The entry for the “Obama1.txt” file now has a button next to it “Entity”. You can click on this button to edit this fi

47、le at this layer. The colour of the annotation buttons are colour coded to indicate their degree of completeness: Light: Not yet coded Medium: Partially Coded Dark: Coded to a high degree Dont open the annotation window just yet, though. First, we need to specify the coding scheme for the Entity lay

48、er. The next tutorial will deal with this process. 14 Tutorial 4: Editing the Coding Scheme 1 Opening the Scheme Editor Before annotating files for a given layer, you need to define the annotation scheme for the layer. The first step here is to open the scheme editor. Change to the Layers pane, and

49、click on the “Edit Scheme” button for the layer. This tutorial assumes we are working on the “Entity” layer defined in the previous tutorial. Figure 4.1: The Entity Scheme before editing A window like Figure 3.2 will pop up. It shows a small “system network” (a hierarchy of features), with “entity” as the most basic concept, and a choice between entity-1 and entity-2. 2 Editing the Scheme These features have been automatically generated, and we will change them to more informative names.

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 企业管理 > 管理学资料

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:道客多多官方知乎号:道客多多

经营许可证编号: 粤ICP备2021046453号世界地图

道客多多©版权所有2020-2025营业执照举报