Concordance Overview

Concordance Overview

Contents

Overview - how concordances are useful
Corpora - what concordances use as texts/data
Corpus Preparation - how to prepare texts/data
Concordance Usage and Options - how to use the concordance program; with screen dump

In its simplest form, a Keyword-in-Context (KWIC) Concordance is a listing of some or all of the words in a text or set of texts, surrounded by the text that they are embedded in. Here is a section of a concordance of just the first sentence of this page:

surrounded by the text	that	they are embedded in. H
sting of some or all of	the	words in a text, or set
of texts, surrounded by	the	text that they are embedd

Typically, the concordance lines would show more of the surrounding text, so the user could more clearly understand how the words are used.

The purpose of a concordance is to study how words are used in a language, and to allow us to acquire a deeper understanding of meaning and usage than can be obtained from a dictionary. As an example, consider the words tan and auburn. Both can be used to mean a color; both indicate a brownish hue. This much you can find in a dictionary. But in a dictionary, you would not find that auburn is used frequently to describe hair color but never to describe skin color. Nor would you find that tan is not used to describe hair. But a concordance which uses a large amount of text from the target language could show you many occurrences of these two words at a glance (and other meanings as well, of course, such as the use of tan as an abbreviation of a trigonometric tangent). In this way you could infer how native speakers use the words, and how these usages may be limited to specific situations.

Acquiring this sense of how words are actually used (as opposed to just what they mean) will help in creating the best possible translations. For example, if you were reading an English story in which someone's skin was described as auburn, you would immediately know that something unusual was intended: perhaps, for example, it is used for comic effect. Your translation, then, would attempt to accomplish the intended comic effect in the target language. If you didn't know that auburn was not normally used to describe skin, but only know that it is a brownish-red color, you would probably just translate the word to the target-language equivalent and lose the intended comic effect.

Corpora

Clearly, the more text that goes into a concordance, the more useful the results will be. Doing a concordance on a sentence or paragraph cannot tell you very much about patterns of usage in the language. Most European languages already have electronic versions of tens or even hundreds of megabytes of text which are publicly available. A single such collection is called a corpus - plural corpora.

Some corpora consist of a broad selection of materials from the language - novels, plays, newspaper articles, transcriptions of authentic speech, and so on. Others are specialized - religious documents or political writings or the works of a single author, etc. Clearly, if you are studying the works of Shakespeare you would want his collected plays and poems in a corpus and nothing more. If you are interested in the language of current events, you would want newspapers or political writings. But if you are interested in written language in general, you would want as broad a selection as possible.

The less commonly taught languages generally do not have prepared corpora. For this course, some medium-sized corpora have been prepared for your use For the concordance programs supplied with this course, you will be able to create your own corpora by downloading documents from the internet or obtaining electronic version of texts in other ways. Most likely these corpora will not reach megabyte size, but using your own data with our concordance programs should give you a good feel for the way that concordances can be used.

In fact, the concordance programs used in this course were not developed to process megabytes of text data. There are professional concordance programs available that can do that, and there are web sites (for some European languages, including English) that can access huge corpora.

Our programs, on the other hand, are free and can be used by any computer with an internet connection and a web browser. They can use texts that you supply, not just already prepared texts. However, they will probably choke if you feed them too much data. (At this writing, 240 printed pages or a bit less than half a megabyte of text have been processed without causing a crash.)

Corpus Preparation

If you want to create your own corpus, you can do so by any convenient means... but you must ensure that that the data you create is pure text. By that, I mean a text with no hidden format codes, font information, or other information in it. Here is one suggested workflow to create a text file and use it in our concordance programs:

go to an internet site and select and copy some text to the clipboard
open Wordpad and paste in the text
save the file using Save as Type: "Text Document" or (for Thai) "Unicode Text Document"
now anytime you want to use this text as your corpus, open the file (using Wordpad), select the text, copy it, and then paste it into the concordance textbox. Be sure you click the Paste text to use radio button

Special Notes for Thai:

1. The Thai must be Unicode Thai. Some web sites use older encodings that will not work.

2. Someone must manually insert spaces between words. The software cannot guess at word divisions. Sorry!

Concordance Usage and Options

Step 1: Concordance Step 1: Provide a Document to Use

Choose a file to use: you must click on one of the files below. This file is a prepared text (corpus) that will be used as the data for the concordance.
Paste text to use: paste some text (your corpus) into the text box. The concordance will use this as its data.

Step 2: Choose Concordance Type

Display all words: this choice will do a concordance with all the words in the text, creating one line for each word each time it occurs in the text. Note clearly that this will produce a huge web page if the text is more than a few paragraphs. Transmitting and displaying this page to your computer may take several minutes or more, depending on the actual size.
Enter a single word for display: this choice will produce a concordance on just the word (or pattern - see below) that you specify. You will normally use this option repeatedly as you study each word of interest. You must, of course, enter the word your are interested in (or the pattern - see below).

Step 3: Choose Matching Type (only used for single word display)

Whole word match: will only list whole words. If you supply "bat" as the word, "batter" will not be listed in the concordance output.
Partial word match: will match any occurrence of the word even if it is inside a larger word. If you supply "bat" as the word, "batter" will be listed in the concordance output.
Regular expression match: using a rather arcane but powerful syntax (called Regular Expressions), you can specify a pattern to be used as the "word" to be listed. For example, typing ke\w*an will generate a concordance on all words beginning with ke- and ending with -an. Further instructions on how to form these patterns can be found by clicking the link here.

Note that there is a link on the concordance screen to Regular Expression Help.

Step 4: Enter how many characters to display before and after the word

Enter a number (normally) between 40 and 200 in the "Context Size" box.

Finally:

Click on the Submit button to create and display the concordance results. The actual work is done on the SEAsite server and the results are sent back to your computer and displayed on the screen by your browser. This may take anywhere from a second to several minutes, depending on the amount of data to be sent back and forth and processed on the server.

The concordance output consists of

the original text processed in a scrollable text box
0 to many concordance lines (the % number in the right-most column is the approximate position of the line in the text)

You can now

examine the results on-screen
save the results to disk (in Internet Explorer, use File/Save and specify Web page, complete (*.htm, *.html). Use an Encoding of Western European (Windows) or Unicode (UTF-8) for Thai.
print the results from your browser

And then...

You can use your browser's Back button to go back and run a new concordance on a new word or corpus. When you return to that screen, the Reset button will clear all fields to default values.