Pattern Specification using Regular Expressions

Pattern Specification using Regular Expressions

Overview
    Try It Out
Regular Expression Syntax
    Pattern Specifiers
    Repetition Specifiers
    Specifying Alternatives
    Grouping
Examples
    Indonesian
    Tagalog
    Thai

Overview

The concordance programs available on this web site allow you to specify a pattern in place of a single word. The concordance results will then be generated on any text that matches the pattern you specify rather than just one particular word. Possible patterns could be:

a multi-word phrase or collocation, e.g. fiscal year
non-contiguous text, e.g. neither one thing nor the other
inflected forms, e.g. all words starting with ke- and ending with -an

You may be familiar with a common computer notation to specify a pattern. For example,

delete *.txt

will delete all files with a .txt. extension. The * is called a wildcard and matches any sequence of characters.

Regular expressions are much more powerful and flexible and allow you to specify almost any imaginable pattern. Fully mastering the syntax of regular expressions is difficult - but learning how to use enough of it to be useful is easy.

A regular expression is a combination of regular characters and special characters and symbols that together indicate the pattern. For example, the following specifies the pattern "all words starting with ke- and ending with -an":

\bke\w*an\b

\b = word boundary - (there must be a space, tab, or line break before the next part of the pattern - the ke)
ke = letters to be matched literally
\w* = any number of letters (word characters)
an = letters to be matched literally
\b = word boundary (there must be a space, tab, or line break immediately after the previous part of the pattern - the an)

Try It

Before (and after you study the following material, try out the mechanics of searching a text for a pattern.

Start the Concordance Program
Select the Paste text to use button
Select the Enter a single word to display button
Select the Regular Expression match button
Now type in some text in the text box: for example, to use the example above, type or copy and paste from here:

Kebakaran Hutan.
Hutan dan lahan yang terbakar sudah mencapai 155.611,58 hektar.
Kerugian kebakaran hingga awal April ini mencapai Rp 2.672.880.600.000.

Now type or copy from here into the Step 2 word box the pattern: \bke\w*an\b
Click the Submit button.
You will see the results with 2 kebakarans and 1 rerugian.
Hit the back button and modify the pattern to \b\w*bakar\w*\b
Submit again. You see three occurrences of words with bakar in them.

You get the idea. You will learn how to form such pattern specifiers below. This is just to get the mechanics down, so you can practice creating and testing your own patterns when you are ready.

Regular Expression Syntax

Pattern Specifiers

Here is a table of some of the special characters and symbols that can be used in a regular expression.

Note: at present, all text submitted to the concordance is internally converted to lower case, so any pattern which searches for an explicit uppercase letter or letters will fail.

Pattern Specifier	Characters it matches	Example
\d	Any digit from 0..9	\d\d matches 72 but not aa or 7a
\D	Any character that is not a digit	\D\D\D matches jim but not 123
\w	Any word character: A..Z, a..z, 0..9, underscore	\w\w\w\w matches Ab_2 but not $%& or Ab_&
\W	Any non-word character	\W matches & but not m
\s	Any space, tab, or line break	form\sfeed matches form feed or form and feed on he next line but not formfeed
\S	Any NON-space, tab, or line break	i.e. every visible character
.	Any single character except a line break
[...]	Any one of the characters shown between the []	[abc] will match a single a or b or c but nothing else. [a-z] will match any one lower case letter
[^...]	Any character except one of those inside the []	[^abc] will match any character except a or b or c. A or B or C would match [^a-z] will match any one character that is not a lower case letter.
\b	Any word boundary	Usually a space

Repetition Specifiers

Often you want to specify a pattern where certain character classes can repeat, as in example 3 above, where 4 word characters are desired. There are some shortcut notations for this.

Repetition Character	Meaning	Example
{n}	match n of the previous item	9{3} matches 999 but not 9 or 99 or 9999
{n,}	match n or more of the previous items	a{2,} matches aa or aaa or aaaa, etc. but not a
{n,m}	match at least n but no more than m of the previous item. If n is 0, the item is optional	x{2,4} matches xx or xxx or xxxx but not x or xxxxx
?	match the previous item 0 or 1 times, making it optional	123-?456 matches 123-456 or 123456
+	match the previous item 1 or more times	x+ matches x or xx or xxx ...
*	match the previous item 0 or more times	x* matches any number of x's (0 ... many). \w* matches any number of word characters.

Specifying Alternatives

Using the [...] notation, we can specify alternative single characters. Combining this with the {n} notation, we could specify a number of repetitions of alternative single characters. For example, [123]{3} would match 111, 112, 113, 121, 122, 123, 131, 132, 133, 211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333.

The alternation character (|) allows you to provide broader choices. For example, Jim|James would match either Jim or James. Combined with grouping, this feature is even more useful.

Grouping

You can specify a group with ().

It is often necessary to use a group with alternation to express what you want. Suppose you have the following text (this example uses all lower case because of the Note above.

some stuff...
mr homes
dr watson ...

and you want to pick out all the

title-name pairs. You might try:

mr|mrs|ms|dr [a-z]*

intending to say "match any title followed by a space, followed by any number of lower case letters" Note that this would not correctly capture mr o'hare. We could fix that:

mr|mrs|ms|dr [a-z]\'?[a-z]* (the \'? means 0 or 1 single quotes. The \ before the ' is required for a number of special punctuation marks like ' and * and ?)

But there is another problem. If the pattern matcher tries to find all the matches in the text (as the concordance programs do) then it will first match the mr (because the | says that mr is ok by itself as a match. (it sees all of the Dr [a-z]\'?[a-z] as the last "or" alternative). So the matches will be

mr
dr watson

We can fix this very simply with grouping. Make it clear that the titles are a group and that one of the items in the group must match before the rest can be checked. Similarly, make a second group to say "a name with one letter, an single quote, and then some more letters OR just a series of letters:

(mr|mrs|ms|dr) ([a-z]\'?[a-z]*|[a-z]*)

The matches now will be:

mr homes
dr watson

You can see that this can get tricky. Play with simple examples first, and work your way to more complex expressions/patterns. Studying the examples below should also help.

Examples

The rules and examples shown above must be studied and memorized, but the best way to see how they work and learn to use them is by example. The sections below give some language specific examples that will help you understand how to form useful patterns for linguistic study, and will provide some ideas for your own explorations.

Indonesian

A quick syntax review:

\b matches a word boundary
\w* matches 0 to many word characters (e.g. letters)
\w+ matches 1 to many word characters
(xx|yy) matches "xx" or "yy"
. matches any one character
.{0,15} matches any combination of 0 to 15 characters. These are "curly braces" Note the '.' as the previous character
.{8} matches exactly 8 characters. Note the '.'
( one|two|three) - groups alternatives

You can string these together. Here are some examples:

Regular Expression Pattern	Description/Example
\bme\w*\b	all words starting with me
\b(meng\|meny)\w*\b	all words starting with meng or meny Note the ().
\b(meng\|meny)\w*an\b	all words starting with meng or meny and ending with an
\b(me(ng\|m\|n\|ny))\w*\b	all words starting with meng or mem or men or meny
\bke\w*an\b	all ke..an words
\b\wy\w\b	all words with a y anywhere
\b\w+y\w*\b	all words with a y where y is not the first letter
\ba multiword phrase\b	matches that exact phrase: a multiword phrase
\bword1 .{1,15} word2\b \bbaik .{1,20} maupun\b	matches "word1" followed by 1 to 15 "any" characters, followed by "word2" (note spaces after word1 and before word2 baik orang yang miskin maupun orang yang kaya

Tagalog

Thai