Regular Expressions Primer

Regular Expressions Primer Overview

Regular Expressions Primer Overview

The Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct complex expressions.

Regular expressions are embedded in programs to parse text. For example, an Expect Tcl script might contain the following string to anonymously log in to an FTP server:

  expect -re "([Ll]ogin|Username):.*" {
      send "anonymous\r"
  }

The ([Ll]ogin|Username) regular expression matches the following strings: "Login", "login", and "Username". When one of these strings is matched, Expect sends the string "anonymous" to the FTP server. Note that the expect -re command must prefix all Tcl regular expressions. Without expect -re, the command defaults to matching glob patterns. In glob pattern matching, the wildcards are "*" (which matches any sequence of characters) and "?" (which matches a single character). A literal "*", "?" or "\" might be matched by escaping with a backslash. See the Tcl Glob Manual Page for more information.

About Regular Expressions

Regular expressions are used to describe patterns of characters that match against text strings. They can be used as a tool to search for and replace text, manipulate data, or test for a certain condition in a string of characters. Many everyday tasks can be accomplished with regular expressions, such as checking for the occurrence of a specific word or phrase in the body of an e-mail message, or finding specific file types, such as .txt files, in a folder or directory. Regular expressions are often called "regex", "regexes", "regexps", and "RE". This primer uses the terms "regular expressions", "regex", and "regexes" equally.

About Regex Syntax

Regular expressions use syntax elements comprised of alphanumeric characters and symbols. For example, the regex (2) searches for the number 2, while the regex ([1-9][0-9]{2}-[0-9]{4}) matches a regular 7-digit phone number.

There are many flavors and types of regular expression syntax. These variations are found in various tools, languages and operating systems. For example, Tcl, Perl, Python, grep, sed, VI, and Unix all use variations on standard regex syntax. This primer focuses on standard regex patterns not tied to a specific language or tool. This standard syntax can be later applied to the specific language, tool or application of your choice.

Building Simple Patterns

Complete regular expressions are constructed using characters as small building block units. Each building block is in itself simple, but since these units can be combined in an infinite number of ways, knowing how to combine them to achieve a goal takes some practice. This section shows you how to build regexes through examples ranging from the simple to the more complex.

Matching Simple Strings

The simplest and most common type of regex is an alphanumeric string that matches itself, called a "literal text match". A literal text regex matches anywhere along a string. For example, a literal string matches itself when placed alone, and at the beginning, middle, or end of a larger string. Literal text matches are case sensitive.

Using regexes to search for simple strings.

Example 1: Search for the string "at".

Regex:
```
    at
   
```
Matches:
```
    at
    math
    hat
    ate
   
```
Doesn't Match:
```
    it
    a-t
    At
   
```

Example 2: Search for the string "email".

Regex:
```
    email
   
```

Matches:

    email
    emailing
    many_emails

Doesn't Match:
```
    Email
    EMAILing
    e-mails
   
```

Example 3: Search for the alphanumeric string "abcdE567".

Regex:
```
    abcdE567
   
```

Matches:

    abcdE567
    AabcdE567ing
    text_abcdE567

Doesn't Match:
```
    SPAMabCdE567
    ABCDe567
   
```

Note: Regular expressions are case sensitive unless case is deliberately modified.

Searching with Wildcards

In the previous examples, regular expressions are constructed with literal characters that match themselves. There are other characters in regex syntax that match in a more generalized way. These are called "metacharacters". Metacharacters do not match themselves, but rather perform a specific task when used in a regular expression. One such metacharacter is the dot ".", or wildcard. When used in a regular expression, the wildcard can match any single character.

Using the wildcard to match any character.

Example 1: Use a wildcard to search for any one character before the string "ubject:".

Regex:
```
    .ubject:
   
```

Matches:

    Subject:
    subject:
    Fubject:

Doesn't Match:
```
    Subject
    subject
   
```

Example 2: Use three dots "..." to search for any three characters within a string.

Regex:
```
    t...s
   
```

Matches:

    trees
    tEENs
    t345s
    t-4-s

Doesn't Match:
```
   Trees
   twentys
   t1234s
   
```

Example 3: Use several wildcards to match characters throughout a string.

Regex:
```
    .a.a.a
   
```

Matches:

    Canada
    alabama
    banana
    3a4a5a

Doesn't Match:
```
    aaa
   
```

Searching for Special Characters

In regular expression syntax, most non-alphanumerical characters are treated as special characters. These characters, called "metacharacters", include asterisks, question marks, dots, slashes, etc. In order to search for a metacharacter without using its special attribute, precede it with a backslash "\" to change it into a literal character. For example, to build a regex to search for a .txt file, precede the dot with a backslash \.txt to prevent the dot's special function, a wildcard search. The backslash, called an "escape character" in regex terminology, turns metacharacters into literal characters.

Precede the following metacharacters with a backslash "\" to search for them as literal characters:

^ $ + * ? . | ( ) { } [ ] \

Using the backslash "\" to escape special characters in a regular expression.

Example 1: Escape the dollar sign "$" to find the alphanumeric string "$100".

Regex:
```
    \$100
   
```
Matches:
```
    $100
    $1000
   
```
Doesn't Match:
```
    \$100
    100
   
```

Example 2: Use the dot "." as a literal character to find a file called "email.txt".

Regex:
```
    email\.txt
   
```
Matches:
```
    email.txt
   
```
Doesn't Match:
```
    email
    txt
    email_txt
   
```

Example 3: Escape the backslash "\" character to search for a Windows file.

Regex:
```
    c:\\readme\.txt
   
```
Matches:
```
    c:\readme.txt
   
```

Doesn't Match:

    c:\\readme.txt
    d:\readme.txt
    c:/readme.txt

Ranges and Repetition

Regex syntax includes metacharacters which specify the number of times a particular character or string must match. This group of metacharacters are called "quantifiers"; they influence the quantity of matches found. Quantifiers act on the element immediately preceding them, which could be a digit, a letter, or another metacharacter (including spaces as metacharacters not previously defined and the dot "."). This section demonstrates how quantifers search using ranges and repetition.

Ranges, {min, max}

Ranges are considered "counting qualifiers" in regular expressions. This is because they specify the minimum number of matches to find and the maximum number of matches to allow. Use ranges in regex searches when a bound, or a limit, should be placed on search results. For example, the range {3,5} matches an item at least 3 times, but not more than 5 times. When this range is combined with the regex, a{3,5}, the strings "aaa", "aaaa", and "aaaaa" are successfully matched. If only a single number is expressed within curly braces {3}, the pattern matches exactly three items. For example, the regex b{3} matches the string "bbb".

Using ranges to identify search patterns.

Example 1: Match the preceding "0" at least 3 times with a maximum of 5 times.

Regex:
```
    60{3,5} years
   
```

Matches:

    6000 years
    60000 years
    600000 years

Doesn't Match:

    60 years
    600 years
    6003 years
    6000000 years

Example 2: Using the "." wildcard to match any character sequence two or three characters long.

Regex:
```
    .{2,3}
   
```
Matches:
```
    404
    44
    com
    w3
   
```
Doesn't Match:
```
    4
    a
    aaaa
   
```

Example 3: Match the preceding "e" exactly twice.

Regex:
```
    be{2}t
   
```
Matches:
```
    beet
   
```
Doesn't Match:
```
    bet
    beat
    eee
   
```

Example 4: Match the preceding "w" exactly three times.

Regex:
```
    w{3}\.mydomain\.com
   
```
Matches:
```
    www.mydomain.com
   
```

Doesn't Match:

    web.mydomain.com
    w3.mydomain.com

Repetition, ?*+

Unlike range quantifiers, the repetition quantifiers (question mark "?", asterisk "*", and plus "+") have few limits when performing regex searches, they are greedy. This is significant because these quantifiers settle for the minimum number of required matches, but always attempt to match as many times as possible, up to the maximum allowed. For example, the question mark "?" matches any preceding character 0 or 1 times, the asterisk "*" matches the preceding character 0 or more times, and the plus "+" matches the preceding character 1 or more times. Use repetition quantifiers in regex searches when large numbers of results are desired.

Using repetition to search for repeated characters with few limits.

Example 1: Use "?" to match the "u" character 0 or 1 times.

Regex:
```
    colou?r
   
```
Matches:
```
    colour
    color
   
```
Doesn't Match:
```
    colouur
    Colour
   
```

Example 2: Use "*" to match the preceding item 0 or more times; use "." to match any character.

Regex:
```
    www\.my.*\.com
   
```

Matches:

    www.mysite.com
    www.mypage.com
    www.my.com

Doesn't Match:
```
    www.oursite.com
    mypage.com
   
```

Example 3: Use "+" to match the preceding "5" at least once.

Regex:
```
    bob5+@foo\.com
   
```

Matches:

    bob5@foo.com
    bob5555@foo.com

Doesn't Match:

    bob@foo.com
    bob65555@foo.com

Quantifier Summary

The following table defines the various regex quantifiers. Note that each quantifier is unique and will perform a varying minimum and maximum number of matches in order to search successfully.

Quantifier	Description
{num}	Matches the preceding element num times.
{min, max}	Matches the preceding element at least min times, but not more than max times.
?	Matches any preceding element 0 or 1 times.
*	Matches the preceding element 0 or more times.
+	Matches the preceding element 1 or more times.

Using Conditional Expressions

Conditional expressions help qualify and restrict regex searches, increasing the probability of a desirable match. The vertical bar "|" symbol, meaning "OR", places a condition on the regex to search for either one character in a string or another. Because the regex has a list of alternate choices to evaluate, this regex technique is called "alternation". To search for either one character or another, insert a vertical bar "|" between the desired characters.

Example 1: Use "|" to alternate a search for various spellings of a string.

Regex:
```
    gray|grey
   
```
Matches:
```
    gray
    grey 
   
```
Doesn't Match:
```
    GREY
    Gray
   
```

Example 2: Use "|" to alternate a search for either email or Email or EMAIL or e-mail.

Regex:
```
    email|Email|EMAIL|e-mail
   
```

Matches:

    email
    Email
    EMAIL
    e-mail

Doesn't Match:
```
    EmAiL
    E-Mail
   
```

Grouping Similar Items in Parentheses

Use parentheses to enclose a group of related search elements. Parentheses limit scope on alternation and create substrings to enhance searches with metacharacters. For example, use parentheses to group the expression (abc), then apply the range quantifier {3} to find instances of the string "abcabcabc".

Along with grouping expressions into subpatterns, parentheses also capture each part of a matched string. It should be noted that this further parsing requires additional time and resources to complete.

Using parentheses to group regular expressions.

Example 1: Use parentheses and a range quantifier to find instances of the string "abcabcabc".

Regex:
```
    (abc){3}
   
```
Matches:
```
    abcabcabc
    abcabcabcabc
   
```
Doesn't Match:
```
    abc
    abcabc
   
```

Example 2: Use parentheses to limit the scope of alternative matches on the words gray and grey.

Regex:
```
    gr(a|e)y
   
```
Matches:
```
    gray
    grey
   
```
Doesn't Match:
```
    gry
    graey
   
```

Example 3: Use parentheses and "|" to locate past correspondence in a mail-filtering program. This regex finds a 'To:' or a 'From:' line followed by a space and then either the word 'Smith' or the word 'Chan'.

Regex:
```
    (To:|From:)(Smith|Chan)
   
```

Matches:

    To:Smith
    To:Chan
    From:Smith
    To:Smith, Chan
    To:Smithe
    From:Channel4News

Doesn't Match:

    To:smith
    To:All
    To:Schmidt

Matching Sequences

You can build a regular expression to match a sequence of characters. These sequences, called "character classes", simply place a set of characters side-by-side within square brackets "[]". An item in a character class can be either an ordinary character, representing itself, or a metacharacter, performing a special function. This primer covers how to build simple character classes, prevent matches with character classes, and construct compound character classes with metacharacters.

Building Simple Character Classes

The most basic type of character class is a set of characters placed side-by-side within square brackets "[]". For example, the regular expression [bcr]at, matches the words "bat", "cat", or "rat" because it uses a character class (that includes "b","c", or "r") as its first character. Character classes only match singular characters unless a quantifier is placed after the closing bracket. For examples using quantifiers with character classes, see Compound Character Classes. The following table shows how to use simple character classes in regex searches.

Note: When placed inside a character class, the hyphen "-" metacharacter denotes a continuious sequence of letters or numbers in a range. For example, [a-d] is a range of letters denoting the continuious sequence of a,b,c and d. When a hyphen is otherwise used in a regex, it matches a literal hyphen.

Using simple character classes to perform regex searches.

Example 1: Use a character class to match all cases of the letter "s".

Regex:
```
    Java[Ss]cript
   
```
Matches:
```
    JavaScript
    Javascript
   
```
Doesn't Match:
```
    javascript
    javaScript
   
```

Example 2: Use a character class to limit the scope of alternative matches on the words gray and grey.

Regex:
```
    gr[ae]y
   
```
Matches:
```
    gray
    grey
   
```
Doesn't Match:
```
    gry
    graey
   
```

Example 3: Use a character class to match any one digit in the list.

Regex:
```
    [0123456789]
   
```
Matches:
```
    5 
    0
    9
   
```
Doesn't Match:
```
    x
    ?
    F
   
```

Example 4: To simplify the previous example, use a hyphen "-" within a character class to denote a range for matching any one digit in the list.

Regex:
```
    [0-9]
   
```
Matches:
```
    5
    0
    9
   
```
Doesn't Match:
```
    234
    42
   
```

Example 5: Use a hyphen "-" within a character class to denote an alphabetic range for matching various words ending in "mail".

Regex:
```
    [A-Z]mail
   
```
Matches:
```
    Email
    Xmail
    Zmail
   
```
Doesn't Match:
```
    email
    mail
   
```

Example 6: Match any three or more digits listed in the character class.

Regex:
```
    [0-9]{3,}
   
```

Matches:

    012
    1234
    555
    98754378623

Doesn't Match:
```
    10
    7
   
```

Preventing Matches with Character Classes

Previous examples used character classes to specify exact sequences to match. Character classes can also be used to prevent, or negate, matches with undesirable strings. To prevent a match, use a leading caret "^" (meaning NOT), within square brackets, [^...]. For example, the regex [^a] matches any single character except the letter "a".

Note: The caret symbol must be the first character within the square brackets to negate a character class.

Using character classes to prevent a sequence from matching.

Example 1: Prevent a match on any numeric string. Use the "*" to match an item 0 or more times.

Regex:
```
    [^0-9]*
   
```

Matches:

    abc
    c
    Mail
    u-see
    a4a

Doesn't Match:
```
    1
    42
    100
    23000000
   
```

Example 2: Search for a text file beginning with any character not a lower-case letter.

Regex:
```
    [^a-z]\.txt
   
```
Matches:
```
    A.txt
    4.txt
    Z.txt
   
```
Doesn't Match:
```
    r.txt
    a.txt
    Aa.txt
   
```

Example 3: Prevent a match on the numbers "10" and "12".

Regex:
```
    1[^02]
   
```
Matches:
```
    13
    11
    19
    17
    1a
   
```
Doesn't Match:
```
    10
    12
    42
    a1
   
```

Compound Character Classes

Character classes are a versatile tool when combined with various pieces of the regex syntax. Compound character classes can help clarify and define sophisticated searches, test for certain conditions in a program, and filter wanted e-mail from spam. This section uses compound character classes to build meaningful expressions with the regex syntax.

Using compound character classes with the regex syntax.

Example 1: Find a partial e-mail address. Use a character class to denote a match for any number between 0 and 9. Use a range to restrict the number of times a digit matches.

Regex:
```
    smith[0-9]{2}@
   
```
Matches:
```
    smith44@
    smith42@
   
```
Doesn't Match:
```
    Smith34
    smith6
    Smith0a
   
```

Example 2: Search an HTML file to find each instance of a header tag. Allow matches on whitespace after the tag but before the ">".

Regex:
```
    (<[Hh][1-6] *>)
   
```

Matches:

     <H1>
     <h6>
     <H3  >
     <h2    >

Doesn't Match:
```
    <H1
    <   h2>
    <a1>
   
```

Example 3: Match a regular 7-digit phone number. Prevent the digit "0" from leading the string.

Regex:
```
    ([1-9][0-9]{2}-[0-9]{4})
   
```
Matches:
```
     555-5555
     123-4567 
   
```

Doesn't Match:

    555.5555
    1234-567
    023-1234

Example 4: Match a valid web-based protocol. Escape the two front slashes.

Regex:
```
    [a-z]+:\/\/
   
```

Matches:

    http://
    ftp://
    tcl://
    https://

Doesn't Match:
```
    http
    http:
    1a3:// 
   
```

Example 5: Match a valid e-mail address.

Regex:

    [a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+

Matches:

    j_smith@foo.com
    j.smith@bc.canada.ca
    smith99@foo.co.uk
    1234@mydomain.net

Doesn't Match:

    @foo.com
    .smith@foo.net
    smith.@foo.org
    www.myemail.com

Character Class Summary

The following table defines various character class sequences. Use these alphanumeric patterns to simplify your regex searches.

Character Class	Description
`[0-9]`	Matches any digit from 0 to 9.
`[a-zA-z]`	Matches any alphabetic character.
`[a-zA-z0-9]`	Matches any alphanumeric character.
`[^0-9]`	Matches any non-digit.
`[^a-zA-z]`	Matches any non-alphabetic character.

Matching Locations within a String

At times, the pattern to be matched appears at either the very beginning or end of a string. In these cases, use a caret "^" to match a desired pattern at the beginning of a string, and a dollar sign "$" for the end of the string. For example, the regular expression email matches anywhere along the following strings: "email", "emailing", "bogus_emails", and "smithsemailaddress". However, the regex ^email only matches the strings "email" and "emailing". The caret "^" in this example is used to effectively anchor the match to the start of the string. For this reason, both the caret "^" and dollar sign "$" are referred to as anchors in the regex syntax.

Note: The caret "^" has many meanings in regular expressions. Its function is determined by its context. The caret can be used as an anchor to match patterns at the beginning of a string, for example:(^File). The caret can also be used as a logical "NOT" to negate content in a character class, for example: [^...].

Using anchors to match at the beginning or end of a string.

Example 1: Use "$" to match the ".com" pattern at the end of a string.

Regex:
```
    .*\.com$
   
```
Matches:
```
    mydomain.com 
    a.b.c.com
   
```

Doesn't Match:

    mydomain.org 
    mydomain.com.org

Example 2: Use "^" to match "inter" at the beginning of a string, "$" to match "ion" at the end of a string, and ".*" to match any number of characters within the string.

Regex:
```
    ^inter.*ion$
   
```

Matches:

    internationalization
    internalization

Doesn't Match:
```
    reinternationalization
   
```

Example 3: Use "^" inside parentheses to match "To" and "From" at the beginning of the string.

Regex:
```
    (^To:|^From:)(Smith|Chan)
   
```

Matches:

    From:Chan
    To:Smith
    From:Smith 
    To:Chan

Doesn't Match:

    From: Chan
    from:Smith
    To Chan

Example 4: Performing the same search as #3, place the caret "^" outside the parentheses this time for similar results.

Regex:

    ^(From|Subject|Date):(Smith|Chan|Today)

Matches:

    From:Smith
    Subject:Chan 
    Date:Today

Doesn't Match:
```
    X-Subject:
    date:Today
   
```

More Tcl Regex Resources

ActiveState Programmer Network (ASPN)

re_syntax, Tcl regular expressions syntax
regexp, match a regular expression against a string
regsub, perform substitutions based on regular expression pattern matching

Tcl Developer Xchange

O'Reilly Books

Exploring Expect, "Chapter 5: Regular Expressions", by Don Libes
Mastering Regular Expressions, by Jeffrey Friedl