Leverage Regular Expressions
Use .NET regular expressions to simplify and optimize your text-processing applications.
by Francesco Balena
January 2003 Issue
Technology Toolbox: VB.NET, C#, VB6
Regular expressions provide a simple and efficient way—far beyond the capabilities most conventional programming languages offer—to search and replace text. However, relatively few developers are familiar with this powerful tool. I'll explain how you can leverage regular expressions from VB.NET to solve common parsing problems, such as extracting words, numbers, and dates out of a text document. Most of what you'll learn applies to previous Visual Basic versions too (see the sidebar, "Use Regular Expressions From VB6").
A regular expression is a search pattern that identifies what you're looking for in a source string. You're familiar already with the simplest forms of search patterns, such as the one you can use at the command prompt when you copy or delete files (for example, DEL A??.*). You can also use regular-expression–like patterns with SQL's LIKE operator in WHERE clauses, or VB's LIKE operator:
If name Like "[AEIOU]???" Then
' name is a 4-char word that
' starts with uppercase vowel
End If
Any .NET regular-expression operation requires that you pass the search pattern to the Regex object's constructor:
Dim re As New Regex("[AEIOU]\w\w\w")
All the regular-expression classes are in the System.Text.RegularExpressions namespace, so all code snippets in this article assume you've used a proper Imports statement. The previous regular expression searches for a string that starts with an uppercase vowel and is followed by three alphanumerical characters (the \w sequence). This first example gives you an idea of why most developers consider regular expressions cryptic and why their use tends to be confined to specialized text-processing languages such as Perl and Awk. The truth is, you can become familiar with regular expressions in a matter of a few hours.
After you've created the Regex object, you can enumerate all occurrences of the searched pattern in a source string by iterating over the collection of results the Matches method returns:
Dim text As String = "Anne Bob Eric"
Dim m As Match
For Each m In re.Matches(text)
Console.WriteLine("{0} at index" _
& "{1}", m.Value, m.Index)
Next
The previous code displays these lines in the console window:
Anne at index 0
Eric at index 9
The System.Text.RegularExpressions namespace doesn't expose a large number of classes and methods (see Figure 1). The challenge in putting regular expressions to good use is learning how to build the search pattern. Regular expressions in .NET are a superset of regular expressions that VBScript 5.0 and later support, so you can leverage your prior knowledge on this topic. I've built an application that lets you test a regular expression against a text file (see Figure 2, and download the code here).
Become Familiar With Regex Hierarchy
The most common and useful constructs you can use in the search pattern fall into a small number of categories (see Table 1). Character escapes provide a means to insert nonprintable characters in the search string; for example, you can use this sequence to search for the string "Visual Basic .NET" preceded by a tab character and followed by a carriage-return/line-feed pair:
\tVisual Basic \.NET\r\n
Notice that you need to escape the dot character, because unescaped dots have a special meaning inside regular expressions and match any character. Constructs in the character-class category are also simple to grasp. For example, this sequence matches a sequence of a nonalphabetical character, followed by a letter, two more alphanumerical characters, a digit, and a white space, such as the ",ndx2 " or " A123 " sequences:
\W[A-Za-z]\w\w\d\s
Notice that the leading nonalphanumerical character and the trailing white space are part of the match—a detail that becomes important if you want to replace the found substring. (I'll cover replace functions later.)
Atomic zero-width assertions are special: They specify where the match must appear in the source string, yet they don't match any character; to put it differently, they don't consume any character in the source text. The most common construct of this type is \b, which marks a word boundary without matching the character that immediately precedes and follows the word. For example, this regular expression finds all five-character words in the source string, without matching the characters before and after each word:
\b\w\w\w\w\w\b
The ^ and $ assertions stand for the beginning and the end of the source string. You often use them with the Regex.IsMatch method to test whether a field's contents coincide with the expected format (instead of containing the search string). For example, suppose TextBox1 is expected to contain a three-digit number; you can validate the control's value with regular expressions like this:
Dim re As New Regex("^\d\d\d$")
If re.IsMatch(TextBox1.Text) Then
' value is ok
End If
The ^ and $ assertions have a slightly different meaning when you use the regular expression in multiline mode. In that case, they match the beginning and the end of individual lines. By default, the Regex object considers the source text as a continuous flux of characters, as is the case with a text document. You can use multiline mode when you parse a document line-by-line, such as a file containing one record per line, with each field delimited by tab or comma characters. You specify multiline mode by passing an enumerated value to the Regex constructor's second argument:
' find all 4-char words at the
' beginning of each line
Dim re As New Regex("^\w\w\w\w\b", _
RegexOptions.Multiline)
The search pattern can include quantifiers too. A quantifier specifies how many times the construct that precedes it can be repeated in the source string. For example, this regular expression matches all words with three, four, or five characters:
\b\w{3,5}\b
The * construct indicates zero or more occurrences; ? stands for zero or one occurrence, so you can use it for optional elements in the matched string; + stands for one or more occurrences. For example, the \w+ regular expression matches any word, so you count the words in a source string with only two statements:
Dim text As String = "Here is a sample sentence"
Dim re As New Regex("\w+")
Console.WriteLine( _
re.Matches(text).Count) ' => 5
Search for Date Values
This example shows how you can search for date values in the m/d/yyyy format (where month and day numbers may have one or two digits) anywhere in the source text:
Dim re As New _
Regex("\b1?\d/[123]?\d/\d{4}\b")
Dim m As Match
For Each m In re.Matches(text)
Console.WriteLine(m.Value)
Next
The | and ( ) constructs let you specify alternative matching strings. This code uses them to find variable declarations in a VB source file. (For simplicity's sake, it assumes the variable's type doesn't contain dots.) Notice the IgnoreCase option for performing case-insensitive searches:
Dim pattern As String = _
"\b(Dim|Private|Public)\s+\w+" & _
"\s+As\s+\w+"
Dim re As New Regex(pattern, _
RegexOptions.Multiline Or _
RegexOptions.IgnoreCase)
Dim m As Match
' source is the VB code to parse
For Each m In re.Matches(source)
Console.WriteLine(m.Value)
Next
The ( ) construct is also useful for applying quantifiers to a subsequence of characters. For example, you can modify the previous search pattern to account for the optional New keyword in variable declarations:
Dim pattern As String = _
"\b(Dim|Private|Public)\s+\w+" & _
"\s+As\s+(New\s+)?\w+"
Interestingly, the ( ) construct defines implicitly a numbered matching group—that is, a subsequence in the found string. Groups are numbered starting with 1 and let you refer to portions of the match. You can use groups to improve the search for variable declarations in a VB listing:
Dim pattern As String = "\b" & _
"(Dim|Private|Public)\s+(\w+)" & _
"\s+As\s+(New\s+)?(\w+)"
This pattern defines four groups: the keyword used to declare the variable, the variable's name, the optional New keyword, and the variable's type. You can use these groups to extract additional information from each Match object:
Dim m As Match
For Each m In re.Matches(source)
Console.WriteLine( _
"variable {0} of type {1}", _
m.Groups(2).Value, _
m.Groups(4).Value)
Next
In this example, the third group would contain the "New " string if this keyword appears in the match; otherwise, it would contain an empty string. You can make your code more readable by assigning a name to a group because you can reference the group later by its name instead of its position. This technique avoids problems arising from nested pairs of parentheses, because nested-parentheses pairs alter group numbering.
Named or numbered groups are especially useful when you're reading comma- or tab-delimited text files such as those you can export from Microsoft Excel and many database apps. For example, consider a text file containing employee names and salaries:
"John", "Doe", 55000
"Mary Ann", "Smith" , 58125.50
This format poses a few challenges that might not be apparent immediately. For example, each element can be preceded or followed by optional spaces or tab characters; also, both the first and the last name can contain multiple words, so you can't search for them with the \w+ sequence (which doesn't match the space). Similarly, the salary information might or might not contain a decimal portion, so you can't match it against the \d+ sequence. Here's an admittedly nontrivial regular expression based on named groups that fits the bill:
^\s*"(?<first>[^"]+)"\s*,
\s*"(?<last>[^"]+)"\s*,
\s*(?<salary>\d+(\.\d\d)?)\s*$
You can match the first and the last name correctly by looking for any character sequence that doesn't contain the double quote, and you can match the salary portion by adding a nested pair of parentheses that make the decimal portion optional. The ^ and $ symbols ensure that the pattern will be found at the beginning of each line in the source file; the several \s* constructs serve to ignore any additional space and tab characters surrounding the commas (see Listing 1).
Make Your Match
The code in Listing 1 also shows an interesting technique that's especially appropriate with large source strings. In such cases, the Matches method would return only when all matches have been found, an operation that might take several seconds. A better approach is to use the Regex.Match method to find the first match, then the Match.NextMatch method to search for all subsequent occurrences of the string, until the Match.Success property returns False to indicate there are no more matches. This approach is also effective when you don't need to iterate over all the occurrences—for example, when you want to retrieve a given employee's salary and can ignore all the employees after him or her.
Numbered or named groups are useful for referring to a previous match in the pattern itself and provide a means to tell the regular-expression engine to "match that substring again." Consider the problem of parsing a set of strings that can be enclosed in either single or double quotes. The problem: You want to search for a closing quote that's identical to the opening one. You can find such strings with this pattern:
("|')[^\1]+\1
The ("|') group matches the initial single or double quotes; the sequence [^\1]+ finds all the characters that differ from the initial quote; finally, the \1 reference matches the closing quote, be it single or double.
Regular expressions can replace text too. To do this, you use the Regex.Replace method, which replaces each occurrence of the search pattern passed in the first argument with the string passed in the second argument:
' Replace all Windows version names
' with "Windows XP"
Dim re As New Regex("Windows (95|98|NT|2000)")
' "text" is the source string
Console.WriteLine(re.Replace(text, "Windows XP"))
Replace patterns can include a reference to a numbered or named group defined in the search pattern. This is useful for arranging portions of the searched string in a different order. For example, this code searches for all names in the format "title firstname lastname" and converts them to the "title lastname, firstname" format:
txt = "Mr. Joe Doe and Mrs. Anne Smith"
Dim re As New Regex( _
"(Mr\.|Mrs\.)\s+(\w+)\s+(\w+)")
Console.WriteLine( _
re.Replace(txt, "$1 $3, $2"))
'=> Mr. Doe, Joe and Mrs. Smith, Ann
You can achieve the same result by using named groups:
Dim re As New Regex( _
"(?<title>Mr\.|Mrs\.)\s+" & _
"(?<first>\w+)\s+(?<last>\w+)")
Console.WriteLine(re.Replace(txt, _
"${title} ${last}, ${first}"))
The Replace method offers overloaded versions to support additional arguments, such as the starting index and the highest number of allowed substitutions. Another overloaded version takes a delegate to a function; this function is invoked for each match in the source string and is expected to return the replacement string for that occurrence. The function receives a Match object, so your code can retrieve the found occurrence and any subgroup inside it. This feature can be powerful (see Listing 2).
Interestingly, the Regex class exposes several shared methods that let you perform most search-and-replace functions without instantiating a Regex object. For example, you can replace a pattern in a string with the Replace shared method:
Regex.Replace(source, pattern, replacement)
Regardless of whether you use instance or shared methods, the Regex object parses the regular expression pattern and creates a sort of tokenized version that's interpreted efficiently at run time. This tokenized version is usually faster than the most efficient program you can write in VB.NET or C#, but you can achieve even faster code by specifying the Compiled option in the Regex constructor:
Dim re As New Regex(pattern, _
RegexOptions.Compiled)
In this case, the regular expression is converted to Microsoft Intermediate Language (MSIL), which is compiled eventually to super-fast native code the first time you call the IsMatch, Match, Matches, or Replace methods. The only downsides to this technique are that the first call to the method takes slightly longer and that the native code hangs around in memory until the application shuts down. For these reasons, you should use the Compiled option only if you expect to reuse the same Regex object often during the application's lifetime. The Regex class even exposes the CompileToAssembly shared method, which lets you compile one or more regular expressions into a DLL, so that you can reuse those precompiled regular expressions later without any overhead at run time.
Regular expressions are also useful in conjunction with other portions of the .NET Framework. For example, you use them together with the RegularExpressionValidator ASP.NET control to validate a control's contents against a regular expression. Conveniently, Visual Studio .NET offers a set of regular expressions for the most common data types—such as phone numbers, Internet URLs, and e-mail addresses—in a dialog box (see Figure 3). If nothing else, this dialog box provides additional examples of useful regular expressions you can reuse in your apps (not necessarily ASP.NET apps).
Regular expressions can also be useful in many other situations (see Additional Resources). For example, you can use them to parse log files, extract data from pages retrieved from Web sites (the so-called screen scraping), read and process comma-delimited text files, and create add-ins that analyze your code. Regular expressions are a fascinating and complex topic, and this article has merely scratched the surface.
About the Author
Francesco Balena is the editor in chief of Visual Basic Journal, VSM's Italian licensee; the author of Programming Microsoft Visual Basic 6.0 and Programming Microsoft Visual Basic .NET (Microsoft Press); and a regular speaker at VSLive! and other developer conferences. He is the principal of Code Architects Srl and the founder of the www.vb2themax.com Web site.
|