|
Parse Text Files With Regular Expressions
Learn to parse fixed-length files and delimited text files, detect when a key combination is pressed, and change the style of the Web control that has the input focus.
by Francesco Balena and Marco Bellinaso
September 13, 2004
Technology Toolbox: VB.NET, C#, ASP.NET
One of the great things about being a book and magazine writer and the founder of a Web site is that I can keep myself in touch with thousands of developers. And even when I don't receive e-mails, I can see which articles on our Web site developers visit most frequently (see the sidebar, "The 2TheMax Family of Sites"). It's surprising to see that so many developers spend so much time on a relatively small set of problems. It's another form of the famous 80/20 rule: Programmers spend 80 percent of their time solving the recurring 20 percent of all possible problems. With this new .NET 2 the Max column, we hope to help you deliver better applications, faster, by making the solutions to these recurring problems more widely known. —Francesco Balena
Parse Fixed-Length Fields in Text Files
XML has become the standard technology in information exchange, but many applications still use more primitive ways to import and export data. One such technique is based on text files containing fixed-width fields. Consider these text lines:
John Smith New York
Ann Doe Los Angeles
Each text line contains information about the first name (six characters), last name (eight characters), and city. The largest city has 11 characters, but usually you can assume that the last field will take all the characters up to the end of the current line.
Building a program that reads individual fields isn't difficult at all. Your app simply reads a line, then uses the String.Substring method to extract individual fields. However, I want to illustrate a different approach, based on regular expressions. Consider this regular expression:
^(?<first>.{6})(?<last>.{8})(?<city>.+)$
The dot (.) represents "any character." Therefore, .{6} means "any 6 characters." The expression (?<first>.{6}) creates a group named "first" that corresponds to these initial six characters. Likewise, (?<last>.{8}) creates a group named "last" that corresponds to the next eight characters. Finally, (?<city>.+) creates a group for all the remaining characters on the line and names it "city." The ^ and $ characters represent the beginning and end of the line, respectively. You can easily write short VB and C# routines built on this regular expression to parse a file (see Listing 1). Download the code for parsing fixed-length fields in text files here.
The beauty of this approach based on regular expressions is that it is unbelievably easy to adapt the code to different field widths and to work with delimited fields. For example, if the fixed-width fields are separated by semicolons, you simply modify the regular expression without touching the remaining code:
^(?<first>.{6});(?<last>.{8});
(?<city>.+)$
Once you understand how regular expressions work, creating and maintaining your parser routines becomes child's play. —F.B.
Use Regular Expressions With Delimited Text Files
Let's assume you want to write a program to parse a common (albeit primitive, according to today's standards) exchange format: delimited text files. Each field is separated from the next by a comma, a semicolon, a tab, or another special character. To further complicate things, such files usually allow values embedded in single or double quotes. In this case, you can't use the Split method of the String type to do the parsing, because your result would be bogus if a quoted value happens to include the delimiter (as in "Doe, John").
Regular expressions are a real lifesaver in such cases. You can use the parsing code (see Listing 1) for these purposes, provided that you use a different regular expression that accounts for delimited fields. Let's start with the simplified assumption that there are no quoted strings in the file:
John , Smith, New York
Ann, Doe, Los Angeles
Back to top
|