Friday, September 10, 2010
 

Search
Latest entries in the Euricom Team blog
May 15

Written by: Euricom
5/15/2009 2:51 PM 

But sometimes, writing them can be a serious pain in the ... you know where. Just the other day I had to write a little tool that does some serious find and replace stuff in generated code. At first I thought this would be a fairly easy job using regular expressions, but eventually I had to beat two issues that I really got stuck with for quite some time.

I was able to find 'the bugs' in my regex with this very simple, yet useful tool to test regular expression which I downloaded from here: http://www.regular-expressions.info/download/csharpregexdemo.zip

To demonstrate the issue, you can use the following input string:

testa
atest
testa
atest

... with the following regular expression pattern: test$ and the following replacement string: HELLO

When you hit the 'Replace' button, you get the result as expected: Only the last test in the string gets replaced.

Now, most of the time when you work with regular expressions, you work on a single line basis. But in my case I had to work accross multiple lines, so I used RegexOptions.Multiline. This option changes the way "^" and "$" work, so that they match the beginning and the end of each line in the input string instead of the beginning and end of the whole string the regex pattern is applied to. To demonstrate RegexOptions.Multiline in the tool, set the flag for the option '^ and $ match at embedded newlines'.

Now one might expect the first atest in my input string would be replaced by aHELLO as well ... but it didn't !

Now the sample I gave you is really straightforward, but the RegEx I worked uppon was fairly complex. So I'm gonna spare you the details on how long it took me to find the reason for this behavior and how many times I cursed at my computer during my quest, but eventually it all boiled down to this.

In Windows the second line ends with a carriage return and line feed character (\r\n). But, RegexOptions.Multiline works on ... you guessed it ... the new line character only, ignoring the good old carriage return altogether. So if you want the replacement to work correctly you could use something like the sample below to make the replacement correctly.

The second problem I handled was that my RegEx pattern started with ^(\s*) which was necessary to capture trailing spaces in a capture group to reuse them in the replacement string. It took me quite a while to realise that \s also captures newline characters as well, which sometimes resulted in unwanted extra empty lines in the replacement result.

By David Stroobants, .Net Solutions Architect

Tags:

3 comment(s) so far...

Re: Regular expressions are the max !

A couple of other usefull free RegEx authoring tools that help you to build expressions and analyse their behaviour:

- Regular Expression Workbench (which includes sourcecode): code.msdn.microsoft.com/RegexWorkbench

- Regular Expression Designer from Rad Software (which has a very handy RegEx reference):
www.radsoftware.com.au/regexdesigner/

By Nick Verschueren on   5/16/2009 12:05 PM

Re: Regular expressions are the max !

And another one:

- The Regex Coach:
www.weitz.de/regex-coach/

By Hans De Smedt on   5/18/2009 7:15 AM

Re: Regular expressions are the max !

Regex Coach is a tool that I actually use and is in fact a realy intuitive and great tool. The only disadvantage is that it doesn't support the .NET flavour of the RegEx engine, so you always need to translate your RegEx to the .NET syntax.

In fact, the RegEx that lead to this problem was originally written with RegEx Coach, but that tool doesn't suffer from the carriage return / new line character issue. In fact, if you try the sample from above in RegEx Coach, you'll see that it actually indicates a positive match on the second line.

By David Stroobants on   5/19/2009 7:30 AM

Your name:
Your email:
(Optional) Email used only to show Gravatar.
Your website:
Title:
Comment:
Add Comment   Cancel 
Copyright (c) 2010 Euricom ::Terms Of Use::Privacy Statement