Below is a comment I started in response to this blog entry on grep speed. I decided it was better posted here. Enjoy…
Yeah, the link above is to a photo of GNU grep author Mike Haertel. Trust me when I say that it's fairly unlikely you'd beat him in any kind of fair test of this kind of coding—he is utterly brilliant. Three stories…
1. Mike calls me over to his terminal, and shows me an inner loop of some version of grep he's working on. It consists of two unbounded loops, each with a goto into the other loop, and a return. 15 minutes later, Mike's convinced me that this is far and away the cleanest way to code the algorithm variant he's working on.
2. I'm sitting and doing some UNIX thing or other with Mike watching over my shoulder. He asks, "Why are you using the -y flag to grep? The -i flag does the same thing!" I immediately start into a giant rant about "taking a baseball bat to the idiot who broke the old V7 -y flag." Of course, who I was talking about didn't occur to me until Mike half-jokingly says "I'm sorry!"
I mention this story because it's maybe the only time I've ever won a UNIX argument with Mike. I explain that in the "good old days" the -y flag would match uppercase in the pattern to uppercase in the target, but lowercase in the pattern to mixed case in the target. Mike says "no it didn't!" Fortunately, the means to settle it are at hand: Mike logs onto the PDP-8 emulator running V6 UNIX at his house, and we dig through the original source code. Point for me! Mike feels bad, because the original -y behavior is obviously cooler, and he helped get the lame behavior into the POSIX standard.
3. Mike explains to me that the grep backreference stuff in the POSIX standard is bad. Really bad. The problem is that it's a bit too powerful. Turns out that you can write an arbitrary Diophantine equations as an grep backreference expression, and then try to match it against an infinite regular string to find solutions to the equation. Since Diophantine equations are undecidable in general, there's no way to build a POSIX-compliant grep.
I miss you, Mike. Move back to Portland. (B)