Using Regexps to parse XML?

A common task for a programmer is to extract XML data, transform it and insert it to a database. The obvious and easiest tool to use is usually a DOM XML parser that parses the XML into a tree that is easy to use.

Some common and easy to use tools for this are:

It is also somewhat common for XML files to contain malformed XML. It is very easy to include unescaped data where it’s not allowed, and while this is perfectly fine for HTML, XML parsers will fail with errors. In 2005 the google reader team reported that 7% of all public RSS and Atom feeds have XML errors in them. These numbers are much higher in private feeds in Bussiness to Bussiness agrements.

This stops you from proceeding until you can fix the data. And the only correct way to fix it is to contact the person who generated the XML and this usually results in a series of e-mails to educate this person in how to generate valid XML.

But when you are trying to finish your work within a reasonable timeframe it’s not always acceptable to wait for new data to be delivered.

Alternatively you can make the XML parser handle the invalid XML or you can write a filter to fix the data before the XML parser sees it, which is not a fun situation.

Regexps

If you went the route to write Regexps to fix the data you have now taken the path of writing a parser that deals not with generic XML data but data generated by a specific (broken) program.

But if we’re using regexps to fix specific flaws in the XML we’re working with then how useful is it really to have a complex XML parsing library when all it seems to be doing is getting in the way.

If you’re working directly against a single feed then it might be worth bending the rules and using an an old and tested method:

But here’s the kicker: If you’re writing a parser for one specific feed, you can write a parser that parses only that format with regexps.

For example let’s assume this feed:

<Article>
  <Title>Hello link</Title>
  <Body>
    Hello! <a href="/link">link</a>.
  </Body>
</Article>
<Article>
  <Title>Hello link</Title>
  <Body>
    Hello! <a href="/world">link</a>.
  </Body>
</Article>

The <a> tag is not allowed inside the <Body> tag. We can assume that their code looks like this:

<?php
foreach($articles as $article) {
print <<<XML
<Article>
  <Body>
    {$article['body']}
  </Body>
</Article>
XML;
}
?>

So instead of thinking about it as an XML document you can think of it as a data file which has a article prefix of @”

\n \n "@ and a suffix of @"\n \n
”@. Or the regexp @|
\n \n (.*?)\n \n
|@

And this is exactly what regexps were designed to parse! You could write a parser that’s tied specifically to the format and as long as there are no changes in that generator you’ll be the hero who did the job in 1 hour without any complaint e-mails instead of spending several days waiting for new feeds to be generated.

Posted on 11 May 2009 by Morgan Christiansson.
blog comments powered by Disqus