Professional PHP

PHP Programming, Web Development, PHP Advocacy and PHP Best Practices.
« Firefox Extensions for Web Developers
Software Development Team Diversity »

The Problem with Markup Languages

March 14th, 2007

Chris Shiflett has a post today, Allowing HTML and Preventing XSS. The problem is how to allow users to format their contributed content without introducing security vulnerabilities. The answer is usually some sort of markup language or filtering and sanitization of HTML.

BBCODE was designed for this purpose. There is no actual standard, but the core syntax seems fairly uniform. It’s good for those used to forums, where it seems to norm.

HTML markup is nice because it is a standard, even if varying subsets are supported. Learning a little HTML isn’t going to hurt anyone, at least for the next 20 years or so. The problem is that HTML was never intended to be hand edited. The syntax is not the most inviting, and different HTML-like markup languages handle whitespace differently than the HTML standard.

Wiki markup syntaxes were designed to be human friendly. The main problem I have with wiki syntax is that there is no standard. It seems like every wiki has a different way to formulate a link, for example. I guess there is some progress with Wiki Creole, but I still have a bad taste in my mouth.

The other problem I have with wiki markup is that I find it to be non-deterministic. When I edit any given wiki and try to use more than basic formatting, I never know what I am going to get. Most of the markup processing engines for these wikis are impenetrable morasses of regular expressions. It can be hard to gauge interactions. Are you really sure they are secure?

Speaking of impenetrable morasses of regular expressions, have you ever looked at WordPress’s input path? I’m sure every one with a WordPress blog who likes to blog about PHP code knows that it is a code eater. I’ve been particularly disappointed with WordPress in this area. Most the “code formatting” plugins still have problems protecting code from WordPress’ heavy hand.

But the WordPress preg_replace gauntlet doesn’t just mangle code. I have a post which has been sitting in draft mode for several weeks because I can’t figure out how to give it the proper markup. WordPress is somehow taking my perfectly balanced input markup and producing “unbalanced” output markup. I haven’t yet tracked down the problem to either submit a fix or to do a good bug report. Frankly, I’m not looking forward to trudging through all those regular expressions.

In Chris’ post, he takes the regular expression approach. Folks in the comments have pointed out a few problems with his approach, including the problem of interleaved tags. If you can’t tell by now, I am not a fan of the regular expression gauntlet approach to markup languages. I prefer a defined syntax and a traditional computer science style parser (which may use regular expressions).

The other must-have is a preview option. With so much variation in markup languages, not having a preview leaves the user to play Russian roulette with their submitted content. I’ve talked about that before in the usability of input filtering. This is another area where WordPress leaves the user high and dry.

The complex input path in WordPress combined with its reliance on global variables seems to leave it unable to do an in-page preview. The admin area preview is an IFRAME so that it launches a separate request. The various live preview plugins are JavaScript based and don’t work when it is disabled. They also don’t pass the input through the same input path that WordPress uses, so they are not a true preview.

I don’t mean for this to be a WordPress rant, on the whole, I like WordPress. Rather, I just wanted to point out how hard it can be to do good input filtering, that is safe, reliable, deterministic, and usable.

Filed Under

  • PHP, Software Design, Usability

Related Posts

  • Upgraded to WordPress 1.2
  • The Usability of Input Filtering
  • reCAPTCHA – Combining Distributed Problem Solving with a Web Service
  • Delicious Outage Link Dump
  • Ruby versus PHP or There and Back Again
You can leave a response, or trackback from your own site.

12 Responses to “The Problem with Markup Languages”

  1. metapundit says:
    3/14/2007 at 10:59 am

    What about using markdown and disabling embedded html? A filter to strip all html tags before passing the text to a markdown converter ought to work. It still leaves you with links, formatting (bold, italic, etc), code blocks and lists all with a consistent and readable syntax…

  2. Chris Shiflett says:
    3/14/2007 at 11:27 am

    In Chris’s post, he takes the regular expression approach. Folks in the comments have pointed out a few problems with his approach, including the problem of interleaved tags.

    Actually, that’s the only problem, and it’s easy to resolve.

    The issue is that there is one specific case where I allow invalid XHTML: improperly nested tags. They still display and function as a user would expect, but I should force the user to correct the markup, else leave it escaped. I’ll be addressing this issue as well as any others that arise.

    I think you’ll find that the user experience on my blog is very good, and I’m still preventing XSS and producing XHTML. Not bad for a dirt-simple approach, even if it does use regular expressions. :-) The theory is sound, and that’s what makes me confident that the implementation will mature into a solid solution to this problem.

    For those seeking a more sophisticated approach, I think Edward Yang has done a great job with HTML Purifier:

    http://hp.jpsband.org/

  3. Jeff says:
    3/14/2007 at 3:20 pm

    Chris,

    I do think the user experience on your blog is good. I really like the new design. I didn’t mean to imply otherwise. You’re using s9y, aren’t you? (Not WordPress)

    This post started out as a quick comment on your blog and then moved here when I started to go off on WordPress. :)

  4. Nate K says:
    3/15/2007 at 6:31 am

    I like Shiflett’s approach, and I don’t think all uses of regular expressions are bad. It is the best way to get exactly what you are expecting – without having to program so much business logic. My big issue is – do people really NEED to have all html tags available to them? I am working through this with a site right now (a non tech site), and I will only allow italics and bold (ill parse the paragraphs). I just don’t see a reason for trying to give everyone so much power with editing HTML .

    We know what happens when you give a user too much control, just have a look at any given myspace page. I want to make sure the sites I work on stay clean and have valid code – therefore you really need to review (on a needs basis), what control do your users really need to have?

    And – I agree with everything you said about WP :)

  5. Pat O says:
    3/19/2007 at 2:44 pm

    Re: Wiki Markup Code, is there any website or page that compares markup for wiki software. I am have started a few wikis, using MediaWiki and PMWiki, am really new at this, but now realize the complexity of these markups. I would prefer to not go down two widely divergent paths. So far, PMWiki, seems much easier to install and to add content to; however, obviously MediaWiki is more widely implemented. And, in the future, I would prefer to operate wikis that more people can add content to.

  6. dgx says:
    3/21/2007 at 7:36 am

    Hello, i have posted longer comment to the Chris’s blog, so shortly – the wellknown markup formatters (Markdown, Texile) are the Rulette players, as you write. To write bulletproof solution for comments was challenge, so I wrote Texy.

    It allows you use HTML tags and ensures the well-formedness of the resulting code (demo),

    And, there is plugin for WordPress. It out country is it very popular combination, WordPress with Texy.

  7. dgx says:
    3/21/2007 at 7:37 am

    …and any additional features: support for typography rules, may be combined with a syntax highlighter, and is highly configurable.

  8. Jack Teese says:
    5/2/2007 at 8:42 am

    Well I stumbled over a site today that got some nice syntax highlighting: http://php-coding-practices.com

    I wonder what WP plugin they use.

    I agree that regexp should be used for such things – it’s just the way to go.

  9. Aaron Saray says:
    6/16/2007 at 5:31 pm

    Just a note: I’ve had the same issue with my wordpress and some code. It actually would cause a 503 error whenever I posted a tutorial with a certain segment of PHP code. I finally figured out it was a spacing issue… I had to put a few extra spaces in some of my code – specifically in a .= block and one other place that I can’t remember… It was just far easier to mangle the code than to go fix the reg expressions ;)

  10. Wayne Whitty says:
    10/7/2008 at 5:24 am

    I just think that a lot of people don’t want to have to worry about about HTML being submitted by their users. A mixture of fear and laziness I suppose.

    Wayne

  11. kevinxiao says:
    7/15/2010 at 1:16 am

    the wellknown markup formatters (Markdown, Texile) are the Rulette players, as you write.

  12. jessica says:
    8/3/2010 at 12:04 pm

    I had to put a few extra spaces in some of my code – specifically in a .= block and one other place

Leave a Reply

Click here to cancel reply.

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

code: use [code=php][/code].

Comment Preview

    Subscribe Feed
    Share Subscribe to this blog…
    Share Bookmark or share this page…
  • About

    My name is Jeff Moore. I'm a PHP programmer living in San Francico and working for a startup.

    More about me…

  • Categories (Home)

    • Agile Methods (14)
    • Mac (14)
    • Misc (17)
    • Open Source (14)
    • PHP (98)
    • Software Design (29)
    • Usability (14)
    • Web Design (20)
  • Recent Comments

    • Programming Language Trends via Google  19
      Craigslist pva, jessica, Scott [...]
    • Looking Towards the Cloud  35
      bentonville multiple listing, cosmetic dental, Sam Brodish [...]
    • PHP versus ASP  8
      Marhta Blight, Ravi, Ryan Brooks [...]
    • How to Transfer Mac OS X Application Data between Computers  59
      Website Migration, harry the computer support guy, Dotty Salvage [...]
    • Working with PHP 5 in Mac OS X 10.5 (Leopard)  157
      lehuuphuc, Robert Parthemer, Lingerie Intimate [...]
    • PHP Games  25
      jessica, Tennille Cranor at Chilli Plants, Lucas Ortell [...]
    • un-PEAR-ing  5
      jessica, Eugene Panin, Arnaud [...]
    • The Legality of Republishing RSS Feeds  23
      kevinxiao, Marissa Miscovich, Quick Student Loans [...]
    • Faster Page Loading  4
      jessica, angular cheilitis, Aaron Rosenfeld [...]
    • PDO versus MDB2  15
      jessica, kevinxiao, Gavin [...]
  • Recent Posts

    • ZendCon: Writing Maintainable PHP Code
    • Looking Towards the Cloud
    • Holiday Tech Support
    • Closures are coming to PHP
    • php | tek Wrapup
    • php | tek 2008
    • Sarah Snow Stever
    • Benchmarking PHP’s Magic Methods
    • The Endpoints of the Scale of Stupidity on Video
    • Working with PHP 5 in Mac OS X 10.5 (Leopard)
  • Site

    • Archives
    • Log in
  • Search