Professional PHP

PHP Programming, Web Development, PHP Advocacy and PHP Best Practices.
« Firefox Extensions for Web Developers
Software Development Team Diversity »

The Problem with Markup Languages

March 14th, 2007

Chris Shiflett has a post today, Allowing HTML and Preventing XSS. The problem is how to allow users to format their contributed content without introducing security vulnerabilities. The answer is usually some sort of markup language or filtering and sanitization of HTML.

BBCODE was designed for this purpose. There is no actual standard, but the core syntax seems fairly uniform. It’s good for those used to forums, where it seems to norm.

HTML markup is nice because it is a standard, even if varying subsets are supported. Learning a little HTML isn’t going to hurt anyone, at least for the next 20 years or so. The problem is that HTML was never intended to be hand edited. The syntax is not the most inviting, and different HTML-like markup languages handle whitespace differently than the HTML standard.

Wiki markup syntaxes were designed to be human friendly. The main problem I have with wiki syntax is that there is no standard. It seems like every wiki has a different way to formulate a link, for example. I guess there is some progress with Wiki Creole, but I still have a bad taste in my mouth.

The other problem I have with wiki markup is that I find it to be non-deterministic. When I edit any given wiki and try to use more than basic formatting, I never know what I am going to get. Most of the markup processing engines for these wikis are impenetrable morasses of regular expressions. It can be hard to gauge interactions. Are you really sure they are secure?

Speaking of impenetrable morasses of regular expressions, have you ever looked at WordPress’s input path? I’m sure every one with a WordPress blog who likes to blog about PHP code knows that it is a code eater. I’ve been particularly disappointed with WordPress in this area. Most the “code formatting” plugins still have problems protecting code from WordPress’ heavy hand.

But the WordPress preg_replace gauntlet doesn’t just mangle code. I have a post which has been sitting in draft mode for several weeks because I can’t figure out how to give it the proper markup. WordPress is somehow taking my perfectly balanced input markup and producing “unbalanced” output markup. I haven’t yet tracked down the problem to either submit a fix or to do a good bug report. Frankly, I’m not looking forward to trudging through all those regular expressions.

In Chris’ post, he takes the regular expression approach. Folks in the comments have pointed out a few problems with his approach, including the problem of interleaved tags. If you can’t tell by now, I am not a fan of the regular expression gauntlet approach to markup languages. I prefer a defined syntax and a traditional computer science style parser (which may use regular expressions).

The other must-have is a preview option. With so much variation in markup languages, not having a preview leaves the user to play Russian roulette with their submitted content. I’ve talked about that before in the usability of input filtering. This is another area where WordPress leaves the user high and dry.

The complex input path in WordPress combined with its reliance on global variables seems to leave it unable to do an in-page preview. The admin area preview is an IFRAME so that it launches a separate request. The various live preview plugins are JavaScript based and don’t work when it is disabled. They also don’t pass the input through the same input path that WordPress uses, so they are not a true preview.

I don’t mean for this to be a WordPress rant, on the whole, I like WordPress. Rather, I just wanted to point out how hard it can be to do good input filtering, that is safe, reliable, deterministic, and usable.

Filed Under

  • PHP, Software Design, Usability

Related Posts

  • Upgraded to WordPress 1.2
  • The Usability of Input Filtering
  • reCAPTCHA – Combining Distributed Problem Solving with a Web Service
  • Delicious Outage Link Dump
  • Ruby versus PHP or There and Back Again
You can leave a response, or trackback from your own site.

16 Responses to “The Problem with Markup Languages”

  1. metapundit says:
    3/14/2007 at 10:59 am

    What about using markdown and disabling embedded html? A filter to strip all html tags before passing the text to a markdown converter ought to work. It still leaves you with links, formatting (bold, italic, etc), code blocks and lists all with a consistent and readable syntax…

  2. Chris Shiflett says:
    3/14/2007 at 11:27 am

    In Chris’s post, he takes the regular expression approach. Folks in the comments have pointed out a few problems with his approach, including the problem of interleaved tags.

    Actually, that’s the only problem, and it’s easy to resolve.

    The issue is that there is one specific case where I allow invalid XHTML: improperly nested tags. They still display and function as a user would expect, but I should force the user to correct the markup, else leave it escaped. I’ll be addressing this issue as well as any others that arise.

    I think you’ll find that the user experience on my blog is very good, and I’m still preventing XSS and producing XHTML. Not bad for a dirt-simple approach, even if it does use regular expressions. :-) The theory is sound, and that’s what makes me confident that the implementation will mature into a solid solution to this problem.

    For those seeking a more sophisticated approach, I think Edward Yang has done a great job with HTML Purifier:

    http://hp.jpsband.org/

  3. Jeff says:
    3/14/2007 at 3:20 pm

    Chris,

    I do think the user experience on your blog is good. I really like the new design. I didn’t mean to imply otherwise. You’re using s9y, aren’t you? (Not WordPress)

    This post started out as a quick comment on your blog and then moved here when I started to go off on WordPress. :)

  4. Nate K says:
    3/15/2007 at 6:31 am

    I like Shiflett’s approach, and I don’t think all uses of regular expressions are bad. It is the best way to get exactly what you are expecting – without having to program so much business logic. My big issue is – do people really NEED to have all html tags available to them? I am working through this with a site right now (a non tech site), and I will only allow italics and bold (ill parse the paragraphs). I just don’t see a reason for trying to give everyone so much power with editing HTML .

    We know what happens when you give a user too much control, just have a look at any given myspace page. I want to make sure the sites I work on stay clean and have valid code – therefore you really need to review (on a needs basis), what control do your users really need to have?

    And – I agree with everything you said about WP :)

  5. Pat O says:
    3/19/2007 at 2:44 pm

    Re: Wiki Markup Code, is there any website or page that compares markup for wiki software. I am have started a few wikis, using MediaWiki and PMWiki, am really new at this, but now realize the complexity of these markups. I would prefer to not go down two widely divergent paths. So far, PMWiki, seems much easier to install and to add content to; however, obviously MediaWiki is more widely implemented. And, in the future, I would prefer to operate wikis that more people can add content to.

  6. dgx says:
    3/21/2007 at 7:36 am

    Hello, i have posted longer comment to the Chris’s blog, so shortly – the wellknown markup formatters (Markdown, Texile) are the Rulette players, as you write. To write bulletproof solution for comments was challenge, so I wrote Texy.

    It allows you use HTML tags and ensures the well-formedness of the resulting code (demo),

    And, there is plugin for WordPress. It out country is it very popular combination, WordPress with Texy.

  7. dgx says:
    3/21/2007 at 7:37 am

    …and any additional features: support for typography rules, may be combined with a syntax highlighter, and is highly configurable.

  8. Jack Teese says:
    5/2/2007 at 8:42 am

    Well I stumbled over a site today that got some nice syntax highlighting: http://php-coding-practices.com

    I wonder what WP plugin they use.

    I agree that regexp should be used for such things – it’s just the way to go.

  9. Aaron Saray says:
    6/16/2007 at 5:31 pm

    Just a note: I’ve had the same issue with my wordpress and some code. It actually would cause a 503 error whenever I posted a tutorial with a certain segment of PHP code. I finally figured out it was a spacing issue… I had to put a few extra spaces in some of my code – specifically in a .= block and one other place that I can’t remember… It was just far easier to mangle the code than to go fix the reg expressions ;)

  10. Wayne Whitty says:
    10/7/2008 at 5:24 am

    I just think that a lot of people don’t want to have to worry about about HTML being submitted by their users. A mixture of fear and laziness I suppose.

    Wayne

  11. kevinxiao says:
    7/15/2010 at 1:16 am

    the wellknown markup formatters (Markdown, Texile) are the Rulette players, as you write.

  12. jessica says:
    8/3/2010 at 12:04 pm

    I had to put a few extra spaces in some of my code – specifically in a .= block and one other place

  13. Shad Purter says:
    1/10/2012 at 5:11 am

    Also, you don’t need lots of gadgets to entertain yourself when you have lots of siblings to play with (I know- I’m one of 12 and my youngest sibling is now just 4- my parents would agree with all Msgr. wrote!) I’m grateful for my family- it’s much more important to me than “stuff.”

  14. Evita Shetterly says:
    1/16/2012 at 5:25 am

    http://blogs.palmbeachpost.com/extracredit/2008/09/19/leo-dicaprio-loves-fau

  15. Theta binaural beats says:
    4/9/2012 at 5:41 am

    I can understand why you might badly want this to work, but you really need to come to terms that it really is BS.

  16. television sets says:
    4/28/2012 at 4:31 am

    YouTube was began in 2005 by 3 PayPal workers because then YouTube has exploded to among the list of biggest webpages in the world. Youtube Currently has over 1 billion page views monthly And also over 1 Billion Unique Page Views 30 days.

Leave a Reply

Click here to cancel reply.

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

code: use [code=php][/code].

Comment Preview

    Subscribe Feed
    Share Subscribe to this blog…
    Share Bookmark or share this page…
  • About

    My name is Jeff Moore. I'm a PHP programmer living in San Francico and working for a startup.

    More about me…

  • Categories (Home)

    • Agile Methods (14)
    • Mac (14)
    • Misc (18)
    • Open Source (14)
    • PHP (99)
    • Software Design (29)
    • Usability (14)
    • Web Design (20)
  • Recent Comments

    • The Legality of Republishing RSS Feeds  28
      Tory Rennemeyer, eenicker, Reverse Phone Lookup [...]
    • Working with PHP 5 in Mac OS X 10.5 (Leopard)  258
      Tuan Lal, Lavagem de estofados, Edward L. Kind [...]
    • php | tek 2008  36
      how to mend ice machine, Akademija Debelih, Odbacena [...]
    • goto in PHP  59
      kasor, Thomas Valdivieso, Murray Ziadie [...]
    • Firefox Extensions for Web Developers  33
      kasor, Website Design Toronto, mobila bistrita [...]
    • Why PHP is easier to learn than Java  68
      kasor, Justina Calvery, Guy Lipton [...]
    • Meta Tag Refresh Faux Paux  43
      html email templates, E-Juice Reviews, image [...]
    • Improved Error Messages in PHP 5  49
      Carroll Tina, Przeprowadzka, Emery Harari [...]
    • Benchmarking PHP's Magic Methods  33
      kayu oyunlar?,dora,oyun,oyna, Benjamin Bejjani, paypal website [...]
    • Microbenchmarks of single and double qouting.  24
      kefir grains minneapolis, sexshop dildo, tuim688 [...]
  • Recent Posts

    • Richard Thomas
    • ZendCon: Writing Maintainable PHP Code
    • Looking Towards the Cloud
    • Holiday Tech Support
    • Closures are coming to PHP
    • php | tek Wrapup
    • php | tek 2008
    • Sarah Snow Stever
    • Benchmarking PHP’s Magic Methods
    • The Endpoints of the Scale of Stupidity on Video
  • Site

    • Archives
    • Log in
  • Search