WYSIWYG

http://kufli.blogspot.com
http://github.com/karthik20522

Friday, June 12, 2009

HTML Strip [ RegEx vs String ]

Since I work on a extremely user driven content web site, I have to make sure that there is no user inputted HTML on the page that break the CSS or the layout of the page. So we had to build a HTML Stripping functionality to strip out the HTML on the fly. We had initially used the obvious RegEx technique to strip out the HTML. But as the traffic increased the page performance/page load time started increasing. So we decided to enable trace on the page to determine the most expensive operation. So while re factoring the code we realized that the HTML stripping functionality was adding on the page load time.

So while digging around the internet to find a optimized stripping code, I came across two site. 1) DotNet Pearls [http://dotnetperls.com/remove-html-tags] and 2) StackOverflow [http://stackoverflow.com/questions/473087/string-benchmarks-in-c-refactoring-for-speed-maintainability].

Both these sites spoke about string operations vs RegEx and I decided to implement the technique mentioned on their site and following is the result from Page.Trace




As you can see from above data that there is a huge speed difference (thou it’s in milliseconds factor!!) but still much faster and there are fewer objects (string) used which is good for memory usage.

Though the original string index code worked wonders, but since we are optimizing code for performance (speed and memory usage) we can re-factor the code to use stringbuilder (better memory management).

Labels: ,