Thursday, February 19, 2009

ASP.NET parse HTML string HtmlDocument

Here is a quick tutorial that shows how a string containing HTML can be parsed and navigated using HtmlDocument object.
You must use this namespace for HtmlBrowser and HtmlDocument objects

namespace System.Windows.Forms

First lets say you have a flat HTML in a string variable like this

string strHTML = "[some raw html]";
WebBrowser browser = new WebBrowser();
browser.ScriptErrorsSuppressed = true;
HtmlDocument htmlDocument = browser.Document.OpenNew(true);
htmlDocument.Write(strHTML);

I recommend to set ScriptErrorsSuppressed=true; to avoid possible JS problems while loading HTML.

Once you HtmlDocument object is ready you have these functions (similar to JavaScript) on your disposal:

htmlDocument.GetElementById(string id)
htmlDocument.GetElementsByTagName(string tagName)
htmlDocument.GetElementFromPoint(System.Drawing.Point point)

All these methods returns ether HtmlElement or HtmlElementCollection and here are useful methods for parsing thru elements

htmlElement.Parent

htmlElement.NextSibling

htmlElement.FirstChild

htmlElement.InnerHtml

htmlElement.InnerText

htmlElement.Children

htmlElement.GetElementsByTagName(string tagName)

As you can see this is exactly same as JavaScript DOM model so anybody that has experience with working with DOM will be right at home.

It would be nice to have something like JQuery server side to parse the document, if you know about a better way of parsing or a library dedicated to it fell free to add a comment?

3 comments:

Anonymous said...

I don't know, if it is different in ASP.NET. I tried this solution in a standard c# application in .NET 2.0. But browser.Document is null, so you can't call "OpenNew"-Method. I found a workaround with setting the Url of "browser" to "about:blank". This will create all necessary attributes to get the browser.Document for overwriting with your string.

WebBrowser wb = new WebBrowser();
wb.ScriptErrorsSuppressed = true;
wb.Url = new Uri("about:blank");
HtmlDocument doc = wb.Document.OpenNew(true);
doc.Write(myString);

Deegii said...

I tried this using ASP.NET MVC 2.0 . But couldn't find WebBrowser, HtmlDocument classes. Pls help me?

Fizzled said...

@Deegii -

"You must use this namespace for HtmlBrowser and HtmlDocument objects

namespace System.Windows.Forms"

Though HtmlBrowser is a typo, and should be WebBrowser.