Posts tagged ‘open source’

July 14th, 2010

Using HtmlAgilityPack to parse a website.

Bing.com Twitter Bird Image ResultsIn my line of work I’m often asked to come up with some unique data sources to power some of our information visualizations. Often times this data  is not available in an API or RSS feed. Sometimes it’s just available in HTML. Up until recently I’ve shied away from parsing HTML with RegEx or XML parsers because they can be really processor intensive and give mixed results.

This past week I had another need for some data in an HTML page as a data source, this time for a WPF application. Since I needed to make it work I did a quick search on on the internet for some HTML parsers. What I found was amazingly useful. The HtmlAgilityPack is a .NET library that gives all the tools to parse your HTML, retrieve the data you’re looking for. It supports Linq to Objects and has XPATH implementation to let you find what you need easily. PLUS, because it’s on codeplex it’s open-source and free! Fantastic!

Here is a little code snippet if you ever wanted to know how many Twitter Bird image results are in a Bing.com search. Note that I’m using the XPATH syntax to find the <span> tag with and ID value of ‘sw_ptc’ and then select the InnerText from that node. HtmlAgilityPack has lots of XPATH and XSLT options so you’ll be able to find whatever your looking for with minimal effort.

using System;
using System.IO;
using System.Net;
using System.Configuration;
using System.Xml.Linq;
using HtmlAgilityPack;
class BingCounter
    {
        private WebClient webClient;
        public BingCounter()
        {
            Uri _feedUri = new Uri("http://www.bing.com/images/search?q=twitter+bird");
            webClient = new WebClient();
            webClient.DownloadDataCompleted += new DownloadDataCompletedEventHandler(stringDataCompletedEvent);
            webClient.DownloadDataAsync(_feedUri);
        }
        private void stringDataCompletedEvent(object sender, DownloadDataCompletedEventArgs e)
        {
            if (e.Error == null)
            {
                try
                {
                    byte[] xmlString = e.Result;
                    HtmlDocument doc = new HtmlDocument();
                    doc.Load(new StreamReader(new MemoryStream(xmlString)));
                    string results = "";
                    foreach (HtmlNode span in doc.DocumentNode.SelectNodes("//span[@id='sw_ptc']"))
                    {
                        results = span.InnerText;
                    }
                }
                catch (Exception exp)
                {
                    Console.Write(exp.Data);

                }
            }
        }

Have fun and happy coding.