Sunday, July 21, 2013

Parsing Html Links in Web Pages in C#

Step 1: download and add reference of Html Agility Pack to your project
Download Html Agility Pack from the following link:

http://htmlagilitypack.codeplex.com/

Add a reference of the library to your project

Step 2: download the html web page to parse

Use the following code to download the web page.


Step 3: Parse the downloaded web page for <a href="..."></a> link using Html Agility Pack

private void Download(string url)
{
    WebClient client = new WebClient();
    client.DownloadFileCompleted +=    new AsyncCompletedEventHandler(client_DownloadFileCompleted);
    client.DownloadFileAsync(new Uri(url), @"c:\temp.html");
}

void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
    //do something here
}
Use the following code in the client_DownloadFileCompleted() to enumerate all links in the web page.

void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
 HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
 doc.Load("c:\\temp.html", Encoding.UTF8);

 foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a/@href"))
 {
  HtmlAttribute att = link.Attributes["href"];
  string url = att.Value;
  Console.WriteLine("Url in temp.html: {0}", url);
 }
}

No comments:

Post a Comment