Explain Codes LogoExplain Codes Logo

How to use HTML Agility pack

html
html-agility-pack
xpath
error-handling
Alex KataevbyAlex Kataev·Aug 4, 2024
TLDR

HTML Agility Pack is your trusted ally for parsing/editing HTML. You can add it to your project via NuGet (Install-Package HtmlAgilityPack). To load HTML, start with var htmlDoc = new HtmlDocument(); then do htmlDoc.LoadHtml(html);. For honing in on the desired data, leverage the power of XPath: var nodes = htmlDoc.DocumentNode.SelectNodes("//div[@class='info']");. Loop through those nodes like this:

foreach (var node in nodes) { string infoContent = node.InnerText.Trim(); // Utilize infoContent, the world is your oyster... full of data. }

The code above pulls out the text from all <div> elements tagged with the class info.

Fluent motions with HTML Agility Pack

Decoding HTML entities

Leverage the HtmlEntity.DeEntitize() method for morphing HTML entities back into plain text. This is particularly handy when dealing with HTML-encoded special characters that you need to process as regular text, or you know... understand.

string textWithEntities = htmlNode.InnerHtml; string decodedText = HtmlEntity.DeEntitize(textWithEntities); // Presto, no more entities!

Spontaneous HTML loading using streams

Load() method goes beyond strings and opens doors for streams, allowing a new dimension of dynamic data sourcing. Load content straight from a MemoryStream and embrace the 'stream' of data.

using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(html))) { htmlDoc.Load(stream); // htmlDoc is charged up and ready to roll! }

Visor down for XPath for precise targeting

Unlock the true potential of XPath by understanding its syntax (check out W3Schools XPath Tutorial). It's your roadmap for HTML elements, leading you straight to your data treasure chest.

Dodging the feared NullReference

Stop NullReferenceException in its tracks by affirming a node's existence prior to attribute extraction. Implement the ?. operator or null checks and prepare an error-proof code.

var attributeValue = node?.GetAttributeValue("href", null); if (attributeValue != null) { // Attribute value exists and is ready to party! }

Tackling erroneous and imperfect HTML

Wrestling with poorly structured HTML

HTML Agility Pack is designed to take on flawed HTML without breaking a sweat. Switch on OptionFixNestedTags when dealing with stubborn nested tags and let the pack work its magic.

Coding an error-resistant armor

Errors are inevitable, but they're not undefeatable. Use try-catch blocks around critical parsing code, log those errors, and keep your project running as smooth as butter.

Pulling HTML straight from the internet's heart

Why bother with HTTP requests when HtmlWeb can do the job for you? Load HTML using LoadFromWebAsync and make life easy.

var web = new HtmlWeb(); var htmlDoc = await web.LoadFromWebAsync("http://example.com"); // htmlDoc is now loaded with the HTML content of the webpage, web-scrapers unite!