Explain Codes LogoExplain Codes Logo

Scraping html tables into R data frames using the XML package

r
data-manipulation
html-scraping
r-package
Alex KataevbyAlex Kataev·Dec 8, 2024
TLDR

readHTMLTable() from the XML package in R readily converts an HTML table into a data frame. Here’s the quick rundown:

# Load those fancy R packages library(XML) library(RCurl) # Specify your URL here and no, you can't use WebMD url <- "http://www.example.com/table.html" # Get HTML content but beware, this ain't a pirate! html <- getURL(url, .opts = list(ssl.verifypeer = FALSE)) # Read all tables, yes, even the pesky hidden ones! tables <- readHTMLTable(html) # We are going for gold here, so let's get the longest table! tableIndex <- which.max(sapply(tables, function(t) nrow(t))) # Extracting our prize table into df, and Voila! df <- tables[[tableIndex]]

Remember to install the necessary packages (XML, RCurl) if you haven't already, and handle SSL verification carefully to ensure data security when accessing https enabled websites.

Learning the ropes: HTML table extraction in R

Complex HTML scenarios: Bring it on!

HTML tables can sometimes be as complex as your ex's emotions!

  • To tackle nested tables, xpathSApply() is your perfect wingman, it extracts data from specific parts of the HTML document, even nested piles.

  • For tables without headers, you can use underworld tactics with XML::getNodeSet() and XPath expressions to extract them and then manually patch them onto your data frame. We leave no tables turned!

  • Caught up in the web of advanced data manipulation? The rlist library can easily merge nested lists into a dataframe. So, keep your tools ticker sharp, mate!

Unruly HTML chaos: Tidying up

Let's clean up this dumped data which can be as filthy as a pirate ship cabin:

  • Call in the laptop-wielding CSI, use gsub() to clean up individual entries and replace all that's unwanted, because who likes junk, right?

  • If data is scattered over multiple tables, bring some order to your battlefield by using as.data.frame() and rbind. Wrangle that rowdy data into a single, sombre data frame.

Going advanced: rvest package to the rescue

If you found XML helpful, you'll find the rvest package a superhero with just the right powers:

# Load rvest package, time to get super library(rvest) # Load HTML like the Matrix webpage <- read_html(url) # Find those hiding table nodes, we've got you cornered! table_nodes <- html_nodes(webpage, "table") # Tidy up those tables, it's cleanup time cleaned_tables <- lapply(table_nodes, html_table) # Load it into the data frame, it's home time df <- cleaned_tables[[tableIndex]]

Handling oversize tables: Call in the Matrix

When dealing with monstrously large tables, a matrix can be your getaway vehicle:

# Measuring up the big gun big_table <- matrix(unlist(tables[[tableIndex]]), ncol = ncol(tables[[tableIndex]]), byrow = TRUE) # Loading it into the data frame. Mission accomplished! df <- as.data.frame(big_table)