Scraping html tables into R data frames using the XML package

data-manipulation

html-scraping

r-package

byAlex Kataev·Dec 8, 2024

readHTMLTable() from the XML package in R readily converts an HTML table into a data frame. Here’s the quick rundown:

# Load those fancy R packages
library(XML)  
library(RCurl)  

# Specify your URL here and no, you can't use WebMD
url <- "http://www.example.com/table.html"  

# Get HTML content but beware, this ain't a pirate!
html <- getURL(url, .opts = list(ssl.verifypeer = FALSE))  

# Read all tables, yes, even the pesky hidden ones!
tables <- readHTMLTable(html)  

# We are going for gold here, so let's get the longest table!
tableIndex <- which.max(sapply(tables, function(t) nrow(t)))

# Extracting our prize table into df, and Voila!
df <- tables[[tableIndex]]

Remember to install the necessary packages (XML, RCurl) if you haven't already, and handle SSL verification carefully to ensure data security when accessing https enabled websites.

Learning the ropes: HTML table extraction in R

Complex HTML scenarios: Bring it on!

HTML tables can sometimes be as complex as your ex's emotions!

To tackle nested tables, xpathSApply() is your perfect wingman, it extracts data from specific parts of the HTML document, even nested piles.
For tables without headers, you can use underworld tactics with XML::getNodeSet() and XPath expressions to extract them and then manually patch them onto your data frame. We leave no tables turned!
Caught up in the web of advanced data manipulation? The rlist library can easily merge nested lists into a dataframe. So, keep your tools ticker sharp, mate!

Unruly HTML chaos: Tidying up

Let's clean up this dumped data which can be as filthy as a pirate ship cabin:

Call in the laptop-wielding CSI, use gsub() to clean up individual entries and replace all that's unwanted, because who likes junk, right?
If data is scattered over multiple tables, bring some order to your battlefield by using as.data.frame() and rbind. Wrangle that rowdy data into a single, sombre data frame.

Going advanced: rvest package to the rescue

If you found XML helpful, you'll find the rvest package a superhero with just the right powers:

# Load rvest package, time to get super
library(rvest)  

# Load HTML like the Matrix
webpage <- read_html(url)  

# Find those hiding table nodes, we've got you cornered!
table_nodes <- html_nodes(webpage, "table") 

# Tidy up those tables, it's cleanup time
cleaned_tables <- lapply(table_nodes, html_table)  

# Load it into the data frame, it's home time
df <- cleaned_tables[[tableIndex]]

Handling oversize tables: Call in the Matrix

When dealing with monstrously large tables, a matrix can be your getaway vehicle:

# Measuring up the big gun
big_table <- matrix(unlist(tables[[tableIndex]]), ncol = ncol(tables[[tableIndex]]), byrow = TRUE)

# Loading it into the data frame. Mission accomplished!
df <- as.data.frame(big_table)