How can I scrape a page with dynamic content (created by JavaScript) in Python?

python

scraping

javascript-rendering

selenium

byAnton Shumikhin·Feb 10, 2025

To scrape dynamic content in Python, leverage the Selenium webdriver:

from selenium import webdriver

# Initialize WebDriver for Chrome
driver = webdriver.Chrome('/path/to/chromedriver')

# Navigate to the webpage
driver.get('http://example.com')

# Wait for 10 seconds to allow the content to load
driver.implicitly_wait(10)

# Extract the dynamic HTML and print it
print(driver.page_source)

# Close the browser
driver.quit()

Selenium renders the JavaScript on a webpage mimicking an actual user, exposing dynamic content for extraction.

Balance of power: Tools and their trade-offs

When choosing a tool for scraping dynamic content, consider the balance between usability, performance, and long-term maintenance. Here's a quick overview:

Requests-HTML - Simple and efficient for rendering JavaScript without heavy lifting.
Splash - A little more complex and it requires Docker, but offers OS-independent and superpowers in JS rendering.
Selenium - The gold standard for browser automation, offering extensive control but at the expense of system resources.

Beware of non-updated packages like dryscrape and phantomjs which are no longer supported.

Handling time in a dynamic world

Wait times are just as dynamic as the data you're scraping. driver.implicitly_wait() helps with waits after webpage actions but sometimes explicit pauses are necessary:

import time

# Trigger an action that might require a buffer for JS execution
driver.click(that_little_red_button)

# Pause the script for a few seconds allowing the page to catch up
time.sleep(5)  # time spent thinking about the meaning of life

# Now, scrape the data
data = driver.find_element_by_id('fresh_content').text

This strategy is perfect for pages that have elements that take a while to load due to animation or deferred loading.

Secondary sources, Unforeseen goldmines

Dynamic websites often that load data from secondary URLs via AJAX. With developer tools like Network tab in your browser's inspection tool, you can detect these calls and directly access data in simple JSON, XML formats.

import requests

# Let's assume you've uncovered the sneaky secondary URL JS is interacting with
response = requests.get('http://example.com/data_endpoint')

# You can now extract data directly
youve_got_mail = response.json()

# And just like that, an outlaw in the wild wild web.

This is often more efficient and definitely a better conversation starter compared to rendering the entire page.

Going incognito with headless mode

Modern browsers like Chrome and Firefox have a headless mode, i.e., they can operate without a visible user interface:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Set headless option
chrome_options = Options()
chrome_options.add_argument("--headless")

# Fire up Chrome in headless mode
driver = webdriver.Chrome(chrome_options=chrome_options)

# Happy scraping!
driver.get('http://example.com')

Ideal when working in server environments where a GUI isn't available or when you wish to minimize browser overhead.

Armed to the teeth with Scrapy and Splash

For complex tasks and sites loaded with JavaScript, Scrapy combined with Splash offers a comprehensive solution. Splash adds full JavaScript rendering to Scrapy's power in an easy-to-use package.

# Inside your Scrapy settings.py

# Splash service URL
SPLASH_URL = 'http://localhost:8050'

# Adding splash middleware to Scraper
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

This combines the efficiency of Scrapy with Splash's JavaScript rendering for heavy-duty scraping tasks.

Light-weight rendering with Requests-HTML

For lower computation overhead, Requests-HTML performs light-weight JavaScript rendering:

from requests_html import HTMLSession

session = HTMLSession()  # Initiate the session
r = session.get('http://example.com')  # Request the page

# Render JavaScript
r.html.render()

# Now, dive into the content
print(r.html.text)

A solid choice when comprehensive browser features are overkill.

Navigating dynamic content

To extract data from a site after the JS has been rendered, Selenium provides fine-grained control:

# By ID
content_by_id = driver.find_element_by_id('content')

# By XPath
content_by_xpath = driver.find_element_by_xpath('//div[@class="content"]')

# By CSS Selector
content_by_css = driver.find_element_by_css_selector('div.content')

The right selector can make all the difference, it's akin to choosing the right path in a forest.

Ensuring full load before scraping

Ensure that the webpage is fully loaded before scraping:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for specific element to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "dynamic-element")))