Explain Codes LogoExplain Codes Logo

How can I scrape a page with dynamic content (created by JavaScript) in Python?

python
scraping
javascript-rendering
selenium
Anton ShumikhinbyAnton Shumikhin·Feb 10, 2025
TLDR

To scrape dynamic content in Python, leverage the Selenium webdriver:

from selenium import webdriver # Initialize WebDriver for Chrome driver = webdriver.Chrome('/path/to/chromedriver') # Navigate to the webpage driver.get('http://example.com') # Wait for 10 seconds to allow the content to load driver.implicitly_wait(10) # Extract the dynamic HTML and print it print(driver.page_source) # Close the browser driver.quit()

Selenium renders the JavaScript on a webpage mimicking an actual user, exposing dynamic content for extraction.

Balance of power: Tools and their trade-offs

When choosing a tool for scraping dynamic content, consider the balance between usability, performance, and long-term maintenance. Here's a quick overview:

  • Requests-HTML - Simple and efficient for rendering JavaScript without heavy lifting.

  • Splash - A little more complex and it requires Docker, but offers OS-independent and superpowers in JS rendering.

  • Selenium - The gold standard for browser automation, offering extensive control but at the expense of system resources.

Beware of non-updated packages like dryscrape and phantomjs which are no longer supported.

Handling time in a dynamic world

Wait times are just as dynamic as the data you're scraping. driver.implicitly_wait() helps with waits after webpage actions but sometimes explicit pauses are necessary:

import time # Trigger an action that might require a buffer for JS execution driver.click(that_little_red_button) # Pause the script for a few seconds allowing the page to catch up time.sleep(5) # time spent thinking about the meaning of life # Now, scrape the data data = driver.find_element_by_id('fresh_content').text

This strategy is perfect for pages that have elements that take a while to load due to animation or deferred loading.

Secondary sources, Unforeseen goldmines

Dynamic websites often that load data from secondary URLs via AJAX. With developer tools like Network tab in your browser's inspection tool, you can detect these calls and directly access data in simple JSON, XML formats.

import requests # Let's assume you've uncovered the sneaky secondary URL JS is interacting with response = requests.get('http://example.com/data_endpoint') # You can now extract data directly youve_got_mail = response.json() # And just like that, an outlaw in the wild wild web.

This is often more efficient and definitely a better conversation starter compared to rendering the entire page.

Going incognito with headless mode

Modern browsers like Chrome and Firefox have a headless mode, i.e., they can operate without a visible user interface:

from selenium import webdriver from selenium.webdriver.chrome.options import Options # Set headless option chrome_options = Options() chrome_options.add_argument("--headless") # Fire up Chrome in headless mode driver = webdriver.Chrome(chrome_options=chrome_options) # Happy scraping! driver.get('http://example.com')

Ideal when working in server environments where a GUI isn't available or when you wish to minimize browser overhead.

Armed to the teeth with Scrapy and Splash

For complex tasks and sites loaded with JavaScript, Scrapy combined with Splash offers a comprehensive solution. Splash adds full JavaScript rendering to Scrapy's power in an easy-to-use package.

# Inside your Scrapy settings.py # Splash service URL SPLASH_URL = 'http://localhost:8050' # Adding splash middleware to Scraper DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }

This combines the efficiency of Scrapy with Splash's JavaScript rendering for heavy-duty scraping tasks.

Light-weight rendering with Requests-HTML

For lower computation overhead, Requests-HTML performs light-weight JavaScript rendering:

from requests_html import HTMLSession session = HTMLSession() # Initiate the session r = session.get('http://example.com') # Request the page # Render JavaScript r.html.render() # Now, dive into the content print(r.html.text)

A solid choice when comprehensive browser features are overkill.

To extract data from a site after the JS has been rendered, Selenium provides fine-grained control:

# By ID content_by_id = driver.find_element_by_id('content') # By XPath content_by_xpath = driver.find_element_by_xpath('//div[@class="content"]') # By CSS Selector content_by_css = driver.find_element_by_css_selector('div.content')

The right selector can make all the difference, it's akin to choosing the right path in a forest.

Ensuring full load before scraping

Ensure that the webpage is fully loaded before scraping:

from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Wait for specific element to load WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "dynamic-element")))

This command waits up to 10 seconds for an element with the ID dynamic-element to arrive in the DOM.