How to read html from a url in python 3
Here's how to use requests
to quickly read HTML from a URL. Use requests.get()
to make a GET request to the server, then access the .text
attribute to retrieve the HTML content:
This gets you the HTML content of http://example.com
in a jiffy. Quick and to the point! Now for the in-depth walkthrough.
Check your Python version
# If Python were a coffee, this code would tell you how strong it is.
import sys
print(sys.version)
The requests
library works gracefully with Python 3.4 and above. Please check your Python version to ensure compatibility.
Install requests
# Knock, knock. Who's there? Requests.
pip install requests
Make sure you have the requests
module installed. It simplifies fetching HTML content, making your life a whole lot easier.
Catching exceptions
# Sometimes, life throws you exceptions. You gotta catch ‘em all!
import requests
url = 'http://example.com'
try:
response = requests.get(url)
response.raise_for_status() # Throws a fit if HTTP request failed
html_content = response.text
print(html_content)
except requests.exceptions.RequestException as e:
print(f"A wild error appeared: {e}")
Ensure error tolerance when working with the web. Code in a way to handle unexpected road bumps like network failures, incorrect URLs, or unanticipated server responses.
In Bytearray We Trust
# When life gives you bytearrays, decode them!
import urllib.request
url = "http://example.com"
with urllib.request.urlopen(url) as response:
html_bytes = response.read()
html = html_bytes.decode("utf8")
print(html)
Using urllib
, HTML content sometimes comes as bytearrays. Decode these bytes using decode()
, then store your HTML content as a plain string. Don’t forget to close the connection using with
statement, it’s good housekeeping.
The urllib alternative
# When requests are too demanding, I turn to urllib.
import urllib.request
url = 'http://example.com'
with urllib.request.urlopen(url) as response:
raw_html = response.read().decode('utf8')
print(raw_html)
When requests
isn't available, urllib
is a solid fallback option. Remember to close the connection to manage resources responsibly.
Top-notch error handling
# URL validation with Python, because URLs need love too!
Always validate the URL format. It’s preventive medicine - prevents common mistakes in HTML retrieval.
Resource management
# Nice coders clean up after themselves.
Just fetch your data and move along. Don't manually assemble HTML or clean up data. Leverage built-in methods and libraries whenever possible.
Read the docs
# The best advice any programmer can give you is “RTFM”.
Learning to fetch and read HTML in Python is just the tip of the iceberg. Look beyond and discover Python's core documentation, the full potential of the requests
library, and parsing libraries like Beautiful Soup and lxml.
Was this article helpful?