Extract protocol and host name from URL

python

url-parsing

urlparse

urlsplit

byNikita Barsukov·Feb 7, 2025

Here's a quick way to extract protocol and host from a URL using Python's urlparse():

from urllib.parse import urlparse

# The good ol' standard method
url = "http://www.example.com"
parsed_url = urlparse(url)
result = f"{parsed_url.scheme}://{parsed_url.netloc}/"
print(result)  # Ahoy! It prints: http://www.example.com/

scheme gives you the protocol, netloc extracts the host.

URL parsing - the nitty-gritty

Python's urlparse function does the heavy lifting here. It breaks down the URL into digestible elements by returning a named tuple containing its six components: scheme, netloc, path, params, query, and fragment.

Error management and other methods

The safety net: try-except

To prevent unanticipated ValueError ruining our day, let's wrap the URL parsing within a try-except block:

# Preventing the parade from getting rained on
try:
    parsed_url = urlparse(url)
    result = f"{parsed_url.scheme}://{parsed_url.netloc}/"
    print(result)  # Hop in joy, it worked!
except ValueError as e:
    print(f"Whoopsie Daisy, an error: {e}")  # Oh well, at least we caught it!

The backup dancer: urlsplit

urlsplit, the cousin of urlparse, is also an adept performer at dissecting URLs. It behaves similarly, parting the URL string into components:

from urllib.parse import urlsplit

# URL shenanigans with urlsplit
split_url = urlsplit(url)
result = f"{split_url.scheme}://{split_url.netloc}/"  # Voilà, we have the result!

The last-man-standing: string methods

When all the standard parsing methods fail, resorting to string operations is an option. However, it's more akin to a tightrope walk without safety net:

# Cutting strings, the caveman way
if '//' in url:
    protocol, rest = url.split('//', 1)
else:
    protocol, rest = '', url
host = rest.split('/', 1)[0]
result = f"{protocol}//{host}/"  # Here's hoping it doesn't crash!

Advanced URL parsing

Battle of the methods: urlparse vs tldextract

urlparse may not always suffice, especially for rigged URLs chock-full of subdomains and gTLDs. In such cases, check out tldextract:

import tldextract

# Extracting, Now in HD
extracted = tldextract.extract(url)
result = f"{extracted.scheme}://{extracted.domain}.{extracted.suffix}/"

urlparse vs urlsplit: The standoff

While urlparse and urlsplit may seem twinned, they harbour slight differences, especially when handling fragments and queries. Refer official Python docs for a heart-to-heart comparison.

The secret sauce: Good practices

URL Verification: Check structure before parsing.
Data sense: Specialized URLs might work better with tldextract.
Benchmark fanatic: Compare different methods for efficiency in performance-critical applications.

explain-codes / Python / Extract protocol and host name from URL

Linked

Url decode UTF-8 in Python



Retrieving parameters from a URL



Checking whether a string starts with XXXX



Check if a JavaScript string is a URL



How to validate a url in Python? (Malformed or not)

