Extract domain from URL in Python
The article is maintained by the team at commabot.
Extracting the domain from a URL in Python can be accomplished using several methods, ranging from utilizing the standard library to leveraging third-party packages for more complex URL parsing. Below is a comprehensive guide that covers different approaches to achieve this task.
urllib
The urllib.parse
module provides functions for manipulating URLs and extracting different parts of a URL, including the domain.
from urllib.parse import urlparse
def extract_domain(url):
parsed_url = urlparse(url)
domain = parsed_url.netloc
return domain
url = "https://www.example.com/path/page.html?query=argument"
domain = extract_domain(url)
print(domain) # Output: www.example.com
Extracting Domain Without Subdomains
If you need just the second-level domain (SLD) and top-level domain (TLD) without subdomains, you can use the tldextract
library, which handles various edge cases and is more reliable for complex URLs.
pip install tldextract
Use it as follows:
import tldextract
def extract_sld_tld(url):
tld_ext = tldextract.extract(url)
domain = f"{tld_ext.domain}.{tld_ext.suffix}"
return domain
url = "https://subdomain.example.com/path/page.html?query=argument"
domain = extract_sld_tld(url)
print(domain) # Output: example.com
Regular Expressions
For simple URL structures, you might consider using regular expressions (regex). However, this method is less reliable for complex URLs and might not handle all edge cases well.
import re
def extract_domain_regex(url):
pattern = r'^(?:http[s]?://)?([^/]+)'
match = re.search(pattern, url)
if match:
return match.group(1)
return None
url = "http://www.example.com/path/page.html"
domain = extract_domain_regex(url)
print(domain) # Output: www.example.com
Handling Edge Cases
When extracting domains, consider the following edge cases:
URLs without a scheme (e.g.,
www.example.com/path
).Internationalized domain names.
URLs with port numbers (e.g.,
www.example.com:8080/path
).
The urllib.parse
and tldextract
methods handle these cases gracefully, making them preferable for most applications.
For most use cases, using urllib.parse
or tldextract
is recommended due to their robustness and ability to handle a wide range of URL formats. Regular expressions can be used for simpler tasks but require careful handling to avoid common pitfalls.