Python: How to make absolute links in BeautifulSoup
Real HTML data, obtaining from the WEB can have relative internal links. This is not very convenient for further parsing. That is why, every time, when you have a new HTML data, which you need to parse to find internal links, it is necessary to convert all internal links to its absolute values. This can be done by adding domain name and recalculating path.
Everything can be done with Python tools.
For this example we will analyse page https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. If you look at the source code of this page, you will find a lot of links with following format:
These kind of relative links should be converted. For example, above link should be: https://en.wikipedia.org/wiki/ISO_3166-2
Downloading HTTP data
It is possible to download HTTP data with library requests.
I always advise you to give web-server proper browser name, and also give proper referrer. As a referrer, it is possible to give the same page
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0)
Gecko/20100101 Firefox/20.0',
'Referer': base_url
}
# Download requires HTML file with given URL
response = requests.get(url, headers=headers)
# Go ahead to analyze it response is ok (code 200)
if response.status_code == 200:
# this is ok, we can process this HTML
# Process HTML data into soup structure
soup = BeautifulSoup(response.content, 'lxml')
# Convert relative links into absolute if any
MakeAbsoluteLinks(url, soup)
else:
# fail
print(f"Can't read url = '{url}',\nError code = {response.status_code}")
Convert relative path to absolute in BeautifulSoup
Relative links can be found in the following HTML tags:
- a
- link
- img
To convert relative link to its absolute value we can use urljoin function from urllib.parse library. This function joins current url with link from this url to make absolute url. If the link from this url is already in absolute format, then this function will not do any changes.
def MakeAbsoluteLinks(url, soup):
'''
Convert all links in BS object to absolute
url - base url (downloaded)
soup - soup object
'''
# Find all a tags
for a in soup.findAll('a'):
# if this tag have href property
if a.get('href'):
# Make link in absolute format
a['href'] = urljoin(url, a['href'])
# Find all link tags
for link in soup.findAll('link'):
# if this tag have href property
if link.get('href'):
# Make link in absolute format
link['href'] = urljoin(url, link['href'])
# Find all img tags
for img in soup.findAll('img'):
# if this tag have src property
if img.get('src'):
# Make link in absolute format
img['src'] = urljoin(url, img['src'])
After execution of this function all links will be in absolute format. Modification of these tags will result in modification of the original soup content
Published: 2022-05-13 14:16:11
Updated: 2022-05-13 14:17:35