My Coding > Programming language > Python > BeautifulSoup > Python: How to make absolute links in BeautifulSoup

Python: How to make absolute links in BeautifulSoup

Real HTML data, obtaining from the WEB can have relative internal links. This is not very convenient for further parsing. That is why, every time, when you have a new HTML data, which you need to parse to find internal links, it is necessary to convert all internal links to its absolute values. This can be done by adding domain name and recalculating path.

Everything can be done with Python tools.

For this example we will analyse page https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. If you look at the source code of this page, you will find a lot of links with following format:

These kind of relative links should be converted. For example, above link should be: https://en.wikipedia.org/wiki/ISO_3166-2

Downloading HTTP data

It is possible to download HTTP data with library requests.

I always advise you to give web-server proper browser name, and also give proper referrer. As a referrer, it is possible to give the same page


import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) 
Gecko/20100101 Firefox/20.0',
        'Referer': base_url
      }
# Download requires HTML file with given URL
response = requests.get(url, headers=headers)
# Go ahead to analyze it response is ok (code 200)
if response.status_code == 200:
    # this is ok, we can process this HTML
    # Process HTML data into soup structure
    soup = BeautifulSoup(response.content, 'lxml')
    # Convert relative links into absolute if any
    MakeAbsoluteLinks(url, soup)
else:
    # fail
    print(f"Can't read url = '{url}',\nError code = {response.status_code}")

Convert relative path to absolute in BeautifulSoup

Relative links can be found in the following HTML tags:

  • a
  • link
  • img

To convert relative link to its absolute value we can use urljoin function from urllib.parse library. This function joins current url with link from this url to make absolute url. If the link from this url is already in absolute format, then this function will not do any changes.


def MakeAbsoluteLinks(url, soup):
    '''
        Convert all links in BS object to absolute
        url - base url (downloaded)
        soup - soup object
    '''
    # Find all a tags
    for a in soup.findAll('a'):
        # if this tag have href property
        if a.get('href'):
            # Make link in absolute format
            a['href'] = urljoin(url, a['href'])
    # Find all link tags
    for link in soup.findAll('link'):
        # if this tag have href property
        if link.get('href'):
            # Make link in absolute format
            link['href'] = urljoin(url, link['href'])
    # Find all img tags
    for img in soup.findAll('img'):
        # if this tag have src property
        if img.get('src'):
            # Make link in absolute format
            img['src'] = urljoin(url, img['src'])

After execution of this function all links will be in absolute format. Modification of these tags will result in modification of the original soup content


Published: 2022-05-13 14:16:11
Updated: 2022-05-13 14:17:35

Last 10 artitles


9 popular artitles

© 2020 MyCoding.uk -My blog about coding and further learning. This blog was writen with pure Perl and front-end output was performed with TemplateToolkit.