Python Fuzzy Multi-format Redirect Builder

Chris · Apr 2, 2023

The Fuzzy Multi-format Redirect Builder Python script is a time-saving tool to help speed up redirect mapping. Every SEO knows that redirect mapping can be a tedious and very time-consuming task.

This python script helps you to speed up the redirect mapping process. Saving yourself, your team, and your clients time and money.

Python Libraries
Below are the python libraries used in this script. You may need to install some of them, you will likely need to install Fuzzywuzzy library.

Code:

pip install fuzzywuzzy[speedup]
import csv
import re
import io
import pandas as pd
from google.colab import files
from fuzzywuzzy import fuzz, process

Upload your CSV file with the URLs
This part of the script is to upload your lists of URLs. I used Google Colab to build the script, you may need to tweak the code if you are using another IDE such as Jupyter Notebook. Note* there should be 2 columns in the CSV file that you are uploading. Column A should be named "source", and Column B should be named "destination".

Code:

# upload your csv file - recommended max limit 100k per batch
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

file_name = list(uploaded.keys())[0]  # Adjust this if you expect multiple files
df = pd.read_csv(io.StringIO(uploaded[file_name].decode('utf-8')))

Preprocessing the URLs and Executing the Script
What makes this script slightly different from other Python redirect mapping scripts is that I have added a number of preprocessing rules. The preprocessing rules allow you to modify the URLs that you inputted. In my opinion, this is a real time-saver as the preprocessing rules filter out pagination URLs, media formats, protocols, etc. To switch on a preprocessing rule, simply uncomment it.

There is also a function in the script to set your best match threshold. The higher the number, the results output performance is greater. The script relies on the fuzzywuzzy library for this. The lower the threshold, the reliability decreases of a good match. URLs below the threshold limit will not be matched. That means you may have to manually remap those URLs.

Once you are satisfied with your preprocessing configuration and set a threshold limit, run the script. Just make sure that your CSV matches the file_name variable.

Code:

file_name = "urls-input.csv" # Make sure the input CSV file name matches

# create dataframe from the uploaded csv file
def read_csv(file_name):
    df = pd.read_csv(file_name)
    return df

### Comment out preprocessing rules that are not is not required for each column

def preprocess_url1(url):
## Preprocessing rules for the (source) first column
  # url = url.lower() # force lowercase
  # url = re.sub(r'^https?://', '', url)  # strips out protocol
  # url = re.sub(r'^www\.', '', url)  # strips out www.
  # url = re.sub(r'^https?://(www\.)?', '', url)  # strips out protocol and www.
  # url = re.sub(r'^https?://[^/]+', '', url)  # remove domain name including protocol
  # url = re.sub(r'\?.*$', '', url) # removes parameters and querystrings
  # url = re.sub(r'/$', '', url) # removes trailing slash
  # url = re.sub(r'\.(php|htm|html|asp)$', '', url)  # remove file extensions
  # url = re.sub(r'\.(css|json|js)$', '', url)  # remove asset files
  # url = re.sub(r'\.(webp|png|jpe?g|gif|bmp|svg|tiff?)$', '', url)  # remove image formats
  # url = re.sub(r'\.(pdf|docx?|csv|xlsx?|pptx?|zip|rar|tar|gz|7z|mp3|wav|ogg|avi|mp4|mov|mkv|flv|wmv)$', '', url)  # Remove common download formats

    return url


def preprocess_url2(url):
## Preprocessing rules for the (destination) second column
  # url = url.lower() # force lowercase
  # url = re.sub(r'^https?://', '', url)  # strips out protocol
  # url = re.sub(r'^www\.', '', url)  # strips out www.
  # url = re.sub(r'^https?://(www\.)?', '', url)  # strips out protocol and www.
  # url = re.sub(r'^https?://[^/]+', '', url)  # remove domain name including protocol
  # url = re.sub(r'\?.*$', '', url) # removes parameters and querystrings
  # url = re.sub(r'/$', '', url) # removes trailing slash
  # url = re.sub(r'\.(php|htm|html|asp)$', '', url)  # remove file extensions
  # url = re.sub(r'\.(css|json|js)$', '', url)  # remove asset files
  # url = re.sub(r'\.(webp|png|jpe?g|gif|bmp|svg|tiff?)$', '', url)  # remove image formats
  # url = re.sub(r'\.(pdf|docx?|csv|xlsx?|pptx?|zip|rar|tar|gz|7z|mp3|wav|ogg|avi|mp4|mov|mkv|flv|wmv)$', '', url)  # Remove common download formats
  # url = re.sub(r'\?(page|pagenumber|start)=[^&]*(&|$)', '', url)  # ignore pagination URLs
   
    return url   


def get_best_match(url, url_list):
    scorers = [fuzz.token_sort_ratio, fuzz.token_set_ratio, fuzz.partial_token_sort_ratio]
    best_match_data = max((process.extractOne(url, url_list, scorer=scorer) for scorer in scorers), key=lambda x: x[1])
    best_match, best_score = best_match_data[0], best_match_data[1]
    return best_match, best_score

# threshold score is set at 60. URLs that do not meet threshold will need to be manually mapped
def compare_urls(df, column1, column2, threshold=60):
    result = []

    preprocessed_url1_list = [preprocess_url1(url) for url in df[column1]]
    preprocessed_url2_list = [preprocess_url2(url) for url in df[column2]]

    for url, preprocessed_url1 in zip(df[column1], preprocessed_url1_list):
        best_match, best_score = get_best_match(preprocessed_url1, preprocessed_url2_list)

        if best_score < threshold:
            best_match = 'No Match'

        result.append({'Source URL': url, 'Best Match Destination URL': best_match, 'Match Score': best_score})

    return result

# build the dataframe
def compare_urls_in_csv(file_name, column1, column2):
    df = read_csv(file_name)
    result = compare_urls(df, column1, column2)
    result_df = pd.DataFrame(result)
    result_df.to_csv("result.csv", index=False)
    print("Comparison results saved to result.csv")
    return result_df


# fsave the results to csv file in a callable dataframe
column1 = "source"      # Make sure the input CSV file has a column named "source"
column2 = "destination" # Make sure the input CSV file has a column named "destination"
result_df = compare_urls_in_csv(file_name, column1, column2)

Show the results
The final part is to show the result_df dataframe and download the dataframe as a CSV file. It is entirely up to you how you construct the script. I personally find it better to run the show dataframe and file download in its own IDE cell.

Code:

files.download('result.csv')
result_df

Google Colab Link
Here is a copy of the full script in Google Colab. Feel free to make a copy of the script and tweak it to your own requirements.
Link: https://colab.research.google.com/drive/1iwtWNk63W4Nyv4JK03OPcS5gt6LHWBNk?usp=sharing

Create URLs for Testing
To put the python redirect mapper to the test, I created a tool to generate random URLs. If you're interested in trying the script out to put it through its paces, then create a bunch of random URLs just as I did using this tool: https://chrisleverseo.com/tools/random-url-generator/

Redirects Generator
Another tool that I have created to speed up the whole remapping and redirecting process is the Redirects Generator. It creates the redirect rules from your remapped URLs. You can export the redirect rules to htaccess, YAML and NGINX. Link to the tool: https://chrisleverseo.com/tools/redirects-generator/

That's a wrap
If you have any problems with running the python redirect builder script. Comment below, and I'll try to help you debug your issues. It should be fine if you've followed the guidance above.

Hopefully, you will find it a useful script to add to your collection of tools. It certainly beats using Microsoft Excel fuzzy lookup. I find that fuzzy on Excel crashes a lot and you haven't the luxury of the preprocessing filtering rules that I have added to this script.

Thanks for reading

Python Fuzzy Multi-format Redirect Builder

Chris

Administrator

New Posts