• Home
  • Python Fuzzy Multi-format Redirect Builder

    Chris

    Administrator
    Staff member
    The Fuzzy Multi-format Redirect Builder Python script is a time-saving tool to help speed up redirect mapping. Every SEO knows that redirect mapping can be a tedious and very time-consuming task.

    This python script helps you to speed up the redirect mapping process. Saving yourself, your team, and your clients time and money.

    Python Libraries
    Below are the python libraries used in this script. You may need to install some of them, you will likely need to install Fuzzywuzzy library.
    Code:
    pip install fuzzywuzzy[speedup]
    import csv
    import re
    import io
    import pandas as pd
    from google.colab import files
    from fuzzywuzzy import fuzz, process

    Upload your CSV file with the URLs
    This part of the script is to upload your lists of URLs. I used Google Colab to build the script, you may need to tweak the code if you are using another IDE such as Jupyter Notebook. Note* there should be 2 columns in the CSV file that you are uploading. Column A should be named "source", and Column B should be named "destination".

    Code:
    # upload your csv file - recommended max limit 100k per batch
    uploaded = files.upload()
    
    for fn in uploaded.keys():
      print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    
    file_name = list(uploaded.keys())[0]  # Adjust this if you expect multiple files
    df = pd.read_csv(io.StringIO(uploaded[file_name].decode('utf-8')))

    Preprocessing the URLs and Executing the Script
    What makes this script slightly different from other Python redirect mapping scripts is that I have added a number of preprocessing rules. The preprocessing rules allow you to modify the URLs that you inputted. In my opinion, this is a real time-saver as the preprocessing rules filter out pagination URLs, media formats, protocols, etc. To switch on a preprocessing rule, simply uncomment it.

    There is also a function in the script to set your best match threshold. The higher the number, the results output performance is greater. The script relies on the fuzzywuzzy library for this. The lower the threshold, the reliability decreases of a good match. URLs below the threshold limit will not be matched. That means you may have to manually remap those URLs.

    Once you are satisfied with your preprocessing configuration and set a threshold limit, run the script. Just make sure that your CSV matches the file_name variable.

    Code:
    file_name = "urls-input.csv" # Make sure the input CSV file name matches
    
    # create dataframe from the uploaded csv file
    def read_csv(file_name):
        df = pd.read_csv(file_name)
        return df
    
    ### Comment out preprocessing rules that are not is not required for each column
    
    def preprocess_url1(url):
    ## Preprocessing rules for the (source) first column
      # url = url.lower() # force lowercase
      # url = re.sub(r'^https?://', '', url)  # strips out protocol
      # url = re.sub(r'^www\.', '', url)  # strips out www.
      # url = re.sub(r'^https?://(www\.)?', '', url)  # strips out protocol and www.
      # url = re.sub(r'^https?://[^/]+', '', url)  # remove domain name including protocol
      # url = re.sub(r'\?.*$', '', url) # removes parameters and querystrings
      # url = re.sub(r'/$', '', url) # removes trailing slash
      # url = re.sub(r'\.(php|htm|html|asp)$', '', url)  # remove file extensions
      # url = re.sub(r'\.(css|json|js)$', '', url)  # remove asset files
      # url = re.sub(r'\.(webp|png|jpe?g|gif|bmp|svg|tiff?)$', '', url)  # remove image formats
      # url = re.sub(r'\.(pdf|docx?|csv|xlsx?|pptx?|zip|rar|tar|gz|7z|mp3|wav|ogg|avi|mp4|mov|mkv|flv|wmv)$', '', url)  # Remove common download formats
    
        return url
    
    
    def preprocess_url2(url):
    ## Preprocessing rules for the (destination) second column
      # url = url.lower() # force lowercase
      # url = re.sub(r'^https?://', '', url)  # strips out protocol
      # url = re.sub(r'^www\.', '', url)  # strips out www.
      # url = re.sub(r'^https?://(www\.)?', '', url)  # strips out protocol and www.
      # url = re.sub(r'^https?://[^/]+', '', url)  # remove domain name including protocol
      # url = re.sub(r'\?.*$', '', url) # removes parameters and querystrings
      # url = re.sub(r'/$', '', url) # removes trailing slash
      # url = re.sub(r'\.(php|htm|html|asp)$', '', url)  # remove file extensions
      # url = re.sub(r'\.(css|json|js)$', '', url)  # remove asset files
      # url = re.sub(r'\.(webp|png|jpe?g|gif|bmp|svg|tiff?)$', '', url)  # remove image formats
      # url = re.sub(r'\.(pdf|docx?|csv|xlsx?|pptx?|zip|rar|tar|gz|7z|mp3|wav|ogg|avi|mp4|mov|mkv|flv|wmv)$', '', url)  # Remove common download formats
      # url = re.sub(r'\?(page|pagenumber|start)=[^&]*(&|$)', '', url)  # ignore pagination URLs
       
        return url   
    
    
    def get_best_match(url, url_list):
        scorers = [fuzz.token_sort_ratio, fuzz.token_set_ratio, fuzz.partial_token_sort_ratio]
        best_match_data = max((process.extractOne(url, url_list, scorer=scorer) for scorer in scorers), key=lambda x: x[1])
        best_match, best_score = best_match_data[0], best_match_data[1]
        return best_match, best_score
    
    # threshold score is set at 60. URLs that do not meet threshold will need to be manually mapped
    def compare_urls(df, column1, column2, threshold=60):
        result = []
    
        preprocessed_url1_list = [preprocess_url1(url) for url in df[column1]]
        preprocessed_url2_list = [preprocess_url2(url) for url in df[column2]]
    
        for url, preprocessed_url1 in zip(df[column1], preprocessed_url1_list):
            best_match, best_score = get_best_match(preprocessed_url1, preprocessed_url2_list)
    
            if best_score < threshold:
                best_match = 'No Match'
    
            result.append({'Source URL': url, 'Best Match Destination URL': best_match, 'Match Score': best_score})
    
        return result
    
    # build the dataframe
    def compare_urls_in_csv(file_name, column1, column2):
        df = read_csv(file_name)
        result = compare_urls(df, column1, column2)
        result_df = pd.DataFrame(result)
        result_df.to_csv("result.csv", index=False)
        print("Comparison results saved to result.csv")
        return result_df
    
    
    # fsave the results to csv file in a callable dataframe
    column1 = "source"      # Make sure the input CSV file has a column named "source"
    column2 = "destination" # Make sure the input CSV file has a column named "destination"
    result_df = compare_urls_in_csv(file_name, column1, column2)

    Show the results
    The final part is to show the result_df dataframe and download the dataframe as a CSV file. It is entirely up to you how you construct the script. I personally find it better to run the show dataframe and file download in its own IDE cell.

    Code:
    files.download('result.csv')
    result_df


    Google Colab Link
    Here is a copy of the full script in Google Colab. Feel free to make a copy of the script and tweak it to your own requirements.
    Link: https://colab.research.google.com/drive/1iwtWNk63W4Nyv4JK03OPcS5gt6LHWBNk?usp=sharing

    Create URLs for Testing
    To put the python redirect mapper to the test, I created a tool to generate random URLs. If you're interested in trying the script out to put it through its paces, then create a bunch of random URLs just as I did using this tool: https://chrisleverseo.com/tools/random-url-generator/


    Redirects Generator
    Another tool that I have created to speed up the whole remapping and redirecting process is the Redirects Generator. It creates the redirect rules from your remapped URLs. You can export the redirect rules to htaccess, YAML and NGINX. Link to the tool: https://chrisleverseo.com/tools/redirects-generator/


    That's a wrap
    If you have any problems with running the python redirect builder script. Comment below, and I'll try to help you debug your issues. It should be fine if you've followed the guidance above.

    Hopefully, you will find it a useful script to add to your collection of tools. It certainly beats using Microsoft Excel fuzzy lookup. I find that fuzzy on Excel crashes a lot and you haven't the luxury of the preprocessing filtering rules that I have added to this script.

    Thanks for reading
     
    Last edited: