SQL Fuzzy Matching Guide

SQL Fuzzy Matching Guide

Overview

Have you wondered how Google approximates your search queries? How can their algorithm find relatively similar search queries without 100% matching the exact string you inputted? Although Google’s search algorithm involves more than fuzzy matching SQL, this matching technique is an integral part of making your search experience seamless and relevant.

Besides relevancy, other problems include scaling the fuzzy matching algorithms within large databases in SQL. A simple, less accurate fuzzy matching algorithm can be easier to develop in the short run but might be inefficient or hard to maintain in the long run. Below, we look at fuzzy matching in SQL and the basic concepts we can learn in order to generate a better approach to meet our needs.

What is Fuzzy Matching?

Fuzzy matching, which can also more aptly be called Approximate String Matching, is a method that enhances the ability to provide predictive search functions, detect misspellings, and return approximate values.

Fuzzy matching is a practical application of “fuzzy logic.” Essentially, while most algorithms stem from a binary perspective (i.e., having 1 or 0 as return values), fuzzy logic returns numerical values that can determine “truthiness” or “falseness” (i.e., “degrees of truth”).

What this logic implies is that the in-betweens are taken into consideration. Applying the logic to fuzzy matching, the results are not only “match” and “non-match” (binary outputs) but can also determine the similarity score of two strings (relevance).

The approximate, “in-between” values do not have to 100% match the string input but must meet a certain threshold to be considered “similar enough.”

Benefits of Fuzzy Matching

Fuzzy matching in SQL has many benefits outside of matching and search prediction. Just a few of those applications can be found in the following areas:

  • Fuzzy matching can arrive at matches despite irregularities, especially regarding unstandardized formatting. Moreover, because of the algorithms employed in fuzzy matching, it can generate matches despite mismatched cases, unreliable spacing, incorrect word order, and more.
  • It enables you to link and clean unclear data that one would otherwise toss out.
  • Instead of a data scientist manually cleaning and comparing data elements, fuzzy matching automates this process.
  • Fuzzy matching can help you generate a numerical value (i.e., a match score) to determine how similar one string is to another. Typically, a score of 1.0 means a perfect match, and a score of 0.0 is a no-match.

How Do You Fuzzy Match in SQL?

Finding exact matches is an easy task in SQL, but the same thing cannot be said for generating approximate matches.

While fuzzy matching may seem simple, there are many methods you can employ in SQL to do a fuzzy match, making it daunting to design the best course of action. While one approach may be objectively better than others for a particular situation, there is no one correct way to fuzzy match in SQL.

Here are the fuzzy matching methods we will take a look at:

  • Basic SQL Fuzzy Matching
  • Using the Levenshtein distance
  • Through SOUNDEX

Fuzzy Matching Strings: Basic SQL Fuzzy Matching

While more complex fuzzy matching methods exist, sometimes simpler ones work wonders. This fuzzy matching technique is relatively quick, easy to implement, and easy to understand. Moreover, this fuzzy matching method is easy to use for beginners!

To start, we will use the LIKE operator in SQL for this method.

Problem Set: Given a table of pupils, determine which pupils have a name similar to Smith

Select StudentName, LEN(StudentName) - LEN('Smith') from Students
WHERE StudentName like '%Smith%'
Order By LEN(StudentName) - LEN('Smith')

Example output should look like the following:

Peterson Smith 9
Smithson Ford 8
Smith Jay 4
Smith 0

Let’s break the code down. The LEN function returns the length of a string; through this, we can generate a pseudo-similarity algorithm that bases its output score on the differences in string length.

Using the WHERE clause, we can select only those strings in which Smith is placed on the original string’s left or right-hand side. This restriction is determined by the (%) set at the start and end of Smith.

Using The Levenshtein Distance

The Levenshtein distance, also known as the edit distance, is an algorithm that counts the differences between two strings based on their mismatches. For example, given the word ‘Hello’ and ‘Mellow’, you will need to change ’M’ to ‘H’ and remove ‘W’, which means the two words have an edit distance of two.

This algorithm is not necessarily easy to implement but is still commonly deployed in SQL and other coding languages. Learning this approach will give you flexibility across a range of situations, and is one of the best methods to learn in constructing a fuzzy match.

Let’s look at the code:

DELIMITER $$
CREATE FUNCTION levenshtein( str1 VARCHAR(255), str2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
    BEGIN
        DECLARE str1_len, str2_len, i, j, c, c_temp, cost INT;
        DECLARE str1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);

        SET str1_len = CHAR_LENGTH(str1), str2_len = CHAR_LENGTH(str2), cv1 = 0x00, j = 1, i = 1, c = 0;

        IF str1 = str2 THEN
            RETURN 0;
        ELSEIF str1_len = 0 THEN
            RETURN str2_len;
        ELSEIF str2_len = 0 THEN
            RETURN str1_len;
        ELSE
            WHILE j <= str2_len DO
                SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
            END WHILE;
            WHILE i <= str1_len DO
                SET str1_char = SUBSTRING(str1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
                WHILE j <= str2_len DO
                    SET c = c + 1;
                    IF str1_char = SUBSTRING(str2, j, 1) THEN
                        SET cost = 0; ELSE SET cost = 1;
                    END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                    IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                    IF c > c_temp THEN
                        SET c = c_temp;
                    END IF;
                    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
                END WHILE;
                SET cv1 = cv0, i = i + 1;
            END WHILE;
        END IF;
        RETURN c;
    END$$
DELIMITER ;

Implementation from Artful Software

This implementation of the Levenshtein algorithm may be pretty intimidating to look at due to the nested ‘WHILE’ loops, but let’s take a quick tour of the code:

In this algorithm, we utilize hex conversions to generate our distance return value, denoted by the variable c. Let’s shift our focus from the rest of the code and focus on the nested ‘WHILE’ loops, as this is where the magic happens.

Going off with this block:

WHILE i <= str1_len DO
                SET str1_char = SUBSTRING(str1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
                WHILE j <= str2_len DO
                    SET c = c + 1;
                    IF str1_char = SUBSTRING(str2, j, 1) THEN
                        SET cost = 0; ELSE SET cost = 1;
                    END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                    IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                    IF c > c_temp THEN
                        SET c = c_temp;
                    END IF;

We can see that a temporary variable, str1_char, and str2_char, has been made to store the current characters to be compared. We set str1_char to the current character we are pointing at in the string. For example, where i=2, in the word HELLO, we are pointing to the character E.

We set a similar variable to our second character, denoted by str2_char. We then compare if these two are the same. If they are the same, the cost variable will be set to zero; else, we will set it to one. The cost variable will be tallied to generate the distance between the strings.

The process will then be done iteratively until we are done comparing the two strings.

Using SOUNDEX

Soundex is a phonetic algorithm that both generates and indexes strings based on their sound. Soundex can help match strings despite the differences in spelling, and it mainly concerns itself with consonants.

What makes Soundex the most popular algorithm of its kind is its general standardization and, nowadays, wide-reaching implementation in SQL database systems.

To implement Soundex, we use the following code:

Select StudentName, 
DIFFERENCE (Soundex(StudentName), Soundex('Smith'))
from Students
WHERE StudentName like %Smith%
Order By DIFFERENCE (Soundex(StudentName), Soundex('Smith')) DESC

Let’s break down the code. We generate the differences in Soundex values between two strings using the DIFFERENCE (Soundex(str_x, str_y)) function to create a numerical value that can help differentiate or contrast these two strings phonetically.

Of course, we will only include those records with %Smith% in the actual string. However, one can change this method dramatically to become more lenient.

For example, instead of requiring the presence of the string Smith inside the word, it can also be set to return a match where the Soundex values between two strings must not be of this degree.

Nevertheless, the approach above also works and is an easy and straightforward way to make a Fuzzy Matching algorithm using SOUNDEX.

How Do You Fuzzy Match large datasets?

Fuzzy matching is essential for most workloads, but as your databases grow, simple fuzzy matching algorithms may take minutes to hours before they can process your query. In this section, we will discuss the details of an advanced fuzzy matching algorithm: Term Frequency, Inverse Document Frequency (or tf-idf for short).

Before we proceed with the main algorithm, let us understand the core concepts of fuzzy matching large datasets:

  • The first concept we need to learn is simplification or deduplication, which means we will need to append, simplify, extend, or compress words. For example, First St. will turn into First Street, while Los-Reyes will turn into Los Reyes. We will need to do this to simplify our algorithm’s workload and still deliver accurate and helpful results.

  • We also need to conduct record linkage, wherein we will link two or more elements with no unique identities but which are similar nonetheless. For example, we might want to link “Burma” and “Myanmar” despite the two words having a poor Levenshtein distance score.

Basic Term Frequency, Inverse Document Frequency Introduction

The tf-idf algorithm is used in Natural Language Processing fields (NLP) and involves dividing the text into n-grams (or chunks) and creating numerical values based on their rarity. To keep things clean, common words such as “are,” “is,” “and,” “I,” and “you” are filtered out.

By applying the algorithm, we can compare close matches throughout the text that appear less common. For example, uncommon keywords are valued more due to their rarity of occurrence, thus creating a better, more accurate algorithm.

Sample SQL Fuzzy Matching Interview Questions

The question below is a possible SQL interview question that you may encounter about Fuzzy Matching:

Student Tests, from Interview Query

Let’s say there are two tables, students and tests.

The tests table does not have a student id. However, it has both the first and last name, as well as the date of birth, which we should consider as possibly inaccurate since these details are entered manually.

What process would you use to determine which student in the student rosters took which exam?

Notes: You can assume that you can utilize a human to support the review of your matches, and that you need to evaluate thousands of rows of data for this project.

Input:

students table

Columns Type
id INTEGER
firstname VARCHAR
lastname VARCHAR
date_of_birth DATETIME

Output:

tests table

Columns Type
firstname VARCHAR
lastname VARCHAR
date_of_birth DATETIME
test_score INTEGER
test_date DATETIME

Most popular answer from Interview Query:

I would start with the assumption that making a mistake on more than one column is unlikely. I would select a candidates table where each test has several rows, each consisting of students with at least two matching columns. To select from the candidates, I would use a dictionary of common typographical mistakes for the names and dates. If all three match, I would select that option. Otherwise, go with a measure of difference between the values.

If two identically named students are transposed, then the only recourse would be manual selection.

Another user-submitted answer showcased the following:

Other than exact matches for firstname + lastname + dateofbirth, I think you’d want to look for matches to 23 and have a human review the third field to see if it approximately matches.

For example, for Jon Doe born on 5/7/1990, you might match Jonathan Doe 5/7/1990 or Jon Doe 6/7/1990.

I’d expect an exact match on 33 fields to net ~80% of matches, then need 23 matching to get another ~15%.

Finally, if it’s only thousands of rows, you should be able to hand-match the remaining ~5%. I’d sort by last name, since it’s more likely to be unique than first name and is less susceptible to a fat-finger mistake than DOB. (Even if you slightly misspell a last name, it’s going to look similar enough that a human can spot a match.)

Using the second solution, let us build a pseudo code SQL query to accomplish precisely that.

  • Concatenate the firstname, secondname, and date_of_birth columns from the Students_table, and store this concatenation into a new column. For ease of reference, we can call this column compare_A. By aggregating our relevant data, we can easily compare each student’s respective information.
  • Do the same, but this time from the Tests_table: concatenate the firstname, secondname, and date_of_birth columns. For ease of reference, we can call this concatenated column ‘compare_B’
  • Fuzzy match an element from compare_A to all the elements of compare_B using the Levenshtein Distance algorithm.
  • To get all the values that match 66% (23) of our original string, get the length of the longer string (i.e. if an element from compare_B is longer, use that element’s length).
  • Using the length, divide the Levenshtein distance’s result by the length. If a resulting calculation for two elements is greater than 33% (i.e., 13), exclude them from the list.
  • Use manual review to match the two tables’ results from each other.