How to effectively use different Fuzzy Matching Algorithms?
Have you ever tried to match data in two tables? Isn’t it too difficult to match values in tables without identical data like the names with spelling mistakes or extra space between two words?
Analytics team faces the biggest problem when they get the task of normalizing two sets of similar data with varied variables. They need to do this in order to create a common record for modeling. Most of the times, the reason behind errors in tables is the manual entry of values. For instance, the forms filled by users or the spreadsheets maintained by the employees have highest chances of incorrect data. Small organizations with limited data might not understand the problem well, but large scale businesses with humongous data know how painful it is to match or merge multiple tables.
Now the big question here how to solve this problem?
So, the solution to this problem is Fuzzy Matching. It is a technique used in computer-assisted translation for record linking. The technique works on less than 100% matches to while searching for agreements between different segments of text and entries in tables of previous translations. It is designed to reliably match two words which are similar but not exactly similar.
Below given table is the example of two data sets, which are same but not exactly similar.
|Data Set 1||Data Set 2|
|Organization Name||Sales||Organization Name||Number of Customers|
|EasyMorphInc||$300||Sally Harper Cntr||10|
|Sally Harper Center||$500||Domino’s Pizza||100|
Types of Fuzzy Matching Algorithms
To ensure that businesses working in different domains get maximum benefit from the process of matching or merging data through this software, different algorithms have been provided in the open source libraries. One can get the perfect results by using right and the most suitable fuzzy matching algorithm, only when they are aware of varied options. Given below is the list of algorithms for data matching:
This is a phonetic algorithm, which indexes names by recognizing them through sound according to English language pronunciation. This helps in matching names despite minor spelling differences by assuring same representation of homophones through encoding.
It is an immediate sequence of n items for a selected sequence of text or speech. Based on the application, the items in the text or speech can be phonemes, letters, words, syllables or base pairs. This is a probabilistic language model to anticipate the next item in a sequence in the form of an (n-1)- order Markov model.
Levenshtein distance Algorithm
Levenshtein distance measures the difference between two sequences through a string metric. In simple terms, Levenshtein distance between two words is the minimum number of single-character edits such as substitutions, deletions, and insertions used by the software to change one word into the other word.
This is a metric tree used to distinct metric spaces. BK-tree was suggested by Walter Austin Burkhard and Robert M. Keller.
Bitmap algorithm with modifications
This is an approximate string matching algorithm, which tells about the existence of a substring which is “approximately equal” to a given pattern. In this algorithm, approximate equality is defined as Levenshtein distance. As per Levenshtein distance, the algorithm considers a substring and pattern equal if they are within a given distance k of each other.
It is a string metric distance between finite sequences of symbols, given by calculating the minimum number of operations the software needs to perform to transform one string into the other string. In this algorithm, the operation performed to convert string is defined as a deletion, insertion or substitution of a single character. It also includes a replacement of two contiguous characters.
Tips to use fuzzy matching effectively for the best results
Limit matching records: Comparing the entire database with itself can make the process more difficult and time consuming. To get final data with least errors, limit the number of records recovered for comparison. To limit your data, you can exclude deceased entries like non-person entities. It will prevent the tool from checking unnecessary data or the data modified in a past 2-3 days.
Work with short strings: Fuzzy matching gives the best results when done with short strings. Sit and think of the things you really want to match on like addresses, customer names, email ids etc. to limit the comparison. Further, in address also you can limit the comparison by selecting a particular option like the Street Line 1 instead of matching of the complete address.
Use simple match queries: Focus on simple match queries like ID; rather than queries of your choice. Undoubtedly, removing date of birth and address from the matching data list makes a lot of sense as it helps the users in deciding if a certain match is true duplicate or not. While doing this, make sure you keep that part of the query separate from the match itself. Putting the match in a WITH clause is an easy and effective way to get the perfect results.
String matching, near matching or fuzzy matching, with whatever name you call it, the fact that it is very helpful for the businesses remains same. Apart from matching strings, the technology also helps the businesses reduce marketing postage cost by removing duplicate records from the database. It also ensures that the data is updated with zero errors, which also boosts company’s reputation in the market. Customers feel happy when their names are spelled correctly and they receive only one version of the promotional campaigns. Use the tips given in the article and get most out of the technology to make your marketing campaigns and business operations more profitable.