Saturday, February 19, 2011

Fuzzy String Matching

*Update* checkout Google Refine for cleaning dirty data.

I wanted to generate numbers that represented a sort of "sameness" or "matchiness" of two strings, that's how I was thinking about it.  This led me to learning a little about fuzzy string matching using a method called Levenshtein. I didn't need to understand how to write the algorithms to do this I just wanted to use the tools.
ArcGIS has adopted Python as it's scripting language.  I downloaded compiled and installed a module called pylevenshtein which will compute the levenshtein edit distance as well as other methods of comparing two strings.  (had to install microsoft studio 2008 express it's free).  Once this is setup it's rather easy to implement the various string comparison algorithms in the field calculator.
I'm now figuring out which function to use or combination of functions to use to see if it's any better than my current method.  The Jaro Distance is looking good.

update:
I've decided that the method I was already using was best for me, perhaps I can write a script to automate some of the steps involved.
Step 1  select where old name <> new name
Step 2 export this selection
Step 3 delete all unnecessary fields(to reduce dataset size and ease the manual review steps)
Step 4 create two new txt fields 4 characters in length
Step 5 populate new txt fields with left(old name/new name, 4)
Step 6 select where left4_old=left4_new
Step 7 scan selected parcels > X acres(depending on size of dataset)
Step 8 delete the manually corrected selection of left4_old=left4_new

On spot checking this method the results are very good.  I found it difficult to interpret the various edit distance algorithms from the pylevenshtein package into something meaningful that would improve the accuracy and speed in developing change data.  The exercise was to try and find a better/faster way and I ended up going back to my original method.  Maybe it's because my method is easier to comprehend.

One goal for generating the data is to highlight ownership changes over time on a map. This could also be useful when looking at older subdivisions and gauging the regeneration rate.

2 comments:

  1. It would be interesting to see a video time lapse animation of a region that shows a ping upon ownership change along with a timeline indicator at the bottom or something. Pretty cool stuff here!

    ReplyDelete
  2. now I'm thinking of how to do that. ideally we'd need exact day or at least the month of owner change so all changes don't flash all at once at the end of each year...hmmm

    ReplyDelete