HOWTO: Speed up string match lookups

When you have large number of patterns (dozens) to scan to find out which pattern is matching a given string, there's a few things you can do to speed up the job.

If the patterns are hard coded, there is of course any number of ways that you can be clever. But if you do not know what the patterns look like beforehand, which is the case when you're trying to match input strings against patterns in [ GlobalStrings.lua] using a formatstring-to-regex utility like BabbleLib's Deformat function.

The approach below works by making lists of words used by patterns, and then looking at words in the input strings to determine which list(s) to look for matches in.

Actually, the process is 2-pass. The first pass figures out the LEAST commonly used words, and then just uses those.


 * Note: The example contains a very simplistic "MyDeformatterFunc" for converting "%s" to "(.*)". It will not work for other locales than english. Do not use it in the real world, please.

-- Functions that we want called for different string matches function RoughPokeFunc(v1,v2) print("RoughPokeFunc "..v1.." "..v2); end function SoftPokeFunc(v1,v2) print("SoftPokeFunc "..v1.." "..v2); end function SoftNudgeFunc(v1,v2) print("SoftNudgeFunc "..v1.." "..v2); end function ChickenFunc(v1,v2) print("ChickenFunc "..v1.." "..v2); end -- Strings to match mapped to functions that we want called MatchStrings = { ["%s roughly pokes %s"] = RoughPokeFunc, ["%s softly pokes %s"] = SoftPokeFunc, ["%s softly nudges %s"] = SoftNudgeFunc, ["%s gets nudged by %s and runs away screaming"] = ChickenFunc, } -- VERY simplistic deformatter function. -- You probably want a real deformatting library for this. function MyDeformatterFunc(str) return (string.gsub(str, "%%s", "(.*)")); end -- First run: count how many occurences there are of each word WordCounts = {} for str,func in MatchStrings do  for word in string.gfind(str, "[^ ]+") do   	if(string.find(word, "^%%")) then -- ignore format strings else WordCounts[word] = (WordCounts[word] or 0) + 1; end end end -- Second run: for each string, pick the least common word and place string in that hash bucket MatchStringsHash = {} for str,func in MatchStrings do  local bestword, num; for word in string.gfind(str, "[^ ]+") do  	if(string.find(word, "^%%")) then -- ignore format strings else if(not num or WordCounts[word] &lt; num) then num = WordCounts[word]; bestword = word; end end end assert(bestword); if(not MatchStringsHash[bestword]) then MatchStringsHash[bestword] = {}; end MatchStringsHash[bestword][MyDeformatterFunc(str)] = func; end WordCounts = nil; -- now we don't need the counts anymore -- Dump our MatchStringsHash on-screen so we can see what it looks like! print "Examining hash buckets" print "--" for word,strings in MatchStringsHash do  print("  "..word..":"); for str,func in strings do  	print("    \""..str.."\""); end end -- Function that scans for matches and calls the resulting function function ScanForMatch(str) local bDone = false; local nCompares = 0; for word in string.gfind(str, "[^ ]+") do  	if(MatchStringsHash[word]) then for pattern,func in MatchStringsHash[word] do  			nCompares = nCompares + 1; local success,_,v1,v2,v3,v4 = string.find(str, pattern); if(success) then func(v1,v2,v3,v4); bDone=true; break; end end end if(bDone) then break; end end print(" \""..str.."\": "..nCompares.." string.finds actually executed\n"); end print(""); print("Executing!"); print("--"); ScanForMatch("Alice roughly pokes Bob"); ScanForMatch("Bob softly pokes Charles"); ScanForMatch("Charles softly nudges Denise"); ScanForMatch("Denise gets nudged by Eve and runs away screaming"); ScanForMatch("This string does not exist");

Running the above produces the following output: Examining hash buckets --  roughly: "(.*) roughly pokes (.*)" nudges: "(.*) softly nudges (.*)" gets: "(.*) gets nudged by (.*) and runs away screaming" softly: "(.*) softly pokes (.*)" Executing! -- RoughPokeFunc Alice Bob "Alice roughly pokes Bob": 1 string.finds actually executed SoftPokeFunc Bob Charles "Bob softly pokes Charles": 1 string.finds actually executed SoftNudgeFunc Charles Denise "Charles softly nudges Denise": 2 string.finds actually executed ChickenFunc Denise Eve "Denise gets nudged by Eve and runs away screaming": 1 string.finds actually executed "This string does not exist": 0 string.finds actually executed

Problems with this approach
There is no guarantee as to which order the string matches will be attempted.

For example, assume these two patterns:
 * 1) "%s hits %s."
 * 2) "%s hits %s hard."

Now, given the input string "Alice hits Bob.", only #1 will match, and all is good.

But with the input string "Alice hits Bob hard.", there is NO guarantee which string will match. You can get #1 with the arguments "Alice", "Bob hard". Or you can get #2 with the arguments "Alice", "Bob".