We are often struck by how often we spend time trying to optimize something when we would be better off just picking a better algorithm. There is the old story about the mathematician Gauss who, when in school, was given busy work to add the integers from 1 to 100. While the other students laboriously added each number, Gauss realized that 100+1 is 101 and 99 + 2 is also 101. Guess what 98 + 3 is? Of course, 101. So you can easily find that there are 50 pairs that add up to 101 and know the answer is 5,050. No matter how fast you can add, you aren’t likely to beat someone who knows that algorithm. So here’s a question: You have a large body of text and you want to search for it. What’s the best way?
Of course, that’s a loaded question. Best can mean many things and will depend on the kind of data you are dealing with and even the type of machine you are using. If you are just looking for a string, you could, of course, do the brute force algorithm. Let’s say we are looking for the word “convict” in the text of War and Peace:
- Start with the first letter of War and Peace
- If the current letter isn’t the same as the current letter of “convict” then move to the next letter, reset the current letter in “convict” and go back to step 2 until there are no more letters.
- If the current letters are the same, move to the next letter of convict and, without forgetting the current letter of the text, compare it to the next letter. If it is the same, keep repeating this step until there are no more letters in “convict” (at which point you have a match). If not the same, reset the current letter of “convict” and also go back to the original current letter of the text and then move to the next letter, going back to step 2.
That’s actually hard to describe in English. But, in other words, just compare the text with the search string, character by character until you find a match. That works and, actually, with some modern hardware you could write some fast code for that. Can we do better?
Again, it really depends on your definition of better. Let’s assume that the text contains many strings that are almost what we are looking for, but not quite. For example, War and Peace probably has many occurrences of the word “the” within it. But it also has “there,” “then,” and “other” all of which contain our target word. For the word “the” that isn’t a big deal because it is short, but what if you were sifting through large search strings? (I don’t know — DNA genome data or something.) You’d spend a lot of time chasing down dead ends. When you discover that the current text has 199 of the 200 characters you are looking for, it is going to be disappointing.
There’s another disadvantage. While it is easy to tell where the string matches and, therefore, where it doesn’t match, it is hard to figure out if there has been just a small insertion or deletion when it doesn’t match. This is important for tools like
rsync where they don’t want to just know what matched, they want to understand why things don’t match.
It was looking at
rsync, in fact, that led me to see how
rsync compares two files using a rolling checksum. While it might not be for every application, it is something interesting to have in your bag of tricks. Obviously, one of the best uses of this “rolling checksum” algorithm is exactly how
rsync uses it. That is, it finds when files are different very quickly but can also do a reasonable job of figuring out when they go back to being the same. By rolling the frame of reference,
rsync can detect that something was inserted or deleted and make the appropriate changes remotely, saving network bandwidth.
In Search Of
However, you can use the same strategy for handling large text searches. To do this, you need a hashing algorithm that can put in and take out items easily. For example, suppose the checksum algorithm was dead simple. Just add the ASCII codes for each letter together. So the string “AAAB” hashes to 65 + 65 + 65 + 66 or 261. Now suppose the next character is a C, that is, “AAABC”. We can compute the checksum starting at the second position by subtracting the first A (65) and adding a C (67). Silly with this small data set, of course, but instead of adding hundreds on numbers each time you want to compute a hash, you can now do it with one addition and subtraction each.
We can then compute the hash for our search string and start computing the hashes of the file for the same length. If the hash codes don’t match, we know there is no match and we move on. If they do match, we probably need to verify the match since hashes are, generally, inexact. Two strings might have the same hash value.
There are, however, a few problems with this. If you are just looking for a single string, the cost of computing the hash is expensive. In the worst case, you’ll have to do a compare, an add, and a subtract for each character, plus maybe some tests when you have a hash collision: two strings with the same hash that don’t actually match. With the normal scheme, you’ll just have to do a test for each character along with some wasted tests for false positives.
To optimize the hash algorithm, you can do fancier hashing. But that is also more expensive to compute, making the overhead even worse. However, what if you were looking for a bunch of similar strings all with the same length? Then you could compute the hash once and save it. Each search after that would be very fast because you won’t waste time investigating many dead ends only to backtrack.
My hash algorithm is very simple, but not very good. For example, you can see in the example that there is one false positive that will cause an extra comparison. Of course, better hash algorithms exist, but there’s always a chance of a collision.
How much is the difference using this hashing strategy? Well, I decided to write a little code to find out. I decided to ignore the cost of computing the search pattern hash and the initial part of the rolling hash as those will zero out over enough interactions.
If you search for the word “convict” in the text of War and Peace from Project Gutenberg, you’ll find it only occurs four times in 3.3 million characters. A normal search had to make about 4.4 million comparisons to figure that out. The hash algorithm easily wins with just under 4.3 million. But the hash computation ruins it. If you count the add and subtract as the same cost as two comparisons, that adds about 5.8 million pseudo comparisons to the total.
Is that typical? There probably aren’t too many false positives for “convict.” If you run the code with the word “the” which should have a lot of false hits, the conventional algorithm takes about 4.5 million comparisons and the adjusted total for the hash algorithm is about 9.6 million. So you can see how false positives affect the normal algorithm.
You’ll note that my lackluster hashing algorithm also results in a large number of false hash positives which erodes away some of the benefits. A more complex algorithm would help, but would also cost some upfront computation so it doesn’t help as much as you might think. Nearly any hashing algorithm for an arbitrary string will have some collisions. Of course, for small search strings, the hash could be the search string and that would be perfect, but it isn’t feasible in the general case.
The code doesn’t save the hashes, but suppose it did and suppose the false positive rate of the first search is about average. That means we save a little more than 100,000 comparisons per search once the hashes are precomputed. So once you have to search for 60 or so strings, you break even. If you search for 600 strings — but don’t forget, they all have to be the same size — you can save quite a bit over the easy comparison code.
I didn’t actually time things, because I didn’t want to optimize each bit of code. In general, fewer operations are going to be better than more operations. There are plenty of ways to bump the code’s efficiency up and also some heuristics you could apply if you analyze the search string a little bit. But I just wanted to verify my gut-level feel for how much each algorithm spent on searching the text.
I originally started thinking about this after reading the code for
rsync and the backup program
kup. Turns out there is a name for this, the Rabin-Karp algorithm. There are some better hash functions that can reduce false positives and get a few extra points of efficiency.
What’s my point? I’m not suggesting that an RK search is your best approach for things. You really need a big data set with a lot of fixed-size searches to get an advantage out of it. If you think about something like
rsync, it is really using the hashes to search for places where two very long strings might be equal. But I think there are cases where these oddball algorithms might make sense, so it is worth knowing about them. It is also fun to challenge your intuition by writing a little code and getting some estimates of just how much better or worse one algorithm is than another.