Install this package:
$ jget get spell
Jakes shitty spelling corrector
This is a fun little toy project that I think might be useful. I was primarily motivated to do this by stockroom-terminal's search function. Currently, the filter requires a perfect substring match to find an item. In other words, if you're looking for "minecraft:iron_ingot", you need to enter a substring of that term to find it.
But that's not ideal - hitting the wrong key is pretty common. It would be great if it could find "minecraft:iron_ingot" even if the query was "mniecraft:irn_ingot" or "igut.
There are plenty of open questions as to how to actually implement such a behaviour, but all of them involve some kind of corrector; so that's what I'm going to try and build today.
Peter Norvig's wonderful article on spelling correctors is a great introduction on how spelling correction works at a basic level. My approach is going to differ slightly, but the broad strokes will be the same.
To quote directly from the article, we want to compute: argmax c in Candidates: P(c) P(w|c) / P(w) where c is a candidate and w is the query.
Just as for peter, P(w) is the same for all candidates, so we can factor it out. We are looking for then: argmax c in Candidates: P(c) P(w|c)
We need to work out the 4 components:
- selection mechanism (i.e. argmax) -> how to choose the best correction
- candidate model (i.e. choice of
c
) -> how to choose what corrections to consider - language model (i.e.
P(C)
) -> how likely is a particular candidate? - error model (i.e.
P(w|c)
) -> How likely is a particular candidate, given a misspelling?
Selection mechanism and Candidate generation
Argmax remains relatively easy to implement, even when we have to do it ourselves in lua. One difference will be that, as this is an information retrieval settings, we may want to select more than just the best fit, but the best n fits. That's what I've implemented.
The candidate generator is a bit more interesting. My dictionary is extremely constrained; we only want to consider items in the storage system. At most, we might be looking at perhaps 200 different types of items - which is quite a tiny dictionary. Because this is such a small space, it isn't prohibitively expensive to use the whole thing as our candidate set for every query.
This doesn't make sense for Peter, as the number of words in his dictionary much bigger; on the order of several thousands. In that context, it's actually less work to generate candidates in the manner that he does - looking at words just 1 or 2 edits away. For us, for larger words, this would probably be worse than what we get currently.
Another important consideration is that this should be a "live" corrector - meaning it should provide correction on incomplete words. Peter's version works on completed fragments. In practise, this means that the filtering he does - looking at only those candidates in the dictionary - is infeasible. In many cases, it would take more than 1 or 2 edits to turn a query into a valid item name.
Sometimes a completed query doesn't even have this property - its very common when using stockroom to omit the "minecraft:" part of the item ID, as it doesn't really help discriminate between very many items. Our corrector should be able to handle this - and that means that a simple 1 or 2 edit candidate set just won't cut it.
Language model
The language model is also quite interesting in this application, because it's hard to work out what a good value would be.
A simple idea we could exploit is to reason that a query is more likely if we have more of the item in the system. For this, we could just use the item count in the storage system. A little bit more thought should convince you that that probably isn't a great solution. By far the most common item in the system is frequently cobblestone, but that isn't necessarily the most common retrieval request
We need the system to encode how useful an item is to us. And unfortunately, there aren't really any operations we can do on the names of items alone to generate this. The only real way of knowing is to measure.
For now, I'm going to just assume everything is equally likely. But in the future, I would like to learn and get a bit better. Therefore, I'm going to add a logger to VATM that will record withdrawals. In the future, we can hopefully use this data to learn a better language model.
To do this, I am adding telemetry to stockroom-terminal. It's going to record the number of times a given item is requested.
I'm also going to add laplace smoothing. This is so that weird items that don't appear very often don't get totally ignored by the system, but simply de-prioritised.
Error model
The error model for our system needs to capture the probability that a given query really represents a given candidate. For peter, he has different classes of candidates with varying likelihoods. We don't have that - our candidate generator does not encode information like that. Instead, we need a metric that can somehow discern how likely a given query-candidate pair is.
For this, we will use string similarity. In this case, levenshtein similarity - but "edit distance" would be just as good.