JGET

explore

spell

Downloads
19
Created On
1/27/2025
Last Updated
10/14/2025

Install this package:

$ jget get spell

Jakes shitty spelling corrector

This is a fun little toy project that I think might be useful.

I decided to write a little domain specific spelling corrector (just for fun). I was primarily motivated to do this by stockroom-terminal's search function. Previously, the filter requires a perfect substring match to find an item. In other words, if you're looking for "minecraft:iron_ingot", you need to enter a substring of that term to find it.

But that's not ideal - hitting the wrong key is pretty common. It would be great if it could find "minecraft:iron_ingot" even if the query was "mniecraft:irn_ingot" or even just "igut".

There are plenty of open questions as to how to actually implement such a behaviour, but all of them involve some kind of corrector; so that's what I'm going to try and explain today.

Peter Norvig's wonderful article on spelling correctors is a great introduction on how spelling correction works at a basic level. In fact, if all you care about is how to build spelling correctors, you probably don't even need to read this article at all. Peter's is much better and, my approach is going to differ only slightly.

In particular, I've built a domain-specific spelling corrector. There are certain assumptions I can make that yield a more effective (read: accurate) corrector in this particular use-case. These assumptions don't make sense in a more general setting; the data they rely on might not exist, or they might make for a less accurate corrector in a general case.

So the first thing I'd like to focus on is that; exactly how I've tailored a spell checker to the specific use case might be of interest.

I'm also going to look into how I a spell checker can be integrated into an actual working system. There's lots of ways to do this, and lot's of potential advantages. For one, we can look at extracting data from the existing system to help make our corrector better. But we can also use these kinds of mechanisms to help evaluate the product we build - and that's also important.


Now, on with the show. To quote directly from the article, we want to compute:

$$ \text{argmax}_{c \in \text{Candidates}}\ P(c|w) $$

Where $c$ is a candidate, and $w$ is the query. In other words, we are looking for the best (read: most likely) candidate in our set of potential candidates, given our query $w$.

We can use bayes rule to break this down into some smaller parts:

$$ \text{argmax}_{c \in \text{Candidates}}\ \frac{P(c) \cdot P(w|c)}{P(w)} $$

And of course, $P(w)$ (the probability of any given query) is the same for all candidates, so we can factor it out. We are looking for then:

$$ \text{argmax} _{c \in \text{Candidates}}\ P(c) \cdot P(w|c) $$

There are 4 components in this equation that we need to work out and implement:

  1. A selection mechanism (i.e. $\text{argmax}$) -> how to choose the best candidate
  2. A candidate model (i.e. choice of $c$) -> how to choose what candidates to consider
  3. language model (i.e. $P(C)$) -> how likely is a particular candidate in general?
  4. error model (i.e. $P(w|c)$) -> How likely is a particular candidate, given a misspelling?

Selection mechanism and Candidate generation

Lua does not have a built-in argmax function. Fortunately, argmax is pretty easy to implement, even when we have to do it yourself. If you want the top $n$ items from a sorted list, the skipTake function from the utils package can give you exactly that (just set skip to 0, and take $n$).

For this reason, I decided to just return the whole list of candidates, sorted by score. This is probably OK, as I don't expect the list to ever get that long. It's also probably more useful this way; it's possible

The candidate generator is a bit more interesting. My dictionary is extremely constrained; we only want to consider items in the storage system. At most, we might be looking at perhaps 200 different types of items - which is quite a tiny dictionary. Because this is such a small space, it isn't prohibitively expensive to use the whole thing as our candidate set for every query.

This doesn't make sense for Peter, as the number of words in his dictionary much bigger; on the order of several thousand. In that context, it's actually less work to generate candidates in the manner that he does - looking at words just 1 or 2 edits away. For us, for larger words, this would probably be worse than what we get currently.

Another important consideration is that this should be a "live" corrector - meaning it should provide correction on incomplete words. Peter's version works on completed fragments. In practise, this means that the filtering he does - looking at only those candidates in the dictionary - is infeasible. In many cases, it would take more than 1 or 2 edits to turn a query into a valid item name.

Sometimes a completed query doesn't even have this property - its very common when using stockroom to omit the "minecraft:" part of the item ID, as it doesn't really help discriminate between very many items. Our corrector should be able to handle this - and that means that a simple 1 or 2 edit candidate set just won't cut it.

A more interesting question is weather we should consider the "minecraft:" part of the wrod at all.

Language model

The language model is also quite interesting in this application, because it's hard to work out what a good value would be.

A simple idea we could exploit is to reason that a query is more likely if we have more of the item in the system. For this, we could just use the item count in the storage system. A little bit more thought should convince you that that probably isn't a great solution. By far the most common item in the system is frequently cobblestone, but that isn't necessarily the most common retrieval request

We need the system to encode how useful an item is to us. And unfortunately, there aren't really any operations we can do on the names of items alone to generate this. The only real way of knowing is to measure.

For now, I'm going to just assume everything is equally likely. But in the future, I would like to learn and get a bit better. Therefore, I'm going to add a logger to VATM that will record withdrawals. In the future, we can hopefully use this data to learn a better language model.

To do this, I am adding telemetry to stockroom-terminal. It's going to record the number of times a given item is requested.

I'm also going to add laplace smoothing. This is so that weird items that don't appear very often don't get totally ignored by the system, but simply de-prioritised.

Error model

The error model for our system needs to capture the probability that a given query really represents a given candidate. For peter, he has different classes of candidates with varying likelihoods. We don't have that - our candidate generator does not encode information like that. Instead, we need a metric that can somehow discern how likely a given query-candidate pair is.

For this, we will use string similarity. In this case, levenshtein similarity - but "edit distance" would be just as good.