Fine-Tuning a LLM to Generatively solve Cryptic Crossword Clues (PART 1 / 2)

8 min readAug 4, 2023

Recently, my interest in solving Cryptic Crosswords has admittedly become something of an obsession — even if they cause me a fair bit of struggle! One day I decided to experiment if I could solve them using deep-learning techniques in lieu of my brain — which is where this miniseries comes in.

The series is split into two parts:

PART 1: Fine-tuning of the google/t5-small-ssm, a small instance of the T5 LLM with 77M parameters, available on HuggingFace, suitable for generative, closed-book Q&A. Training is conducted with 90k cryptic crossword clues for 50 epochs on Google Colab using a single T4 GPU, taking approx. 6hr 40min. Evaluation is performed on a holdout test set of 10k examples.

PART 2 (COMING SOON!): We extend this research to train a much larger model, using a cutting-edge quantised representation that allows us to still train using just a single GPU.

Furthermore, if you wished to replicate or even improve on my results, you can find the code and detailed explanations on methodology in my GitHub repository.

Let’s delve into it!

Example Cryptic Crossword Clue (for those less familiar)

Cryptic Crosswords look the same as ordinary crosswords but behave rather differently.

Here’s an example from The Telegraph “Big Book of Cryptic Crosswords 8, #34”:

12. One criticised about nothing becomes cut off (8)

The answer is ISOLATED. The reasoning for this is set out:

The definition is usually present at the start or the end of clues in cryptic crosswords, in this case is “cut off”
One is written with Roman Numerals as ‘I’
Another word for ‘criticised’ is ‘SLATED’
Nothing is equivalent to zero, 0 or perhaps ‘O’
Therefore we have ‘I’ + ‘SLATED’ about ‘O’, about can also mean ‘around’, and hence we get ISOLATED, which means “cut off”.

As you can see, getting even one answer is very involved and often requires multiple steps. As such, lots of practice is needed before one can start to become comfortable solving clues without constantly flipping to the answers at the back of the book. The more you try, the more you start noticing patterns in the way the clues are structured, and the easier and more rewarding clues become to solve.

Thus, solving cryptic clues doesn’t just rely on good general knowledge, as with regular crosswords. The set of patterns however is limited, and that is what gives motivation to try fine-tuning a model to see if it can learn them.

Using an Intelligent Solution

Few-shot prompts fed to OpenAI’s ChatGPT

Like many of us would nowadays think of trying first, I tested ChatGPT with a few-shot prompt to see how well it would perform. The prompts contained a Clue, Answer and a Reason for the answer. Despite reporting that it understood the examples, it generally didn’t perform well — though it did a good job at replicating the output format as given in its input examples. It was particularly bad at memorising that the number in brackets signified the number of letters of the desired answer.

This was a quick play and certainly not scientific. I could, if I wanted to, have designed an intensive prompt engineering process, presumably using the OpenAI API and buying some tokens, which could have produced actual results.

Instead, I decided on a methodology whereby I would feed a pre-trained LLM, able to do generative Q&A, with examples from a large Cryptic Crossword Dataset (more below) - to see if I could fine-tune its output.

This is not the first example of applying deep learning to solve crossword problems. A computer scientist teamed up with a Professor at Berkeley College, University of California, to solve (non-cryptic) crosswords as a pastime during the lockdown — their program is called Dr. Fill. There is an iOS App called Crossword Genius, developed by Unlikely AI, where one can scan crosswords with their phone camera and their mascot dog “Ross” helps solve and explain the clues. I tried it, and though the phone camera didn’t capture all of the clues on the page correctly (you can manually edit), the clue solving is very impressive.

Note that these examples didn’t just solve clues, but actually took the crossword grid shape into account. For purposes of simplicity, this was left out of scope here. There is a Python library which can generate a crossword UI from a textfile containing clues — which could be interesting to explore in future work.

The Dataset

There is an open-source dataset which contains over half a million cryptic crossword clues, collected from various sources. The dataset, as well as providing the clue and answer, also provides the definition from the sentence itself. This will be handy as an input feature in our model to provide some context.

As the website datasheet explains, there are cases of nulls and errors, so the repo includes some simple preprocessing to remove rows with corrupt data. The source GitHub used by this dataset also contains some limited cases of annotation that actually explain the answer (similarly to my explanation of the clue above), but permission is required to obtain access.

Methodology

I previously mentioned this is a closed-book Q&A exercise. This means that we require a model which can produce generative, rather than discriminative, answers. HuggingFace has a good tutorial for a discriminative open-book Q&A — this is similar to a reading comprehension exercise where the model learns to find an answer present within a block of text called a ‘context’.

Considering we don’t have (nor want) to feed the model explanations for the crossword clue, this makes our task closed-book. That said, we are still allowed to provide the dataset’s definition column as a form of context during the training phase.

Given these reasons, Google’s T5 encoder-decoder model is a suitable candidate for our purposes. The full model available on HuggingFace for Q&A is 11B parameters and a whopping 45.2GB in size, which means hosting and training on a single GPU on a Colab instance is infeasible. Luckily, a smaller model exists at a more reasonable 300MB and 77M parameters.

Training

The HuggingFace transformer module provides some examples on performing fine-tuning and evaluation on the SQUAD comprehension dataset. We use these examples and slightly adapt them for our closed-book use case (differences detailed in GitHub repo), including saving outputs from the prediction phase.

We ran the training on a dataset of 100k preprocessed crossword clues, split 81k training, 9k validation and 10k holdout datasets. Some of the hyperparameters for the torch model we used below were:

 python run_seq2seq_qa.py \
  --model_name_or_path 'google/t5-small-ssm'\
  ...
  --learning_rate 2e-3 \
  --num_train_epochs 25 \
  --per_device_train_batch_size 192 \
  --max_seq_length 64 \
  --version_2_with_negative \
  ...

(Initial) Learning rate: 2e-3. The trainer reduces the rate per epoch
# epochs = 25. This was tuned to run the process within a reasonable timeframe
Batch size = 192. This was tuned to use a large amount of the available GPU memory (average 13.6/15GB available for T4 GPU on Colab), thus further reducing training time
Max sequence length = 64 (often longer for Q&A but crossword clues are short)
version_2_with_negative. This returns the SQUAD v2 evaluation process from the evaluate python module. The output includes the # of correct answers and F1 score, both given as a percentage.

Running this took 3hr 20min (I used a little hack I found to stop Colab from timing out, check out the repo README). As seen in the plot below, the training loss was continuing to fall at a decent rate, indicating that using more epochs, whilst taking longer, could produce a better result.

Training metrics for T5 fine-tuning: 25 epochs

I decided therefore to run a second round of training for a further 25 epochs, using the previous model as checkpoint. It took the same time to run and the loss fell by a factor of 10.

Results

The results of the prediction looked like this:

{
    ...
        "predict_samples": 10000,
        "test_exact": 32.24,
        "test_f1": 33.73559473859474,
    ...
}

The model was able to answer correctly 3,224 (32.24%) of the 10,000 clues in the test set, with an F1 score of 33.7%. Considering the challenging nature of Cryptic clues, getting over 3/10 clues correct actually seems encouraging to me. The second encouragement is that the length of the prediction string is correct 85% of the time, meaning the model is learning to make use of the numbers in brackets present in the clue.

The F1 score in NLP tasks is a measure of the similarity between the tokens of the prediction and answer — the low score here implies that when the prediction is not equal to the answer, it is probably guessing a very different word or phrase.

Qualitative checks show varied results for the incorrect answers. Some can be easily linked to the clue - in the examples below, row 2 is incorrect but I can see that ESCAPE could be another word for vacation: the model got the definition wrong. Similarly, it was unlucky in row 3 to give the right answer, but off by a letter. Lines 4 and 5, I am not convinced I know why it gave the predictions it gave!

Conclusion / Part 2

Based on this initial experiment, it seems that a Generative approach could become better at solving, if we spend more time tuning the model. This can be achieved by:

Increasing the number of training examples
Increasing the number of epochs
Using a larger LLM with more parameters

All of these options come at a cost as more time / compute resources are required. However, for point 3, we could avoid this by utilising a more powerful LLM that has been adapted to be trained on fewer resources, through a technique called quantisation. We will explore this in Part 2, to see if it provides improvement on these results. Stay Tuned!

(If you got this far, congratulations, here is a bonus sneak peak of what’s to come: 🤫)