I've often found myself uneasy when LLMs (Large Language Models) are asked to quote the Bible. While they can provide insightful discussions about faith, their tendency to hallucinate responses raises concerns when dealing with scripture, which we regard as the inspired Word of God.
To explore these concerns, I created a benchmark to evaluate how accurately LLMs can recall scripture word for word. Here's a breakdown of my methodology and the test results.
Methodology
To ensure consistent and fair evaluation, I tested each model using six scenarios designed to measure their ability to accurately recall scripture. For readers interested in the technical details, the source code for the tests is available here. All tests were conducted with a temperature setting of 0, and I have given slack to the models by making the pass check case and whitespace insensitive.
A temperature of 0 ensures the models generate the most statistically probable response at each step, minimising creativity or variability and prioritising accuracy. This approach is particularly important when evaluating fixed reference material like the Bible, where precise wording matters.
Test 1: Popular Scripture Recall
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ✅ |
Llama 3.1 8B | ✅ |
Llama 3.3 70B | ⚠️ |
GPT 4o | ✅ |
GPT 4o mini | ✅ |
Gemini 1.5 Pro | ✅ |
Gemini 1.5 Flash | ✅ |
Gemini 2.0 Flash | ✅ |
Claude 3.5 Haiku | ✅ |
Claude 3.5 Sonnet | ✅ |
When asking a model to recall John 3:16 in the NIV translation, the only model that failed to accurately recall the verse word for word was Llama 3.3 70B. It was only a very slight translation mismatch, with it recalling "only begotten son" where the actual verse in the NIV does not include begotten, despite it being present in other translations.
Test 2: Obscure Verse Recall
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ⚠️ |
Llama 3.1 8B | ❌ |
Llama 3.3 70B | ❌ |
GPT 4o | ✅ |
GPT 4o mini | ⚠️ |
Gemini 1.5 Pro | ⚠️ |
Gemini 1.5 Flash | ⚠️ |
Gemini 2.0 Flash | ⚠️ |
Claude 3.5 Haiku | ⚠️ |
Claude 3.5 Sonnet | ✅ |
Many models struggled to recall Obadiah 1:16 NIV word for word, often mixing up the words with other translations. For these cases, I have marked them as partial for correctly recalling the verse in some translation, even if not the specific requested one. The models that clearly succeeded seem to be very large models, 405B for Llama and GPT 4o and Claude 3.5 Sonnet.
Test 3: Verse Continuation
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ✅ |
Llama 3.1 8B | ❌ |
Llama 3.3 70B | ✅ |
GPT 4o | ✅ |
GPT 4o mini | ⚠️ |
Gemini 1.5 Pro | ✅ |
Gemini 1.5 Flash | ❌ |
Gemini 2.0 Flash | ✅ |
Claude 3.5 Haiku | ⚠️ |
Claude 3.5 Sonnet | ✅ |
When quoting the model 2 Chronicles 11:13 (but without specifying where in the bible it is found) and asking it to produce the immediate next verse, we had a much more mixed bag of results. Many medium-to-large sized models got this correct, but the smaller ones completely hallucinated parts or all of the verse. Claude 3.5 Haiku almost recalled the verse, but referred to the Levites as "they", which is not explicitly a translation in any of the more well known translations and appears to be the model substituting the intention of the word rather than the exact one.
Test 4: Verse Block Recall
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ✅ |
Llama 3.1 8B | ❌ |
Llama 3.3 70B | ✅ |
GPT 4o | ✅ |
GPT 4o mini | ✅ |
Gemini 1.5 Pro | ✅ |
Gemini 1.5 Flash | ⚠️ |
Gemini 2.0 Flash | ✅ |
Claude 3.5 Haiku | ⚠️ |
Claude 3.5 Sonnet | ✅ |
When asked to recall Lamentations chapter 3 verses 19 through 24, the models did this very well. Only the smallest of the models, Llama 3.1 8B outright failed here, instead recalling the beginning of the chapter. The two warnings were only slight translation mismatches of a few words, but the essence of the verse was preserved.
Test 5: Query Based Lookup
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ✅ |
Llama 3.1 8B | ✅ |
Llama 3.3 70B | ✅ |
GPT 4o | ✅ |
GPT 4o mini | ✅ |
Gemini 1.5 Pro | ✅ |
Gemini 1.5 Flash | ✅ |
Gemini 2.0 Flash | ✅ |
Claude 3.5 Haiku | ✅ |
Claude 3.5 Sonnet | ✅ |
Asking the models, "What's that verse in the bible about the Earth being filled with knowledge of God's glory?", all of them successfully recalled it was Habakkuk 2:14. Verse lookup is definitely a strong-suit, even in smaller models.
Test 6: Entire Chapter Recall
Model | Pass |
---|---|
Llama 3.1 405B | ✅ |
Llama 3.1 70B | ✅ |
Llama 3.1 8B | ❌ |
Llama 3.3 70B | ✅ |
GPT 4o | ✅ |
GPT 4o mini | ✅ |
Gemini 1.5 Pro | ✅ |
Gemini 1.5 Flash | ✅ |
Gemini 2.0 Flash | ✅ |
Claude 3.5 Haiku | ✅ |
Claude 3.5 Sonnet | ✅ |
When asking for the entire contents of Romans 6 in the KJV translation, almost all of the models recalled all 23 verses accurately. Even the failed case of Llama 3.1 8B recalled over 98% of the words correctly, with only 9 incorrect words.
Conclusions
If you really want to lean on an LLM to give you textually accurate bible verses of popular translations, you really should lean on higher parameter count (i.e. larger) models. These include models like Llama 405B, OpenAI GPT 4o and Claude Sonnet which all had perfect scores. Smaller models (7B range) will often mix up translations, and in some cases even mix up or hallucinate verse altogether. Medium-sized models (70B range) often accurately preserve the intention of the verses, although the verse may be a mangled representation of several translations, and in some cases paraphrased a little by the LLM.
You can certainly still use smaller models for discussion that references scripture by Book/Chapter/Verse, but it is important to lean on an actual copy of the Bible for the correct text in these cases.
Looking into the future, we may very well see smaller models perform better on these benchmarks, but there is surely a limitation to how much information can be encoded into such small models.
For full test results, see the results file here, including the raw prompts for each test. If you feel like I missed a crucial test, feel free to submit an issue on GitHub.