Can LLMs Accurately Recall the Bible

I've often found myself uneasy when LLMs (Large Language Models) are asked to quote the Bible. While they can provide insightful discussions about faith, their tendency to hallucinate responses raises concerns when dealing with scripture, which we regard as the inspired Word of God.

To explore these concerns, I created a benchmark to evaluate how accurately LLMs can recall scripture word for word. Here's a breakdown of my methodology and the test results.

Methodology

To ensure consistent and fair evaluation, I tested each model using six scenarios designed to measure their ability to accurately recall scripture. For readers interested in the technical details, the source code for the tests is available here. All tests were conducted with a temperature setting of 0, and I have given slack to the models by making the pass check case and whitespace insensitive.

A temperature of 0 ensures the models generate the most statistically probable response at each step, minimising creativity or variability and prioritising accuracy. This approach is particularly important when evaluating fixed reference material like the Bible, where precise wording matters.

Model Pass
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3.3 70B ⚠️
GPT 4o
GPT 4o mini
Gemini 1.5 Pro
Gemini 1.5 Flash
Gemini 2.0 Flash
Claude 3.5 Haiku
Claude 3.5 Sonnet

When asking a model to recall John 3:16 in the NIV translation, the only model that failed to accurately recall the verse word for word was Llama 3.3 70B. It was only a very slight translation mismatch, with it recalling "only begotten son" where the actual verse in the NIV does not include begotten, despite it being present in other translations.

Test 2: Obscure Verse Recall

Model Pass
Llama 3.1 405B
Llama 3.1 70B ⚠️
Llama 3.1 8B
Llama 3.3 70B
GPT 4o
GPT 4o mini ⚠️
Gemini 1.5 Pro ⚠️
Gemini 1.5 Flash ⚠️
Gemini 2.0 Flash ⚠️
Claude 3.5 Haiku ⚠️
Claude 3.5 Sonnet

Many models struggled to recall Obadiah 1:16 NIV word for word, often mixing up the words with other translations. For these cases, I have marked them as partial for correctly recalling the verse in some translation, even if not the specific requested one. The models that clearly succeeded seem to be very large models, 405B for Llama and GPT 4o and Claude 3.5 Sonnet.

Test 3: Verse Continuation

Model Pass
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3.3 70B
GPT 4o
GPT 4o mini ⚠️
Gemini 1.5 Pro
Gemini 1.5 Flash
Gemini 2.0 Flash
Claude 3.5 Haiku ⚠️
Claude 3.5 Sonnet

When quoting the model 2 Chronicles 11:13 (but without specifying where in the bible it is found) and asking it to produce the immediate next verse, we had a much more mixed bag of results. Many medium-to-large sized models got this correct, but the smaller ones completely hallucinated parts or all of the verse. Claude 3.5 Haiku almost recalled the verse, but referred to the Levites as "they", which is not explicitly a translation in any of the more well known translations and appears to be the model substituting the intention of the word rather than the exact one.

Test 4: Verse Block Recall

Model Pass
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3.3 70B
GPT 4o
GPT 4o mini
Gemini 1.5 Pro
Gemini 1.5 Flash ⚠️
Gemini 2.0 Flash
Claude 3.5 Haiku ⚠️
Claude 3.5 Sonnet

When asked to recall Lamentations chapter 3 verses 19 through 24, the models did this very well. Only the smallest of the models, Llama 3.1 8B outright failed here, instead recalling the beginning of the chapter. The two warnings were only slight translation mismatches of a few words, but the essence of the verse was preserved.

Test 5: Query Based Lookup

Model Pass
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3.3 70B
GPT 4o
GPT 4o mini
Gemini 1.5 Pro
Gemini 1.5 Flash
Gemini 2.0 Flash
Claude 3.5 Haiku
Claude 3.5 Sonnet

Asking the models, "What's that verse in the bible about the Earth being filled with knowledge of God's glory?", all of them successfully recalled it was Habakkuk 2:14. Verse lookup is definitely a strong-suit, even in smaller models.

Test 6: Entire Chapter Recall

Model Pass
Llama 3.1 405B
Llama 3.1 70B
Llama 3.1 8B
Llama 3.3 70B
GPT 4o
GPT 4o mini
Gemini 1.5 Pro
Gemini 1.5 Flash
Gemini 2.0 Flash
Claude 3.5 Haiku
Claude 3.5 Sonnet

When asking for the entire contents of Romans 6 in the KJV translation, almost all of the models recalled all 23 verses accurately. Even the failed case of Llama 3.1 8B recalled over 98% of the words correctly, with only 9 incorrect words.

Conclusions

If you really want to lean on an LLM to give you textually accurate bible verses of popular translations, you really should lean on higher parameter count (i.e. larger) models. These include models like Llama 405B, OpenAI GPT 4o and Claude Sonnet which all had perfect scores. Smaller models (7B range) will often mix up translations, and in some cases even mix up or hallucinate verse altogether. Medium-sized models (70B range) often accurately preserve the intention of the verses, although the verse may be a mangled representation of several translations, and in some cases paraphrased a little by the LLM.

You can certainly still use smaller models for discussion that references scripture by Book/Chapter/Verse, but it is important to lean on an actual copy of the Bible for the correct text in these cases.

Looking into the future, we may very well see smaller models perform better on these benchmarks, but there is surely a limitation to how much information can be encoded into such small models.

For full test results, see the results file here, including the raw prompts for each test. If you feel like I missed a crucial test, feel free to submit an issue on GitHub.

Benjamin Kaiser

Benjamin Kaiser

Software Engineer working on the SharePoint team at Microsoft. I'm passionate about open source, slurpees, and Jesus.
Gold Coast, Australia

Post Comments