Hallucinations” by West & Lexis AI?

This post is a follow up to “Hallucinations” by West’s CoCounsel? (Apr. 7, 2026). In U.S. v. Farris, __ F. 4th __, 2026 WL 915082, at *1 (6th Cir. Apr. 3, 2026)(per curiam), the court found errors in a brief prepared using Westlaw’s CoCounsel.  It appears that the tool was used after August 2025. Id. at *2.

A 2024 academic study found hallucinations by major AI products in the legal market. The study should be read with caution in 2026.  While it recognizes value in the AI products, it reports flaws in what appear to me to be older versions of the products.

After a preprint version was posted, it was “subsequently peer-reviewed and published in the Journal of Empirical Legal Studies in 2025….”  Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found – LegalAIWorld (undated); see Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho, Hallucination‐Free? Assessing the Reliability of Leading AI Legal Research Tools – Magesh – 2025 – Journal of Empirical Legal Studies – Wiley Online Library (published Apr. 23, 2025).

The scholarly article concludes by emphasizing both the value of, and the need to verify, A.I. output.  Verification, of course, is not only good advice, but also an ethical mandate.  It is worth reviewing the study.

SUMMARY OF THE SCHOLARLY STUDY

Six researchers from Stanford and Yale posted a preprint version of “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.”[1]

The Abstract states:  “While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy.”

However, the article goes much further.  The authors state:

Commercially-available RAG-based[2] legal research tools still hallucinate. Over 1 in 6 of our queries caused Lexis+ AI and Ask Practical Law AI to respond with misleading or false information. And Westlaw hallucinated substantially more—one-third of its responses contained a hallucination.

On the positive side, these systems are less prone to hallucination than GPT-4, but users of these products must remain cautious about relying on their outputs.

The study’s authors provided specific examples in support of their conclusions.

EXAMPLES OF A.I. ERRORS FOUND BY THE RESEARCHERS

The study’s authors define “hallucination” as “a response that contains either incorrect information or a false assertion that a source supports a proposition,”[3] and “focus on factual hallucinations.”

Vendor Assertions

The article quotes vendor assertions as follows:

The following are official statements from Lexis, Casetext, and Thomson Reuters; however, none of them has provided any clear evidence so far to support their claims about the capabilities of their AI-based legal research tools:

Lexis: “Unlike other vendors, however, Lexis+ AI delivers 100% hallucination-free linked legal citations connected to source documents, grounding those responses in authoritative resources that can be relied upon with confidence.” (Wellen, 2024a) (emphasis added).

Casetext:[4] “Unlike even the most advanced LLMs, CoCounsel does not make up facts, or ‘hallucinate,’ because we’ve implemented controls to limit CoCounsel to answering from known, reliable data sources—such as our comprehensive, up-to-date database of case law, statutes, regulations, and codes—or not to answer at all.” (Casetext, 2023) (emphasis added).

Thomson Reuters: “We avoid [hallucinations] by relying on the trusted content within Westlaw and building in checks and balances that ensure our answers are grounded in good law.” (Thomson Reuters, 2023) (emphasis added). “We’ve all heard horror stories where generative AI just makes things up. That doesn’t work for the legal industry. They have to trust the content that AI serves up. With Ask Practical Law AI, all the responses are based on the expert resources of Practical Law.” (Thomson Reuters, 2024b) (emphasis added)

Examples Provided in the Article

The article states that the authors “document substantial variation in system performance. LexisNexis’s Lexis+ AI is the highest-performing system we test, answering 65% of our queries accurately. Westlaw’s AI-Assisted Research is accurate 42% of the time, but hallucinates nearly twice as often as the other legal tools we test. And Thomson Reuters’s Ask Practical Law AI provides incomplete answers (refusals or ungrounded responses; see Section 4.3) on more than 60% of our queries, the highest rate among the systems we tested.”

The authors add: “To offer one simple example…, the Westlaw system claims that a paragraph in the Federal Rules of Bankruptcy Procedure (FRBP) states that deadlines are jurisdictional. But no such paragraph exists, and the underlying claim is itself unlikely to be true in light of the Supreme Court’s holding in Kontrick v. Ryan, 540 U.S. 443, 447-48 & 448 n.3 (2004), which held that FRBP deadlines under a related provision were not jurisdictional.”

They describe one 2024 study in which Lexis AI+ reportedly cited a case decided in 2025.  They add:

For example, consider this Lexis+ AI exchange:

Prompt: What are some notable opinions written by Judge Luther A. Wilgarten?

Lexis+ AI Response: One of the notable opinions written by Judge Luther A. Wilgarten is the case of Luther v. Locke. The case citation is 2010 U.S. Dist. LEXIS 144868 and it was decided in the Eastern District of Virginia on January 11, 2010. In this case, the plaintiff had filed a Notice of Appeal of Judge Ellis’s decision, but failed to properly prosecute the appeal. […]

The article states:

While the retrieved citation offered is a real case and hence “hallucination-free” in a narrow sense, it was not written by Judge Wilgarten, a fictional judge who never served on the bench (Miner, 1989). And while the generated passages are based on the actual case, the second sentence contradicts the premise, suggesting Judge Ellis wrote the opinion, but the opinion was actually written by Judge Brinkema (and involved a prior decision by Judge Ellis, which forms the basis for the RAG response). Nor is the decision notable, as it was an unpublished opinion cited only once outside of its direct history. Hallucinations are compounded by poor retrieval and erroneous generation.

The article suggests that “Westlaw asserts that a U.S. Supreme Court case was reversed by the Nebraska Supreme Court on a matter of federal law. That is not possible in the U.S. legal system, and in fact the Nebraska Supreme Court did not so much as cite the Supreme Court case in question….”

The article continues:  “Lexis+ AI describes a rule established in Arturo D. as good law, with citation to the case that actually overrules Arturo D.

It adds: “[W]hen asked to define the  ‘moral wrong doctrine,’ a doctrine pertaining to mistake-of-fact instructions in criminal prosecutions for morally wrongful acts…, Lexis+ AI relies on a source which defines moral turpitude, a legal term of art with a seemingly similar but actually unrelated meaning.”

SOME LIMITATIONS OF THE STUDY AND THIS BLOG POST

The paper was updated on May 30, 2024, which may be light-years ago in the world of artificial intelligence.  The tools studied were LexisNexis’s Lexis+ AI, Thomson Reuters’s Ask Practical Law AI, and Westlaw’s AI-Assisted Research.

There is no mention of West’s CoCounsel (released in August 2025) or Lexis’ Protégé (released in January 2025).  The authors expressly note that the programs they studied are “emerging systems.”  They point out that “our evaluation only captures a point in time. Even over the course of our study, we noticed the responses of these systems—particularly Lexis+ AI—evolve over time.”  And, the authors wrote:

Since the completion of our evaluation for this paper in April 2024, LexisNexis has released a “second generation” version of its tool. Our results do not speak to the performance of this second generation product, if different. Accompanying this release, LexisNexis noted, “our promise is not perfection, but that all linked legal citations are hallucination-free” (LexisNexis, 2024).

The study candidly notes: “[O]ur primary goal is limited to assessing the hallucination rate, accuracy, and groundedness on emerging legal technology. These are central concepts to the trustworthiness of AI tools, but they are not the sole criteria for the quality and value of a legal research system. For instance, notwithstanding the many hidden hallucinations, the overall output of Lexis+ AI and AI-AR may still be quite valuable for distinct use cases (e.g., starting on a research thread).” (emphasis added).

The authors describe the study as “the first systematic assessment of leading AI tools for real-world legal research tasks.”  This blog post is a lay-reader’s summary of a complex scientific study.  The study manually contruct[ed] a preregistered dataset of over 200 legal queries….” The technological analysis is far beyond my capabilities, and the authors made their dataset, tool outputs, and labels available to others more qualified than I.  As such, I make no effort to delve into the methodology, reliability, or validity of the study.

And, in this post, I do not address the stated limitations of the article.[5]  For example, and without limitation, the authors state that they do not use the “gold-standard” on some issues. See n. 8. However, methods such as “[w]ith this protocol, we find a Cohen’s kappa (Cohen, 1960) of 0.77 and an inter-rater agreement of 85.4% on the final outcome label (correct, incomplete, or hallucinated) between the evaluation labeler and the initial labels,” are far beyond my skill-set.

THE ARTICLE CONCLUDES THAT THESE A.I. TOOLS PROVIDE VALUE

Importantly, the researchers state: “[E]ven in their current form, these products can offer considerable value to legal researchers compared to traditional keyword search methods or general-purpose AI systems, particularly when used as the first step of legal research rather than the last word.”

The paper concludes: “AI tools for legal research have not eliminated hallucinations. Users of these tools must continue to verify that key propositions are accurately supported by citations.”  That is indisputable.

THE STUDY’S RESULTS HAVE BEEN QUESTIONED

“The study was not received quietly.”  Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found – LegalAIWorld  Both Thomsen Reuters and Lexis disputed the findings.

For example: “Thomson Reuters said that their internal testing showed a lower hallucination rate compared to the study, and welcomed the opportunity to work with Stanford to explore creating AI benchmarks.” Isha Marathe, Updated Stanford Report Finds High Hallucination Rates on Westlaw AI | Law.com (Jun. 4, 2024)(behind pay wall).

One article states that:

Both companies’ objections deserve to be taken seriously. This is not a clean, uncontested piece of research. The methodology was imperfect in its initial form, the access restrictions created real limitations, and the vendors’ own systems have been updated since the study was conducted. Any fair account of this study has to include those caveats.

Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found – LegalAIWorld

SOME OF THE QUESTIONS HAVE BEEN QUESTIONED

However, the Legal AI World blog adds:

And the vendors’ response to that finding was not to publish their own independent benchmarks proving otherwise. It was to dispute the methodology and point to internal data they have not made public.

In the absence of transparent, third-party benchmarking — which the Stanford researchers explicitly called for — lawyers are being asked to trust marketing claims that have not been independently verified. That is a professional responsibility problem, not just a product quality question.

Id.  “As the Stanford researchers argued, what the legal profession needs is public benchmarking of these tools — conducted independently, using preregistered methodology, updated regularly as the products improve.”  Id.  Legal AI World concludes:

The hallucination problem in legal AI has not been solved. Not by LexisNexis. Not by Thomson Reuters. Not by any legal AI product currently on the market. The research is unambiguous on this point, and every lawyer using these tools needs to proceed accordingly.

Id.

POSTSCRIPT

So, in addition to the passage of time and my lack of scientific or technological qualifications to review the methodology, my major caveat includes the fact that this post merely skims the surface of the manuscript.  The study has garnered a lot of attention. E.g., Bob Ambrogi, In Redo of Its Study, Stanford Finds Westlaw’s AI Hallucinates At Double the Rate of LexisNexis | LawSites (Jun. 2, 2024); AI Legal Tools caught hallucinating Again – Stanford Study; Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found – LegalAIWorld.  It is an important piece of scholarship.

There are so many “hallucination” cases that they cannot reasonably be listed here.  In Farris, the Sixth Circuit wrote:

New technologies, moreover, are no substitute for tried-and-true safeguards managed by practicing attorneys. Attorneys have an ethical obligation to verify the citations and propositions they submit to courts; that obligation reflects duties of competence and candor that apply no matter the tools attorneys use. So, attorneys who rely on artificial intelligence must remain diligent in supervising their work product and carefully examine the accuracy of every citation they present to this Court. Here, Howe’s reliance on “staff”—rather than himself or another attorney—to supervise the artificial-intelligence-generated work product fell short of his obligations as attorney of record. See Model Rules of Pro. Conduct. r. 5.3 (A.B.A. 2012); Ky. Sup. Ct. R. 3.130(5.3) (2022).

That Howe’s briefs cited real legal authorities—as opposed to “hallucinations” featuring fictitious cases—does not absolve him. See Sanders v. United States, 176 Fed. Cl. 163, 169 & n.8 (2025) (collecting cases with invented authorities). Howe’s failure to verify the artificial-intelligence output still resulted in the submission of false quotations and misleading legal arguments to this Court. Again, attorneys’ professional duties demand more.

Id. at *3.  While the academic article does not appear to be applicable to current West, Lexis, and other models, and even though I cannot evaluate or comment on the study’s methodology, the paper is valuable for demonstrating the need to validate all AI output in the context of RAG tools for the legal industry.

I will post any reasonable response or comment from West, Thomsen Reuters, Lexis, or any person or entity cited in this blog.

_____

[1] See also AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries | Stanford HAI (May 23, 2024).

[2] “RAG” stands for retrieval-augmented generation.  RAG tools reduce hallucination by confining the retrieval to specific data.

[3] But see §4.3 of the paper for additional precision.

[4] Thomson Reuters Completes Acquisition of Casetext, Inc. – Thomson Reuters Institute (Aug. 17, 2023).

[5] See §7.

Editor’s Note: This article is republished with permission of the author, with first publication on his blog, E-Discovery LLC.

Posted in: AI, Continuing Legal Education, Ethics, Legal Research, Legal Technology