11 2024 June

Legal Generative AI Needs Improvement

George MorellBy George Morell


“Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools” is the recent and controversial study of the Stanford University which has analyzed legal search tools with generative artificial intelligence (AI) features (LexisNexis and Thomson Reuters, among others) and has concluded that the tools are quite impressive, despite what is advertised.

The tools compared, after an extension of the study, were:

– Lexis+ AI

– Westlaw AI-Assisted Research

–Ask Practical Law AI

GPT4 Plus was also included to see how it responded to specialized tools.

You have to think that ChatGPT is estimated that in legal matters You can hallucinate between 58% and 82% of the time, which is a serious problem without supervision. And even more so when recent studies indicate that in the Anglo-Saxon legal sector between 15 and 35% of lawyers are already using generative AI, even weekly.

The specialized tools that the legal sector is beginning to provide claim to prevent 100% of hallucinations when using “retrieval-augmented generation” or RAG, a technique sold as the great solution to the use of generative AI in specific fields of knowledge.

The “trick” used by RAG consists of including between the prompt (the instructions given) and the result obtained, 2 intermediate steps, on the one hand the recovery and on the other the generation. A big peculiarity is that the recovery takes into account user-specific documents, not the general generative AI dataset.

That is, the prompt (for example, what is the ruling that generated the right to be forgotten in Europe?) is used to search Westlaw for documents relevant to the question (as if it were a normal search).

Then the prompt + those documents are sent to the LLM to generate (the second phase, “generation”) the result. But not according to its nebulous and generic dataset but rather with respect to one that is theoretically much more suitable thanks to the documents recovered in the first phase. It is like feed the system with relevant and specific information by subject before generating the response.

That is why it is said that RAG should largely eliminate hallucinations.

The Stanford study says that RAG improves the results of things like ChatGPT4, but that the level of hallucination is not “100% free” as advertised, and in fact sometimes it is considerable.

In that sense, it is important toacar than the study defines hallucinations such as false answers but also those that falsely claim that a source supports a claim. In addition, it also includes incomplete responses, consisting of those that are negative or unfounded.

For example, when practicing with different tools, One of the questions was what were some of the most popular opinions?acadays of Judge Luther A. Wilgarten. The Lexis+ AI tool responded by citing a case from 2010, where it was decided and what happened to the appeal.

The problem comes when, although the case cited is real, it was not written by Judge Luther A. Wilgarten, who does not actually exist and was an invention :p Furthermore, the response contradicted the mention by citing another judge incorrectly And if that were not enough, it failed to comply with the requested premise, since that opinion was not considered one of the notable ones by Judge Brinkema, who had really written it.

In short, the answer was a compendium of hallucinations and errors.

That said, Who gave the highest percentage of correct answers according to the study?

– Lexis+ AI -> 65% correct

– GPT-4 -> 49% correct

– Westlaw AI-Assisted Research -> 42% correct

– Ask Practical Law AI -> 20% correct

The study analyzes the results in much more detail, but draws conclusions that are good to keep in mind:

- RAG system makes legal generative AIs fail less in general than ChatGPT4, but even so it is second in the global ranking.

– Tools are usually err more on questions related to time, jurisdiction and especially false premises, questions that include an error in the understanding of the law by the person asking it.

- While longer answer offered by the tool, greater number of errors.

- While fewer documents and worse dataset have the tool in the “Recovery” phase, the more errors the legal generative AI tool offers.

– The answers still generally include many basic legal understanding failures: the identification of the parties or the hierarchy of judicial bodies, for example.

In conclusion, It is clear that these tools are a good first step and that the technology used to improve hallucinations (RAG) helps in this, but they still need the same or more supervision than a normal ChatGPT and of course they are not error-free as they advertise, which It is certainly problematic.

It seems that the Skynet lawyer will still have to wait.