Technology

A legal AI chatbot, a conscious architecture

Vincent

Software Engineer

29/10/2025

•

min read

AI solutions that help organizations work more efficiently are here to stay. But how can we use AI in a transparent and up-to-date way within a legal text publisher? The goal: to support users in their daily work with the help of an intelligent, context-aware chatbot. By unlocking complex information in an understandable way, it becomes possible to get answers to questions that normally involve hours of research within seconds.

Smart access to legal knowledge with RAG

Because this project is about legal information, explainability essential. Users must be able to trust the answers and understand what those answers were based on. In addition, laws and regulations are constantly changing. It is therefore crucial that the chatbot always works with the most up to date information.

Based on these requirements, the choice for Retrieval Augmented Generation (RAG) was the obvious choice. This approach combines the power of generative AI with a dynamic database. Where a traditional Large Language Model (LLM) relies solely on trained data, a RAG model gets its knowledge in real time from a reliable and up-to-date data source.

Lemon does not position itself as an innovation agency, but as an organization that focuses on applying existing and extensively tested technologies. In this context, LangChain is the most suitable solution when it comes to RAG-oriented chatbots. This project also contributes to the standardization of LLM applications in the wider field. Within LangChain, a RAG chatbot is defined as an AI chatbot with an integrated Retrieval Tool.

The user question is asked in natural language, while the data source requires a strictly structured question.

Such a system consists of two components that can be optimized: retrieval and generation. The first challenge lies in the retrieval-phase. The user question is asked in natural language, while the data source requires a strictly structured question. This discrepancy can be bridged by means of vector search.

Vector Search

Vector search shifts the central question “How do I formulate a natural language query?” to “When are two fragments of text similar?” By making this similarity quantitative, each source text can be ranked by relevance to the question.

While classic methods such as ROUGE and BLEU scores exist, this project focuses on text embedding. A text embedder converts text into numerical vectors. Consequently, we can calculate the distance between two vectors and, in fact, normalize the score to compare each vector with another. The effectiveness of the text embedder lies in the training to generate vectors that are close together for semantically equivalent texts.

The database technology used for this purpose is Mongo Atlas. This is because this storage provider offers inherent support for vector search indices. Of course, this search method has several parameters for optimization.

When can an AI chatbot be used?

This brings us to the next challenge: “When is the system good enough?” Here we make use of the classification terms recall and precision. The definition remains inherently the same in each subsystem, but has a context-dependent meaning:

Recall: Of all true-positives, how many elements are marked as positive?
Precision: Of all if positive highlighted elements, how many of them are actual true-positives?

For the retrieval-phase in this case were recall and precision defined as follows:

Recall → context recall: Have all the essential text fragments been retrieved?
Precision → context precision: How much irrelevant information was included?

Since this project works with exact text, it can be checked for each character whether a retrieved fragment is relevant. To achieve this, the exact source text is needed as a reference in our testbench dataset.

By retrieving small fragments and applying a technique that we neighborhood retrieval name (retrieving fragments around the identified piece), became a recall of 80% with a precision reached 40%. It was consciously optimized for recall, because internal analyses showed that the chatbot itself has the capacity to filter irrelevant information.

For the generation-phase, the metrics were defined as follows:

Recall → Answer correctness: How many statements from the example answer appear in the generated response?
Precision → Answer-faithfulness: How many statements from the generated response match the retrieved context?

Comparing these generated texts requires a more sophisticated approach. Here, texts are split at the sentence level. A vector is then generated for each statement. When the vectors of two statements are close enough together, they are considered semantically the same.

The testing showed that the basic chatbot is naturally very faithful is 95%. Before that, there was only minimal prompt engineering needed. In addition, the correctness of an answer is strongly linked to the performance of the retrievalsystem. In examples from the dataset where we scored excellently on retrieval, our chatbot also performed well on accuracy. This underlines that a good retrievalsystem completes 80% of the work.

Follow improvements

The engineering methodology requires that all decisions are data-driven. To monitor and compare our experiments, LangSmith was chosen. LangSmith, a hosted service of the Langchain team, covers three aspects:

1. Evaluation
2. Prompt Engineering
3. Observability

Initially, we used LangSmith because of its extensive evaluation options with clear graphs and simple side-by-side comparisons. It also proved that the Prompt playground is very useful for quickly testing various prompts and analyzing the impact of chatbot parameters before conducting large-scale experiments. Both functions have so significantly contributed to improving response quality.

In turn, the observability platform provides valuable insights into system performance, so bottlenecks are easily identified and easy wins are realized. In addition, the platform can also be used as a handy debug-tool serve, allowing *console.log* should be used less in the code. However, for reasons of cost efficiency, it was decided not to apply this in a production environment.

After the experimental optimization, carried out on a small but representative subset of the entire data source, the indexation of the entire data set followed: download, split, embed and save. The latter step proved to be a significant problem.

Optimise what's essential and make the rest possible.

The data source contained an enormous amount of information; at least 10,000 books with 200-300 pages, which accounts for approximately 500 million vectors. Storing this would involve too high costs. Moreover, this concerned only one of the three source types. To keep the project scalable, vector search was retained for central source types and switched to key-word search.

The implementation of Mongo Atlas' Lucene search index resulted in a solid and cost-efficient solution. In addition, this implementation also makes it possible to vector search with the classic key-word search to compare. Experimental data show that the key-word search has a significantly higher precision, but unfortunately at the expense of the recall.

Smart, efficient, transparent and up to date

The result: a chatbot that makes legal knowledge accessible, communicates correctly, and is designed with the care that fits the legal world. Thanks to the optimization process, we now have an efficient system, even despite the enormous size of the information sources involved. In other words, RAG is a powerful building block for developing explainable AI solutions in knowledge-intensive sectors.

With each new implementation, we refine our approach and help organizations make complex information accessible in a way that really works.

Technology