Technology

A legal AI chatbot, a conscious architecture

Vincent

Software Engineer

29/10/2025

•

min read

AI solutions that help organizations work more efficiently are here to stay. But how can we implement AI in a transparent and up-to-date way within a legal text publisher? The goal: to support users in their daily work with the help of an intelligent, context-aware chatbot. By unlocking complex information more accessible, it becomes possible to receive answers in seconds to questions that would normally require hours of research.

Smart access to legal knowledge with RAG

Because this project concerns legal information, explainability is essential. Users must be able to trust the answers and understand their basis. In addition, laws and regulations are constantly changing. It is therefore crucial that the chatbot always works with the most up-to-date information.

Based on these requirements, Retrieval Augmented Generation (RAG) was the solution. This approach combines the power of generative AI with a dynamic database. Where a traditional Large Language Model (LLM) relies solely on pre-trained data, a RAG model retrieves its knowledge in real time from a reliable and up-to-date data source.

Lemon does not position itself as an innovation agency, but as an organization that focuses on applying existing and extensively tested technologies. In this context, LangChain is the most suitable solution when it comes to RAG-oriented chatbots. This project also contributes to the standardization of LLM applications in the wider field. Within LangChain, a RAG chatbot is defined as an AI chatbot with an integrated retrieval tool.

The question is asked in natural language, whereas the data source requires a strictly structured question.

The system consists of two optimizable components: retrieval and generation. The first challenge lies in the retrieval phase. The question is asked in natural language, whereas the data source requires a strictly structured question. This discrepancy is bridged using vector search.

Vector Search

Vector search shifts the central question “How do I formulate a natural language query?” to “When are two fragments of text similar?” By quantifying this similarity, each source text can be ranked by relevance to the question.

While classic methods such as ROUGE and BLEU scores exist, this project focuses on text embedding. A text embedder converts text into numerical vectors. Consequently, we can calculate the distance between two vectors and in fact, normalize the score for comparison. The effectiveness of the text embedder lies in its ability to generate vectors that are proximal for semantically equivalent texts.

The database technology used for this purpose is Mongo Atlas. This is because this storage provider offers native support for vector search indices. This search method offers several parameters for fine-tuning and optimization.

When can an AI chatbot be used?

This brings us to the next challenge: “When is the system good enough?” Here we make use of the metrics recall and precision. While their fundamental definitions remain inherently the same in each subsystem, their application is context-dependent:

Recall: Of all true-positives, how many elements are marked as positive?
Precision: Of all if positive highlighted elements, how many of them are actual true-positives?

For the retrieval phase, recall and precision were defined as follows:

Recall → context recall: Have all the essential text fragments been retrieved?
Precision → context precision: How much irrelevant information was included?

Since this project works with exact text, it can be checked for each character whether a retrieved fragment is relevant. Because this project utilizes exact text, we can verify relevance at the character level. To achieve this, the exact source text is required as a reference in our test bench dataset.

By retrieving small fragments and applying a technique that we called "neighborhood retrieval name" (retrieving fragments around the identified piece), became a recall of 80% with a precision reached 40%. We deliberately optimized for recall, as internal analysis showed the chatbot is capable of filtering out irrelevant information.
‍
For the generation phase, the metrics were defined as follows:

Recall → Answer correctness: How many statements from the example answer appear in the generated response?
Precision → Answer-faithfulness: How many statements from the generated response match the retrieved context?

Comparing these generated texts requires a more sophisticated approach. Here, texts are decomposed at the sentence level. A vector is generated for each statement. When the vectors of two statements are close, they are considered semantically identical.

Testing showed that the base chatbot is naturally highly faithful, scoring 95%. Consequently, only minimal prompt engineering was required. In addition, the correctness of an answer is strongly linked to the performance of the retrieval system. Datasets with high retrieval scores showed a direct correlation with high chatbot accuracy. This underlines that a good retrieval system completes 80% of the work.

Continuous improvements

The engineering methodology requires data-driven decision-making. To monitor and compare our experiments, we chose LangSmith. LangSmith, a hosted service of the LangChain team, covers three aspects:

1. Evaluation
2. Prompt Engineering
3. Observability

Initially, we used LangSmith for its extensive evaluation options with clear graphs and simple side-by-side comparisons. It also proved that the Prompt playground is very useful for quickly testing various prompts and analyzing the impact of chatbot parameters before conducting large-scale experiments. Both functions have so significantly contributed to improving response quality.

The observability platform provides insights into system performance, allowing us to identify bottlenecks and realize "easy wins." Furthermore, the platform serves as a powerful debugging tool, reducing the need for *console.log* in the code. However, for reasons of cost efficiency, this is currently nog implemented in the production environment.

After the experimental optimization, carried out on a small but representative subset of the entire data source, the indexation of the entire data set followed: download, split, embed and save. This final step, however, presented a significant challenge.

Optimise the essentials and enable the rest

The data source contained a vast amount of information—at least 10,000 books of 200–300 pages each—totaling approximately 500 million vectors. Storing this at scale would have incurred prohibitive costs. Furthermore, this represented only one of the three source types. To maintain scalability, we retained vector search for core sources while switching to keyword search for others.

The implementation of Mongo Atlas' Lucene-based search index resulted in a solid and cost-efficient solution. In addition, this implementation allows for a direct comparison between vector search and traditional keyword search. Experimental data shows that key-word search has a significantly higher precision, but unfortunately at the expense of the recall.

Smart, efficient, transparent and up-to-date

The result: a chatbot that makes legal knowledge accessible, communicates accurately, and is designed with the care benefitting the legal world. Thanks to the optimization process, we now have an efficient system, even despite the enormous volume of the information sources involved. In other words, RAG is a powerful building block for developing explainable AI solutions in knowledge-intensive sectors.

With each new implementation, we refine our approach and help organizations unlock complex information in a way that truly delivers value.

Technology