RAG Retrieval Augmented Generation

Herve Blanc
Feb 14, 2024
6 min read

Updated: Mar 12, 2024

What’s RAG ? Yet another acronym the tech industry is found of. RAG means Retrieval Augmented Generation. It is a key Generative AI framework to take advantage of LLMs with your own company’s data (LLM stands for Large Language Model, ChatGPT runs on top of a LLM). RAG is part of the new LLMOps acronym (Large Language Model Operations) or MLOps for LLMs: a new set of tools and best practices to manage the lifecycle of LLM-powered applications, including development, deployment, and maintenance.

LLMs are huge AI models with billions of parameters. Depending on the models and which company, and the budget they have at their disposal, the model might be trained once or twice a year. The LLMs training require loads of GPUs to learn from billions of token: the whole internet content pretty much. The LLMs training is quite an expensive task as you have to rent/own the supercomputer equipped with thousands GPUs and run trainings jobs for weeks.

As such, the LLMs are not updated daily and they suffer from “knowledge cut off” meaning they have been trained on internet content up to the date their training started with this given dataset. Any question on a subject that emerged after that date is basically unknown to the model. Also, another sure thing is the model does not know much or anything about your company’s data as you ensure key proprietary information is secured behind your firewall.

Questioning LLMs about things they don’t know about leads to undesirable behavior called hallucinations. LLMs, being the best at “predicting the next token” in answer to your input prompt, will likely try generating irrelevant or random suite of tokens, when asked about unknown information. The end user could quickly realize that erroneous output if familiar with the subject. Otherwise it could be trouble, for example “Judge sanctions lawyers for brief written by A.I. with fake citations”.

Luckily, if you give the LLMs some “context” that is relevant to your users questions, the LLMs will be able to answer without hallucinations by extracting information from your company’s data. Retrieval of the company’s data is indeed a key step of RAG (see RAG diagram below). The system is using embeddings to encode the users’ question and search for all documents, or parts of documents, with similar encoding as the user’s question. This is called similarity search.

Even a simple RAG application entails tuning many different parameters, components, and models.

Source: GradientFlow Techniques, Challenges, and Future of Augmented Language Models

This implies all your company’s data would have been previously processed and stored in a vector store. This is a database of all the chunks of documents stored by their encoding. Embeddings are models themselves that allow to generate vector encodings for searching content very efficiently. The encoding (vector of numbers) denotes real meaning: for example, a king would be close concept with man and queen (i.e. this is very different from keyword search). Some embeddings might use vectors of 4096 numbers to represent text parts. These 4k vectors are compared to find the nearest neighbors in this large mathematical space when searching for answers to our questions.

Just for the sake of understanding, we could reduce the number of vectors to only 2 and use a 2D representation to show how similar some concepts might be encoded with a simpler embedding model.

semantic search 2D vector simplified representation

source: Cohere What is semantic search

Note that LLMs have limitations on the context size they can process : OpenAI GPT model had a context window limited to 2048 tokens. This explains why documents have to be chunked into pieces during the indexing step. An LLM prompt generally includes the user query and the context. This context itself includes carefully crafted instructions the designer tested to support the application goal (also called prompt engineering). The context could be including zero or few shot learning too : these are examples to teach the LLM what to do. Some more advanced considerations like Chain-of-Though prompting might be used if the application is an AI agent. All this “context” (user inputs + prompt engineering + retrieved document chunk) is passed to the LLM for processing, then the LLM provides its answer (the LLM “predicted the next best tokens” that could most likely follow the input prompt).

The retrieval results are also key to enhance the LLMs response, this can be achieved by clearly referencing the documents that served the basis for the answer generation. This gives the end user the opportunity to verify the LLM’s answer sourcing and perform fact checking. As mentioned earlier, LLMs suffer from what we call hallucinations whenever they are asked to generate text about subjects they have not seen during training or passed in as context. If you provide a link to the retrieved part of the relevant document that was processed by the LLM to your users, they can read this document paragraph and verify for themselves the LMM provided a correct answer. Hallucinations are reduced by RAG, but one cannot ensure there won’t be any, thus the referenced documents are really key to trustability of the system with human in the loop.

For the user experience to be complete, you would need to provide a user interface for signaling the LLM did not answer properly. This is the opportunity for the designer to learn from its users and collect data that can later be used to evolve the application. New demands might be emerging from your users and the designers might not have taken these use cases into consideration. Collecting positive feedback from your end users would be useful too as you most definitely need to create a complete set of tests before deploying your next app or LLM version (thru few shots learning or fine tuning). OpenAI’s user interface provided buttons close to the answer window so users can easily provide feedbacks.

Now that you understand what RAG is, and how it works, let’s figure out why you would want to invest in such a system for your employees to be able to interact with your company’s data. May be I should have started by that, right ? Well, I did post about Generative AI and productivity some time ago already so I won’t go there anymore. I just want to stress one important matter here which was dubbed “shadow AI”. Shadow AI is a term describing unsanctioned or ad-hoc generative AI use within an organization that’s outside IT governance. In other words, if you don’t provide your employees a RAG, they will use ChatGPT, or competitors, they will extract the information manually, so ChatGPT can provide meaningful help. This poses a serious risk of your confidential information leaking out to the companies owning those free to use Chatbots.

Underneath the hood, RAG has been popularized by several open-source tools. Probably, the most famous one is the LangChain framework. It allows you to implement RAG (and more advanced AI agents) interfacing the different components described above. First of all, it comes with an impressive list of documents loaders supporting data ingestion from all sources you think of (csv, office documents, pdfs, notion, slack, …). Just that makes LangChain the 1st place you want to check when it comes to RAG connectors. Then LangChain supports working with many different LLMs (GPT, Llama, …) whether running behind APIs or “locally” deployed on your server instances, or more likely on your cloud infrastructure. Using just one LLM API may create an SPOF risk (single point of failure). Lots of people realized this during the OpenAI “coup” during the agitated month of Novembre. Similar thing for embeddings, you’ll find them all in LangChain (Anthropic, Cohere, HuggingFace, Mistral, … to site others, not just OpenAI). Most vector stores are also integrated (qdrant, milvus, pinecone, weaviate, … to name a few). This is likely a component you want to carefully select based on the amount of information you’ll have to index in your RAG. Retrieval is going to be key to the response time perceived by your end users.

That's it, I hope this got you motivated for getting RAG working with your company's data ASAP.

And don't forget to spread the information if you enjoyed this blog post, just click on the social network buttons below. Sharing is caring :-)