2024-03-15

Ragging on RAG#

There's a hot new topic in AI. Frankly, there are many hot new topics in AI any day of the week, but we're going to focus on one right now: Retrieval Augmented Generation. This is meant to be a pretty straightforward article on what it is and why it exists. I found it hard to understand RAG at first through the jargonistic and technical details, so I'll do my best to make this approachable without assuming any knowledge.

Why should you care about RAG?#

RAG marries one of the oldest great technological masterpieces, Search, with one of the newest, Generative AI. So there's two different arguments for why RAG is great:

Generative AI is amazing, but giving it access to more information makes it better.
Search is amazing, but adding automatic summarization and analysis makes it better.

Let's start with the first one. It's the one that introduced me to RAG - and it's also the one that makes it easier to explain.

For this we'll turn around and go through the concepts of RAG backwards:

Generation: What specifically am I talking about when I say GenAI?
Augemented: Why does GenAI need to be augmented? How do you augment it?
Retrieval: What is Retrieval? How does it come into GenAI?

Generation is pretty cool#

Enough people have [spoken] [for] [Generative] [AI] that I'm not going to rewrite the wheel here, and will assume you're all at least a little convinced that generation is pretty cool.

Instead let's get a high level understanding of how LLMs work, as pertains to this discussion. I'm abstracting away a lot of detail, so if you're comfortable with your existing knowledge, just skim through the pictures.

What are LLMs? An Oversimplification.#

Let's talk about [Large Language Models]. LLMs are a specific subset of GenAI that generate text. They took the world by storm in 2023 because they generate good text. This is a small but important distinction against previous types of AI.

To get an LLM, you first take the internet.

The internet

Then you take your LLM architecture and [pre-train] it: you run the internet through a lot of math, and end up with a working LLM - a computer program that holds in itself a deep understanding of the information and language structure of its input data (the internet).

LLM from internet

Now you have a LLM.

To use a LLM, you provide some input text - the prompt. The LLM then keeps writing from where you left off by intelligently working out what should come next, hence Generative Artificial Intelligence.

Using LLM

In the basic format, this gives text completion similar to playing the Story Game. That's not the easiest way of interaction, so most end-user LLMs today have been adjusted for chat instead of completion; you can converse with it in question-answer format.

QnA usage

Here's GenAI (LLM iteration) as experienced by most users today; a smart and knowledgeable system you can converse with in plain English[2].

From whence comes the knowledge?#

LLMs have two sources of information:

Pre-training: Everything that was included in the training data when the LLM was first created
1. Always known by the LLM in all interactions
2. Can contain a LOT of data, includes pretty much all common knowledge and more
3. Unable to be updated once the LLM is trained
In-context: Anything that's included in the prompt when the LLM is used
1. Only known within the current interaction
2. Limited amount of data
3. New and different info can be added whenever desired

Info Added Over Time

Pre-training data is a wonderful thing, with a truly impressive amount of knowledge squanched into the LLM. It's the best choice for scope, rememberance, and understanding. Unfortunately, it's a one-time-only type of deal for each LLM: pre-training is expensive and time consuming.

This is a problem, because the diagram above is a lie.

In reality, only part of the internet is included in pre-training (for many reasons[3]). Hence a lot will be missing from an LLMs knowledge bank, and it will be silent or misinformed if it needs to talk on such subjects.

Missing Info Types

Any casual LLM user will run into these constraints.

Ask anything personalized, about current events, generally about something outside the model's knowledge? Many LLMs will [hallucinate]. Those more aware of their own limitations will ask you for more info.

Here we see one of the annoying problems that comes with LLM generation; but we also start to see hints of the solution.

The Problem:

There's some things that LLMs don't know. Sometimes this information is needed in order to get useful responses.

👁️‍🗨️ Tip

You can test this out with the Mistral Jon Snow Hugging Face Assistant. He gives examples on how to use OpenAI's chat-completion Python library.

Unfortunately, every example is wrong.

OpenAI deprecated the openai.Completion.create call in Nov 23, after [Mistral] finished retraining. Because his knowledge is out of date, the OpenAI calls given by Jon Snow will all return errors.

Augmenting the Context#

Babies babble. Have you ever been around one? Babies make lots of cooing, gurgling noises, which can be very cute but completely uninformative. This makes sense, since babies aren't particularly informed on how language works. Adults (generally) have that information, and thus can communicate significantly more effectively.

Even with adults, you easily see the difference between informed and uninformed conversation (cough generation cough). Just try asking directions from a tourist[1] instead of a local, and you'll feel the difference that information makes.

This is exactly the problem we've run into LLMs; they're very useful on familiar ground, but large chunks of the map are empty, ready for you to walk right off a cliff. Fortunately, ChatGPT was already telling us how to start solving this earlier:

ChatGPT:To find out how far away Easter is, we'll need to know the current date. Could you please provide today's date?

We just need to tell it the information it's missing!

An intuitive understanding#

Any LLM user has probably followed this dance through before:

You:How far away is Easter?

LLM:To find out how far away Easter is, we'll need to know the current date. Could you please provide today's date?

You:Today is 12 Feb 2024

LLM:Easter in 2024 falls on April 7th. To find out how far away that is from February 12th, we can simply subtract the two dates: April 7th - February 12th = 54 days. So, Easter in 2024 is 54 days away from February 12th.

Once you've done this enough, you start to anticipate it:

You:Today is 12 Feb 2024. How far away is Easter?

Conversations with LLMs suddenly become a lot easier once you start telling it upfront the information it needs to know in order to contribute usefully[4].

In-Context Awareness#

Note

'Context' is essentially an LLM's short-term memory. This is limited; the Context Window is the amount of text or information the model can consider at one time when generating or understanding language.

This is the second type of information source [discussed above]; the LLM context. To fix the problem of missing information, you make the LLM actively aware of it in the current conversation.

Add New Info To Context

Doing this is easy, as you've seen earlier - just add the info in the input text given to the LLM. It does generally have to be textual[5], but other than that, you can just bung it in right before the user input or questions[6].

👁️‍🗨️ Tip

You can test this out with this Ygritte Hugging Face Assistant. Like Jon Snow, Ygritte gives examples on how to use OpenAI's chat-completion Python library.

However, the difference is in the system prompt. For Ygritte, I included a couple paragraphs from one of OpenAI's guides to text generation and one additional line saying "Use the documentation above to answer."

This change to the input information means that Ygritte will now correctly use client.chat.completions.create to access the OpenAI completion API. Augmenting the context with additional information has given us an Assistant that's actually useful to it's purpose[spoiler!].

Add New Text To Context

A simple enough solution, telling the LLM what it needs to know ahead of time. All the same, it's quite a powerful little trick. ChatGPT seems to inject the current date [into the system prompt. This seems a clean way to resolve our issue.

The Problem:

There's some things that LLMs don't know. Sometimes this information is needed in order to get useful responses.

The Solution:

Add the needed information into the LLM context.

Info To Context

The inevitable issues#

Nothing is ever that simple. The problem with this easy augmentation comes from two sources:

There's a lot of missing info which may be needed
Each LLM context window (short-term memory) is limited

This can be represesnted by correcting the earlier diagram to be somewhat closer to reality:

Correct Info To Context

Clearly the missing information does not all fit into the LLM context. To an extent, your potentially-relevant information set might be limited from the start. In the ongoing example of the Hugging Face Assistants for OpenAI API usage, the dataset they need is the OpenAI documentation. Unfortunately, even that dataset is going to be too large for the ~50 page context window of Mistral once you consider all the how-to docs, API references, forum questions, release notes, etc.

We're saved by a third point:

Only a small amount of the entire missing information will actually be relevant to any specific LLM interaction

This means we can reduce the information given to the LLM context with two steps.

Info Reduction

Both of these happen at different times, with different parameters.

It's now that we start thinking of how LLMs can be used in applications. The new AI world isn't just about people talking directly to the LLMs; it's about using them to support other programs, or adjusting them for specific purposes. We've seen this throughout in the Hugging Face Assistants; these are AI [agents] which are (nominally) meant for the specific purpose of helping people use the OpenAI offerings.

So narrowing down the potentially relevant info happens when an AI agent is first being created and is determined by what this specific agent needs to know in order to do its job. Finding the relevant info, compressed enough to be injested by the LLM context, happens during use and depends on what questions the user asks.

Info Reduction Reasons

For an OpenAI assistant, this would look like:

Purpose: Answer questions about how to use OpenAI's offerings clearly and well
Potentially Relevant Info: All available OpenAI documentation, examples, forum posts, updates, etc.
Question: "How do I use the chat API in python?"
Actually Relevant Info: The Chat Completions documentation and the Chat API reference

The first step - working out what potentially relevant data is for your agent purpose - is usually fairly straightforward. Once you're there, the problem is how to extract the actually relevant information for every interaction.

The Problem:

Once a user asks a question, how do you find the useful data from a large dataset?

Retrieving Relevantly#

This one's pretty easy.

The Problem:

Once a user asks a question, how do you find the useful data from a large dataset?

The Solution:

Search.

Get the user input, and search the dataset for the most relevant material. Everything everywhere does this, including the very OpenAI documentation itself:

Here we finally see the full RAG pipeline:

Get the user input
Retrieve info from your dataset relevant to the user input
Augment the LLM context with this info
Generate an answer with this added knowledge

👁️‍🗨️ Tip

You can test this out with the Varys Hugging Face Assistant. Unlike the earlier assistants, he can answer a question about any OpenAI offering.

When Varys is asked a question, he first does a search on the openai.com domain to retrieve relevant information. This information then augments the LLM context in order to inform the answer generation.

Thanks to this implementation of the full RAG pipeline, Varys is actually useful for purpose!

This is very simple in theory, and can also be simple in practice. Or not. Search is a [massive industry], with many different implementation options.

Quick considerations#

There's many things to keep in mind when working out how to actually search your dataset. Off the top of my mind:

Speed: will results return quick enough to avoid users getting annoyed?
Input usefulness: does the input already contain all the keywords I need? Are there any assumptions?
Question complexity: how much variety of information do I need?
Result size: is this small enough to fit into my context[7]?
Further reduction: if I get too many results, how do I cut it down further?

Easy retrieval implementations#

There's two basic types of search to start the RAG conversation: string search and embedding search.

You don't need to be a programmer to know string search, you just need to use google. This is the familiar concept of taking the words in the input and finding the sections of your dataset that have the same words. A broader category than just [exact substring matching], this can include [fuzzy matching], [approximate string matching], and [semantic search]. All code languages will have a function for at least the first and likely libraries for the rest, and computer science graduate will have implemented an entire algorithm from scratch.

[Embedding search] is another recent[8] gift from the AI overlords. Words or phrases are converted into numerical vectors ([embeddings]) in multi-dimensional space, allowing for semantic relationships to be captured. These embeddings are then used to compare and match similar words or phrases in a more meaningful way than traditional string matching techniques. In short, they're a more complex form of semantic search with a much bigger brain.

I won't get further into the details of embeddings; that's another article for another day. Hopefully you're convinced enough that search is something that has been conquered before (many times before) and can be conquered again.

The Problem:

Nothing! Nada! Everything is golden!

A Basic Implementation#

Let's create an itsy baby RAG system for a News Broadcaster AI agent. This will use the dataset of current events to give informed answers to user questions.

Retrieval#

This will use the newsapi as dataset and retrieval in one. When the user asks a question, we search for relevant current events using this API.

def retrieve_relevant_news(input_text):
    keywords = get_keywords(input_text)
    newsapi_url = ('https://newsapi.org/v2/everything?q='
                   + '&q='.join(keywords)
                    + '&from=2024-03-10&sortBy=relevancy&searchIn=title,description'
                    + '&apiKey=' + newsapi_apikey)

    response = requests.get(newsapi_url).json()['articles']
    useful_results = [item['title']+"\n"+item['description'] for item in response if item['source']['id']][:10]

    return '\n'.join(useful_results)

Note that the results are being trimmed for only ones that actually include data, and limited to only 10.

This is a search mechanism that relies pretty much entirely on an external service, although there may be some cleverness in the get_keywords() function[9]. However, it does work:

Augmented#

Both the external retrieved information and the user question should be put into the context. Most LLMs use a message array for this.

def augment_context(retrieved_data, input_text):
    augmented_context = [
        {
            "role": "system",
            "content": "You are a news broadcaster. Always let the user know about the details in the following news:\n" + retrieved_data
        },
        {
            "role": "user", "content": input_text
        },
    ]
    return augmented_context

Massaging the prompt associated with the retrieved data is an easy way to start getting better answers.

Generation#

With your message array prepped and augmented, getting an answer is a simple matter of a call to your LLM of choice. Here I'm using the OpenAI chat API - the same one that all the Hugging Face Assistants are there to help with.

def generate_answer(augmented_context):
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=augmented_context
    )
    return response.choices[0].message.content

Putting it together#

Run each piece in sequence:

def run_rag(input_text):
    print("Question asked:", input_text)
    retrieved_data = retrieve_relevant_news(input_text)
    print("\nRETRIEVED information:\n", retrieved_data)
    augmented_context = augment_context(retrieved_data, input_text)
    print("\nAUGMENTED context with the retrieved information")
    answer = generate_answer(augmented_context)
    print("\nGENERATED answer:\n", answer)
    return answer

Tada! One complete RAG setup.

I've added a little more juice to make it command line runnable, and you can find it as a gist. Here's a demo of it in action, but please do copy it over and run it yourself as well!

Keeping Current#

Now that the concept of RAG is clearly understood, let's take a step back and place it in the world of today.

Colloquial definitions#

RAG is one of those broadly-used terms that seems as though it should have a well-spec'd definition, but is really just vibes.

Here's some general principles to work from:

RAG is generally a LLM-specific term, despite "generation" being pretty broad. For instance, I've yet to see any 'RAG image creation' projects.
Retrieval generally means 'search of a specific known subset of documents'.
Linked to the above, it's questionable whether tool usage is considered RAG. More specifically, web search retrieval generally isn't, from what I've seen, called RAG.
The above is all flexible and very dependent on the hype train of the day.

Breaking the Basics#

RAG is the latest fad - how can you be special?

Popular things are populat because they're great, but it then becomes difficult to do something great that nobody else has thought of. However, there's a couple of different spaces which allow for a distinctive edge.

First, dataset. If you can bring a specialized, useful dataset that other people don't have access to (or don't know very well), you can build something novel. Web search, useful as it is, has been done before.

Second, retrieval system. This is where a lot of the heavy lifting happens. The search element is the entire point of a RAG system, but as anyone who uses Google knows, it can be difficult to actually find the right results. Hrishi's written a good series on ways to improve.

Finally, generation. This is the realm of prompts, priming, possibly even fine-tuning. Massage the output to be something people like reading!

1: This might be somewhat unfair, especially if you want to go to a popular tourist attraction - some travellers study the local map very thoroughly (my Dad) and some locals only know the five back streets they take often (me). ↩

2: Or other language of your choice, although generally English is the most conversant. ↩

3: Only text data can be used, knocking out all of YouTube and other multimedia sites. Some internet data isn't easily accessible, some isn't legally usable, much is of lower quality than desired. Even then, it's just not worth the time or money to send everything through once the LLM's already been trained with most of the useful stuff - basic 80/20 rule . ↩

4: This applies to people, too! "When do you want to get dinner? I'm busy on Thursday, but the rest of the week is free." ↩

5: Adding multimedia generally requires the intermediate step of transforming it to text. By now there's a bunch of AI that will describe your diagrams/articles/rag/videos/pdfs as text on your behalf. ↩

6: You'll get significantly better results if you do a bit more formatting, but straight up bunging works surprising well. ↩

spoiler!: Unfortunately, Ygritte's usefulness is limited. She's only updated with one specific function update - if you try and ask about any other feature that is new or changed since 2021, she'll still give you deprecated knowledge. ↩

7: Preferably your augmenting data will be even smaller than needed for the context window - overstuffing the context doesn't always end well . ↩

8: While it's questionable whether 'in the last two years' counts as recent in the tech world, embeddings are definitely significantly more recent than substring matching. ↩

9: There is absolutely no cleverness, it's quite useless, except for how it doesn't require any library installations. Please do rewrite. ↩