Welcome to brdata-rag-tools’s documentation!

Tutorial

Introduction:

Welcome to the brdata-rag-tools tutorial. In this brief introduction, I will guide you through using the library with a simple example.

Installation:

To install the package run:

# create virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# install the package
python3 -m pip install brdata-rag-tools

Basic Usage

Choose the language model (LLM) you want by initiating the LLM class with a value from the LLMConfig enum.

All LLMs from brdata-rag-tools are connected via an API. The library serves as a wrapper to make the models more easily accessible.

from brdata_rag_tools.models import LLM, LLMConfig

llm = LLM(model_name=LLMConfig.GPT35TURBO)

Select the desired LLM, such as GPT 3.5 Turbo, GPT 3, GPT 4, IGEL, or Google’s Bison models.

All GPT models may be used by anyone with an API token, the IGEL and Bison Model is only accessible from BR Data’s infrastructure.

Next we set the environment with OpenAI’s access token:

os.environ["OPENAI_TOKEN"] = "YOUR TOKEN HERE"

For IGEL set the env var “IGEL_TOKEN” or “GOOGLE_TOKEN” for the Bison model respectively.

The LLM class holds merely the token and the actual model class.

The model class holds the actual logic and connections to interact with the language model endpoints.

Now, interact with the model using the prompt method:

joke = llm.prompt("Please tell me a joke.")
print(joke)

Chat with the model

Models of the GPT Family also support chat functionality – this means that models are aware of prompts sent earlier to the model. Use the chat method.

answer = llm.chat("Please return 'test')
print(answer)
answer = llm.chat("What did I tell you in the last message?")
print(answer)

For a new chat and to make the model forget earlier messages, use the new_chat method:

llm.new_chat()
answer = llm.chat("What did I tell you in the last message?")
print(answer)

Databases

We do not only want to talk to our LLM, we want to augment it’s prompt. This means we want to query a database for relevant content.

This is done using so called semantic, or vector, search. For semantic search the searchable content is transformed into a numerical representation, a vector embedding.

To retrieve relevant content, the user’s prompt is also transformed into a vector. The prompt-vector is then compared to all vectors in the database and the most similar vectors are retrieved.

You can choose of two different database types:

  1. PgVector a database based on Postgres with an extension for vector search. It is a good choice to use if you plan to build production services. You need to deploy the database yourself.

  2. SQLite with FAISS is a good choice if you want to try out something. While FAISS is a very capable library the usage in this library is not optimized for production.

FAISS and SQLite

Create your database by importing and invoking it. Without any parameters it will be a memory-only database. This means, if you stop your program, the data will be lost.

To write your database to disk, use the database parameter and pass it the path to your database file. If it does not exist, it will be created in the specified directory. You may use absolute or relative paths.

database = FAISS(database="FAISS.db")

PGVector

The easiest way to run it, if you’re not on the BR data infrastructure, is to run it via docker.

The following command is not safe for a production environment! Don’t use trust mode in critical applications.

docker run -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust ankane/pgvector

Connect to the database with psql if you want to make sure pgvector is up and running

psql -U postgres -h localhost -p 5432

Follow the instructions to set a password or trust all hosts.

If you’re on the BR data infrastructure, simply add pgvector as database type to your project’s config.yaml file and forward port 5432 to localhost.

Once you have your instance of pgvector running instance the PGVector class and supply it with the database’s password.

from brdata_rag_tools.databases import PGVector
database = PGVector(password="PASSWORD")

Populate your database

To search for relevant content, you first need to ingest it in the database.

Therefore you need a table in the database to ingest your data. You get a bare minimum of such a table with the following method:

embeding_table = database.create_abstract_embedding_table()

This method returns an abstract Database Table. Those table always contain the following Columns:

  • id (string)

  • embedding (Vector)

  • embedding_source (string)

The Embedding Type

The embedding column will be generated by the Database from the content in embedding_source. The id needs to be unique for each row.

To actually use it, you need to inherit from the abstract table. In the following example, we will use our little search for podcast recommendations.

The table needs to know which kind of embedding you want to use. The most universal embedding type is Sentence Transformers, which is fine tuned for cosine similarity comparison of German texts.

from brdata_rag_tools.embeddings import EmbeddingConfig

embedding = EmbeddingConfig.SENTENCE_TRANSFORMERS

embeding_table = database.create_abstract_embedding_table(embed_type=embedding)

The database table

The returned abstract table is an SQLAlchemy table object. You may add your own Columns to it to store data additional to the three aforementioned items.

Give the table any name you like using the __tablename__ attribute. This is the only necessary field. Other columns, like title and url in the example above, are introduced using the SQLAlchemy logic.

For more information on this topic, please refer to the SQLAlchemy Tutorial. A list of types to use in your mapped_column attributes is available here.

Next, create the tables in the database:

# Define database table
class Podcast(embedding_table):
    __tablename__ = "podcast"
    title: Mapped[str] = mapped_column(String)
    url: Mapped[str] = mapped_column(String)
    # Inherited from parent class:
    # embedding_source: str
    # embedding: Vector

# Create tables
database.create_tables()

# Fill with content
podcast1 = Podcast(title="TRUE CRIME - Under Suspicion",
                   id="1",
                   url="example.com",
                   embedding_source="Who is rightfully, who is wrongly suspected here? What if people are wrongly convicted, and no one believes them? Or vice versa: If the true perpetrator goes unpunished? Under Suspicion - In the 7th season of the successful BAYERN 3 True Crime Podcast, defense attorney Dr. Alexander Stevens and BAYERN 3 host Jacqueline Belle discuss new exciting criminal cases. This time, it's about people who have come under suspicion. Who is guilty? Who is lying, who is telling the truth? And in the end, are the right ones always convicted?")

podcast2 = Podcast(title="SCHOENHOLTZ - The Orchestra Podcast",
                   id="2",
                   url="br24.de",
                   embedding_source="How does an orchestra work? How do you get in? And why do orchestra musicians always wear black? Who could answer these questions better than an orchestra musician herself! Anne Schoenholtz is a violinist in the Bavarian Radio Symphony Orchestra, BRSO - an orchestra that has just been voted the third best in the world. As the host of the orchestra podcast, Anne takes us behind the scenes of the BRSO and elicits intimate confessions and funny stories from her colleagues about orchestra life. In the third season, we find out how concert programs are created, what ails musicians typically, and why there are so many jokes about violas. Sir Simon Rattle, the new chief conductor of the BRSO, answers community questions at the end of each episode.")

podcast3 = Podcast(title="Crime Scene History – True Crime meets History",
                   id="3",
                   url="bla.com",
                   embedding_source="In Crime Scene History, Niklas Fischer and Hannes Liebrandt, two historians from Ludwig Maximilian University in Munich, leave the lecture hall and travel back to exciting crimes from the past: a mysterious water corpse in the Berlin Landwehr Canal, young Stalin as the leader of a bloody robbery, or the hunt for a war criminal halfway around the world. True crime from history discussed in an entertaining way. The focus is on the question of what this actually has to do with us today. Crime Scene History is a podcast from Bayern 2 in collaboration with the Georg von Vollmar Academy.")

# Write to database
database.write_rows([podcast1, podcast2, podcast3])

Since we ware using SQLAlchemy’s Table classes, those tables are the exact representation of what will be stored in our database and we will interact only through those Table classes with the content from the vector store.

Right now, we only have content in our tables and no embedding so far. The embedding is automatically computed when you send your table to the database.

Write sqlalchemy tables to DB

To create a normal table using SQLalchemy without embedding column follow the normal SQLalchemy procedure.

Import the Base class from the databases module, not from SQLAlchemy itself.

# Import Base from databases module
from brdata_rag_tools.databases import Base

# Use Base as parent class for your table
class Person(Base):
    __tablename__ = "person"
    id: Mapped[str] = mapped_column(String, primary_key=True, unique=True)
    name: Mapped[str] = mapped_column(String)

# Create tables
database.create_tables()

# Create a row
person = Person(
    id = "123",
    name = "John Doe"
)

# Set create_embeddings=False to write row to DB
database.write_rows([person], create_embeddings=False)

Querying the database

Remember the following line:

embedding_table = database.create_abstract_embedding_table(embed_type=embedding)

Here we’ve specified the embedding type for the Table. The embeddings are now created from the type we’ve specified in this line and sent to the vector store. Now we can query the database for content. Via the database.session() attribute we may also interact with it as a normal database via SQLAlchemy.

with database.session() as session:
    response = session.execute(text("SELECT * from podcast;")).all()

for row in response:
    print(row.title)

This statement now prints out all of the three podcasts in the database. Just alike you can write your custom SQL queries to filter your results.

To select only those podcasts hosted on br24.de, you would write

with database.session() as session:
    response = session.execute(text("SELECT * from podcasts where url = 'br24.de';")).all()

for row in response:
    print(row.title)

Alternatively you may use the sqlalchemy ORM syntax to query the database:

from sqlalchemy import select

with database.session() as session:
    response = session.execute(select(Podcast).where(Podcast.url == 'br24.de')).all()

for row in response:
    print(row.title)

Finding similar results

But conventional queries are not the strength of vector databases. We want to find content similar to a user query to augment our prompts to the LLM with.

Therefore we query the database with a question, using the retrieve_similar_content method. To find us some podcasts on music, we simply ask for them:

context = database.retrieve_similar_content("Please show me some podcasts on music.",
                                            table=Podcast,
                                            embedding_type=embedding.SENTENCE_TRANSFORMERS)

The returned context object is a list of dictionaries, with the table name as key for the context and the key cosine_dist which indicates the distance of the search term’s vector and the the content’s vector.

The smaller cosine_dist is, the more similar are query and result.

for row in context:
    print(row["cosine_dist"], row["Podcast"].title)

Adding context to your prompt

Now we may augment our prompt to the LLM. Therefore we need to write a prompt template:

prompt_template = ("You are the podcast expert at Bayerischer Rundfunk. "
                   "A user asks you the following question:\n"
                   "{}\n"
                   "Here are some podcasts that you can recommend:\n"
                   "-{}\n"
                   "Limit yourself to the selection of podcasts and do not invent new ones. Recommend only one podcast from the list and briefly justify your decision. "
                   "Write in the second person, addressing the user directly.")

In the template we see two placeholders: One for the user question. If we develop an app, this would be the prompt given to us by the user. For now, we just write it ourselves:

user_prompt = "Please recommend me some podcasts on music."

The second placeholder is for the context we retrieved from our database. We just need to restrucutre it as a human readable list.

context = [x["Podcast"].title + ": " + x["Podcast"].embedding_source for x in context]

Then we put everything together using a python format string and send it to the LLM:

prompt = prompt_template.format(user_prompt, "\n-".join(context))
response = language_model.prompt(prompt)

print(response)

LLM usually only have a limited amount of tokens you may pass to them. If you run your RAG applicatoin on the server, there is a little helper function to make sure you don’t exceed the token limit. You pass it your template and the user prompt as a string and the context as a list of strings.

If the context is too long, the function will pop the last elements of your context until it fits the context window.

context = language_model.model.fit_to_context_window(prompt_template + user_prompt, context)

Simply use this function before you pass the context to the LLM.

Registering your own Language Model

If you have your own Language Model deployed, you may want to use it with this library.

This library assumes that a language model is available through an REST API. In principle you mal also run it locally. You may need to fill in some dummy values in the connection related fields in the upcoming example.

To register your own model you need to inherit from the Generator class in the models package.

The most simple way to do so is as follows. Your init method needs the paramters model and auth_token. Those parameters are then passed to the init function of the super or parent class:

from brdata_rag_tools.models import Generator

class Bison(Generator):
    def __init__(self, model, auth_token):
        super().__init__(model=model,
                         auth_token=auth_token)

Those parameters are needed internally. model will hold the LLMCofig class and auth_token the token to connect to the REST API. Optional parameters you may pass are the following.

  • temperature: float

  • max_new_tokens: int

  • top_p: float

  • top_k: int

  • length_penalty: float

  • number_of_responses: int

  • max_token_length: int

Those are only used by your service, so you don’t have to stick too closely to the definitions, but for the sake of reusability it is advised not to overload these parameters. You can always introduce your own parameters if you need to.

Each Generator needs a prompt method. In this method you query your service with the parameters you’ve specified above.

The prompt()-Method is usually just a wrapper around your REST API. If you choose to run the model locally you may also query it directly from here.

The prompt() method should take a string as input and should return a string.

class Bison(models.Generator):
    def __init__(self, model, auth_token):
        super().__init__(model=model,
                         temperature=1.0,
                         max_new_tokens=256,
                         top_p=0.9,
                         length_penalty=1.0,
                         auth_token=auth_token)

    def prompt(self, prompt: str) -> str:
        import requests

        headers = {
            'accept': 'application/json',
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.auth_token}'
        }

        json_data = {
            'id': str(time.time()),
            'prompt': prompt,
            'model_name': 'bison001',
            'max_tokens': self.max_new_tokens,
            'temperature': self.temperature,
            'top_k': self.top_k,
            'top_p': self.top_p,
        }

        response = requests.post('https://google-models-proxy.brdata-dev.de/v1/bison', headers=headers,
                                 json=json_data)

        return response.json()["response"]

If your model supports chat functionality, you can also implement a chat() method.

If you choose not to, the generic chat method will be used by adding the chat history to the end of your prompt. This will not produce the best results and may also confuse the model. Use at your own risk.

Now that you’ve created your model, you need to register it as a last step. Use the register method from models.

from brdata_rag_tools.models import register
register(Bison, name="bison001", max_input_tokens=8192)

You pass the method your Model class, and need to specify the name of your model and the maximum amount of tokens the model can handle as input.

After registration you may use your model just as you would with any pre-registered model:

language_model = models.LLM(model=models.LLMConfig.BISON001, auth_token=os.environ.get("BISON_TOKEN"))

response = language_model.prompt("Mighty language model, what is your name?.")
print(response)

Registering your own Embedding Models

Registering your own embedding model follows the same principles as with language models. You create your own embedding class by inheriting from the Embedder parent class.

In this example we will not query an endpoint but only return a dummy value of [1, 2, 3] for each row.

The parent class expects two parameters which you need to pass:

  1. The endpoint under which the service is available. Since we don’t call an external service here, we simply fill in a dummy value.

  2. The auth_token used for authentication to your service. We leave this with None here as we don’t call an actual endpoint.

Each embedder needs two methods:

  1. create_embedding(text) which takes a string as input and returns the embedding as numpy array. This method is used to create the embedding for your user prompt, which is used as an input to the database.

  2. create_embedding_bulk(rows) which takes a list of SQLAlchemy table classes as input and assigns the created embedding directly to the class’s embedding attribute. This method is used for ingesting your data into the database. Those separate methods exist to allow you to optimize for large throughput during ingest and allows you to minimize the number of requests to your service.

class Test(Embedder):
    def __init__(self):
        super().__init__(endpoint = "example.com", auth_token = None)

    def create_embedding_bulk(self, rows):
    """
    Takes an  list of SQLAlchemy Table classes as input and returns them with embeddings assigned.
    """
        for row in rows:
            row.embedding = np.array([1, 2, 3])

        return rows

    def create_embedding(self, text: str) -> np.array:
        return np.array([1, 2, 3])

After you’ve created your class, you may register it using the register function from embeddings:

from brdata_rag_tools.embeddings import register

embeddings.register(Test, name = "test", dimensions = 3)

You need to pass the name of your embedding model and the dimensionality to the register function.

Then you can use it just as you would with the pre-registered embedding methods. The name of your Embedder is always stored in all-caps in the selection Enum.

embed_type = EmbeddingConfig.TEST

database = databases.FAISS()
EmbeddingTable = database.create_abstract_embedding_table(embed_type=embed_type)

Indices and tables

Models

class models.Generator(model=None, auth_token: str | None = None, temperature: float | None = None, max_new_tokens: int | None = None, top_p: float | None = None, top_k: int | None = None, length_penalty: float | None = None, number_of_responses: int | None = None, max_token_length: int | None = None)
Class:

Generator

This class represents a generator for text generation using language models.

Parameters:
  • model (LLMConfig) – The language model to be used for text generation.

  • auth_token (str, optional) – The API auth_token to access the language model. If not provided, the auth_token will be fetched based on the model value.

  • temperature (float, optional) – The temperature parameter for text generation. A higher value (e.g., 1.0) makes the output more random, while a lower value (e.g., 0.2) makes it more focused and deterministic. If not provided, the default model’s temperature will be used.

  • max_new_tokens (int, optional) – The maximum number of new tokens to generate. If not provided, the default model’s maximum new tokens value will be used.

  • top_p (float, optional) – The top-p probability threshold for text generation. Only tokens with cumulative probability less than or equal to the threshold will be considered. If not provided, the default model’s top-p value will be used.

  • top_k (int, optional) – The top-k number of tokens to consider for text generation. Only the k most probable tokens will be considered. If not provided, the default model’s top-k value will be used.

  • length_penalty (float, optional) – The length penalty factor. It determines how much influence the length of the generated text has on the probability distribution. A higher value (e.g., 0.8) encourages generating shorter text, while a lower value (e.g., 1.2) encourages longer text. If not provided, the default model’s length penalty value will be used.

  • number_of_responses (int, optional) – The number of responses to generate. If set, the generator will return a list of responses instead of a single response. If not provided, the generator will return a single response.

Variables:
  • model (LLMName) – The language model to be used for text generation.

  • auth_token (str) – The API auth_token to access the language model.

  • temperature (float) – The temperature parameter for text generation.

  • max_new_tokens (int) – The maximum number of new tokens to generate.

  • top_p (float) – The top-p probability threshold for text generation.

  • top_k (int) – The top-k number of tokens to consider for text generation.

  • length_penalty (float) – The length penalty factor.

  • number_of_responses (int) – The number of responses to generate.

fit_to_context_window(prompt: str, context: List[str], requested_completion_length: int | None = None) List[str]

Reduces a list of semantic search results to fit the context window of the given LLM.

Token lengths are estimated and may differ from the real auth_token vector’s length.

Guesses for OpenAI models are more accurate than for other models.

The requested_completion_lenght aligns with OpenAI’s max_new_tokens parameter. The length of the generated response is added to the original prompt, if the answer gets too long, OpenAI returns an error.

Parameters:
  • prompt – Your prompt for the LLM

  • context – The context retrieved by the semantic search.

Returns:

The reduced context

Raises:

ValueError – If the prompt is too long to fit any context

get_token(token: str) str

Returns the given auth_token or retrieves the appropriate auth_token based on the model value.

Parameters:

token (str) – The auth_token to be used for authentication.

Returns:

The retrieved auth_token or the given auth_token.

Return type:

str

Raises:

ValueError – If no auth_token is provided for the model value.

prompt(prompt: str) str

Prompts the user with a prompt string and returns the user’s input.

Parameters:

prompt (str) – The prompt string to display to the user.

Returns:

The user’s input as a string.

Return type:

str

class models.IGEL(model, auth_token=None)

A class representing the IGEL generator.

The IGEL generator is a type of generator that uses the LLMName.IGEL model to generate text. It accepts an auth_token for authentication and various parameters that control the generation process.

Args:

auth_token (str, optional): A auth_token for authentication. Defaults to None.

Attributes:

model (LLMConfig): The model used by the IGEL generator. auth_token (str): The auth_token used for authentication. temperature (float): The temperature parameter for generation. max_new_tokens (int): The maximum number of new tokens to generate. top_p (float): The top-p parameter for generation. length_penalty (float): The length penalty parameter for generation.

Methods:

prompt(prompt: str) -> str: Generates text based on the provided prompt.

prompt(prompt: str)

Prompts the user with a prompt string and returns the user’s input.

Parameters:

prompt (str) – The prompt string to display to the user.

Returns:

The user’s input as a string.

Return type:

str

class models.LLM(model: LLMConfig, auth_token: str | None = None)

Class representing a Language Model.

Args:

model_name (LLMConfig): The name of the language model. auth_token (str, optional): The API auth_token for the language model. Defaults to None.

Attributes:

model_name (LLMConfig): The name of the language model. auth_token (str): The API auth_token for the language model. model (Generator): The language model generator.

Methods:

prompt: Generate text based on the given prompt.

prompt(prompt: str) str

Prompt the model with a given prompt and return the generated output.

Parameters:

prompt (str) – The prompt string to provide to the model.

Returns:

The generated output from the model in response to the given prompt.

Return type:

str

class models.LLMConfig(value=<no_arg>, names=None, module=None, qualname=None, type=None, start=1, boundary=None)

An enumeration.

class models.OpenAi(model: LLMConfig, auth_token: str)

Create text from OpenAi.

Parameters:
  • model (LLMConfig) – The name of the language model to use.

  • auth_token (str) – The API auth_token for accessing the OpenAi API.

prompt(prompt: str) str

Generate text with GPT model family.

class models.Role(value=<no_arg>, names=None, module=None, qualname=None, type=None, start=1, boundary=None)

An enumeration.

Databases

Databases are vector databases for similarity search. Right now, only PGVector databases are supported.

Embeddings