Build Your Own LLM - Data Ingestion

Build Your Own LLM - Data Ingestion

Patrick Deziel | Monday, Jan 15, 2024 |  DIY LLM Python LLM

2023 was the year of large language models (LLMs) due to services like ChatGPT and Stable Diffusion gaining mainstream attention. In this series, learn about the architecture behind LLMs and how to build your own custom LLM!

Note: This tutorial requires Python >= 3.9

ChatGPT has an API, why do I need my own LLM?

We’ve come a long way since Microsoft’s Clippy.

ChatGPT is arguably the most advanced chatbot ever created, and the range of tasks it can perform on behalf of the user is impressive. However, there are aspects which make it risky for organizations to rely on as a permanent solution.

  1. It’s controlled by a third party who can disable features or change its functionality and API at any time.
  2. It’s a SaaS, so your users might experience latency issues that you can’t address directly.
  3. The per-request cost model can get expensive.
  4. It may require a significant amount of prompt engineering to get right (and sometimes it’s just wrong).
  5. It introduces the risk of prompt injection by malicious users.

There are certainly disadvantages to building your own LLM from scratch. LLMs notoriously take a long time to train, you have to figure out how to collect enough data for training and pay for compute time on the cloud.

But guess what, you don’t have to start from scratch.

Transfer learning is an efficient and practical way to fine tune an existing open source language model and produce a custom LLM. If your organization has specific data tenancy needs or requires a model with knowledge about a more specific domain or task, it may be worth it to build your own LLM!

Creating a Movie Reviews Corpus

In order to train or fine-tune an LLM, you need a training corpus. Obtaining a representative corpus is sneakily the most difficult part of modeling text.

If you’re aiming to build intellectual property, this is where you’d want to bring in your in-house document cache — news feeds, blog posts, emails, documents, Sharepoints, Powerpoints, all the points. Make sure you budget enough time for this step; it takes longer than you think.

For this post, we’ll save time by using a pre-curated dataset that contains 100k IMDB movie reviews that was initially created for sentiment analysis. We will fine-tune an existing open source model using this specialized corpus, and then we will test it to see if it performs better than the base GPT, BERT, or RoBERTa models on NLP tasks involving user movie reviews.

Ingesting the Data

A corpus is a collection of documents. In our example, a document is a single movie review. In Ensign, creating a corpus of documents is equivalent to publishing a series of events to a topic. Before ingesting anything, you need to create an Ensign project.

  1. Create an Ensign account at rotational.app.
  2. Create a new Ensign project (e.g. movie-assistant).
  3. Create a topic within the new project (e.g. movie-reviews-text).
  4. Create an API key with publish and subscribe permissions (don’t forget to download it).

The next step is to install the package dependencies. datasets is a helper to download datasets from HuggingFace and pyensign is the Ensign Python SDK.

$ pip install datasets
$ pip install "pyensign[ml]"

Next we’ll write some Python code to stream the documents into Ensign. Each document will be a separate event, and we’ll include the sentiment as well as the split labels in case we need them later. The code below is meant to be run within a Python notebook like Jupyter. Remember to substitute the Client ID and Client Secret from your API key!

import json
from datasets import load_dataset
from pyensign.events import Event
from pyensign.ensign import Ensign

# Helper function to write events to the topic store
def write_events(split):
    for record in load_dataset("imdb", split=split, streaming=True):
        data = {
            "text": record["text"],
            "label": record["label"],
            "split": split,
        }
        yield Event(
            json.dumps(data).encode("utf-8"),
            mimetype="application/json",
            schema_name="reviews-raw",
            schema_version="0.1.0"
        )

ensign = Ensign(
    client_id=<your client ID>,
    client_secret=<your client secret>
)

TEXT_TOPIC = "movie-reviews-text"
for event in write_events("train"):
    await ensign.publish(TEXT_TOPIC, event)

for event in write_events("test"):
    await ensign.publish(TEXT_TOPIC, event)

for event in write_events("unsupervised"):
    await ensign.publish(TEXT_TOPIC, event)

After running the code, you should see 100k events published to Ensign on the topic dashboard. Another way to confirm that the ingestion worked is to run a query and convert the events into a DataFrame.

from pyensign.ml.dataframe import DataFrame

cursor = await ensign.query("SELECT * FROM movie-reviews-text")
df = await DataFrame.from_events(cursor)
df.head()

“Movie Reviews”

The DataFrame should contain the three splits in the dataset (train, test, unsupervised).

print("train", len(df[df["split"] == "train"]))
print("test", len(df[df["split"] == "test"]))
print("unsupervised", len(df[df["split"] == "unsupervised"]))

train 25000
test 25000
unsupervised 50000

To be continued…

Now that you have a custom text corpus in Ensign, in the next module you will train an LLM using semi-supervised learning!

Image generated by DALL-E

About This Post

In this blog series you'll learn how to build your own domain-specific LLM

Written by:

Share this post:

Recent Rotations butterfly

View all

How to Manage Overwhelm

Each morning, I make the mistake of checking social media before getting out of bed. As I catch up on what’s happening in the world, I often find myself thinking “This too much” all before 8 AM.

Feb 13, 2024

Build Your Own LLM - Getting Into Production

If you’re building LLMs but have no way to deploy them, are they even useful? In this post, you’ll deploy an LLM into a live production application!

Feb 9, 2024

Build Your Own LLM - Training

If you want to protect your IP or avoid vendor lock, you may find that building your own LLM is more practical than relying on services like ChatGPT. In this post, you’ll train a custom LLM using your own data!

Feb 6, 2024
Enter Your Email To Subscribe