Build Your Own LLM - Data Ingestion

Build Your Own LLM - Data Ingestion

Patrick Deziel | Monday, Jan 15, 2024 |  DIY LLM PythonLLM

2023 was the year of large language models (LLMs) due to services like ChatGPT and Stable Diffusion gaining mainstream attention. In this series, learn about the architecture behind LLMs and how to build your own custom LLM!

Note: This tutorial requires Python >= 3.9

ChatGPT has an API, why do I need my own LLM?

We’ve come a long way since Microsoft’s Clippy.

ChatGPT is arguably the most advanced chatbot ever created, and the range of tasks it can perform on behalf of the user is impressive. However, there are aspects which make it risky for organizations to rely on as a permanent solution.

  1. It’s controlled by a third party who can disable features or change its functionality and API at any time.
  2. It’s a SaaS, so your users might experience latency issues that you can’t address directly.
  3. The per-request cost model can get expensive.
  4. It may require a significant amount of prompt engineering to get right (and sometimes it’s just wrong).
  5. It introduces the risk of prompt injection by malicious users.

There are certainly disadvantages to building your own LLM from scratch. LLMs notoriously take a long time to train, you have to figure out how to collect enough data for training and pay for compute time on the cloud.

But guess what, you don’t have to start from scratch.

Transfer learning is an efficient and practical way to fine tune an existing open source language model and produce a custom LLM. If your organization has specific data tenancy needs or requires a model with knowledge about a more specific domain or task, it may be worth it to build your own LLM!

Creating a Movie Reviews Corpus

In order to train or fine-tune an LLM, you need a training corpus. Obtaining a representative corpus is sneakily the most difficult part of modeling text.

If you’re aiming to build intellectual property, this is where you’d want to bring in your in-house document cache — news feeds, blog posts, emails, documents, Sharepoints, Powerpoints, all the points. Make sure you budget enough time for this step; it takes longer than you think.

For this post, we’ll save time by using a pre-curated dataset that contains 100k IMDB movie reviews that was initially created for sentiment analysis. We will fine-tune an existing open source model using this specialized corpus, and then we will test it to see if it performs better than the base GPT, BERT, or RoBERTa models on NLP tasks involving user movie reviews.

Ingesting the Data

A corpus is a collection of documents. In our example, a document is a single movie review. In Ensign, creating a corpus of documents is equivalent to publishing a series of events to a topic. Before ingesting anything, you need to create an Ensign project.

  1. Create an Ensign account at rotational.app.
  2. Create a new Ensign project (e.g. movie-assistant).
  3. Create a topic within the new project (e.g. movie-reviews-text).
  4. Create an API key with publish and subscribe permissions (don’t forget to download it).

The next step is to install the package dependencies. datasets is a helper to download datasets from HuggingFace and pyensign is the Ensign Python SDK.

$ pip install datasets
$ pip install "pyensign[ml]"

Next we’ll write some Python code to stream the documents into Ensign. Each document will be a separate event, and we’ll include the sentiment as well as the split labels in case we need them later. The code below is meant to be run within a Python notebook like Jupyter. Remember to substitute the Client ID and Client Secret from your API key!

import json
from datasets import load_dataset
from pyensign.events import Event
from pyensign.ensign import Ensign

# Helper function to write events to the topic store
def write_events(split):
    for record in load_dataset("imdb", split=split, streaming=True):
        data = {
            "text": record["text"],
            "label": record["label"],
            "split": split,
        }
        yield Event(
            json.dumps(data).encode("utf-8"),
            mimetype="application/json",
            schema_name="reviews-raw",
            schema_version="0.1.0"
        )

ensign = Ensign(
    client_id=<your client ID>,
    client_secret=<your client secret>
)

TEXT_TOPIC = "movie-reviews-text"
for event in write_events("train"):
    await ensign.publish(TEXT_TOPIC, event)

for event in write_events("test"):
    await ensign.publish(TEXT_TOPIC, event)

for event in write_events("unsupervised"):
    await ensign.publish(TEXT_TOPIC, event)

After running the code, you should see 100k events published to Ensign on the topic dashboard. Another way to confirm that the ingestion worked is to run a query and convert the events into a DataFrame.

from pyensign.ml.dataframe import DataFrame

cursor = await ensign.query("SELECT * FROM movie-reviews-text")
df = await DataFrame.from_events(cursor)
df.head()

“Movie Reviews”

The DataFrame should contain the three splits in the dataset (train, test, unsupervised).

print("train", len(df[df["split"] == "train"]))
print("test", len(df[df["split"] == "test"]))
print("unsupervised", len(df[df["split"] == "unsupervised"]))

train 25000
test 25000
unsupervised 50000

To be continued…

Now that you have a custom text corpus in Ensign, in the next module you will train an LLM using semi-supervised learning!

Image generated by DALL-E

About This Post

In this blog series you'll learn how to build your own domain-specific LLM

Written by:

Share this post:

Recent Rotations butterfly

View all

5 Javascript Libraries to Use for Machine Learning

Over the years, several JavaScript libraries have been created for machine learning. Let’s sort through the ones that can help you get started quickly, even if you don’t have much experience with machine learning or data …

Mar 11, 2024

Predicting the Oscars With LLMs

Looking for a middle ground between custom LLMs and traditional ML? Please welcome semantic search to the stage! Let’s use semantic search to predict which film will take home the “Best Picture” Oscar this year 🤩

Mar 8, 2024

How to Manage Overwhelm

Each morning, I make the mistake of checking social media before getting out of bed. As I catch up on what’s happening in the world, I often find myself thinking “This too much” all before 8 AM.

Feb 13, 2024
Enter Your Email To Subscribe