Data Curation: A Whale of a Problem

Once you become an AI/ML practitioner, you quickly realize that the machine learning work is often the least challenging step in the pipeline. So what’s still really, really hard? Getting good data! In this post, we’ll explore why that is and introduce a new open source tool for data curation, Baleen.

The world of AI has seen a rapid increase in its practical usage, as AI companies and startups raised $33 billion during 2020.¹ Many companies like Urbint are using AI and real-world data to predict and prevent incidents that threaten critical infrastructure, workers, the community, and the environment.² Similarly, DataRobot is an Al Cloud leader that connects AI to production for organizations through providing platform for all types of data, users, and environments.³ Natural language processing applications have also seen a surge; 60% of technology leaders report that their NLP budgets grew by at least 10% in 2021, while 33% reported a 30% increase, and 15% reported they had more than doubled.⁴

Although this accelerated movement towards AI, NLP, and Machine Learning has added value to how we interact with and perceive the world, we are consistently reminded to be cautious about the acceleration of technology. The fears of some detractors are difficult to take seriously, but even some of the most brilliant minds have expressed concerns:

We cannot quite know what will happen if a machine exceeds our own intelligence, so we can’t know if we’ll be infinitely helped by it, or ignored by it and sidelined, or conceivably destroyed by it. -Stephen Hawking ⁵

So what is it that we’re worried about? Data.

Why Data is the Real AI/ML Problem

Many machine learning models are built off of general rather than custom datasets. GPT-3 (Generative Pre-trained Transformer 3) is an NLP model that is trained on a large text corpus to generate human-like text.⁶ It caused a lot of excitement around human-like text production, where newspaper companies like The Guardian were publishing articles that were written by GPT-3.⁷ GPT-3 was considered the next big thing in the AI and the NLP world because its uses were unlimited: generating code, Regex, cloning websites, object-use case generation, creating charts and plots, playing games, identifying painting, producing quizzes, meme maker, etc.⁸

However, there was a turning point to this invention that sparked a lot of controversy, as independent researchers found that given the word “Muslim” as a prompt, GPT-3 could generate 60% of its sentences that including bombs and violence. Similarly, there was a greater association of negative words with “Black people”. OpenAI addressed this issue by stating that:

GPT-3, like all large language models trained on internet corpora, will generate stereotyped or prejudiced content⁹

The goal of training GPT-3 on a large, general dataset may have been to imbue the model with general intelligence, but unfortunately its creators also unknowingly fed toxic elements into the machine learning model. As a result, GPT-3 is ultimately limited by the data used to teach it to “speak”, much of it rife with racism, sexism, and religious prejudice. This is one of the key pitfalls of megamodels trained on uncurated text.

Are models better when trained on more specific data? Github’s Copilot may be one of the most interesting recent examples of AI trained on custom data. Copilot (run by OpenAI Codex) is designed to help developers generate code faster. It works by “understanding” the previous lines of code and suggesting individual lines and whole functions instantly¹⁰, and is trained on code repositories on Github.¹¹

Getting Better Data for AI/ML

In order to properly utilize NLP, AI and deep learning, we need to produce quality models that contain quality data.

So how can you create your own curate data to ensure your machine learning models have the best quality input? At Rotational, our emphasis is to approach NLP (and other ML problems) by first generating and storing custom datasets to allow for better machine learning models. This is where projects like Baleen can fill in the gap.

Baleen is an example of an automated ingestion service for RSS feeds to construct a corpus for NLP research. Written in Golang, the ingestion system fetches RSS feeds and stores raw data into S3. Baleen also collects data quality measurements with language statistics, such as total words, vocabulary (unique words), hapaxes (words that only occur once through the corpus), rate of corpus growth, number of entities, etc.¹² Users can specify their own custom list of RSS feeds to Baleen. Note that because publishers and authors of the original articles hold the copyright to their works, data collected via Baleen should not be used for commercial purpose, only for educational purposes. Nonetheless, Baleen presents an example that others can use as a model for data collection and curation.

Photo by Art of Backpacking on Flickr Commons