Simple Data Analysis and Artificial Intelligence 🤖

Welcome to this new project where we will use AI (more specifically, large language models like ChatGPT, Gemini, and Claude) to help us analyze data! These models are remarkably good at extracting, reformatting, cleaning, and categorizing data.

Back in March 2025, I attended a great session at NICAR by Ben Welsh and Derek Willis. They showed us how to categorize claim expenses using these fascinating tools.

Their project (in Python) was eye-opening. So, I decided to implement new methods in the Simple Data Analysis (SDA) library (give it a ⭐!) to make great use of LLMs. I’ll show you how to use these methods with a Google AI Studio API key, but they also work with Vertex AI and Ollama.

Ben kindly agreed to let me reproduce his tutorial, but this time in TypeScript! We will use Deno and VS Code. Check the Setup lesson if needed. However, you can do this with the JS/TS runtime and code editor of your choice!

If you get stuck at any point in this project, it might be helpful to review the previous lessons explaining the basics of SDA:

Let’s code!

Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Click here to get in touch.

Setup

To set up everything we need, let’s use setup-sda as in previous lessons.

Create a new folder, open it with VS Code, and run: deno -A jsr:@nshiab/setup-sda

A screenshot of VS Code after running setup-sda.

Getting an API key

In this lesson, we are going to use the Google LLMs. To interact with them with our code, we need an API key. Go to Google AI Studio, create an account if you don’t already have one, and click on Get API key.

A screenshot of Google AI Studio.

Then click on Create an API key. I don’t know why my page is half in French, but it should work anyway. 😅

Follow the suggested steps. If asked about using a Google Cloud project, select Gemini if available; otherwise, create a new one.

A screenshot of Google AI Studio.

Once you have copied your new API key, create a new .env file in your project and paste it there. Also, add the model we will use, which is gemini-2.0-flash-lite.

.env

AI_KEY=your_api_key_here
AI_MODEL=gemini-2.0-flash-lite

.env is a file we usually use for environment variables, which are often credentials that should not be shared. When you set everything up with setup-sda, this file is automatically added to .gitignore, so it won’t be committed by Git and won’t be pushed to GitHub.

Environment variables in VS Code.

Now, we just need to load these variables into memory.

Install the @std/dotenv library from the Deno team by running deno add jsr:@std/dotenv.

Then add import "@std/dotenv/load"; at the top of your sda/main.ts file. Now, every time you run this code, Deno will automatically load the variables in your .env file!

And we are ready to go! 🚀

A screenshot showing how to load environment variables.

About this API…

Before we actually run some code, there is something I want to show you.

In Google AI Studio, after clicking on Get API key, you have access to your API Plan Billing. Select the model we are going to use: Gemini 2.0 Flash Lite.

There, you’ll see that you have a free account that allows you to make 30 requests per minute with this model (this might have changed since I wrote this).

And you’ll also notice that… the data you are going to send will be used by Google! So, be extra careful: don’t send sensitive data through this API!

As you might already know, nothing is truly free in life… 🥲

If you have a pro or enterprise account, they state that they won’t use your data to improve their products. But if you need total privacy, I’ll show you at the end how to use Ollama to run models locally on your machine.

Google AI plan billing.

The data

The goal of the project is to categorize claim expenses. We are going to use the CSV sample data from Ben’s tutorial. It’s hosted on GitHub and contains 250 rows.

Ben has manually categorized the payee column. We will ask our model to do the same exercise, and then we will compare the human and AI results.

The sample data.

To retrieve the data, we create a new table data and cache the results of loadData, which fetches the data from GitHub. Since the data won’t change, caching it makes sense. It will be stored locally in the .sda-cache folder. When you set everything up with setup-sda, this hidden folder is added to your .gitignore.

We also log the table to the terminal.

Run deno task sda in your terminal to execute and watch this code. Each time we modify and save the code, it will automatically run. Handy!

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB();
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.logTable();
 
await sdb.done();

Loading data from GitHub.

💡

If you want to delete the cache, you can manually remove .sda-cache or run deno task clean in your terminal.

Sending data to the model

To send data and instructions to the model, we need to use the aiRowByRow method on our table. If you’re curious, you’ll find the documentation here.

Its usage is quite simple:

First, pass the column containing the data you want to send to the AI (here, it’s payee).
Then, provide the name of the new column that will be created to store the AI results (here, we name it categoryAI).
And finally, type in your prompt. Here, we ask the LLM to categorize the payee.

The fourth argument is optional. It’s an object with a few options:

rateLimitPerMinute will ensure we respect our API quota. If we send requests too fast, the method will wait the necessary duration to avoid requests being rejected.
verbose will log extra information while processing the data.

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB();
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.logTable();
 
await sdb.done();

If you run this code, you’ll see your prompt plus extra instructions added by the method.

The verbose option adds useful extra information. The tokens represent the amount of information sent to and received from the AI. Based on this, the method gives you an estimated cost for the request if you have a pay-as-you-go account.

Right now, the data is being classified by the AI model, but… it’s a bit slow.

Because we have 250 rows of data, and we can send only 30 requests per minute, classifying the whole dataset will take… over 8 minutes!

We can surely speed this up!

Sending one row at the time.

Speeding up the process

Batches

Instead of sending one row at a time, we could send a batch of them!

In the code below, we use the batchSize option to send 10 rows at a time. Now, processing the whole dataset takes less than a minute!

This is great, but we can push this a bit further…

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB();
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    batchSize: 10,
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.logTable();
 
await sdb.done();

Sending batches of data.

Concurrency

One of the many amazing features of TypeScript and JavaScript is concurrency. Instead of making one request to the API at a time, you send many of them simultaneously and then wait for the results in parallel.

Taking advantage of this feature of the language can provide an incredible speed boost to our code!

Since we have 250 rows of data, we could send batches of 17 rows with 15 concurrent requests to process all of our data in less than… 2 seconds!

These 15 requests keep us below the quota limit of 30, which is important.

One piece of advice: to run this code, make sure you haven’t sent requests for at least a minute; otherwise, you might hit the quota limit of the free plan.

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    batchSize: 17,
    concurrent: 15,
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.logTable();
 
await sdb.done();

Sending concurrent requests.

Caching

There is one last trick we can use to make our code more efficient: caching.

We could store the results sent by the AI in files locally. If we rerun our code and if the prompt/data have not changed, we could use what we stored on our machine instead of waiting again for the AI model.

This is exactly what the cache option allows you to do.

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    batchSize: 17,
    concurrent: 15,
    cache: true,
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.logTable();
 
await sdb.done();

If you run this code, you’ll notice a new hidden folder .journalism-cache. The files in it are JSON files storing the AI responses. It works like the previous .sda-cache folder but is created by the function askAI from my journalism library, which the aiRowByRow method uses to interact with the AI model.

On the first run, the requests are made, and the script takes the same time to run as before.

Caching the data.

But if you rerun this script (CTRL + S in main.ts should do the trick), it’s now much faster! It takes only a few milliseconds! Instead of making the requests, the code retrieves the stored data on your machine.

Of course, if you change the prompt or the data sent to the model, the method will know it has to make new requests and store new results. But if you don’t touch the aiRowByRow arguments modifying the requests, you can safely work on other things without burning your API quotas and at full speed!

Retrieving from the cache.

💡

If you want to delete the cache, you can manually remove .journalism-cache or run deno task clean in your terminal.

Testing the output

Tests

While they are getting better at an incredible pace, LLMs don’t always understand your instructions properly. And when you are making dozens, hundreds, or thousands of requests, there is always a chance they could return something unexpected.

This is why testing is important.

In the code below, we use a test (a simple function) to check if the returned category for each entry falls into the expected categories.

If there is an unexpected category, we throw an error, and with the retry option, we specify that we want to retry five times before giving up and stopping everything.

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    batchSize: 17,
    concurrent: 15,
    cache: true,
    test: (dataPoint) => {
      if (typeof dataPoint !== "string") {
        throw new Error("Not a string");
      }
      if (!["Restaurant", "Bar", "Hotel", "Other"].includes(dataPoint)) {
        throw new Error("Unexpected category");
      }
    },
    retry: 5,
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.logTable();
 
await sdb.done();

On my end, no errors have been thrown. This is looking good!

Testing the results.

Accuracy

Since we tested the results, we know we have the proper categories, but have they been properly attributed?

It’s time to check the accuracy of the model!

Since Ben already classified the payee in the category column and we put the AI results in the categoryAI column, let’s use a few SDA methods to do that!

First, let’s create a new column to check if the AI returned the correct answer, assuming Ben was right.

Using this column, we can then log all occurrences of different answers between Ben and the AI.

We can also calculate the proportion of correct answers from the AI.

sda/main.ts

import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const data = sdb.newTable("data");
await data.cache(async () => {
  await data.loadData(
    "https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
  );
});
 
await data.aiRowByRow(
  "payee",
  "categoryAI",
  `Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
  {
    batchSize: 17,
    concurrent: 15,
    cache: true,
    test: (dataPoint) => {
      if (typeof dataPoint !== "string") {
        throw new Error("Not a string");
      }
      if (!["Restaurant", "Bar", "Hotel", "Other"].includes(dataPoint)) {
        throw new Error("Unexpected category");
      }
    },
    retry: 5,
    rateLimitPerMinute: 30,
    verbose: true,
  },
);
 
await data.addColumn("correctCategory", "boolean", `category === categoryAI`);
await data.logTable({
  conditions: `correctCategory === false`,
  nbRowsToLog: "all",
});
 
await data.summarize({
  categories: "correctCategory",
});
await data.proportionsVertical("count", "perc");
await data.logTable();
 
await sdb.done();

And… here’s the result!

If we check the rows with different answers between Ben and the AI, we can notice that… Ben made a few mistakes!

For example, El POLLO INKA and ELLA DINNING ROOM should be restaurants. And FAIRMONT BATTERY WHARF should be a hotel. (If you don’t see these errors, it might be because Ben has corrected them!)

For the rest, since we don’t have the addresses, we’ll give Ben the benefit of the doubt.

This means that the model achieved an accuracy rate greater than 94%! Not bad considering how simple our instructions were and how fast it was processed.

Depending on what you are working on, maybe this level of accuracy is enough if you disclose it in your publication. One thing is certain, though: doing it by hand would have been much slower (or just impossible if there were millions of rows), and you would have made mistakes too!

If you are not satisfied with the accuracy, you can tweak your prompt, give examples in your instructions, preprocess the data, or just switch to another model. But at least, by checking the accuracy, you won’t be making changes blindly! You’ll know when things get better and when they get worse.

Computing the accuracy rate.

Conclusion

Here, we classified data with a Gemini model. But you can do much more than that, like cleaning, formatting, and extracting data with LLMs!

And by using the SDA aiRowByRow method, you can do this easily, in a few lines of code, at scale!

If you have sensitive data that you don’t want to send to Google or another AI provider, you can also use Ollama. Just start it and pass the following variables in your .env file. SDA will automatically switch to using the local model installed on your machine to process the data.

sda/main.ts

OLLAMA=true
AI_MODEL=gemma3:4b

Sometimes, lightweight open-source models have more difficulty directly returning arrays (which is the expected returned response for aiRowByRow). If that’s the case, I implemented the clean option that allows you to directly modify the LLM response, like restructuring it into an array.

While an amazing technology, remember that results are not guaranteed with LLMs. And, in my projects, the larger the batches, the worse the results tend to be…

But still, they can significantly boost your research and analysis. They are incredibly fast and easy to use.

So, if you try them in one of your projects, let me know!

And thank you again to Ben Welsh and Derek Willis who created the initial version of this tutorial.

Enjoyed this? Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Get in touch if you have any questions.

Visualizing data Wrangling Census data 🇨🇦