Simple Data Analysis and Artificial Intelligence 🤖
Welcome to this new project where we will use AI (more specifically, large language models like ChatGPT, Gemini, and Claude) to help us analyze data! These models are remarkably good at extracting, reformatting, cleaning, and categorizing data.
Back in March 2025, I attended a great session at NICAR by Ben Welsh and Derek Willis. They showed us how to categorize claim expenses using these fascinating tools.
Their project (in Python) was eye-opening. So, I decided to implement new methods in the Simple Data Analysis (SDA) library (give it a ⭐!) to make great use of LLMs. I’ll show you how to use these methods with a Google AI Studio API key, but they also work with Vertex AI and Ollama.
Ben kindly agreed to let me reproduce his tutorial, but this time in TypeScript! We will use Deno and VS Code. Check the Setup lesson if needed. However, you can do this with the JS/TS runtime and code editor of your choice!
If you get stuck at any point in this project, it might be helpful to review the previous lessons explaining the basics of SDA:
Let’s code!
Setup
To set up everything we need, let’s use setup-sda as in previous lessons.
Create a new folder, open it with VS Code, and run: deno -A jsr:@nshiab/setup-sda
Getting an API key
In this lesson, we are going to use the Google LLMs. To interact with them with our code, we need an API key. Go to Google AI Studio, create an account if you don’t already have one, and click on Get API key
.
Then click on Create an API key
. I don’t know why my page is half in French, but it should work anyway. 😅
Follow the suggested steps. If asked about using a Google Cloud project, select Gemini if available; otherwise, create a new one.
Once you have copied your new API key, create a new .env
file in your project and paste it there. Also, add the model we will use, which is gemini-2.0-flash-lite
.
AI_KEY=your_api_key_here
AI_MODEL=gemini-2.0-flash-lite
.env
is a file we usually use for environment variables, which are often credentials that should not be shared. When you set everything up with setup-sda
, this file is automatically added to .gitignore
, so it won’t be committed by Git and won’t be pushed to GitHub.
Now, we just need to load these variables into memory.
Install the @std/dotenv
library from the Deno team by running deno add jsr:@std/dotenv
.
Then add import "@std/dotenv/load";
at the top of your sda/main.ts
file. Now, every time you run this code, Deno will automatically load the variables in your .env
file!
And we are ready to go! 🚀
About this API…
Before we actually run some code, there is something I want to show you.
In Google AI Studio, after clicking on Get API key
, you have access to your API Plan Billing
. Select the model we are going to use: Gemini 2.0 Flash Lite
.
There, you’ll see that you have a free account that allows you to make 30 requests per minute with this model (this might have changed since I wrote this).
And you’ll also notice that… the data you are going to send will be used by Google! So, be extra careful: don’t send sensitive data through this API!
As you might already know, nothing is truly free in life… 🥲
If you have a pro or enterprise account, they state that they won’t use your data to improve their products. But if you need total privacy, I’ll show you at the end how to use Ollama to run models locally on your machine.
The data
The goal of the project is to categorize claim expenses. We are going to use the CSV sample data from Ben’s tutorial. It’s hosted on GitHub and contains 250 rows.
Ben has manually categorized the payee
column. We will ask our model to do the same exercise, and then we will compare the human and AI results.
To retrieve the data, we create a new table data
and cache the results of loadData
, which fetches the data from GitHub. Since the data won’t change, caching it makes sense. It will be stored locally in the .sda-cache
folder. When you set everything up with setup-sda
, this hidden folder is added to your .gitignore
.
We also log the table to the terminal.
Run deno task sda
in your terminal to execute and watch this code. Each time we modify and save the code, it will automatically run. Handy!
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB();
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.logTable();
await sdb.done();
If you want to delete the cache, you can manually remove .sda-cache
or run deno task clean
in your terminal.
Sending data to the model
To send data and instructions to the model, we need to use the aiRowByRow
method on our table. If you’re curious, you’ll find the documentation here.
Its usage is quite simple:
- First, pass the column containing the data you want to send to the AI (here, it’s
payee
). - Then, provide the name of the new column that will be created to store the AI results (here, we name it
categoryAI
). - And finally, type in your prompt. Here, we ask the LLM to categorize the
payee
.
The fourth argument is optional. It’s an object with a few options:
rateLimitPerMinute
will ensure we respect our API quota. If we send requests too fast, the method will wait the necessary duration to avoid requests being rejected.verbose
will log extra information while processing the data.
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB();
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.logTable();
await sdb.done();
If you run this code, you’ll see your prompt plus extra instructions added by the method.
The verbose
option adds useful extra information. The tokens represent the amount of information sent to and received from the AI. Based on this, the method gives you an estimated cost for the request if you have a pay-as-you-go account.
Right now, the data is being classified by the AI model, but… it’s a bit slow.
Because we have 250 rows of data, and we can send only 30 requests per minute, classifying the whole dataset will take… over 8 minutes!
We can surely speed this up!
Speeding up the process
Batches
Instead of sending one row at a time, we could send a batch of them!
In the code below, we use the batchSize
option to send 10 rows at a time. Now, processing the whole dataset takes less than a minute!
This is great, but we can push this a bit further…
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB();
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
batchSize: 10,
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.logTable();
await sdb.done();
Concurrency
One of the many amazing features of TypeScript and JavaScript is concurrency. Instead of making one request to the API at a time, you send many of them simultaneously and then wait for the results in parallel.
Taking advantage of this feature of the language can provide an incredible speed boost to our code!
Since we have 250 rows of data, we could send batches of 17 rows with 15 concurrent requests to process all of our data in less than… 2 seconds!
These 15 requests keep us below the quota limit of 30, which is important.
One piece of advice: to run this code, make sure you haven’t sent requests for at least a minute; otherwise, you might hit the quota limit of the free plan.
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB({ logDuration: true });
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
batchSize: 17,
concurrent: 15,
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.logTable();
await sdb.done();
Caching
There is one last trick we can use to make our code more efficient: caching.
We could store the results sent by the AI in files locally. If we rerun our code and if the prompt/data have not changed, we could use what we stored on our machine instead of waiting again for the AI model.
This is exactly what the cache
option allows you to do.
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB({ logDuration: true });
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
batchSize: 17,
concurrent: 15,
cache: true,
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.logTable();
await sdb.done();
If you run this code, you’ll notice a new hidden folder .journalism-cache
. The files in it are JSON files storing the AI responses. It works like the previous .sda-cache
folder but is created by the function askAI
from my journalism
library, which the aiRowByRow
method uses to interact with the AI model.
On the first run, the requests are made, and the script takes the same time to run as before.
But if you rerun this script (CTRL
+ S
in main.ts
should do the trick), it’s now much faster! It takes only a few milliseconds! Instead of making the requests, the code retrieves the stored data on your machine.
Of course, if you change the prompt or the data sent to the model, the method will know it has to make new requests and store new results. But if you don’t touch the aiRowByRow
arguments modifying the requests, you can safely work on other things without burning your API quotas and at full speed!
If you want to delete the cache, you can manually remove .journalism-cache
or run deno task clean
in your terminal.
Testing the output
Tests
While they are getting better at an incredible pace, LLMs don’t always understand your instructions properly. And when you are making dozens, hundreds, or thousands of requests, there is always a chance they could return something unexpected.
This is why testing is important.
In the code below, we use a test
(a simple function) to check if the returned category for each entry falls into the expected categories.
If there is an unexpected category, we throw an error, and with the retry
option, we specify that we want to retry five times before giving up and stopping everything.
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB({ logDuration: true });
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
batchSize: 17,
concurrent: 15,
cache: true,
test: (dataPoint) => {
if (typeof dataPoint !== "string") {
throw new Error("Not a string");
}
if (!["Restaurant", "Bar", "Hotel", "Other"].includes(dataPoint)) {
throw new Error("Unexpected category");
}
},
retry: 5,
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.logTable();
await sdb.done();
On my end, no errors have been thrown. This is looking good!
Accuracy
Since we tested the results, we know we have the proper categories, but have they been properly attributed?
It’s time to check the accuracy of the model!
Since Ben already classified the payee in the category
column and we put the AI results in the categoryAI
column, let’s use a few SDA methods to do that!
First, let’s create a new column to check if the AI returned the correct answer, assuming Ben was right.
Using this column, we can then log all occurrences of different answers between Ben and the AI.
We can also calculate the proportion of correct answers from the AI.
import "@std/dotenv/load";
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB({ logDuration: true });
const data = sdb.newTable("data");
await data.cache(async () => {
await data.loadData(
"https://raw.githubusercontent.com/palewire/first-llm-classifier/refs/heads/main/_notebooks/sample.csv",
);
});
await data.aiRowByRow(
"payee",
"categoryAI",
`Categorize the payee into one of the following categories: Restaurant, Bar, Hotel or Other. `,
{
batchSize: 17,
concurrent: 15,
cache: true,
test: (dataPoint) => {
if (typeof dataPoint !== "string") {
throw new Error("Not a string");
}
if (!["Restaurant", "Bar", "Hotel", "Other"].includes(dataPoint)) {
throw new Error("Unexpected category");
}
},
retry: 5,
rateLimitPerMinute: 30,
verbose: true,
},
);
await data.addColumn("correctCategory", "boolean", `category === categoryAI`);
await data.logTable({
conditions: `correctCategory === false`,
nbRowsToLog: "all",
});
await data.summarize({
categories: "correctCategory",
});
await data.proportionsVertical("count", "perc");
await data.logTable();
await sdb.done();
And… here’s the result!
If we check the rows with different answers between Ben and the AI, we can notice that… Ben made a few mistakes!
For example, El POLLO INKA and ELLA DINNING ROOM should be restaurants. And FAIRMONT BATTERY WHARF should be a hotel. (If you don’t see these errors, it might be because Ben has corrected them!)
For the rest, since we don’t have the addresses, we’ll give Ben the benefit of the doubt.
This means that the model achieved an accuracy rate greater than 94%! Not bad considering how simple our instructions were and how fast it was processed.
Depending on what you are working on, maybe this level of accuracy is enough if you disclose it in your publication. One thing is certain, though: doing it by hand would have been much slower (or just impossible if there were millions of rows), and you would have made mistakes too!
If you are not satisfied with the accuracy, you can tweak your prompt, give examples in your instructions, preprocess the data, or just switch to another model. But at least, by checking the accuracy, you won’t be making changes blindly! You’ll know when things get better and when they get worse.
Conclusion
Here, we classified data with a Gemini model. But you can do much more than that, like cleaning, formatting, and extracting data with LLMs!
And by using the SDA aiRowByRow
method, you can do this easily, in a few lines of code, at scale!
If you have sensitive data that you don’t want to send to Google or another AI provider, you can also use Ollama. Just start it and pass the following variables in your .env
file. SDA will automatically switch to using the local model installed on your machine to process the data.
OLLAMA=true
AI_MODEL=gemma3:4b
Sometimes, lightweight open-source models have more difficulty directly returning arrays (which is the expected returned response for aiRowByRow
). If that’s the case, I implemented the clean
option that allows you to directly modify the LLM response, like restructuring it into an array.
While an amazing technology, remember that results are not guaranteed with LLMs. And, in my projects, the larger the batches, the worse the results tend to be…
But still, they can significantly boost your research and analysis. They are incredibly fast and easy to use.
So, if you try them in one of your projects, let me know!
And thank you again to Ben Welsh and Derek Willis who created the initial version of this tutorial.