Web scraping with GitHub Actions
GitHub is a platform to store your code, organize your work, collaborate with other fellow coders, but… it’s also a place where you can run code! 🤩
GitHub Actions are mostly used to automatically run tests, builds, and deployments on projects. But for data wranglers like us, they can also be used to scrape data.
In this lesson, we reuse the code that fetches and extracts the latest temperature in Montreal from the previous lessons. But this time, we’ll use GitHub Actions to trigger it every hour and accumulate the data — for free!
If you want to see the final product, head over here.
I strongly encourage you to complete the two previous lessons (What is Git? and What is GitHub?) before diving into this one. I’ll assume you have a GitHub account and know the basic Git commands.
Setup
Create a new folder and open it with VS Code.
In it, create a main.ts file with the code below. Here’s what it does:
- It fetches the latest weather data in Montreal from the Meteorological Service of Canada API.
- The data is in XML format. We use the library
fast-xml-parserto convert it to JSON. - Then we extract the temperature and the time from the nested JSON. We also create a variable with the current time of the scrape.
- Finally, we append the scraped data to
data.csv.
Each time this script runs, a new row is added to the CSV. So the data is accumulated over time.
import { XMLParser } from "fast-xml-parser";
const response = await fetch(
"https://dd.weather.gc.ca/today/observations/swob-ml/latest/CWTQ-AUTO-minute-swob.xml",
);
const xml = await response.text();
const json = new XMLParser({ ignoreAttributes: false }).parse(xml);
const observation =
json["om:ObservationCollection"]["om:member"]["om:Observation"];
const resultTime =
observation["om:resultTime"]["gml:TimeInstant"]["gml:timePosition"];
console.log(resultTime);
const elements = observation["om:result"]["elements"]["element"];
type element = {
"@_name": string;
"@_value": string;
};
const temp = elements
.find((d: element) => d["@_name"] === "air_temp")["@_value"];
console.log(temp);
const scrapeTime = new Date().toISOString();
console.log(scrapeTime);
await Deno.writeTextFile(
"./data.csv",
`\n${scrapeTime},${resultTime},${temp}`,
{
append: true,
},
);main.ts needs a data.csv file, so let’s create one in the project folder.
scrapeTime,resultTime,tempWe will push this project to GitHub, so let’s add a README.md to the project to make it more presentable.
# Montreal temperature scraper
Every hour, this project retrieves the latest temperature in Montreal
from the [Meteorological Service of Canada API](https://eccc-msc.github.io/open-data/msc-data/obs_station/readme_obs_insitu_swobdatamart_en/)
and appends a new row to the `data.csv` file.Let’s take this for a run!
First, install fast-xml-parser:
deno add npm:fast-xml-parser
Then run main.ts:
deno -A main.ts
And you should see the temperature being logged and a row added to the data.csv file!

Since this screenshot was taken, the URL for the Meteorological Service of Canada API has changed. Make sure to use the URL provided in the code block at the beginning of the lesson.
Creating a new GitHub repository
Let’s initialize a new Git repository and push it to GitHub.
First, we initialize and make a first commit:
git initgit add -Agit commit -m "First commit"
Then, create a new repository on GitHub.

Give it a name and a description (optional). You can make it private or public, as you wish.
You don’t need a README because we created one. You don’t need a .gitignore and you don’t need a license for now.
Then hit Create repository.

Copy the commands to push an existing repository from the command line.

And run these commands in your project terminal.

Since this screenshot was taken, the URL for the Meteorological Service of Canada API has changed. Make sure to use the URL provided in the code block at the beginning of the lesson.
Now, if you refresh your GitHub repository page, you’ll see your code and the README.md below it.

Adding a workflow
While it’s called GitHub Actions, you are not creating actions but workflows. And you can see your workflows in the Actions section.

Right now, there are no workflows in our repository. GitHub suggests a few for us. Feel free to explore, but we are going to add one manually.

For GitHub to automatically detect workflows in the repository, we need to create YAML files in .github/workflows/.
YAML stands for YAML Ain’t Markup Language — but to me, it still feels a bit like markup… just stricter. Indentation matters, for example. It’s very common for configuration files. It’s not complicated and we will write some together, so don’t worry about it.
Let’s create new folders and a new file .github/workflows/scrape.yml. Let’s keep it empty for now.
Commit it and push it to GitHub:
git add -Agit commit -m "Adding scrape.yml"git push

After a few seconds, if you refresh the Actions page of your repo, you’ll see your workflow automatically detected by GitHub! But since the file is empty, it’s failing.

Writing a workflow
Let’s do that step by step.
Name and events
When you create a workflow, you first need to give it a name and decide when it should be triggered.
Below, we tell GitHub to trigger this workflow in three situations:
pushmeans when we push new code to themainbranch. Note the indentation — it indicates hierarchy and grouping.workflow_dispatchmeans we will be able to click on a button to manually trigger the workflow. Always handy.schedulesets a cron job, which is a way to run scripts on a schedule. Here,0 * * * *means the workflow will run at the start of every hour (minute0).
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"Cron jobs are widely used, but their syntax is not intuitive. I usually use crontab.guru to make sense of them. Also, it’s important to note that cron jobs in GitHub workflows are not perfectly accurate. When your workflow needs to run, GitHub pushes it to a queue where it waits for a virtual machine to become available. So it might actually run a few minutes late. For most web scraping tasks, that’s fine. But if you need to trigger critical scripts at very specific times, GitHub Actions might not be for you.
Jobs
Now we need to describe the jobs we want the workflow to accomplish. You can have multiple jobs in a workflow. Here, we have just one. We name it scrape.
We tell GitHub we want it to run on a virtual machine with the latest Ubuntu distribution. This means it will run on a cheap virtual machine, which is perfectly fine for us. But you have choices, if you ever need something different.
Our main.ts script writes the freshly scraped data to data.csv, so we need to give the workflow permission to write and modify the files of the repository.
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: writeSteps
The only thing left to do is to tell GitHub the steps of this job. Each step has a name:
Checking out the repotells the workflow to actually access the repository. We use the pre-codedactions/checkout@v4for that. There are a lot of actions available for you to use, especially for common tasks. It’s nice not to have to code everything yourself! 🥳Set up Denoinstalls Deno. We use an available action for that too. We specify that we want version 2 of Deno.Run main.tsruns our script, just like we would locally.
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checking out the repo
uses: actions/checkout@v4
- name: Set up Deno
uses: denoland/setup-deno@v2
with:
deno-version: v2.x
- name: Run main.ts
run: deno -A main.tsSaving the data
Now, if you commit your workflow and push it to GitHub, it will fetch the data and write it to data.csv, but… it won’t actually save data.csv!
Remember: this is running on a virtual machine. So it runs, writes a new row to data.csv, but then the machine is destroyed and removed. The data saved in memory disappears…
This is why we have a final step to commit the changes. We create a Git user named Automated that commits and pushes the changes to data.csv. Isn’t that brilliant? 🙌
name: Scrape Montreal temperature
on:
push:
branches:
- main
workflow_dispatch:
schedule:
- cron: "0 * * * *"
jobs:
scrape:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checking out the repo
uses: actions/checkout@v4
- name: Set up Deno
uses: denoland/setup-deno@v2
with:
deno-version: v2.x
- name: Run main.ts
run: deno -A main.ts
- name: Commit changes
run: git config user.name "Automated" && git config user.email "actions@users.noreply.github.com" && git add data.csv && git commit -m "Automated update" && git pushSave the file and push it to GitHub:
git add -Agit commit -m "Updating workflow"git push
And now, if you refresh your repository’s Actions page, you’ll see your workflow running properly!

If you click on it, you’ll see the jobs. Here, there’s just one: scrape.

And if you click on the job, you’ll see its steps and can click on them. This is useful when something is failing — you’ll be able to see the error messages here.

If you go back to the home page of the repository, you’ll see that the latest commit has been made by the Automated Git user we set up in the workflow.

And if you check the data.csv file, you’ll see a second row has been added!

You can even go one step further and check the commit to see the actual change.
By the way, this is a really great way to check what changed in web pages, data files, and more. If you scrape entire files, GitHub will be able to tell you what changed and when. You’ll have a full, detailed history thanks to commits.

This has run when we pushed to main. But now, it will update automatically every hour thanks to our cron job!
Before GitHub Actions, you would have needed to pay for and set up a server to scrape this information on a schedule. But now, you can do it for free with just one simple YAML file! 💃🕺
Retrieving the data
The committed data is on GitHub. To retrieve it in your local repo, just pull it:
git pull
Note that I went for lunch before taking this screenshot — so I have three lines! The scraper ran while I was away! 😄

If your repository is public, you can also fetch the CSV data easily. This is useful when you want to work with up-to-date data in another project. Click on the Raw button to get a URL to the actual file.

Now you can directly fetch this URL.

You can use this URL to fetch the latest data and process it — for example, with the Simple Data Analysis library.
import { SimpleDB } from "@nshiab/simple-data-analysis";
const sdb = new SimpleDB();
const temperatures = sdb.newTable();
await temperatures.loadData(
"https://raw.githubusercontent.com/nshiab/temperature-scraper/refs/heads/main/data.csv",
);
await temperatures.logTable();
await sdb.done();
Manually triggering a workflow
Right now, our scrape is triggered when we push new code to main and every hour, at the top of the hour.
But since we added workflow_dispatch in our configuration, we can also trigger it manually by clicking on it from the Actions page.

Disabling a workflow
To remove a workflow, you need to actually delete the YAML file from the repository and commit/push the change.
But sometimes, you just want to temporarily disable it — for example, while fixing something in the code, because you don’t need the data at the moment, or because you’ve hit the limit of your free GitHub account! 😬
To disable a workflow while keeping the configuration file in the repository, click on the workflow in the left sidebar, then on the ... button in the top right corner, and finally on Disable workflow.
Don’t worry — you’ll be able to enable it again later without any problem.

Conclusion
Congratulations! You now know how to create simple workflows. I made my repository public, so feel free to fork it for your future projects.
You can use this to scrape data — but also for so many other things! This is code, so the possibilities are endless! With what you learned in this lesson, you’ll be able to understand and replicate more sophisticated workflows you find around.
I hope you found this lesson interesting. But we haven’t explored everything GitHub has to offer… In the next lesson, we’ll discover GitHub Pages, which allows you to create websites for… wait for it… free! And since you now know about GitHub Actions, you’ll be able to republish your web pages automatically each time you push code to the main branch. 🤓
See you in the next lesson!