Skip to main content

Bulk Data

Ingesting bulk_data for a data identifier to local filesystem

This notebook walks through a tutorial to pull data for a given dataset ID.

Prerequisites

  • carbonarc sdk and other dependencies in local python environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install git+https://github.com/Carbon-Arc/carbonarc
pip install python-dotenv

Setup

  1. create .env file in the same directory as this notebook or export API_AUTH_TOKEN env variable
  2. If using .env add the following lines to the .env :
API_AUTH_TOKEN=<api auth token from (https://platform.carbonarc.co/profile)>

Import required dependencies

import os
from datetime import datetime
from dotenv import load_dotenv
from carbonarc import CarbonArcClient

# Load environment variables
load_dotenv()

API_AUTH_TOKEN = os.getenv("API_AUTH_TOKEN")

# Create API client
client = CarbonArcClient(API_AUTH_TOKEN)

List available datasets

datasets = client.data.get_datasets()

3 Get information for a given dataset

dataset = client.data.get_dataset_information(dataset_id="CA0000")  #Insert your Data ID for example CA0028 is Card - US Detailed Panel

4 Fetch data manifest for incremental ingestion

You typically only want to fetch data created since the last ingestion:

last_ingest_time = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(last_ingest_time)

manifest = client.data.get_data_manifest(
dataset_id="CA0000", #Insert your Data ID for example CA0028 is Card - US Detailed Panel
created_since=last_ingest_time
)

Manifest file structure

A manifest generally looks like this:

{
"url": "https://storage.carbonarc.co/path/to/file.parquet",
"format": "parquet",
"records": 1000,
"size_bytes": 123456789,
"modification_time": "2025-04-15T23:04:44",
"price": 123.45
}

5 Purchase the data files in the manifest

file_urls = [x["url"] for x in manifest["datasources"]]

order = client.data.buy_data(
dataset_id="CA0028",
file_urls=file_urls
)

6 Download the purchased files

client.data.download_file(
file_id=order["files"][0],
directory="./"
)