Skip to main content

Accessing Bulk Datasets via the Python SDK

This guide walks you through authenticating, retrieving, and downloading bulk data files from Carbon Arc using the Python SDK.


Prerequisites

Make sure you have:

  • Python 3.10+
  • The carbonarc SDK and python-dotenv installed
python3.10 -m venv .venv  
source .venv/bin/activate
pip install git+https://github.com/Carbon-Arc/carbonarc
pip install python-dotenv

Environment Setup

Create a .env file in your working directory or export API_AUTH_TOKEN in your environment.


API_AUTH_TOKEN=<your API auth token> from https://platform.carbonarc.co/profile

Install Dependencies

Use this to install and load required dependencies

# Import required dependencies
import os
from dotenv import load_dotenv
from carbonarc import CarbonArcClient
import pandas as pd

load_dotenv()
HOST = "https://api.carbonarc.co"
API_AUTH_TOKEN = ""
API_AUTH_TOKEN = os.getenv('API_AUTH_TOKEN')

ca = CarbonArcClient(host=HOST, token= API_AUTH_TOKEN)

Browse available bulk datsets and retrieve information

Check a list of all available Bulk datasets

## List datasets
datasets = ca.data.get_datasets()
datasets
Retrieve information for a given dataset ID.

Select a bulk dataset and retrieve information. This examples shows Card - EU Detailed Panel data.

## Get information for give dataset
dataset = ca.data.get_dataset_information(dataset_id="CA0042")
dataset

Retrieve manifest for dataset

Retrieve the manifest of available files for a dataset. To filter by logical_date, pass a tuple with an operator and the date value.

Example below uses logical_date to view a manifest of files in 01/2025

dm = ca.data.get_data_manifest("CA0042", logical_date=("==", "202501"))
dm

The logical date field uses the below operators:

  • logical_date (tuple): A tuple in the format (operator, YYYYMM).
  • Supported operators: ==, <, <=, >, >=

Purchase a specific file from a Dataset

If you have already inspected the manifest and picked a specific file to download you can use the file_url to download that specific file only.

dataset_id = "CA0042"
file_url = "/CA0042/00211-815-51afdecf-c129-4a6c-969a-e4dbd5c857fb-00002?drop_partition_id=1750267738&logical_week=202504"

ca.data.buy_data(dataset_id=dataset_id, file_urls=[file_url])

Buy the data

This function sumbmits the request and purchases the data from the Carbon Arc API. This will also log the order in the order history.


ca.data.buy_data(dataset_id=dataset_id, file_urls=[file_url])

An example outout would look like the following

{
'order_id': 'YOUR ORDER ID',
'total_price': 58.0,
'total_records_count': 30350,
'file_urls': ['https://api.carbonarc.co/v2/library/data/files/YOURID']
}

Download the Data

Download one file convert it to a CSV with the below code if you have a specific file you would like to download.

import os
import pandas as pd

# List of full download URLs from your purchase
file_urls = ['https://api.carbonarc.co/v2/library/data/files/YOURIDHERE']
# Output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)

# Loop through each file and download → convert
for url in file_urls:
# Extract the file ID from the URL
file_id = url.split("/")[-1]

# Download the file using your client
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)

# Convert to CSV
df = pd.read_parquet(parquet_path)
csv_path = parquet_path.replace(".parquet", ".csv")
df.to_csv(csv_path, index=False)

print(f"✅ Converted to CSV: {csv_path}")

Or you can download all files from your specifc parquet file for example with the following code

import os
import pandas as pd

# Step 1: Set dataset ID and file(s) to buy (from manifest)
dataset_id = "CA0042" #example dataset ID
file_urls_to_buy = [
"/INSERT YOUR PARQUET FILE URL HERE",
# Add more file paths from the manifest if needed
]
# Step 2: Purchase the data
order = ca.data.buy_data(dataset_id=dataset_id, file_urls=file_urls_to_buy)

# Step 3: Extract full download URLs
download_urls = order["file_urls"]

# Step 4: Prepare output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)

# Step 5: Download each file and convert to CSV
for file_url in download_urls:
try:
# Extract file ID from full URL
file_id = file_url.split("/")[-1]

# Download to local disk
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)

# Convert Parquet → CSV
df = pd.read_parquet(parquet_path)
csv_path = parquet_path.replace(".parquet", ".csv")
df.to_csv(csv_path, index=False)

print(f"✅ Converted to CSV: {csv_path}")

except Exception as e:
print(f"❌ Error processing {file_url}: {str(e)}")

If you would like to Buy, Download, and Convert all the files from the manifest you can use the below example code. This buys all the files, downloads the parquet files and does not convert them into CSVs.

import os

# Step 1: Get manifest
dataset_id = "CA0042"
dm = ca.data.get_data_manifest(dataset_id)

# Step 2: Extract file_urls from the manifest
manifest_files = dm.get("manifest", [])
file_urls_to_buy = [entry["file_url"] for entry in manifest_files]

# Step 3: Buy all files
order = ca.data.buy_data(dataset_id=dataset_id, file_urls=file_urls_to_buy)

# Step 4: Extract download URLs
download_urls = order["file_urls"]

# Step 5: Set up output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)

# Step 6: Download all Parquet files
for file_url in download_urls:
try:
file_id = file_url.split("/")[-1]
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)
print(f"✅ Downloaded Parquet file: {parquet_path}")
except Exception as e:
print(f"❌ Error downloading {file_url}: {str(e)}")

For more resources, visit: