Accessing Bulk Datasets via the Python SDK
This guide walks you through authenticating, retrieving, and downloading bulk data files from Carbon Arc using the Python SDK.
Prerequisites
Make sure you have:
- Python 3.10+
- The
carbonarc
SDK andpython-dotenv
installed
python3.10 -m venv .venv
source .venv/bin/activate
pip install git+https://github.com/Carbon-Arc/carbonarc
pip install python-dotenv
Environment Setup
Create a .env file in your working directory or export API_AUTH_TOKEN in your environment.
API_AUTH_TOKEN=<your API auth token> from https://platform.carbonarc.co/profile
Install Dependencies
Use this to install and load required dependencies
# Import required dependencies
import os
from dotenv import load_dotenv
from carbonarc import CarbonArcClient
import pandas as pd
load_dotenv()
HOST = "https://api.carbonarc.co"
API_AUTH_TOKEN = ""
API_AUTH_TOKEN = os.getenv('API_AUTH_TOKEN')
ca = CarbonArcClient(host=HOST, token= API_AUTH_TOKEN)
Browse available bulk datsets and retrieve information
Check a list of all available Bulk datasets
## List datasets
datasets = ca.data.get_datasets()
datasets
Retrieve information for a given dataset ID.
Select a bulk dataset and retrieve information. This examples shows Card - EU Detailed Panel data.
## Get information for give dataset
dataset = ca.data.get_dataset_information(dataset_id="CA0042")
dataset
Retrieve manifest for dataset
Retrieve the manifest of available files for a dataset. To filter by logical_date
, pass a tuple with an operator and the date value.
Example below uses logical_date
to view a manifest of files in 01/2025
dm = ca.data.get_data_manifest("CA0042", logical_date=("==", "202501"))
dm
The logical date field uses the below operators:
- logical_date (tuple): A tuple in the format (operator, YYYYMM).
- Supported operators:
==
,<
,<=
,>
,>=
Purchase a specific file from a Dataset
If you have already inspected the manifest and picked a specific file to download you can use the file_url
to download that specific file only.
dataset_id = "CA0042"
file_url = "/CA0042/00211-815-51afdecf-c129-4a6c-969a-e4dbd5c857fb-00002?drop_partition_id=1750267738&logical_week=202504"
ca.data.buy_data(dataset_id=dataset_id, file_urls=[file_url])
Buy the data
This function sumbmits the request and purchases the data from the Carbon Arc API. This will also log the order in the order history.
ca.data.buy_data(dataset_id=dataset_id, file_urls=[file_url])
An example outout would look like the following
{
'order_id': 'YOUR ORDER ID',
'total_price': 58.0,
'total_records_count': 30350,
'file_urls': ['https://api.carbonarc.co/v2/library/data/files/YOURID']
}
Download the Data
Download one file convert it to a CSV with the below code if you have a specific file you would like to download.
import os
import pandas as pd
# List of full download URLs from your purchase
file_urls = ['https://api.carbonarc.co/v2/library/data/files/YOURIDHERE']
# Output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)
# Loop through each file and download → convert
for url in file_urls:
# Extract the file ID from the URL
file_id = url.split("/")[-1]
# Download the file using your client
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)
# Convert to CSV
df = pd.read_parquet(parquet_path)
csv_path = parquet_path.replace(".parquet", ".csv")
df.to_csv(csv_path, index=False)
print(f"✅ Converted to CSV: {csv_path}")
Or you can download all files from your specifc parquet file for example with the following code
import os
import pandas as pd
# Step 1: Set dataset ID and file(s) to buy (from manifest)
dataset_id = "CA0042" #example dataset ID
file_urls_to_buy = [
"/INSERT YOUR PARQUET FILE URL HERE",
# Add more file paths from the manifest if needed
]
# Step 2: Purchase the data
order = ca.data.buy_data(dataset_id=dataset_id, file_urls=file_urls_to_buy)
# Step 3: Extract full download URLs
download_urls = order["file_urls"]
# Step 4: Prepare output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)
# Step 5: Download each file and convert to CSV
for file_url in download_urls:
try:
# Extract file ID from full URL
file_id = file_url.split("/")[-1]
# Download to local disk
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)
# Convert Parquet → CSV
df = pd.read_parquet(parquet_path)
csv_path = parquet_path.replace(".parquet", ".csv")
df.to_csv(csv_path, index=False)
print(f"✅ Converted to CSV: {csv_path}")
except Exception as e:
print(f"❌ Error processing {file_url}: {str(e)}")
If you would like to Buy, Download, and Convert all the files from the manifest you can use the below example code. This buys all the files, downloads the parquet files and does not convert them into CSVs.
import os
# Step 1: Get manifest
dataset_id = "CA0042"
dm = ca.data.get_data_manifest(dataset_id)
# Step 2: Extract file_urls from the manifest
manifest_files = dm.get("manifest", [])
file_urls_to_buy = [entry["file_url"] for entry in manifest_files]
# Step 3: Buy all files
order = ca.data.buy_data(dataset_id=dataset_id, file_urls=file_urls_to_buy)
# Step 4: Extract download URLs
download_urls = order["file_urls"]
# Step 5: Set up output directory
output_dir = "./downloads"
os.makedirs(output_dir, exist_ok=True)
# Step 6: Download all Parquet files
for file_url in download_urls:
try:
file_id = file_url.split("/")[-1]
parquet_path = ca.data.download_file(file_id=file_id, directory=output_dir)
print(f"✅ Downloaded Parquet file: {parquet_path}")
except Exception as e:
print(f"❌ Error downloading {file_url}: {str(e)}")