Bulk Data
Ingesting bulk_data for a data identifier to local filesystem
This notebook walks through a tutorial to pull data for a given dataset ID.
Prerequisites
carbonarc
sdk and other dependencies in local python environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install git+https://github.com/Carbon-Arc/carbonarc
pip install python-dotenv
Setup
- create .env file in the same directory as this notebook or export
API_AUTH_TOKEN
env variable - If using .env add the following lines to the .env :
API_AUTH_TOKEN=<api auth token from (https://platform.carbonarc.co/profile)>
Import required dependencies
import os
from datetime import datetime
from dotenv import load_dotenv
from carbonarc import CarbonArcClient
# Load environment variables
load_dotenv()
API_AUTH_TOKEN = os.getenv("API_AUTH_TOKEN")
# Create API client
client = CarbonArcClient(API_AUTH_TOKEN)
List available datasets
datasets = client.data.get_datasets()
3 Get information for a given dataset
dataset = client.data.get_dataset_information(dataset_id="CA0000") #Insert your Data ID for example CA0028 is Card - US Detailed Panel
4 Fetch data manifest for incremental ingestion
You typically only want to fetch data created since the last ingestion:
last_ingest_time = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
print(last_ingest_time)
manifest = client.data.get_data_manifest(
dataset_id="CA0000", #Insert your Data ID for example CA0028 is Card - US Detailed Panel
created_since=last_ingest_time
)
Manifest file structure
A manifest generally looks like this:
{
"url": "https://storage.carbonarc.co/path/to/file.parquet",
"format": "parquet",
"records": 1000,
"size_bytes": 123456789,
"modification_time": "2025-04-15T23:04:44",
"price": 123.45
}
5 Purchase the data files in the manifest
file_urls = [x["url"] for x in manifest["datasources"]]
order = client.data.buy_data(
dataset_id="CA0028",
file_urls=file_urls
)
6 Download the purchased files
client.data.download_file(
file_id=order["files"][0],
directory="./"
)