A command-line tool to interact with Hugging Face datasets and migrate them to Couchbase, with support for streaming data.
pip install -r requirements.txt
python setup.py install
The CLI provides the following commands:
Lists all available configurations for a dataset.
hf_to_cb_dataset_migrator list-configs --path dataset
Flags:
--path
: Path or name of the dataset (required)--revision
: Version of the dataset script to load--download-config
: Specific download configuration parameters--download-mode
: Download mode (reuse_dataset_if_exists or force_redownload)--dynamic-modules-path
: Path to dynamic modules--data-files
: Path(s) to source data file(s)--token
: Authentication token for private datasets--json-output
: Output the configurations in JSON format--debug
: Enable debug output--trust-remote-code
: Allow loading arbitrary code from the dataset repository
Lists all available splits for a dataset.
hf_to_cb_dataset_migrator list-splits --path dataset
Flags:
--path
: Path or name of the dataset (required)--name
: Configuration name of the dataset--data-files
: Path(s) to source data file(s)--download-config
: Specific download configuration parameters--download-mode
: Download mode (reuse_dataset_if_exists or force_redownload)--revision
: Version of the dataset script to load--token
: Authentication token for private datasets--json-output
: Output the splits in JSON format--debug
: Enable debug output--trust-remote-code
: Allow loading arbitrary code from the dataset repository
Lists all fields (columns) in a dataset.
hf_to_cb_dataset_migrator list-fields --path dataset
Flags:
--path
: Path or name of the dataset (required)--name
: Name of the dataset configuration--data-files
: Paths to source data files--download-config
: Specific download configuration parameters--revision
: Version of the dataset script to load--token
: Hugging Face token for private datasets--split
: Which split of the data to load--json-output
: Output the fields in JSON format--debug
: Enable debug output--trust-remote-code
: Allow loading arbitrary code from the dataset repository
Migrates data from Hugging Face to Couchbase.
hf_to_cb_dataset_migrator migrate \
--path dataset \
--id-fields id_field \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scope \
--cb-collection my_collection
Flags:
--path
: Path or name of the dataset (required)--id-fields
: Comma-separated list of field names to use as document ID (required)--cb-url
: Couchbase cluster URL (required)--cb-username
: Couchbase username (required)--cb-password
: Couchbase password (required)--cb-bucket
: Couchbase bucket name (required)--cb-scope
: Couchbase scope name (required)--cb-collection
: Couchbase collection name--name
: Configuration name of the dataset--data-files
: Path(s) to source data file(s)--split
: Which split of the data to load--cache-dir
: Cache directory for datasets--download-config
: Specific download configuration parameters--download-mode
: Download mode (reuse_dataset_if_exists or force_redownload)--verification-mode
: Verification mode (no_checks, basic_checks, or all_checks)--keep-in-memory
: Keep dataset in memory--save-infos
: Save dataset information--revision
: Version of the dataset script to load--token
: Authentication token for private datasets--no-streaming
: Disable streaming mode--num-proc
: Number of processes to use--storage-options
: Storage options for remote filesystems--trust-remote-code
: Allow loading arbitrary code from the dataset repository--cb-batch-size
: Number of documents to insert per batch (default: 1000)--debug
: Enable debug output
- List configurations for a public dataset:
hf_to_cb_dataset_migrator list-configs --path dataset
- List configurations for a private dataset:
hf_to_cb_dataset_migrator list-configs --path my-dataset --token YOUR_HF_TOKEN
- List splits for a dataset with specific configuration:
hf_to_cb_dataset_migrator list-splits --path dataset --name config-name
- List fields in JSON format:
hf_to_cb_dataset_migrator list-fields --path dataset --json-output
- List fields for a specific split:
hf_to_cb_dataset_migrator list-fields --path dataset --split train
- List fields with download configuration:
hf_to_cb_dataset_migrator list-fields \
--path dataset \
--download-config '{"force_download": true}' \
--trust-remote-code
- Migrate a dataset with multiple ID fields:
hf_to_cb_dataset_migrator migrate \
--path dataset \
--id-fields field1,field2 \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scope \
--cb-collection my_collection
- Migrate a specific split with streaming enabled:
hf_to_cb_dataset_migrator migrate \
--path dataset \
--split train \
--id-fields id_field \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scope
The CLI will exit with a non-zero status code if an error occurs during execution. Error messages will be displayed on stderr.
- Use
--debug
flag with any command to enable debug-level logging - JSON output options are available for machine-readable output
- Progress information is displayed during migration
- For private Hugging Face datasets, use the
--token
option - Couchbase credentials are required for migration operations
- Credentials can be provided via command-line options