Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Bark examples not working out of the box? #2781

Closed
FeatureSpitter opened this issue Jul 19, 2023 · 21 comments · Fixed by idiap/coqui-ai-TTS#253
Closed

[Bug] Bark examples not working out of the box? #2781

FeatureSpitter opened this issue Jul 19, 2023 · 21 comments · Fixed by idiap/coqui-ai-TTS#253
Labels
bug Something isn't working

Comments

@FeatureSpitter
Copy link

Describe the bug

I have been following this tutorial: https://tts.readthedocs.io/en/dev/models/bark.html#example-use

To Reproduce

But this is the result I got:

(.venv) nemewsys@nemewsys-Legion-5-15ACH6H:~/voice-to-text$ tree bark_voices/
bark_voices/
└── ljspeech
    └── speaker.wav

1 directory, 1 file
(.venv) nemewsys@nemewsys-Legion-5-15ACH6H:~/voice-to-text$ tts --model_name  tts_models/multilingual/multi-dataset/bark --text "This is an example." --out_path "output.wav" --voice_dir bark_voices/ --speaker_idx "ljspeech" --progress_bar True
 > tts_models/multilingual/multi-dataset/bark is already downloaded.
 > Using model: bark
 > Text: This is an example.
 > Text splitted to sentences.
['This is an example.']
Downloading HuBERT custom tokenizer
Downloading (…)rt_base_ls960_14.pth: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104M/104M [00:02<00:00, 39.6MB/s]
Traceback (most recent call last):
  File "/home/nemewsys/voice-to-text/.venv/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/bin/synthesize.py", line 447, in main
    wav = synthesizer.tts(args.text, speaker_name=args.speaker_idx)
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/utils/synthesizer.py", line 365, in tts
    outputs = self.tts_model.synthesize(
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/tts/models/bark.py", line 218, in synthesize
    history_prompt = load_voice(self, speaker_id, voice_dirs)
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/tts/layers/bark/inference_funcs.py", line 81, in load_voice
    generate_voice(audio=audio_path, model=model, output_path=output_path)
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/tts/layers/bark/inference_funcs.py", line 134, in generate_voice
    hubert_manager.make_sure_tokenizer_installed(model_path=model.config.LOCAL_MODEL_PATHS["hubert_tokenizer"])
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/TTS/tts/layers/bark/hubert/hubert_manager.py", line 31, in make_sure_tokenizer_installed
    huggingface_hub.hf_hub_download(repo, model, local_dir=model_dir, local_dir_use_symlinks=False)
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/nemewsys/voice-to-text/.venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1379, in hf_hub_download
    os.makedirs(os.path.dirname(local_dir_filepath), exist_ok=True)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/root/.local'

Expected behavior

For it to produce the output.wav with the voice in the bark_voices folder

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3070 Laptop GPU"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "TTS": "0.15.6",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023"
    }
}

Additional context

No response

@FeatureSpitter FeatureSpitter added the bug Something isn't working label Jul 19, 2023
@erogol
Copy link
Member

erogol commented Jul 19, 2023

Do you have write access to the folder? Seems like you don't.

@FeatureSpitter
Copy link
Author

Do you have write access to the folder? Seems like you don't.

should I even have?

anyway even with chmod 777 it still fails.

@erogol
Copy link
Member

erogol commented Jul 19, 2023

This is your error PermissionError: [Errno 13] Permission denied: '/root/.local'
I don't have a different explanation than the one above. Sorry.

@FeatureSpitter
Copy link
Author

This is your error PermissionError: [Errno 13] Permission denied: '/root/.local' I don't have a different explanation than the one above. Sorry.

I used the jfk.zip example from this other post: #2745

And it worked fine. I think it has to do with the folder structure, or the file types. I've tried to make them equal but still get that .local permission error with my wav files.

@xAIxxxxNAGI
Copy link

@FeatureSpitter

Hi, I also encountered the same issue yesterday.
I could run bark generation without voice clone out of the box, but I faced the same issue when I generated with voice clone.

I found out HuBERT custom tokenizer download path is not set in the current implementation.

This is the model.config.LOCAL_MODEL_PATHS at https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/layers/bark/inference_funcs.py#L134

{'text': '/Users/<myname>/Library/Application Support/tts/tts_models--multilingual--multi-dataset--bark/text_2.pt', 'coarse': '/Users/<myname>/Library/Application Support/tts/tts_models--multilingual--multi-dataset--bark/coarse_2.pt', 'fine': '/Users/<myname>/Library/Application Support/tts/tts_models--multilingual--multi-dataset--bark/fine_2.pt', 'hubert_tokenizer': '/root/.local/share/tts/suno/bark_v0/tokenizer.pth', 'hubert': '/root/.local/share/tts/suno/bark_v0/hubert.pt'}

I think other model paths are set at https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/bark.py#L270, but hubert and tokenieer path is not set, so it directing ./root, which is read-only.

I think you can fix it by modifying the hubert_tokenizer model path from ./root to others by hard-code or downloading the hubert_tokenizer manually to the /root/.local/share/tts/suno/bark_v0/. (this path may be different in your setting).

I fixed this issue by adding the following line at https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/bark.py#L270 like this.

        self.config.LOCAL_MODEL_PATHS["text"] = text_model_path
        self.config.LOCAL_MODEL_PATHS["coarse"] = coarse_model_path
        self.config.LOCAL_MODEL_PATHS["fine"] = fine_model_path

        # This is workaround I found. I know this is not good solution, but it works for now
        self.config.LOCAL_MODEL_PATHS["hubert_tokenizer"] = os.path.join(checkpoint_dir, "hubert_tokenizer.pth")
        self.config.LOCAL_MODEL_PATHS["hubert"] = os.path.join(checkpoint_dir, "hubert.pt")

I'm unsure if it helps your situation, but I just share my way.

@isaac-mcfadyen
Copy link

Any update on this? Just ran into this issue out-of-the-box myself. It seems that it's trying to download something to /root which doesn't work given that /root is only writable by root, not a non-superuser/non-sudo.

@reopio
Copy link

reopio commented Aug 26, 2023

I encountered this problem too. After resolving the code, I found the problem arises from the bark config file config.json. In my case the config file config.json is located at
~/.local/share/tts/tts_models--multilingual--multi-dataset--bark/config.json:

{
	"model": "bark",
    "output_path": "output",
    "logger_uri": null,
    "run_name": "run",
    "project_name": null,
    "run_description": "\ud83d\udc38Coqui trainer run.",
    "print_step": 25,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": null,
    "save_step": 10000,
    "save_n_checkpoints": 5,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": null,
    "print_eval": false,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1000,
    "batch_size": 32,
    "eval_batch_size": 16,
    "grad_clip": 0.0,
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "radam",
    "optimizer_params": null,
    "lr_scheduler": null,
    "lr_scheduler_params": {},
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "num_loader_workers": 0,
    "num_eval_loader_workers": 0,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "win_length": 1024,
        "hop_length": 256,
        "frame_shift_ms": null,
        "frame_length_ms": null,
        "stft_pad_mode": "reflect",
        "sample_rate": 22050,
        "resample": false,
        "preemphasis": 0.0,
        "ref_level_db": 20,
        "do_sound_norm": false,
        "log_func": "np.log10",
        "do_trim_silence": true,
        "trim_db": 45,
        "do_rms_norm": false,
        "db_level": null,
        "power": 1.5,
        "griffin_lim_iters": 60,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null,
        "spec_gain": 20,
        "do_amp_to_db_linear": true,
        "do_amp_to_db_mel": true,
        "pitch_fmax": 640.0,
        "pitch_fmin": 1.0,
        "signal_norm": true,
        "min_level_db": -100,
        "symmetric_norm": true,
        "max_norm": 4.0,
        "clip_norm": true,
        "stats_path": null
    },
    "use_phonemes": false,
    "phonemizer": null,
    "phoneme_language": null,
    "compute_input_seq_cache": false,
    "text_cleaner": null,
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": null,
    "add_blank": false,
    "batch_group_size": 0,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": Infinity,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_energy": false,
    "compute_linear_spec": false,
    "precompute_num_workers": 0,
    "start_by_longest": false,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "",
            "dataset_name": "",
            "path": "",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "",
            "phonemizer": "",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [],
    "eval_split_max_size": null,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "num_chars": 0,
    "semantic_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true
    },
    "fine_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true,
        "n_codes_total": 8,
        "n_codes_given": 1
    },
    "coarse_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true
    },
    "CONTEXT_WINDOW_SIZE": 1024,
    "SEMANTIC_RATE_HZ": 49.9,
    "SEMANTIC_VOCAB_SIZE": 10000,
    "CODEBOOK_SIZE": 1024,
    "N_COARSE_CODEBOOKS": 2,
    "N_FINE_CODEBOOKS": 8,
    "COARSE_RATE_HZ": 75,
    "SAMPLE_RATE": 24000,
    "USE_SMALLER_MODELS": false,
    "TEXT_ENCODING_OFFSET": 10048,
    "SEMANTIC_PAD_TOKEN": 10000,
    "TEXT_PAD_TOKEN": 129595,
    "SEMANTIC_INFER_TOKEN": 129599,
    "COARSE_SEMANTIC_PAD_TOKEN": 12048,
    "COARSE_INFER_TOKEN": 12050,
    "REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text_2.pt",
            "checksum": "54afa89d65e318d4f5f80e8e8799026a"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse_2.pt",
            "checksum": "8a98094e5e3a255a5c9c0ab7efe8fd28"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine_2.pt",
            "checksum": "59d184ed44e3650774a2f0503a48a97b"
        }
    },
    "LOCAL_MODEL_PATHS": {
        "text": "/root/.local/share/tts/suno/bark_v0/text_2.pt",
        "coarse": "/root/.local/share/tts/suno/bark_v0/coarse_2.pt",
        "fine": "/root/.local/share/tts/suno/bark_v0/fine_2.pt",
        "hubert_tokenizer": "/root/.local/share/tts/suno/bark_v0/tokenizer.pth",
        "hubert": "/root/.local/share/tts/suno/bark_v0/hubert.pt"
    },
    "SMALL_REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text.pt"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse.pt"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine.pt"
        }
    },
    "CACHE_DIR": "/root/.local/share/tts/suno/bark_v0"
}

You can modify this config file to this to resolve this problem:

{
	"model": "bark",
    "output_path": "output",
    "logger_uri": null,
    "run_name": "run",
    "project_name": null,
    "run_description": "\ud83d\udc38Coqui trainer run.",
    "print_step": 25,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": null,
    "save_step": 10000,
    "save_n_checkpoints": 5,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": null,
    "print_eval": false,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1000,
    "batch_size": 32,
    "eval_batch_size": 16,
    "grad_clip": 0.0,
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "radam",
    "optimizer_params": null,
    "lr_scheduler": null,
    "lr_scheduler_params": {},
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "num_loader_workers": 0,
    "num_eval_loader_workers": 0,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "win_length": 1024,
        "hop_length": 256,
        "frame_shift_ms": null,
        "frame_length_ms": null,
        "stft_pad_mode": "reflect",
        "sample_rate": 22050,
        "resample": false,
        "preemphasis": 0.0,
        "ref_level_db": 20,
        "do_sound_norm": false,
        "log_func": "np.log10",
        "do_trim_silence": true,
        "trim_db": 45,
        "do_rms_norm": false,
        "db_level": null,
        "power": 1.5,
        "griffin_lim_iters": 60,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null,
        "spec_gain": 20,
        "do_amp_to_db_linear": true,
        "do_amp_to_db_mel": true,
        "pitch_fmax": 640.0,
        "pitch_fmin": 1.0,
        "signal_norm": true,
        "min_level_db": -100,
        "symmetric_norm": true,
        "max_norm": 4.0,
        "clip_norm": true,
        "stats_path": null
    },
    "use_phonemes": false,
    "phonemizer": null,
    "phoneme_language": null,
    "compute_input_seq_cache": false,
    "text_cleaner": null,
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": null,
    "add_blank": false,
    "batch_group_size": 0,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": Infinity,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_energy": false,
    "compute_linear_spec": false,
    "precompute_num_workers": 0,
    "start_by_longest": false,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "",
            "dataset_name": "",
            "path": "",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "",
            "phonemizer": "",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [],
    "eval_split_max_size": null,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "num_chars": 0,
    "semantic_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true
    },
    "fine_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true,
        "n_codes_total": 8,
        "n_codes_given": 1
    },
    "coarse_config": {
        "block_size": 1024,
        "input_vocab_size": 10048,
        "output_vocab_size": 10048,
        "n_layer": 12,
        "n_head": 12,
        "n_embd": 768,
        "dropout": 0.0,
        "bias": true
    },
    "CONTEXT_WINDOW_SIZE": 1024,
    "SEMANTIC_RATE_HZ": 49.9,
    "SEMANTIC_VOCAB_SIZE": 10000,
    "CODEBOOK_SIZE": 1024,
    "N_COARSE_CODEBOOKS": 2,
    "N_FINE_CODEBOOKS": 8,
    "COARSE_RATE_HZ": 75,
    "SAMPLE_RATE": 24000,
    "USE_SMALLER_MODELS": false,
    "TEXT_ENCODING_OFFSET": 10048,
    "SEMANTIC_PAD_TOKEN": 10000,
    "TEXT_PAD_TOKEN": 129595,
    "SEMANTIC_INFER_TOKEN": 129599,
    "COARSE_SEMANTIC_PAD_TOKEN": 12048,
    "COARSE_INFER_TOKEN": 12050,
    "REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text_2.pt",
            "checksum": "54afa89d65e318d4f5f80e8e8799026a"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse_2.pt",
            "checksum": "8a98094e5e3a255a5c9c0ab7efe8fd28"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine_2.pt",
            "checksum": "59d184ed44e3650774a2f0503a48a97b"
        }
    },
    "LOCAL_MODEL_PATHS": {
        "text": "~/.local/share/tts/suno/bark_v0/text_2.pt",
        "coarse": "~/.local/share/tts/suno/bark_v0/coarse_2.pt",
        "fine": "~/.local/share/tts/suno/bark_v0/fine_2.pt",
        "hubert_tokenizer": "~/.local/share/tts/suno/bark_v0/tokenizer.pth",
        "hubert": "~/.local/share/tts/suno/bark_v0/hubert.pt"
    },
    "SMALL_REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text.pt"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse.pt"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine.pt"
        }
    },
    "CACHE_DIR": "~/.local/share/tts/suno/bark_v0"
}

I have pulled a request to huggingface model card erogol/bark to resolve this.

@erogol
Copy link
Member

erogol commented Aug 26, 2023

Should be fixed by #2894

@erogol erogol closed this as completed Aug 26, 2023
@w41g87
Copy link

w41g87 commented Jan 4, 2024

Same bug encountered as of v0.22.0 for the github version of TTS.

@storuky
Copy link

storuky commented Feb 7, 2024

Same bug with 0.22.0

@illtellyoulater
Copy link

illtellyoulater commented Feb 8, 2024

@erogol not fixed, you should reopen this.

@illtellyoulater
Copy link

illtellyoulater commented Feb 8, 2024

@reopio you have little big error in the code you fixed!

so basically what you did was changing this

    "LOCAL_MODEL_PATHS": {
        "text": "/root/.local/share/tts/suno/bark_v0/text_2.pt",
        "coarse": "/root/.local/share/tts/suno/bark_v0/coarse_2.pt",
        "fine": "/root/.local/share/tts/suno/bark_v0/fine_2.pt",
        "hubert_tokenizer": "/root/.local/share/tts/suno/bark_v0/tokenizer.pth",
        "hubert": "/root/.local/share/tts/suno/bark_v0/hubert.pt"
    },
    "SMALL_REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text.pt"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse.pt"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine.pt"
        }
    },
    "CACHE_DIR": "/root/.local/share/tts/suno/bark_v0"
}

into this

    "LOCAL_MODEL_PATHS": {
        "text": "~/.local/share/tts/suno/bark_v0/text_2.pt",
        "coarse": "~/.local/share/tts/suno/bark_v0/coarse_2.pt",
        "fine": "~/.local/share/tts/suno/bark_v0/fine_2.pt",
        "hubert_tokenizer": "~/.local/share/tts/suno/bark_v0/tokenizer.pth",
        "hubert": "~/.local/share/tts/suno/bark_v0/hubert.pt"
    },
    "SMALL_REMOTE_MODEL_PATHS": {
        "text": {
            "path": "https://huggingface.co/erogol/bark/tree/main/text.pt"
        },
        "coarse": {
            "path": "https://huggingface.co/erogol/bark/tree/main/coarse.pt"
        },
        "fine": {
            "path": "https://huggingface.co/erogol/bark/tree/main/fine.pt"
        }
    },
    "CACHE_DIR": "~/.local/share/tts/suno/bark_v0"
}

replacing /root with ~.

So you correctly identified the problem, but you didn't consider that python, not being bash, does NOT automatically expand the ~ into /home/username.

Running "tts" after your change, will cause the code to create a subdirectory named ~ in any dir from which the "tts" command is ran from. which in turn causes #3567 (I stand corrected, that's unrelated, but is still is something that needs to be addressed as has to do with huggingface models not being correctly downloaded).

The correct solution would be using the expanduser function, like this:

my_dir = os.path.expanduser("~/some_dir")
# my_dir => "/home/username/some_dir"

I hope you can add this correction to your erogol/bark PR :)

@illtellyoulater
Copy link

illtellyoulater commented Feb 8, 2024

ouch, just forgot to remember that was a JSON file ! 🤦‍♂😅 well, then the change has to be made in the model loading functions in /TTS/utils/synthesizer.py for example this one: self._load_tts_from_dir(model_dir, use_cuda)

@arthurwolf
Copy link

arthurwolf commented Mar 7, 2024

same problem here. what is the recommended fix ?

I did do the changes recommended ( edit ~/.local/share/tts/tts_models--multilingual--multi-dataset--bark/config.json and change /root/ to /home/myuser/ ), and now I get this error:

╰─(base) ⠠⠵ tts --model_name  tts_models/multilingual/multi-dataset/bark --text "Hey look, she's awake! I can't believe she's awake, that's crazy." --out_path /tmp/output.wav --progress_bar True --voice_dir /ram/ --speaker_idx "tommy"                      on dev|✔
 > tts_models/multilingual/multi-dataset/bark is already downloaded.
 > Using model: bark
/home/arthur/.anaconda3/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7k/17.7k [00:00<00:00, 76.1MiB/s]
Traceback (most recent call last):
  File "/home/arthur/.anaconda3/bin/tts", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/arthur/dev/ai/TTS/TTS/bin/synthesize.py", line 423, in main
    synthesizer = Synthesizer(
                  ^^^^^^^^^^^^
  File "/home/arthur/dev/ai/TTS/TTS/utils/synthesizer.py", line 109, in __init__
    self._load_tts_from_dir(model_dir, use_cuda)
  File "/home/arthur/dev/ai/TTS/TTS/utils/synthesizer.py", line 164, in _load_tts_from_dir
    self.tts_model.load_checkpoint(config, checkpoint_dir=model_dir, eval=True)
  File "/home/arthur/dev/ai/TTS/TTS/tts/models/bark.py", line 281, in load_checkpoint
    self.load_bark_models()
  File "/home/arthur/dev/ai/TTS/TTS/tts/models/bark.py", line 50, in load_bark_models
    self.semantic_model, self.config = load_model(
                                       ^^^^^^^^^^^
  File "/home/arthur/dev/ai/TTS/TTS/tts/layers/bark/load_model.py", line 121, in load_model
    checkpoint = torch.load(ckpt_path, map_location=device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arthur/.anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arthur/.anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1258, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '<'.

@illtellyoulater
Copy link

illtellyoulater commented Mar 7, 2024

@arthurwolf: redownloading the models may fix it (as advised in #3567) but keep in mind you are likely to encounter other bugs, as this codebase is no longer officially maintained.

If Bark is what you were after, you can install it from its official repository featuring updated code working out of the box.

On the other hand, if you were looking for XTTS, you could try AllTalk. It's a newer XTTSv2 implementation coming with an API, DeepSpeed support, and other interesting additional features.

@isaac-mcfadyen
Copy link

as this codebase is no longer officially maintained

Is this something that should be mentioned in the README.md? Until you said this I was not aware of this fact 🙂

@arthurwolf
Copy link

@illtellyoulater

If Bark is what you were after,

I've been trying to get bark to work for weeks (generating speech then trying to use other methods to change the voice to match a sample), came to coqui-ai looking for an alternative (as it seemed to be able to do both tts and voice at the same time), and then coqui-ai docs say "hey if you want to do that use our version of bark" ...

You might actually know how to do what I'm looking for.

I need to either do text to speech with a custom voice (from a sample), or even just convert existing speech to have a different voice (from a custom sample).

What would you recommend as the best way to get there, currently ?

I'll look at https://github.com/erew123/alltalk_tts/ thanks a lot for that.

@illtellyoulater
Copy link

illtellyoulater commented Mar 8, 2024

@arthurwolf

I need to either do text to speech with a custom voice (from a sample), or even just convert existing speech to have a different voice (from a custom sample).

Bark it's an amazing open-source TTS model from "Suno" but the version released by Suno however is not incredibly practical as it will only let you generate 16 words per run. Also, I think that recreating voices with it is a bit more convoluted than with XTTS, at least with the original Suno code, and I haven't researched other third-party implementations enough to be able to suggest one.

However, the good news is that what you are trying to do is exactly what Coqui XTTS excels at! In facts, it only needs a 7-10 secs audio file for it to learn to speak approximately with the same voice.
Another great feature of XTTS is that independently from original speaker language, the copied voice will be available automagically in ~16 different languages, all sounding good and natural!

There is only a little problem, after Cocqui's shutdown and release of the model as open source, all of the codebase necessary to run it (this repository) got unmaintained, with some parts becoming broken... and this is exactly where third-parties re-implementations like AllTalk come into play essentially providing an updated, refined and enhanced version of it. Btw, if you need an even easier to use alternative than AllTalk, take a look at github.com/daswer123/xtts-webui, as you will be able to run all the steps I described above entirely from the included browser UI it comes with!

That's all! I hope this clarifies all of your doubts and helps you getting on track
Keep on going, you're almost there! 🗣🎶

PS: for an added bonus I'll just leave this here: dozens of free voices ready to be downloaded and used in XTTS, enjoy!

https://aiartes.com/voiceai

@illtellyoulater
Copy link

illtellyoulater commented Mar 8, 2024

@isaac-mcfadyen

Is this something that should be mentioned in the README.md? Until you said this I was not aware of this fact 🙂

Not another word on this please! It brings back... mixed memories ;) (#3569 (comment))

@arthurwolf
Copy link

arthurwolf commented Mar 8, 2024

@illtellyoulater thank you so much for the help, I was stuck for a long time trying to get projects to work that I now realize were completely outdated/abandonned, I got alltalk running and I went from 20% to 90% of the way to what I want, absolutely amazing. Thank you again.

Do you know if there's any way to get it to generate whispering or shouting? Some kind of keyword or prompting trick? Or some other project that'd be able to do that? I searched a lot and had not much luck. Bark is able to do it a little bit some of the time, but not with a custom voice...

@eginhard
Copy link
Contributor

This is now fixed in our fork, available via pip install coqui-tts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants