Whisperx segmentation #1063

LewyCoda · 2025-02-20T03:13:28Z

LewyCoda
Feb 20, 2025

current script:

device = str(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
# Load WhisperX model
whisper_model = whisperx.load_model("large-v2", device=str(device), compute_type="float32")
# Transcribe the audio file with WhisperX
transcription_result = whisper_model.transcribe(file_path, language="en", batch_size=16)
# Load and process audio
audio_data = whisperx.load_audio(file_path)
audio_tensor = torch.from_numpy(audio_data)
# Align transcription with timestamps
model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
aligned_transcription = whisperx.align(
        transcript=transcription_result["segments"],
        model=model_a,
        align_model_metadata=metadata,
        audio=audio_tensor,  # Now passing the tensor
        device=device
    )
# Load speaker diarization model in WhisperX
diarization_model = whisperx.DiarizationPipeline(use_auth_token=None, device=device)
diarization_result = diarization_model(file_path)
# Assign speaker labels to the aligned transcription
result_with_speakers = whisperx.assign_word_speakers(diarization_result, aligned_transcription)

Wave file: 16kHz

Seems as though with 2 people we get errors of output during diarization
diarization_result = diarization_model(file_path, num_speakers=2)

However, I have noticed that with 6+ speakers or default settings:
diarization_result = diarization_model(file_path)

We are able to differentiate between speakers however the script believes there is more than two speakers within the audio file. There are 2 speakers, but it believes there is 6+ speakers.

Example with 2 speaker's settings:

SPEAKER_01: Okay.
Error processing word {'start': 1741.207, 'end': 1741.287, 'text': 'Okay.', 'words': [{'word': 'Okay.', 'start': np.float64(1741.207), 'end': np.float64(1741.287), 'score': np.float64(0.0)}]}: 'speaker'
Error processing word {'start': 2009.363, 'end': 2009.443, 'text': 'Okay.', 'words': [{'word': 'Okay.', 'start': np.float64(2009.363), 'end': np.float64(2009.443), 'score': np.float64(0.001)}]}: 'speaker'
Error processing word {'start': 2009.463, 'end': 2009.663, 'text': 'All right.', 'words': [{'word': 'All', 'start': np.float64(2009.463), 'end': np.float64(2009.523), 'score': np.float64(0.026)}, {'word': 'right.', 'start': np.float64(2009.543), 'end': np.float64(2009.663), 'score': np.float64(0.172)}]}: 'speaker'
SPEAKER_00: Well, Toyo, I can't say that they're the same, exactly the same model, but our dealership here in Alamos has their  I think you were talking about the whole idea of the MPs helping us with our investigations. You have the same system as we do at our headquarters with the swipe cards.

The example with default settings has no "Error processing word" errors. However, the default settings do look like it is able to recognize the separation between the spoken users within the audio file but lacks the ability to distinguish how many speakers there are.

Are there any work arounds?

Whisperx seems to also be using a lot of CPU usage. I have set the setting to use GPU as listed above; however, these changes still have my computer using roughly 80% cpu usage. Very interested in the project would like to also contribute if I can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisperx segmentation #1063

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Whisperx segmentation #1063

LewyCoda Feb 20, 2025

Replies: 0 comments

LewyCoda
Feb 20, 2025