You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
device = str(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
# Load WhisperX model
whisper_model = whisperx.load_model("large-v2", device=str(device), compute_type="float32")
# Transcribe the audio file with WhisperX
transcription_result = whisper_model.transcribe(file_path, language="en", batch_size=16)
# Load and process audio
audio_data = whisperx.load_audio(file_path)
audio_tensor = torch.from_numpy(audio_data)
# Align transcription with timestamps
model_a, metadata = whisperx.load_align_model(language_code="en", device=device)
aligned_transcription = whisperx.align(
transcript=transcription_result["segments"],
model=model_a,
align_model_metadata=metadata,
audio=audio_tensor, # Now passing the tensor
device=device
)
# Load speaker diarization model in WhisperX
diarization_model = whisperx.DiarizationPipeline(use_auth_token=None, device=device)
diarization_result = diarization_model(file_path)
# Assign speaker labels to the aligned transcription
result_with_speakers = whisperx.assign_word_speakers(diarization_result, aligned_transcription)
Wave file: 16kHz
Seems as though with 2 people we get errors of output during diarization diarization_result = diarization_model(file_path, num_speakers=2)
However, I have noticed that with 6+ speakers or default settings: diarization_result = diarization_model(file_path)
We are able to differentiate between speakers however the script believes there is more than two speakers within the audio file. There are 2 speakers, but it believes there is 6+ speakers.
Example with 2 speaker's settings:
SPEAKER_01: Okay.
Error processing word {'start': 1741.207, 'end': 1741.287, 'text': 'Okay.', 'words': [{'word': 'Okay.', 'start': np.float64(1741.207), 'end': np.float64(1741.287), 'score': np.float64(0.0)}]}: 'speaker'
Error processing word {'start': 2009.363, 'end': 2009.443, 'text': 'Okay.', 'words': [{'word': 'Okay.', 'start': np.float64(2009.363), 'end': np.float64(2009.443), 'score': np.float64(0.001)}]}: 'speaker'
Error processing word {'start': 2009.463, 'end': 2009.663, 'text': 'All right.', 'words': [{'word': 'All', 'start': np.float64(2009.463), 'end': np.float64(2009.523), 'score': np.float64(0.026)}, {'word': 'right.', 'start': np.float64(2009.543), 'end': np.float64(2009.663), 'score': np.float64(0.172)}]}: 'speaker'
SPEAKER_00: Well, Toyo, I can't say that they're the same, exactly the same model, but our dealership here in Alamos has their I think you were talking about the whole idea of the MPs helping us with our investigations. You have the same system as we do at our headquarters with the swipe cards.
The example with default settings has no "Error processing word" errors. However, the default settings do look like it is able to recognize the separation between the spoken users within the audio file but lacks the ability to distinguish how many speakers there are.
Are there any work arounds?
Whisperx seems to also be using a lot of CPU usage. I have set the setting to use GPU as listed above; however, these changes still have my computer using roughly 80% cpu usage. Very interested in the project would like to also contribute if I can.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
current script:
Wave file: 16kHz
Seems as though with 2 people we get errors of output during diarization
diarization_result = diarization_model(file_path, num_speakers=2)
However, I have noticed that with 6+ speakers or default settings:
diarization_result = diarization_model(file_path)
We are able to differentiate between speakers however the script believes there is more than two speakers within the audio file. There are 2 speakers, but it believes there is 6+ speakers.
Example with 2 speaker's settings:
The example with default settings has no "Error processing word" errors. However, the default settings do look like it is able to recognize the separation between the spoken users within the audio file but lacks the ability to distinguish how many speakers there are.
Are there any work arounds?
Whisperx seems to also be using a lot of CPU usage. I have set the setting to use GPU as listed above; however, these changes still have my computer using roughly 80% cpu usage. Very interested in the project would like to also contribute if I can.
Beta Was this translation helpful? Give feedback.
All reactions