finnish-parliament-scripts

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

Dependencies:

sox
avconv
sclite
python3
python3-lxml
wget

ASR system is also required to produce first-pass hypotheses

Download videos and meeting transcripts and save into DATA-FOLDER:

retrieve/retrieve_sessions.py DATA-FOLDER

Four different files will be saved for each session:

*.mp4 - video of the session
*.wav - audio file stored in wav-format (16kHz,mono)
*.transcript - meeting transcript with speaker information for each paragraph
*.metadata - metadata file containing date information and links to the original video and meeting transcript

EDIT: Currently the retrieval of the meeting transcripts fails because the publishing format has changed.

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Store recognition output in the following format:

start-time-in-seconds end-time-in-seconds word

Align the first-pass output with the meeting transcript using sclite:

align/asr_align_2_elan.py asr-output transcript-file metadata-filename elan-filename

The output is in the Elan EAF-format.

Test the alignment script with example files:

align/asr_align_2_elan.py test/session_79_2008.asr test/session_79_2008.transcript test/session_79_2008.metadata test/session_79_2008.eaf

Extract individual speech segments from a list of EAF-files:

extract/elan_wav_extractor.py eaf-list wav-segment-dir

Stores both audio file (.wav) and transcript (.trn)

Extract individual speech segments from a list of metadata files:

extract/corpus_extractor.py metadata-file-list

Stores audio file (.wav)

André Mansikkaniemi, andre.mansikkaniemi@aalto.fi

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
align		align
extract		extract
retrieve		retrieve
test		test
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

finnish-parliament-scripts

Download videos and meeting transcripts and save into DATA-FOLDER:

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Align the first-pass output with the meeting transcript using sclite:

Extract individual speech segments from a list of EAF-files:

Extract individual speech segments from a list of metadata files:

About

Releases

Packages

Languages

License

aalto-speech/finnish-parliament-scripts

Folders and files

Latest commit

History

Repository files navigation

finnish-parliament-scripts

Download videos and meeting transcripts and save into DATA-FOLDER:

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Align the first-pass output with the meeting transcript using sclite:

Extract individual speech segments from a list of EAF-files:

Extract individual speech segments from a list of metadata files:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages