Releases · vistec-AI/dataset-releases

25 Jan 07:19

5fa3288

THAI SER Latest

Latest

AI Research Institute of Thailand (AIResearch), with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), cooperating with Department of Computer Engineering - Faculty of Engineering and Department of Dramatic Arts - Faculty of Arts, Chulalongkorn University, publishes an open Thai speech emotion recognition dataset, with the sponsorship from Advanced Info Services Public Company Limited (AIS), namely THAI SER.

This dataset consists of 5 main emotions assigned to actors: Neutral, Anger, Happiness, Sadness, and Frustration. The recordings were 41 hours, 36 minutes long (27,854 utterances), and were performed by 200 professional actors (112 female, 88 male) and directed by students, former alumni, and professors from the Faculty of Arts, Chulalongkorn University.

The THAI SER contains 100 recordings and is separated into two main categories: Studio and Zoom. Studio recordings also consist of two studio environments: Studio A, a controlled studio room with soundproof walls, and Studio B, a normal room without soundproof or noise control. Thus the recording environment can be concluded as follows:

StudioA (noise controlled, soundproof wall)
└─ studio001
└─ studio002
...
└─ studio018

StudioB (Normal room without soundproof wall)
└─ studio019
└─ studio020
...
└─ studio080

Zoom (Recorded online via Zoom and Zencastr)
└─ zoom001
└─ zoom002
...
└─ zoom020

Each recording is separated into two sessions: Script Session and Improvisation Session.

To mapped each utterance to an emotion, we use majority voted of answer from 3-8 annotators which collected from crowdsourcing (wang.in.th).

Script session

In the script session, the actor was assigned three sentences:

sentence 1: พรุ่งนี้มันวันหยุดราชการนะรู้รึยัง หยุดยาวด้วย
            (Do you know tomorrow is a public holiday and it's the long one.)
sentence 2: อ่านหนังสือพิมพ์วันนี้รึยัง รู้ไหมเรื่องนั้นกลายเป็นข่าวใหญ่ไปแล้ว
            (Have you read today's newspaper, that story was the topliner.)
sentence 3: ก่อนหน้านี้ก็ยังเห็นทำตัวปกติดี ใครจะไปรู้หล่ะ ว่าเค้าคิดแบบนั้น
            (He/She was acting normal recently, who would thought that he/she would think like that.)

The actor was asked to speak each sentence two times for each emotion with two emotional intensity levels (normal, strong), with an additional neutral expression.

Improvisation session

For the Improvisation session, two actors were asked to improvised according to provided emotion and scenario.

Scenarios	Actor A	Actor B
1	(Neutral) A hotel receptionist trying to explain and service the customer	(Angry) A angry customer who dissatisfy the hotel services
2	(Happy) A person excitingly talking with B about his/her marriage plan	(Happy) A person happily talking with A and help him/her plan his ceremony
3	(Sad) A patient feeling depressed	(Neutral) A doctor attempting to talk with A neutrally
4	(Angry) A furious boss talking with the employee	(Frustrated) A frustrated person attempting to argue with his/her boss
5	(Frustrated) A person frustratingly talk about another person's action	(Sad) A person feeling guilty and sad about his/her action
6	(Happy) A happy hotel staffs	(Happy) Happy customer
7	(Sad) A sad person who felt unsecured about the incoming marriage	(Frustrated) A person who frustrated about another person's insecureness
8	(Frustrated) A frustrated patience	(Neutral) A Doctor talking with the patience
9	(Neutral) A worker who assigned to tell his/her co-worker about the company's bad situation	(Sad) An employee feeling sad after listenning
10	(Angry) A person raging about another person's behavior	(Angry) A person who feels like being blamed by another person
11	(Frustrated) A director who unsatisfied co-worker	(Frustrated) A frustrated person who try their best on the job
12	(Happy) A person who gets a new job or promotion	(Sad) A person who desperate in his/her job
13	(Neutral) A patient inquire information	(Happy) A happy doctor telling his/her patience more information
14	(Angry) A person who upset with his/her work	(Neutral) A calm friend who listened to another person's problem
15	(Sad) A person sadly tell another person about a relationship	(Angry) A person who feels angry after listening to another person's bad relationship

File naming convention

Each of files has a unique filename, provided in .flac format with sample rate about 44.1 KHz. The filename consists of a 5 to 6-part identifier (e.g., s002_clip_actor003_impro1_1.flac, s002_clip_actor003_script1_1_1a.flac). These identifiers define the stimulus characteristics:

File Directory Management

studio (e.g., studio1-10)
└─ <studio-num> (studio1, studio2, ...)
    └─ <mic-type> (con, clip, middle)
        └─<audio-file> (.flac)

zoom (e.g., zoom1-10)
└─ <zoom-num> (zoomo1, zoom2, ...)
    └─ <mic-type> (mic)
        └─ <audio-file> (.flac)

Filename identifiers

Recording ID (s = studio recording, z = zoom recording)
- Number of recording (e.g., s001, z001)
Microphone type (clip, con, middle, mic)

Zoom recording session
- mic = An actor's microphone-of-choice
studio recording session
- con = Condenser microphone (Cardioid polar patterns) which was placed 0.5m from the actor setting
- clip = Lavalier microphone (Omni-directional patterns) attached to the actor’s shirt collar
- middle = Condenser microphone (Figure-8 polar patterns) which was placed between actors
Actor ID (actor001 to actor200: Odd-numbered actors are Actor A, even-numbered actors are Actor B in improvisation session).
Session ID (impro = Improvisation Session, script = Script Session)
- Script Session (e.g., _script1_1_1a)
  - Sentence ID (script1-script3)
  - Repetition (1 = 1st repetition, 2 = 2nd repetition)
  - Emotion (1 = Neutral, 2 = Angry, 3 = Happy, 4 = Sad, 5 = Frustrated)
  - Emotional intensity (a = Normal, b = Strong)
- Improvisation Session (e.g., _impro1_1)
  - Scenario ID (impro1-15)
  - Utterance no. (e.g., _impro1_1 , _impro1_2)

Filename example: s002_clip_actor003_impro1_1.flac

Studio recording number 2 (s002)
Recording by Lavalier microphone (clip)
3rd Actor (actor003)
Improvisation session, scenario 1 (impro1)
1st utterance of scenario recording (1)

Other Files

emotion_label.json - a dictionary for recording id, assigned emotion (assigned_emo), majority emotion (emotion_emo), annotated emotions from crowdsourcing (annotated), and majority agreement score (agreement)
actor_demography.json - a dictionary that contains information about the age and sex of actors.

Version

Version 1 (26 March 2021): Thai speech emotion recognition dataset THAI SER contains 100 recordings (80 studios and 20 zooms) which is 41 hours 36 minutes long which contain 27,854 utterances and be labeled 27,854 utterances.

Dataset statistics

Recording environment	Session	Number of utterances	Duration(hrs)
Zoom (20)	Script	2,398	4.0279
	Improvisation	3,606	5.8860
Studio (80)	Script	9,582	13.6903
	Improvisation	12,268	18.0072
Total (100)		27,854	41.6114

Dataset sponsorship and license

Advanced Info Services Public Company Limited

This work is published under a Creative Commons BY-SA 4.0

Assets 15

actor_demography.json

9.99 KB 2021-03-03T10:41:28Z
emotion_label.json

15.3 MB 2021-03-31T09:18:29Z
README_TH-1.md

17.1 KB 2021-03-27T03:17:32Z
studio1-10.zip

1.45 GB 2021-01-27T11:42:11Z
studio11-20.zip

1.38 GB 2021-01-26T09:10:30Z
studio21-30.zip

1.27 GB 2021-01-25T12:31:41Z
studio31-40.zip

1.37 GB 2021-01-28T21:32:52Z
studio41-50.zip

1.33 GB 2021-01-28T21:16:49Z
studio51-60.zip

1.35 GB 2021-02-09T05:49:19Z
studio61-70.zip

1.3 GB 2021-04-04T11:47:02Z
Source code (zip)

2020-06-04T14:07:28Z
Source code (tar.gz)

2020-06-04T14:07:28Z

22 Jun 09:45

lalital

scb-mt-en-th-2020_v1.0

5fa3288

scb-mt-en-th-2020 - v1.0

AI Research Institute of Thailand (AIResearch), with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB), namely scb-mt-en-th-2020. The dataset contains parallel sentences from various sources such as task-based conversation, organization websites, Wikipedia articles, and government documents.

To obtain parallel sentences, we hire professional and crowdsourced translators and build a module to automatically align parallel sentence pairs from documents, articles, and web pages.

AIResearch also shares pre-trained models for both Thai→English and English→Thai direction as baseline models. See more information at Thai-English Machine Translation Model)

Dataset statistics

English-Thai machine translation dataset scb-mt-en-th-2020 version 1.0 comprise of 1,001,752 segment pairs. The dataset are from 12 different sources (CSV files) as follows:

Method	Sub-dataset	Description	Number of segment pairs
Professional Translators	task_master_1	Task-based dialogs from Taskmaster 1 dataset and translated to Thai by professional translators.	222,733
	generated_review_translator	Machine-generated product reviews in English and translated to Thai by professional translators.	133,330
Crowd-sourced Translators	nus_sms	SMS messages in English from the NUS SMS corpus and translated to Thai by crowdsourced translators.	43,750
	msr_paraphrase	Sentences from Microsoft Research Paraphrase Corpus and translated to Thai by crowdsourced translators.	10,371
	mozilla_common_voice	English Transcript from Common Voice dataset and translated to Thai by crowdsourced translators.	33,797
	generated_review_crowd	Machine-generated product reviews in English and translated to Thai by crowdsourced translators.	24,587
Annotation by Translators	generated_review_yn	Machine-generated product reviews in English which are translated to Thai by Google Translate API (v3) on May 2020, and verified by translators	280,208
Sentence Alignment on PDF Documents	assorted_government	Aligned segments obtained from Thai government PDF documents.	25,398
Sentence Alignment on Web-crawled Dat	thai_websites	Aligned segments from web-crawl data from the top-500 domains in Thailand ranked by alexa.com in May 2020	120,280
	paracrawl	Aligned segments from web-crawl data from the domains listed in ParaCrawl Corpus v5	60,039
	wikipedia	Aligned segments from parallel English-Thai Wikipedia articles	33,756
	apdf	Aligned segments from a news site, Asia Pacific Defense Forum	13,503
			1,001,752

More statistics of the dataset including the number of words/sentences and examples of parallel sentence pairs can be seen in a notebook via Google Colaboratory

Version

Version 1.0 (23 June 2020): English-Thai machien translation dataset scb-mt-en-th-2020 version 1.0 containing 1,001,752 segment pairs.

Sponsorship and license

Siam Commercial Bank PCL has published the dataset to the public under Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0) except the English-Thai sentences pairs from Mozilla Common Voice that will be under CC0; No Rights Reserved.

สถาบันวิจัยปัญญาประดิษฐ์ประเทศไทย (AIResearch) ซึ่งเกิดจากความร่วมมือระหว่างสถาบันวิทยสิริเมธี (VISTEC) และสำนักงานส่งเสริมเศรษฐกิจดิจิทัล (depa) ได้ทำการเปิดชุดข้อมูลคู่ประโยคในภาษาอังกฤษ-ไทย จำนวนกว่า 1 ล้านคู่ประโยคสู่สาธารณะ โดยได้รับการสนับสนุนจาก บมจ. ธนาคารไทยพาณิชย์ (SCB) ภายใต้ชื่อ scb-mt-en-th-2020 ชุดข้อมูลคู่ประโยคนี้ ได้รวบรวมจากหลายข้อมูลแหล่งอาทิเช่น ประโยคจากบทสนทนา ข้อมูลจากเว็บไซต์ข่าวหรือองค์กรที่มีเนื้อหาในสองภาษา บทความวิกิพีเดีย และเอกสารราชการ

การได้มาซึ่งคู่ภาษามีทั้ง การจ้างนักแปลภาษา และ การใช้ Algorithm ในจับคู่ประโยคภาษาไทยและอังกฤษโดยอัตโนมัติ (Sentence alignment) จากหน้าเอกสาร บทความ และเว็บไซต์

โดยชุดข้อมูลนี้ เป็น Model-ready หรือ พร้อมสำหรับการนำไปใช้ฝึกฝนโมเดลแปลภาษาได้ทันที ทางศูนย์วิจัยได้เปิด Pre-trained model สำหรับการนำไปใช้งาน และเป็น Baseline model (สามารถดูเพิ่มเติมที่ Thai-English Machine Translation Model)

สถิติชุดข้อมูล

ชุดข้อมูลคู่ประโยคในภาษาอังกฤษ-ไทย scb-mt-en-th-2020 เวอร์ชัน 1.0 โดยมีจำนวนคู่ประโยคทั้งหมด 1,001,752 คู่ประโยค สำหรับข้อมูลในชุดข้อมูลนี้แบ่งเป็น 12 แหล่งที่มา (ไฟล์นามสกุล .csv) ดังนี้

วิธีการ	ชุดข้อมูล	คำอธิบาย	จำนวนคู่ประโยค
การแปลโดยนักแปลมืออาชีพ	task_master_1	บทสนทนาจากชุดข้อมูล Taskmaster-1 ในภาษาอังกฤษและแปลเป็นไทยโดยนักแปลมืออาชีพ	222,733
	generated_review_translator	รีวิวสินค้าที่สร้างขึ้นจากโมเดลและแปลเป็นไทยโดยนักแปลมืออาชีพ	133,330
การแปลโดยนักแปลจาก crowdsourcing แพลตฟอร์ม	nus_sms	ข้อความ SMS ในภาษาอังกฤษ จากชุดขัอมูล NUS SMS และแปลเป็นไทยโดยนักแปลจาก crowdsourcing แพลตฟอร์ม	43,750
	msr_paraphrase	ประโยคในภาษาอังกฤษ จากชุดขัอมูล Microsoft Research Paraphrase Identification และแปลเป็นไทยโดยนักแปลจาก crowdsourcing แพลตฟอร์ม	10,371
	mozilla_common_voice	บทพูดในภาษาอังกฤษจากโครงการ Common Voice และแปลเป็นไทยโดยนักแปลจาก crowdsourcing แพลตฟอร์ม	33,797
	generated_review_crowd	รีวิวสินค้าที่สร้างขึ้นจากโมเดลและแปลเป็นไทยโดยนักแปลจาก crowdsourcing แพลตฟอร์ม	24,587
การยืนยันความถูกต้องจากนักแปล	generated_review_yn	รีวิวสินค้าที่สร้างขึ้นจากโมเดลในภาษาอังกฤษ ที่ส่งไปยัง Google Translate API (v3) เมื่อเดือนพฤษภาคม พ.ศ. 2563 เพื่อแปลเป็นไทย และตรวจสอบความถูกต้องจากนักแปล	280,208
การจับคู่ประโยคจากเอกสาร PDF	assorted_government	คู่ประโยคที่ถูกจับคู่จากชุดข้อมูลเอกสารจากทางราชการประเทศไทย ในรูปแบบไฟล์ PDF	25,398
การจับคู่ประโยคจากข้อมูลเว็บไซต์	thai_websites	คู่ประโยคที่ถูกจับคู่จากข้อมูลเว็บไซต์ โดยอิงจากโดเมนเนม 500 อันดับแรกของไทย ที่จัดอันดับโดย alexa.com เมื่อเดือนพฤษภาคม พ.ศ. 2563	120,280
	paracrawl	คู่ประโยคที่ถูกจับคู่จากข้อมูลเว็บไซต์ โดยอิงจากโดเมนเนม ที่ปรากฎในชุดข้อมูล Paracrawl v5	60,039
	wikipedia	คู่ประโยคที่ถูกจับคู่จาก...

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script session

Improvisation session

File naming convention

Other Files

Version

Dataset statistics

Dataset sponsorship and license

Releases: vistec-AI/dataset-releases

THAI SER

Script session

Improvisation session

File naming convention

Other Files

Version

Dataset statistics

Dataset sponsorship and license

scb-mt-en-th-2020 - v1.0