-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
202 lines (185 loc) · 9.66 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.66.0" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,600" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="../css/normalize.css">
<link rel="stylesheet" href="../css/skeleton.css">
<link rel="stylesheet" href="../css/custom.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@4.0.0/dist/css/bootstrap.min.css"
integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<link rel="alternate" href="index.xml" type="application/rss+xml" title="Speech Research">
<link rel="shortcut icon" href="favicon.png" type="image/x-icon" />
<title>U-Diffusion Vision Transformer for Text-to-Speech - Speech Research</title>
</head>
<body rightmargin="150" leftmargin="150" topmargin="100" bottommargin="100" line-height:160%>
<style>
.img-container figcaption {
text-align: center;
}
</style>
<style>
table {
width: 100%;
border-collapse: collapse;
}
td {
padding: 10px;
text-align: center;
font-size: 80%
}
img {
max-width: 100%;
max-height: 100%;
object-fit: contain;
}
</style>
<font size="5">
<p> </p>
<p> </p>
<div class="container"><header role="banner"></header>
<article><br />
<h1 align="center"><span style="font-size: 120%;">U-Diffusion Vision Transformer for Text-to-Speech</span></h1>
<br />
<p style="line-height: 1;" align="center"><strong> Xin Jing<sup>1</sup>, Yi Chang<sup>2</sup>, Zijiang Yang<sup>1</sup>, Andreas Triantafyllopoulos<sup>1</sup>, Bjoern Schuller<sup>1</sup> </strong></p>
<p style="line-height: 0.6;" align="center"><sup>1</sup>University of Augsburg, Augsburg, Germany</p>
<p style="line-height: 0.6;" align="center"><sup>2</sup>Imperial College London, London, UK</p>
<section><br />
<div class="container"><center>
<p><a href="https://arxiv.org/abs/2305.13195">[Paper on ArXiv]</a> <a href="https://github.com/EIHW/u-dit-tts/tree/main">[Code on GitHub]</a></p>
</center></div>
<!-- <h2 id="under" align="center"><img src="img/icons/noun-construction-2085884.png" alt="Image description" width="60px">Still Under Construction....</h2> -->
<h2 id="abstract">Abstract</h2>
<p style="text-align: justify; font-size: 80%;">Recently, the adoption of Score-based Generative Models (SGMs), literally Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The proposed U-DiT TTS system, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales and utilizes a pretrained HiFi-GAN as the vocoder. The objective (ie Frechet distance) and MOS results demonstrate that our U-DiT TTS system achieves competitive performance on the single-speaker dataset LJSpeech. Our demos are publicly available at: https://eihw.github.io/u-dit-tts/</p>
<table>
<tr>
<td><img src="img/framework.png" alt="Framework", width="90%"></td>
<td><img src="img/udit.png" alt="udit", width="80%"></td>
</tr>
<tr>
<td>framework</td>
<td>udit</td>
</tr>
</table>
<h2 id="samples"></ion-icon>TTS Samples</h2>
<table class="table" style="table-layout: fixed; word-break: break-word;" align="center">
<p style="font-size: 80%; color:gray;"> 1. <i>The poorer prisoners were not in abject want, as in other prisons,</i></p>
<thead>
<tr>
<td scope="col" width="25%">Ground Truth</td>
<td scope="col" width="25%">Ground Truth mel</td>
<td scope="col" width="25%">U-DiT</td>
<td scope="col" width="25%">Grad-TTS</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><audio controls="controls" style="width: 100%;">
<source src="./data/gt/LJ002-0261.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/gt-mel/LJ002-0261.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/s/LJ002-0261.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/g/LJ002-0261.wav" autoplay="autoplay" />
Your browser does not support the audio element.
</audio></td>
</tr>
</tbody>
</table>
<!-- #2 -->
<table class="table" style="table-layout: fixed; word-break: break-word;" align="center">
<p style="font-size: 80%; color:gray;"> 2. <i>In eighteen fifty-five</i></p>
<thead>
<tr>
<td scope="col" width="25%">Ground Truth</td>
<td scope="col" width="25%">Ground Truth mel</td>
<td scope="col" width="25%">U-DiT</td>
<td scope="col" width="25%">Grad-TTS</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><audio controls="controls" style="width: 100%;">
<source src="./data/gt/LJ018-0218.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/gt-mel/LJ018-0218.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/s/LJ018-0218.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/g/LJ018-0218.wav" autoplay="autoplay" />
Your browser does not support the audio element.
</audio></td>
</tr>
</tbody>
</table>
<!-- #3 -->
<table class="table" style="table-layout: fixed; word-break: break-word;" align="center">
<p style="font-size: 80%; color:gray;"> 3. <i>seems necessary to produce the same result of justice and right conduct</i></p>
<thead>
<tr>
<td scope="col" width="25%">Ground Truth</td>
<td scope="col" width="25%">Ground Truth mel</td>
<td scope="col" width="25%">U-DiT</td>
<td scope="col" width="25%">Grad-TTS</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><audio controls="controls" style="width: 100%;">
<source src="./data/gt/LJ021-0026.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/gt-mel/LJ021-0026.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/s/LJ021-0026.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/g/LJ021-0026.wav" autoplay="autoplay" />
Your browser does not support the audio element.
</audio></td>
</tr>
</tbody>
</table>
<!-- #4 -->
<table class="table" style="table-layout: fixed; word-break: break-word;" align="center">
<p style="font-size: 80%; color:gray;"> 4. <i>And there may be only nine.</i></p>
<thead>
<tr>
<td scope="col" width="25%">Ground Truth</td>
<td scope="col" width="25%">Ground Truth mel</td>
<td scope="col" width="25%">U-DiT</td>
<td scope="col" width="25%">Grad-TTS</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><audio controls="controls" style="width: 100%;">
<source src="./data/gt/LJ024-0019.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/gt-mel/LJ024-0019.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/s/LJ024-0019.wav" autoplay="autoplay" />Your browser does not support the audio element.
</audio></td>
<td><audio controls="controls" style="width: 100%;">
<source src="./data/g/LJ024-0019.wav" autoplay="autoplay" />
Your browser does not support the audio element.
</audio></td>
</tr>
</tbody>
<div class="container"> </div>
</section>
</article>
</div>