-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the feature detection story for MediaStreamTrack support? #126
Comments
|
#2 sounds like a pretty straightforward option that doesn't require any changes to the shape of the API. In Chromium, on-device speech recognition support and MediaStreamTrack support are both behind the same feature flag, so they will launch at the same time. This feature association isn't necessarily guaranteed by other browsers in the future though. |
This worries me indeed.
If we go down this road, would it make sense to have a Moreover, the spec should say whether |
We could have a source attribute, but then it would make the setSource(MediaStreamTrack) redundant as it would only be needed for feature detection purposes. What do you all think about renaming start(MediaStreamTrack) to startWithMediaStreamTrack(MediaStreamTrack)? |
I'm fine with either Option 1 3. Add a readonly attribute
|
Can we not use the standard |
The following seems a bit too much to me but why not? 4. Add static
|
gentle ping |
This is extensible, potentially providing an avenue to offer offline transcription ("faster-than-realtime") in the future, by adding a third
Considering the speed at which my voice-to-text transcription experiments (with various software and hardware) have been running, and the ongoing rate of progress of software and hardware, I think this is a good feature to consider. |
One thing I just realized is that |
I prototyped this in Chromium to see how this would work with https://chromium-review.googlesource.com/c/chromium/src/+/6179415/1 and it looks like Blink bindings are catching this type error before we can actually return a promise with await webkitSpeechRecognition.isSourceSupported(undefined);
// > {source: undefined, supported: true}
await webkitSpeechRecognition.isSourceSupported(1);
// > Uncaught TypeError: Failed to execute 'isSourceSupported' on 'SpeechRecognition':
// > The provided value is not of type '(MediaStreamTrack or undefined)'. Is this pattern still good enough? try {
const { supported } = await webkitSpeechRecognition.isSourceSupported(myAudioTrack);
if (supported) {
// MediaStreamTrack is supported by the Web Speech API.
} else {
throw Error('"¯\_(ツ)_/¯"');
}
} catch(error) {
// MediaStreamTrack is NOT supported by the Web Speech API.
} |
Try to model this after Web Codecs, this works: https://w3c.github.io/webcodecs/#audiodecoder-interface. |
This model uses dictionary strings instead of "typedef (MediaStreamTrack or undefined)". dictionary SpeechRecognitionSource {
required DOMString source; // "mediastreamtrack" or "microphone"
};
dictionary SpeechRecognitionSourceSupport {
boolean supported;
SpeechRecognitionSource source;
};
partial interface SpeechRecognition : EventTarget {
static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionSource source)
}; if ('isSourceSupported' in webkitSpeechRecognition.prototype) {
const {supported} = await webkitSpeechRecognition?.isSourceSupported({source: "mediastreamtrack"});
if (supported) {
// MediaStreamTrack is supported by the Web Speech API.
}
} |
This is in line with what I see in API "around" our API here, such as media decoding and encoding, and WebRTC, thanks a lot for taking the time to revise the proposal multiple times. Here are some simple change to this interface: lower-case strings is indeed the recommended way to do "enums" on the Web, but we can make it explicit, e.g. we'd have: enum SpeechRecognitionSource {
"microphone",
"mediastreamtrack",
"streams" // future extension w/ WhatWG Streams
}
dictionary SpeechRecognitionOptions {
// Do we need more members here?
required SpeechRecognitionSource source;
};
dictionary SpeechRecognitionSourceSupport {
boolean supported;
SpeechRecognitionOptions options;
};
partial interface SpeechRecognition : EventTarget {
static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionOptions options)
}; if we foresee more options to be added (otherwise, we can cut a layer). I'm thinking of a language identifier, or a model size, etc. Do we also want to add provision besides I understand it's a bit verbose, but I think the verbosity isn't unwarranted in this instance:
|
Non-recognized enum values will still throw TypeError though right?
To be on the safe side, I'm happy to add |
IMO, while this approach offers the most extensibility, it seems a little over-engineered for future use cases that may never materialize and comes at the cost of a more complicated API surface. My preference would be for a simpler approach with either the readonly attribute or the new startWithMediaStreamTrack name. Given the historical rate of progress on the Web Speech API, I'm skeptical about if and when we'll need to extend the functionality of this API. But if we do want to add new functionality with additional sources and options, we may want to consider developing a new, modernized speech recognition API with Promises instead of callbacks. I've filed issue #130 to track this discussion.
PR #132 updates the specification to define the method for detecting on-device speech recognition availability as asynchronous. What do you all think? |
The WebIDL may seem complex at first glance, but from a web developer's perspective, it's actually quite straightforward. I recommend using either |
I was wrong indeed, and I checked, and other spec use a Firefox would implement faster-than-real-time transcription rather quickly, I assume, so I would still sit on the side of an extensible API. |
Let me summarize this then. We either go with the following partial interface SpeechRecognition : EventTarget {
undefined startWithMediaStreamTrack(MediaStreamTrack audioTrack);
} Or future-proof dictionary SpeechRecognitionOptions {
required DOMString source; // "mediastreamtrack", "microphone", or "streams" in the future
};
dictionary SpeechRecognitionSourceSupport {
boolean supported;
SpeechRecognitionOptions options;
};
partial interface SpeechRecognition : EventTarget {
static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionOptions options)
}; |
I support the second API, thanks for shepherding this through. |
@eric-carlson Do you have opinions on this? |
I have a slight preference for the first API, but not enough to cause a fuss if you two prefer the second one :) If we do go with the second one, do we still need it to be asynchronous? Checking if the browser supports MediaStreamTrack (or streams) doesn't necessarily need to be tied to on-device speech recognition, though I suppose making it asynchronous provides browsers with the flexibility to do whatever they want to make that determination. |
Spec issue: WebAudio/web-speech-api#126 Change-Id: I07b5551dcd570d32687e13fa0aa62adce16d0a0a Bug: 40286514
Usually, we do not tend to add feature detection API if there is a way to do it in JS. For instance, calling It would be nice to have a more precise algorithm written in the spec, to get the exact order of the checks. And it would probably make sense to clarify that |
Thank you for jumping into the discussion @youennf! I've started playing with assumed exceptions at https://web-speech-mediastreamtrack.glitch.me/issue-126.html and it seems like we have plenty of work to do interop wise if we're going this way ;) |
There's one thing in my opinion that prevents us to use exceptions is that failure due to It means we have to wait for an amount of time before deciding if an error event happened or not. This is bad in my opinion. The For info, I've updated https://web-speech-mediastreamtrack.glitch.me/issue-126.html with plenty of tests to discover what we could do. The following screenshot shows Safari, Chrome, and Chrome Canary with MediaStreamTrack support: ![]() And if you didn't want to dig in the code, here's the snippet I used: (async () => {
const iframe = document.createElement("iframe");
iframe.setAttribute("allow", "microphone 'none'");
iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
document.body.appendChild(iframe);
const recognition = new iframe.contentWindow.webkitSpeechRecognition();
const ac = new AudioContext();
const mediaStreamDestination = ac.createMediaStreamDestination();
const audioTrack = mediaStreamDestination.stream.getAudioTracks()[0];
try {
recognition.start(audioTrack);
const errorPromise = new Promise((resolve) => {
recognition.onerror = ({ error }) => {
resolve(error);
};
});
const timeoutPromise = new Promise((resolve) => setTimeout(resolve, 1000));
const errorEvent = await Promise.race([errorPromise, timeoutPromise]);
if (errorEvent) {
console.log("Fail! recognition.start(audioTrack) fired error event");
} else {
console.log("Success! recognition.start(audioTrack) did not fire error event");
}
} catch (error) {
console.log("Fail! recognition.start(audioTrack) should succeed");
}
iframe.parentNode.removeChild(iframe);
})(); |
w/ my Firefox implementer hat on, if you happen to be in a position in which it's easy for you to list somewhere the differences you see, we're happy to add them to our roadmap, and align. Our implementation has drifted some, and if we're doing work on it as part of local recognition, we might as well make it interoperable. |
Oh I just found out Firefox had an implementation behind
@padenot What do you think of #126 (comment)? |
It's not enabled for a reason :-). We'll make sure it's compatible before shipping, w/ WPT and all.
As said previously, I support an |
Thanks for doing this!
Not really, we just need to check whether This approach can probably support streams support detection with a locked ReadableStream. This feature detection will usually be asynchronous (need to wait for the blob URL iframe being loaded if the current frame can get microphone access). |
This approach looks good indeed @youennf! function isMediaStreamTrackSupported() {
const iframe = document.createElement("iframe");
iframe.setAttribute("allow", "microphone 'none'");
iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
document.body.appendChild(iframe);
const recognition = new iframe.contentWindow.webkitSpeechRecognition();
try {
recognition.start(0);
result = false;
} catch (error) {
result = error.name == "TypeError";
} finally {
iframe.remove();
return result;
}
} Note that non-spec compliant implementation of Firefox also throws a TypeError as they expect a MediaStream, but they will not ship this version.
Would MediaStreamTrack support detection conflict with ReadableStream at some point? Maybe it's a problem for the future only... |
|
Web Speech supports only audio kinds of MediaStreamTrack but I get what you meant. We would "just" pass a MediaStreamTrack, not 0 eventually. If @padenot is fine with the outcome, I think* I can close this issue. |
gentle ping @padenot |
I strongly prefer explicit and clear API that align with what the rest of the Web Platform do, especially around the same space, and considering that it is likely we'll add more feature. I can live with doing this the hacky way, but it's nowhere near as nice. |
Do you have example of an API that would be similar to this particular case? I know that, in many cases, decision has been made to not expose this kind of API as long as there is a way for web pages to feature detect it. |
Web Codecs |
These APIs are mostly targeted at exposing what the OS supports: the codecs, the camera capabilities... As an example, there is no API to tell a web page whether a UA implements https://w3c.github.io/webcodecs/#dom-videodecoderconfig-optimizeforlatency. I think the situation is different if an OS restriction would disallow UAs to implement MediaStreamTrack+SpeechRecognition support, while allowing to implement MediaStreamTrack and SpeechRecognition independently. |
An example of feature detection for each of the items I listed (non-exhaustive), to detect if a UA implement something, that have nothing to do with OS support, and everything to do with what a UA at a particular version and on a particular platform support:
Talked about at length in this thread, and we agreed to do it explicitly, in another API, because not being able to do it is a problem:
For the case at hand, recent advances in the field make it generally possible, but require significant engineering effort to integrate well in the Web Browser, and so we expect some delay before it's available everywhere. It can also depend on the hardware and software, so we're very much in the situation you describe. An example would be that a cheap windows laptop is going to have a hard time performing speech recognition in real-time, but can do it offline without issue. As evoked in this thread, a WHATWG Stream API for Speech recognition would then be supported, but not the To come back to your last point, it's been the case for the longest time that one couldn't implement Web Speech API + Case in point, Firefox already has |
Again, there is no API introduced to tell a web page whether the UA implements This is unneeded as the web page has other ways to know whether The examples you give above are extension points, which warrant dedicated APIs. I'd be fine to investigate how to solve this issue in a more generic way: determine overloads, optional method parameters... BTW, the detection script can probably be further simplified by removing the iframe before calling |
You're right the following detection script works BUT I thought "detached" was not a concept in specs world. function isMediaStreamTrackSupported() {
const iframe = document.createElement("iframe");
iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
document.body.appendChild(iframe);
const recognition = new iframe.contentWindow.webkitSpeechRecognition();
iframe.remove();
try {
recognition.start(0);
return false;
} catch (error) {
return error.name == "TypeError";
}
} |
Do we need setting |
|
Setting
TIL fully active thanks! We should add this to the spec then. |
@youennf @padenot @evanbliu It would be nice to find an agreement on whether we expose a new API to detect source support in the Web Speech API. Given that web developers can use the following snippet for now, I'm not opposed to delay adding the future-proof function isMediaStreamTrackSupported() {
const frame = document.body.appendChild(document.createElement("iframe"));
const recognition = new frame.contentWindow.webkitSpeechRecognition();
frame.remove();
try {
recognition.start(0);
return false;
} catch (error) {
return error.name == "TypeError";
}
} |
It's not the most elegant solution, but I'm fine with deferring this for now--we can always add a isSourceSupported() in the future if/when we add new capabilities. |
Shall I close this issue for now? |
As a web developer, I wish there was a way to feature-detect before calling
start(audioTrack)
following #118 changes. This is not possible for now, even after calling it.May I suggest some ideas?
1. Rename
start(MediaStreamTrack audioTrack)
- bikeshedding is welcome2. Tie MediaStreamTrack support to other properties
It may be appropriate to assume some of the changes to the Web Speech API will come with MediaStreamTrack support. I'm not sure though which ones yet. Maybe
SpeechRecognitionMode mode
from #122?FYI @evanbliu
The text was updated successfully, but these errors were encountered: