Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the feature detection story for MediaStreamTrack support? #126

Open
beaufortfrancois opened this issue Jan 8, 2025 · 47 comments
Open

Comments

@beaufortfrancois
Copy link
Contributor

beaufortfrancois commented Jan 8, 2025

As a web developer, I wish there was a way to feature-detect before calling start(audioTrack) following #118 changes. This is not possible for now, even after calling it.

May I suggest some ideas?

1. Rename start(MediaStreamTrack audioTrack) - bikeshedding is welcome

     undefined start();
-    undefined start(MediaStreamTrack audioTrack);
+    undefined startWithMediaStreamTrack(MediaStreamTrack audioTrack);
if ('startWithMediaStreamTrack' in webkitSpeechRecognition.prototype) {
  // MediaStreamTrack is supported by the Web Speech API.
}

2. Tie MediaStreamTrack support to other properties

It may be appropriate to assume some of the changes to the Web Speech API will come with MediaStreamTrack support. I'm not sure though which ones yet. Maybe SpeechRecognitionMode mode from #122?

if ('mode' in webkitSpeechRecognition.prototype) {
  // MediaStreamTrack is supported by the Web Speech API.
}

FYI @evanbliu

@padenot
Copy link
Member

padenot commented Jan 8, 2025

setSource(MediaStreamTrack), then start() would also allow detection.

@evanbliu
Copy link
Collaborator

evanbliu commented Jan 8, 2025

#2 sounds like a pretty straightforward option that doesn't require any changes to the shape of the API. In Chromium, on-device speech recognition support and MediaStreamTrack support are both behind the same feature flag, so they will launch at the same time. This feature association isn't necessarily guaranteed by other browsers in the future though.

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 9, 2025

This feature association isn't necessarily guaranteed by other browsers in the future though.

This worries me indeed.

setSource(MediaStreamTrack), then start() would also allow detection.

If we go down this road, would it make sense to have a source attribute reflected as well?

Moreover, the spec should say whether setSource(MediaStreamTrack) can be called after start() or not. I believe it should not be possible to call it successfully after speech recognition has started.

@evanbliu
Copy link
Collaborator

evanbliu commented Jan 9, 2025

We could have a source attribute, but then it would make the setSource(MediaStreamTrack) redundant as it would only be needed for feature detection purposes. What do you all think about renaming start(MediaStreamTrack) to startWithMediaStreamTrack(MediaStreamTrack)?

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 10, 2025

I'm fine with either Option 1 startWithMediaStreamTrack(MediaStreamTrack) modulo name, or a source attribute only as described below.

3. Add a readonly attribute source

partial interface SpeechRecognition : EventTarget {
    readonly attribute (MediaStreamTrack or undefined) source;
}
if ('source' in webkitSpeechRecognition.prototype) {
  // MediaStreamTrack is supported by the Web Speech API.
  
  const recognition = new webkitSpeechRecognition();
  recognition.start(myAudioTrack);
  console.log(recognition.source); // MediaStreamTrack myAudioTrack
}

Note that if we expect more sources support, we'll have the same issue next time to detect whether "newSource" is supported or not.

@padenot
Copy link
Member

padenot commented Jan 13, 2025

Can we not use the standard isConfigSupported pattern ?

@beaufortfrancois
Copy link
Contributor Author

Can we not use the standard isConfigSupported pattern ?

The following seems a bit too much to me but why not?

4. Add static isSourceSupported(source)

dictionary SpeechRecognitionSourceSupport {
  boolean supported;
  SpeechRecognitionSource source;
};

typedef (MediaStreamTrack or undefined) SpeechRecognitionSource;

partial interface SpeechRecognition : EventTarget {
  static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionSource source)
};
if ('isSourceSupported' in webkitSpeechRecognition.prototype) {
  const {supported} = await webkitSpeechRecognition?.isSourceSupported(myAudioTrack);
  if (supported) {
    // MediaStreamTrack is supported by the Web Speech API.
  }
}

@beaufortfrancois
Copy link
Contributor Author

gentle ping

@padenot
Copy link
Member

padenot commented Jan 20, 2025

This is extensible, potentially providing an avenue to offer offline transcription ("faster-than-realtime") in the future, by adding a third SpeechRecognitionSource, that could be a ReadableStream.

MediaStreamTrack are always tied to a clock domain, be it of an audio device or system clock, and cannot work faster than real-time.

Considering the speed at which my voice-to-text transcription experiments (with various software and hardware) have been running, and the ongoing rate of progress of software and hardware, I think this is a good feature to consider.

@beaufortfrancois
Copy link
Contributor Author

One thing I just realized is that isSourceSupported(myReadableStream) would throw a TypeError right away now IIUC because of typedef (MediaStreamTrack or undefined) SpeechRecognitionSource. Is that what we want or is it going to fail properly and reject with supported = false?

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 20, 2025

I prototyped this in Chromium to see how this would work with https://chromium-review.googlesource.com/c/chromium/src/+/6179415/1 and it looks like Blink bindings are catching this type error before we can actually return a promise with supported: false:

await webkitSpeechRecognition.isSourceSupported(undefined);
// > {source: undefined, supported: true}
await webkitSpeechRecognition.isSourceSupported(1);
// > Uncaught TypeError: Failed to execute 'isSourceSupported' on 'SpeechRecognition': 
// > The provided value is not of type '(MediaStreamTrack or undefined)'.

Is this pattern still good enough?

try {
  const { supported } = await webkitSpeechRecognition.isSourceSupported(myAudioTrack);
  if (supported) {
    // MediaStreamTrack is supported by the Web Speech API.
  } else {
    throw Error('"¯\_(ツ)_/¯"');
  }
} catch(error) {
  // MediaStreamTrack is NOT supported by the Web Speech API.
}

@padenot
Copy link
Member

padenot commented Jan 20, 2025

Try to model this after Web Codecs, this works: https://w3c.github.io/webcodecs/#audiodecoder-interface.

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 20, 2025

This model uses dictionary strings instead of "typedef (MediaStreamTrack or undefined)".
Would it be better this way?

dictionary SpeechRecognitionSource {
  required DOMString source; // "mediastreamtrack" or "microphone"
};

dictionary SpeechRecognitionSourceSupport {
  boolean supported;
  SpeechRecognitionSource source;
};

partial interface SpeechRecognition : EventTarget {
  static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionSource source)
};
if ('isSourceSupported' in webkitSpeechRecognition.prototype) {
  const {supported} = await webkitSpeechRecognition?.isSourceSupported({source: "mediastreamtrack"});
  if (supported) {
    // MediaStreamTrack is supported by the Web Speech API.
  }
}

@beaufortfrancois
Copy link
Contributor Author

@padenot @evanbliu What do you think?

@padenot
Copy link
Member

padenot commented Jan 22, 2025

This is in line with what I see in API "around" our API here, such as media decoding and encoding, and WebRTC, thanks a lot for taking the time to revise the proposal multiple times. Here are some simple change to this interface: lower-case strings is indeed the recommended way to do "enums" on the Web, but we can make it explicit, e.g. we'd have:

enum SpeechRecognitionSource {
  "microphone",
  "mediastreamtrack",
  "streams" // future extension w/ WhatWG Streams
}

dictionary SpeechRecognitionOptions {
  // Do we need more members here?
  required SpeechRecognitionSource source;
};

dictionary SpeechRecognitionSourceSupport {
  boolean supported;
  SpeechRecognitionOptions options;
};
partial interface SpeechRecognition : EventTarget {
  static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionOptions options)
};

if we foresee more options to be added (otherwise, we can cut a layer). I'm thinking of a language identifier, or a model size, etc. Do we also want to add provision besides supported, e.g. "supported but we'd have to download a big model" ? I'm not entirely sure if this is warranted.

I understand it's a bit verbose, but I think the verbosity isn't unwarranted in this instance:

  • asynchronicity is necessary, because it is possible to need e.g. cross-process calls or checking the presence of a model on disk before answering
  • using an enum allows extensibility in the future

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 22, 2025

Non-recognized enum values will still throw TypeError though right?

Uncaught TypeError: Failed to execute 'isSourceSupported' on 'SpeechRecognition': Failed to read the 'source' property from 'SpeechRecognitionOptions': The provided value 'foo' is not a valid enum value of type SpeechRecognitionSource.

To be on the safe side, I'm happy to add SpeechRecognitionOptions. See https://chromium-review.googlesource.com/c/chromium/src/+/6179415/2

@evanbliu
Copy link
Collaborator

enum SpeechRecognitionSource {
"microphone",
"mediastreamtrack",
"streams" // future extension w/ WhatWG Streams
}

dictionary SpeechRecognitionOptions {
// Do we need more members here?
required SpeechRecognitionSource source;
};

dictionary SpeechRecognitionSourceSupport {
boolean supported;
SpeechRecognitionOptions options;
};
partial interface SpeechRecognition : EventTarget {
static Promise isSourceSupported(SpeechRecognitionOptions options)
};

IMO, while this approach offers the most extensibility, it seems a little over-engineered for future use cases that may never materialize and comes at the cost of a more complicated API surface. My preference would be for a simpler approach with either the readonly attribute or the new startWithMediaStreamTrack name. Given the historical rate of progress on the Web Speech API, I'm skeptical about if and when we'll need to extend the functionality of this API. But if we do want to add new functionality with additional sources and options, we may want to consider developing a new, modernized speech recognition API with Promises instead of callbacks. I've filed issue #130 to track this discussion.

asynchronicity is necessary, because it is possible to need e.g. cross-process calls or checking the presence of a model on disk before answering
This issue tracks the feature detection of MediaStreamTrack support, which can function independently of on-device speech recognition. Offline/on-device speech recognition support is tracked separately in Issue #108.

PR #132 updates the specification to define the method for detecting on-device speech recognition availability as asynchronous.

What do you all think?

@beaufortfrancois
Copy link
Contributor Author

The WebIDL may seem complex at first glance, but from a web developer's perspective, it's actually quite straightforward. I recommend using either startWithMediaStreamTrack() or the latest version of isSourceSupported(). Hopefully, we can reach an agreement on this soon.

@padenot
Copy link
Member

padenot commented Jan 23, 2025

Non-recognized enum values will still throw TypeError though right?

I was wrong indeed, and I checked, and other spec use a DOMString, sorry about that. e.g. https://www.w3.org/TR/mediacapture-streams/#webidl-1647796506, [[kind]] is fairly close to what we're doing here.

Firefox would implement faster-than-real-time transcription rather quickly, I assume, so I would still sit on the side of an extensible API.

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 23, 2025

Let me summarize this then. We either go with the following startWithMediaStreamTrack():

partial interface SpeechRecognition : EventTarget {
    undefined startWithMediaStreamTrack(MediaStreamTrack audioTrack);
}

Or future-proof isSourceSupported() (WIP CL):

dictionary SpeechRecognitionOptions {
  required DOMString source; // "mediastreamtrack", "microphone", or "streams" in the future
};

dictionary SpeechRecognitionSourceSupport {
  boolean supported;
  SpeechRecognitionOptions options;
};

partial interface SpeechRecognition : EventTarget {
  static Promise<SpeechRecognitionSourceSupport> isSourceSupported(SpeechRecognitionOptions options)
};

@padenot
Copy link
Member

padenot commented Jan 23, 2025

I support the second API, thanks for shepherding this through.

@beaufortfrancois
Copy link
Contributor Author

@eric-carlson Do you have opinions on this?

@evanbliu
Copy link
Collaborator

I have a slight preference for the first API, but not enough to cause a fuss if you two prefer the second one :)

If we do go with the second one, do we still need it to be asynchronous? Checking if the browser supports MediaStreamTrack (or streams) doesn't necessarily need to be tied to on-device speech recognition, though I suppose making it asynchronous provides browsers with the flexibility to do whatever they want to make that determination.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jan 23, 2025
Spec issue: WebAudio/web-speech-api#126

Change-Id: I07b5551dcd570d32687e13fa0aa62adce16d0a0a
Bug: 40286514
@youennf
Copy link

youennf commented Jan 24, 2025

Usually, we do not tend to add feature detection API if there is a way to do it in JS.
It seems feature detection can be done with the current API by looking at the exceptions start can throw.

For instance, calling start(videoTrack) on a blob iframe which is disallowed to use microphone by permission policy will always throw.
Either it will throw with InvalidStateError in case videoTrack is checked (UA supports the new API), or it will throw with NotAllowedError due to permission policy (UA does not support the new API).

It would be nice to have a more precise algorithm written in the spec, to get the exact order of the checks.
For instance Safari is throwing in case of detached iframe which could be another way to do feature detection.

And it would probably make sense to clarify that microphone permission policy is needed for start() but not for start(audioTrack).

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 24, 2025

Thank you for jumping into the discussion @youennf!

I've started playing with assumed exceptions at https://web-speech-mediastreamtrack.glitch.me/issue-126.html and it seems like we have plenty of work to do interop wise if we're going this way ;)

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 27, 2025

There's one thing in my opinion that prevents us to use exceptions is that failure due to microphone permission policy does NOT trigger a NotAllowedError DOMException. It actually fires an error event with not-allowed.

It means we have to wait for an amount of time before deciding if an error event happened or not. This is bad in my opinion. The isSourceSupported() method seems cleaner now, and in the future if we want to expose options when starting speech recognition for MediaStreamTrack and maybe ReadableStream.

For info, I've updated https://web-speech-mediastreamtrack.glitch.me/issue-126.html with plenty of tests to discover what we could do. The following screenshot shows Safari, Chrome, and Chrome Canary with MediaStreamTrack support:

Image

And if you didn't want to dig in the code, here's the snippet I used:

(async () => {
  const iframe = document.createElement("iframe");
  iframe.setAttribute("allow", "microphone 'none'");
  iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
  document.body.appendChild(iframe);

  const recognition = new iframe.contentWindow.webkitSpeechRecognition();
  const ac = new AudioContext();
  const mediaStreamDestination = ac.createMediaStreamDestination();
  const audioTrack = mediaStreamDestination.stream.getAudioTracks()[0];

  try {
    recognition.start(audioTrack);

    const errorPromise = new Promise((resolve) => {
      recognition.onerror = ({ error }) => {
        resolve(error);
      };
    });
    const timeoutPromise = new Promise((resolve) => setTimeout(resolve, 1000));
    const errorEvent = await Promise.race([errorPromise, timeoutPromise]);

    if (errorEvent) {
      console.log("Fail! recognition.start(audioTrack) fired error event");
    } else {
      console.log("Success! recognition.start(audioTrack) did not fire error event");
    }
  } catch (error) {
    console.log("Fail! recognition.start(audioTrack) should succeed");
  }
  iframe.parentNode.removeChild(iframe);
})();

@padenot
Copy link
Member

padenot commented Jan 27, 2025

I've started playing with assumed exceptions at web-speech-mediastreamtrack.glitch.me/issue-126.html and it seems like we have plenty of work to do interop wise if we're going this way ;)

w/ my Firefox implementer hat on, if you happen to be in a position in which it's easy for you to list somewhere the differences you see, we're happy to add them to our roadmap, and align. Our implementation has drifted some, and if we're doing work on it as part of local recognition, we might as well make it interoperable.

@beaufortfrancois
Copy link
Contributor Author

beaufortfrancois commented Jan 27, 2025

I've started playing with assumed exceptions at web-speech-mediastreamtrack.glitch.me/issue-126.html and it seems like we have plenty of work to do interop wise if we're going this way ;)

w/ my Firefox implementer hat on, if you happen to be in a position in which it's easy for you to list somewhere the differences you see, we're happy to add them to our roadmap, and align. Our implementation has drifted some, and if we're doing work on it as part of local recognition, we might as well make it interoperable.

Oh I just found out Firefox had an implementation behind media.webspeech.recognition.enable and media.webspeech.recognition.force_enable preferences! Here are the current results:

FAIL: : recognition.onerror should fire 'not-allowed' error if microphone 'none'
FAIL: : recognition.start(audioTrack) should fail with InvalidStateError if microphone 'none'
FAIL: : recognition.start() should fail with UnknownError if microphone 'none' and detached iframe (this behavior is not in the spec)
FAIL: : recognition.start() should fail with UnknownError if detached iframe (this behavior is not in the spec)
SUCCESS: recognition.start() succeeds
FAIL: : recognition.start() should succeed - error: TypeError: SpeechRecognition.start: Argument 1 does not implement interface MediaStream.
FAIL: recognition.start(undefined) should fail

@padenot What do you think of #126 (comment)?

@padenot
Copy link
Member

padenot commented Jan 27, 2025

Oh I just found out Firefox had an implementation behind media.webspeech.recognition.enable and media.webspeech.recognition.force_enable preferences! Here are the current results:

It's not enabled for a reason :-). We'll make sure it's compatible before shipping, w/ WPT and all.

@padenot What do you think of #126 (comment)?

As said previously, I support an isSourceSupported approach, or at least more work in this direction.

@youennf
Copy link

youennf commented Jan 27, 2025

I've started playing with assumed exceptions at https://web-speech-mediastreamtrack.glitch.me/issue-126.html

Thanks for doing this!
That seems like worthwhile input for WPT.
I agree we should try to converge and probably make the spec more algorithmic to clarify the intended behavior.

It means we have to wait for an amount of time before deciding if an error event happened or not.

Not really, we just need to check whether start(videoTrack) will throw InvalidStateError synchronously or not.
start(0) throwing TypeError seems even better to check for start overloading.
If there is no such exception, we can deduce that there is no MediaStreamTrack support.

This approach can probably support streams support detection with a locked ReadableStream.

This feature detection will usually be asynchronous (need to wait for the blob URL iframe being loaded if the current frame can get microphone access).
I am not sure though that this warrants a new API that will become useless as UAs catch up with the spec.

@beaufortfrancois
Copy link
Contributor Author

This approach looks good indeed @youennf!

function isMediaStreamTrackSupported() {
  const iframe = document.createElement("iframe");
  iframe.setAttribute("allow", "microphone 'none'");
  iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
  document.body.appendChild(iframe);

  const recognition = new iframe.contentWindow.webkitSpeechRecognition();

  try {
    recognition.start(0);
    result = false;
  } catch (error) {
    result = error.name == "TypeError";
  } finally {
    iframe.remove();
    return result;
  }
}

Note that non-spec compliant implementation of Firefox also throws a TypeError as they expect a MediaStream, but they will not ship this version.

This approach can probably support streams support detection with a locked ReadableStream.

Would MediaStreamTrack support detection conflict with ReadableStream at some point? Maybe it's a problem for the future only...

@youennf
Copy link

youennf commented Jan 27, 2025

Would MediaStreamTrack support detection conflict with ReadableStream at some point?

start(canvas.captureStream.().getVideoTracks()[0]) would specifically check for MediaStreamTrack support.

@beaufortfrancois
Copy link
Contributor Author

Would MediaStreamTrack support detection conflict with ReadableStream at some point?

start(canvas.captureStream.().getVideoTracks()[0]) would specifically check for MediaStreamTrack support.

Web Speech supports only audio kinds of MediaStreamTrack but I get what you meant. We would "just" pass a MediaStreamTrack, not 0 eventually.

If @padenot is fine with the outcome, I think* I can close this issue.

@beaufortfrancois
Copy link
Contributor Author

gentle ping @padenot

@padenot
Copy link
Member

padenot commented Jan 29, 2025

I strongly prefer explicit and clear API that align with what the rest of the Web Platform do, especially around the same space, and considering that it is likely we'll add more feature.

I can live with doing this the hacky way, but it's nowhere near as nice.

@youennf
Copy link

youennf commented Jan 29, 2025

I strongly prefer explicit and clear API that align with what the rest of the Web Platform do

Do you have example of an API that would be similar to this particular case?

I know that, in many cases, decision has been made to not expose this kind of API as long as there is a way for web pages to feature detect it.

@padenot
Copy link
Member

padenot commented Jan 29, 2025

Web Codecs isConfigSupported, WebRTC's constrainable, the media element canPlayType, MediaCapabilities are examples of clear APIs that allows one to understand if it is possible to do something on the web platform in the current state of things, related to media, without trying it out and seeing if it fails.

@youennf
Copy link

youennf commented Jan 29, 2025

These APIs are mostly targeted at exposing what the OS supports: the codecs, the camera capabilities...
The intent is not to expose whether a UA implements a particular API like we are talking here.

As an example, there is no API to tell a web page whether a UA implements https://w3c.github.io/webcodecs/#dom-videodecoderconfig-optimizeforlatency.
It is up to the website to feature detect it (and it can).

I think the situation is different if an OS restriction would disallow UAs to implement MediaStreamTrack+SpeechRecognition support, while allowing to implement MediaStreamTrack and SpeechRecognition independently.

@padenot
Copy link
Member

padenot commented Jan 29, 2025

These APIs are mostly targeted at exposing what the OS supports: the codecs, the camera capabilities...
The intent is not to expose whether a UA implements a particular API like we are talking here.

An example of feature detection for each of the items I listed (non-exhaustive), to detect if a UA implement something, that have nothing to do with OS support, and everything to do with what a UA at a particular version and on a particular platform support:

  • Web Codecs's isConfigSupported: used to understand if one can do SVC encoding. In Firefox, it's been implemented on Windows, macOS, not on Linux, not on Android. This is feature detection: nothing prevents us from implementing it on Linux and Android, it's actually underway, but it's not shipped yet. On Chrome, you can use it to understand if you can encode h264 level 6.2: Firefox can, Chrome cannot, Safari cannot. This is feature detection, as it means you can work with 8k video. Nothing blocks anybody from implementing this. This has nothing to do with OS support either, it has to do with prioritization and/or the own policy of each vendor.
  • contrainable: lots of properties aren't implemented consistently yet across OSes and UA, this is feature detection. One example: you can't constrain on the sample-rate of an input device in Firefox. Nothing prevents us from implementing it.
  • MediaCapabilities: lots of UA haven't implemented hw decoding and encoding on some combination of platform and hardware, that do support hardware and software decoding. This is feature detection, it's frequent that we implement hw decoding and encoding for new hw and software combination.
  • canPlayType (and its encoding counterpart isTypeSupported()) : Chrome implements h264-in-webm playback and recording, which is nonsensical, but it's a thing regardless, nobody else does. Same for mkv (vs. just the webm subset). Nothing prevents anybody from implementing those, and so authors use canPlayType for feature detection.

As an example, there is no API to tell a web page whether a UA implements w3c.github.io/webcodecs#dom-
videodecoderconfig-optimizeforlatency
.
It is up to the website to feature detect it (and it can).

Talked about at length in this thread, and we agreed to do it explicitly, in another API, because not being able to do it is a problem:
w3c/webcodecs#206 (comment)

I think the situation is different if an OS restriction would disallow UAs to implement MediaStreamTrack+SpeechRecognition
support, while allowing to implement MediaStreamTrack and SpeechRecognition independently.

For the case at hand, recent advances in the field make it generally possible, but require significant engineering effort to integrate well in the Web Browser, and so we expect some delay before it's available everywhere. It can also depend on the hardware and software, so we're very much in the situation you describe. An example would be that a cheap windows laptop is going to have a hard time performing speech recognition in real-time, but can do it offline without issue. As evoked in this thread, a WHATWG Stream API for Speech recognition would then be supported, but not the MediaStreamTrack variant. Alternatively, it can be variable for a single system, that has enabled or disabled its discrete GPU for power efficiency reasons.

To come back to your last point, it's been the case for the longest time that one couldn't implement Web Speech API + MediaStreamTrack, because the OS APIs would only work with the microphone, or because the implementation was relying on calling an external API, increasing the latency in a way that it wouldn't be able to work with MediaStreamTrack.

Case in point, Firefox already has MediaStreamTrack, and implement SpeechRecognition from the microphone only today (behind a flag). We are dependent on the OS APIs for this, so we cannot implement SpeechRecognition + MediaStream together. To solve this, we are forgoing OS APIs and relying on free software and models that we can use.

@youennf
Copy link

youennf commented Jan 29, 2025

Again, there is no API introduced to tell a web page whether the UA implements optimizeForLatency or not.
What is proposed here is something like: bool isPropertyUnderstoodByIsConfigSupported(DOMString).

This is unneeded as the web page has other ways to know whether optimizeForLatency support is implemented by a UA.

The examples you give above are extension points, which warrant dedicated APIs.
In our case, this is not an extension point, there will be a very restricted set of values, probably 2, maybe 3.

I'd be fine to investigate how to solve this issue in a more generic way: determine overloads, optional method parameters...

BTW, the detection script can probably be further simplified by removing the iframe before calling start(), no need to use the allow trick. If the iframe is detached, it will likely not proceed with a prompt and not throw with TypeError.

@beaufortfrancois
Copy link
Contributor Author

BTW, the detection script can probably be further simplified by removing the iframe before calling start(), no need to use the allow trick. If the iframe is detached, it will likely not proceed with a prompt and not throw with TypeError.

You're right the following detection script works BUT I thought "detached" was not a concept in specs world.

function isMediaStreamTrackSupported() {
  const iframe = document.createElement("iframe");
  iframe.src = URL.createObjectURL(new Blob([], { type: "text/html" }));
  document.body.appendChild(iframe);

  const recognition = new iframe.contentWindow.webkitSpeechRecognition();
  iframe.remove();

  try {
    recognition.start(0);
    return false;
  } catch (error) {
    return error.name == "TypeError";
  }
}

@youennf
Copy link

youennf commented Jan 30, 2025

Do we need setting iframe.src given we are doing the test synchronously?

@youennf
Copy link

youennf commented Jan 30, 2025

I thought "detached" was not a concept in specs world.

active might be the right concept. I would not expect start to work in this case. getUserMedia is for instance rejecting explicitly.

@beaufortfrancois
Copy link
Contributor Author

Do we need setting iframe.src given we are doing the test synchronously?

Setting iframe.src can be omitted yes.

I thought "detached" was not a concept in specs world.

active might be the right concept. I would not expect start to work in this case. getUserMedia is for instance rejecting explicitly.

TIL fully active thanks! We should add this to the spec then.

@beaufortfrancois
Copy link
Contributor Author

@youennf @padenot @evanbliu It would be nice to find an agreement on whether we expose a new API to detect source support in the Web Speech API.

Given that web developers can use the following snippet for now, I'm not opposed to delay adding the future-proof isSourceSupported() method. I'll defer to WebAudio folks.

function isMediaStreamTrackSupported() {
  const frame = document.body.appendChild(document.createElement("iframe"));
  const recognition = new frame.contentWindow.webkitSpeechRecognition();
  frame.remove();

  try {
    recognition.start(0);
    return false;
  } catch (error) {
    return error.name == "TypeError";
  }
}

@evanbliu
Copy link
Collaborator

evanbliu commented Feb 4, 2025

It's not the most elegant solution, but I'm fine with deferring this for now--we can always add a isSourceSupported() in the future if/when we add new capabilities.

@beaufortfrancois
Copy link
Contributor Author

Shall I close this issue for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants