Replies: 10 comments 62 replies
-
From a quick search it looks like CoreML can be used through Objective-C. If we can write a simple Obj-C wrapper function that takes a mel segment, runs CoreML-based whisper encoder and outputs the resulting encoder embeddings, I think I can easily plug it in Given the performance numbers that you observe - this will be a game changer! I will probably take a look into this at some point, but since you already have things going - would appreciate any help on this. |
Beta Was this translation helpful? Give feedback.
-
Check out the |
Beta Was this translation helpful? Give feedback.
-
Hey folks! Awesome work :) I was made aware of this thread by @ggerganov after a conversation we had on Twitter. Long story short I've optimized both Whisper's encoder and decoder to run on Apple's Neural Engine a couple weeks back, and have hacked flexible sized inputs for the decoder (though not recommended lol). I've done this twice, once on-top of huggingface's implementation of Whisper, and I've published a version built on-top of OpenAI's implementation: I can validate @rsomani95's benchmarks as I too get similar fp32 encoder prediction performance :) Speeding up the current encoderQuantizing to fp16 and using the standard LLM data format of (batch, seq, embed_dim) actually slows down prediction time, so with a few changes we can get even more performance out of @wangchou 's idea! The current implementation uses the standard LLM data format of (batch, seq, embd_dim) but the neural engine's most conductive data format is 4D and channels first. We also want the last axis to be the sequence since the last axis of the ANE buffer isn't packed, and must be contiguous and aligned to 64 bytes. This only applies to the last axis, and since we're quantizing to fp16 the neural engine is actually padding it up to 64bytes which results in 32 times the memory cost for 16bit precision. TLDR; By switching to (batch, embed_dim, 1, seq) we can further improve the speed of the encoder. Decoder & KvcachingDecoding a (1,1) token with an optimized ANE decoder model ran prediction at best 16ms which is still slower than the 7s currently achieved on CPU. I've spent a good amount of time attempting to figure out a solution to the kvcaching problem, the fundamental issue is that cormel models are unable to branch thus making this difficult. We could export two versions of the decoder, one that's not expecting a kvcache for the first token and another that can handle the kvcache case, but that's pretty gross. QuantizationI actually haven't noticed any performance gains by quantizing to fp16 from fp32, the prediction speed is roughly equivalent. Using fp16 instead of fp32 actually slows down compilation time by roughly 2x in all my tests. I suspect this has to do with how the |
Beta Was this translation helpful? Give feedback.
-
We got some feedback from the author of the Apple Neural Engine optimizations for Transformers on our GitHub issue in CoreMLTools who offered some advice for using |
Beta Was this translation helpful? Give feedback.
-
as a founder in a venture backed ML team building in this space - this is some highly encouraging work fellas! Email is open: colin@supernormal.com |
Beta Was this translation helpful? Give feedback.
-
Did someone try to run the core ml model on iOS?
It happens when ANECompilerService exceeds CPU usage during the first run.
It doesn't relate directly to the model's size; the same crash happens with the base and small models. |
Beta Was this translation helpful? Give feedback.
-
CoreML complication result can be triggered once and loaded manually via some additional CoreML runtime calls. But yes, this is annoying as hell and I think theres an Xcode bug which triggers multiple re-compliations even on successive runs with no changes. AFAIK - Building and deploying for devices should NOT incur the ANE compilation step if you build your app correctly. See https://developer.apple.com/documentation/coreml/mlmodel/3931182-compilemodel and |
Beta Was this translation helpful? Give feedback.
-
What is the process folks are using to convert to coreml models? I tried both https://github.com/wangchou/callCoreMLFromCpp and https://gist.github.com/RobertRiachi/d75bf6946bb8f1cea391c3c03a4ba4db and they both throw an assert - assert x.shape[1:] == self.positional_embedding.shape[::-1], "incorrect audio shape" And the output files don't recognize correctly (only gets a single word on the JFK example). Just looking to produce coreml versions of the medium and large models since I have a Mac Studio with 128GB of RAM to process them. Happy to upload them once I get something working. |
Beta Was this translation helpful? Give feedback.
-
Nope - I had it installed last time. Just was quickly trying to get the
same moduleset and installed base on the list in the last message.
With openai-whisper it got further, but changing:
whisper = load_model("small").cpu()
to
whisper = load_model("medium").cpu()
caused it to error out on the conversion -
RuntimeError: Given groups=1, weight of size [1024, 1024, 1, 1], expected
input[1, 768, 1, 1500] to have 1024 channels, but got 768 channels instead
Are you actually able to convert the medium model? That is all I am
shooting for.
…On Thu, Apr 13, 2023 at 7:27 PM Robert Riachi ***@***.***> wrote:
You're missing the whisper package from openai
pip install -U openai-whisper
github repo if you need more instructions:
https://github.com/openai/whisper
—
Reply to this email directly, view it on GitHub
<#548 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUOXQJP2UTDMIW7BMH6CTXBCYXHANCNFSM6AAAAAAVLV4I2I>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Thanks flexchar for uploading those generated core ML models. It seems to be running faster than 3x. I hope at least..gotta wait till 30+ hour audiobook finishes to be sure. I'm using the medium.en model and it took a little over 16 hours to run it the first time. Now it's processing a batch of audiobooks so we'll see just how fast it actually is. Before was gettting 2x - 3x on Mac M1 8GB RAM. I doubt I have enough RAM for large model though and that's ok. Since this is using 82% RAM currently. Small model wasn't accurate enough for me for locations, cities, etc. |
Beta Was this translation helpful? Give feedback.
-
For small whisper model, I observed 6x speed up on encoder when running it on Apple Neural Engine.
Encoder time per run
whisper.cpp: 1030ms (CPU, 4 threads)
CoreML model: 174ms (Apple Neural Engine)
I wonder is it a good idea to use CoreML model in whisper.cpp.
PS: The decoder part of whisper.cpp is much faster than CoreML because of kv_cache.
Tested on MacBook M1 Air 16GB
Python Conversion Script
Beta Was this translation helpful? Give feedback.
All reactions