title | thumbnail | authors | ||||||
---|---|---|---|---|---|---|---|---|
The AI tools for Art Newsletter - Issue 1 |
/blog/assets/ai_art_newsletter_1/thumbnail.png |
|
The AI space is moving so fast it’s hard to believe that a year ago we still struggled to generate people with the correct amount of fingers 😂.
The last couple of years have been pivotal for open source models and tools for artistic usage. AI tools for creative expression have never been more accessible, and we’re only scratching the surface. Join us as we look back at the key milestones, tools, and breakthroughs in AI & Arts from 2024, and forward for what’s to come in 2025 (spoiler 👀: we’re starting a new monthly roundup 👇).
<iframe src="https://multimodalaiart.substack.com/embed" width="480" height="320" style="border:1px solid #EEE; background:white;" frameborder="0" scrolling="no"></iframe>- Major Releases of 2024
- Image Generation
- Video Generation
- Creative Tools that Shined in 2024
- What should we expect for AI & Art in 2025?
- Starting off strong - Open source releases of January 25
What were the standout releases of creative AI tools in 2024? We'll highlight the major releases across creative and artistic fields, with a particular focus on open-source developments in popular tasks like image and video generation.
Over 2 years since the OG stable diffusion was released and made waves in image generation with open source models, it’s now safe to say that when it comes to image generation from text, image editing and controlled image generation - open source models are giving closed source models a run for their money.
TD;LR - diffusion models and Gaussian flow matching are equivalent. Flow matching proposes a vector field parametrization of the network output that is different compared to the ones commonly used in diffusion models previously.
- We recommend this great blog by Google DeepMindif you’re interested in learning more about flow matching and the connection with diffusion models
Back to practice: First to announce the shift was Stability AI with Stable Diffusion 3, however it was HunyuanDiT that became the first open source model with DiT architecture.
This trend continued with the releases of AuraFlow, Flux.1 and Stable Diffusion 3.5.
Among many pivotal moments in the (not so long) history of open source image generation models, it’s safe to say that the release of Flux.1 was one of them. Flux [dev] achieved a new state-of-the-art, surpassing popular closed source models like Midjourney v6.0, DALL·E 3 (HD) on various benchmarks.
A positive side effect of advancements in image models is the significant improvement in personalization techniques for text-to-image models and controlled generation.
Back in August 2022, transformative works like Textual Inversion and DreamBooth enhanced our ability to teach and introduce new concepts to text-to-image models, drastically expanding what could be done with them. These opened the door to a stream of improvements and enhancements building on top of these techniques (such as LoRA for diffusion models).
However, an upper bound to the quality of the fine-tuned models is naturally the base model from which it was fine-tuned. In that sense, we can’t neglect Stable Diffusion XL, which was also a significant marker in personalization for open source image generation models. A testimony to that is that even now, many of the popular techniques and models for personalization and controlled generation are based on SDXL. The advanced abilities of SDXL (and models that were released after with similar quality) together with the growing understanding of the semantic roles of different components in the diffusion model architecture raises the question -
what can we achieve without further optimization?
cue in the rain of zero shot techniques - 2024 was definitely the year when generating high quality portraits from reference photos was made possible with as little as a single reference image & without any optimization. Training free techniques like IP adapter FaceID, InstantID, Photomaker and more came out and demonstrated competitive if not even superior abilities to those of fine-tuned models.
Similarly, image editing and controlled generation - such as image generation with canny / depth / pose constraints made progress too - both thanks to the growing quality of the base models and the community’s growing understanding of the semantic roles different components have (Instant Style, B-LoRA)
So what’s next? since the shift of paradigms to DiT and flow matching objectives, additional models came out trying to utilize DiT-based models like Flux and SD3.5 for similar purposes, but so far not quite beating the quality of the SDXL-based ones despite the superior quality of the underlying base model. This could be attributed to the relative lack of understanding of semantic roles of different components of the DiT compared to the Unet. 2025 could be the year when we identify those roles in DiTs as well, unlocking more possibilities with the next generation of image generation models.
As opposed to image generation, with video we still have a way to go.
But, it’s safe to say that we’re very far away from where we were a year ago. While we’re all about open-source,
the credit for (some) of the significant leap in AI video generation goes to OpenAI’s sora for changing our
expectations of video model capabilities quite radically. And as fofr put nicely in AI video is having its Stable Diffusion moment (which we recommend reading 🙂) - it
made everyone realize what is possible.
The recent surge of open-source video generation models, including CogVideoX, Mochi, Allegro, LTX Video, and HunyuanVideo, has also been noteworthy. Video generation is inherently more challenging than image generation due to the need for motion quality, coherence, and consistency. Additionally, video generation requires substantial computational and memory resources, leading to significant generation latency. This often hinders local usage, making many new open video models inaccessible to community hardware without extensive memory optimizations and quantization approaches that impact both inference latency and the quality of generated videos. Nevertheless the open source community has made remarkable progress - which was recently covered in this blog on the state of open video generation models.
While this implies that most community members are still unable to experiment and develop with open-source video models, it also suggests that we can expect significant advancements in 2025.
Audio generation has progressed significantly in the past year going from simple sounds to complete songs with lyrics. Despite challenges - Audio signals are complex and multifaceted, require more sophisticated mathematical models than models that generate text or images and training data quite scarce - 2024 saw open source releases like OuteTTS and IndicParlerTTS for text to speech and openai’s Whisper large v3 turbo for audio speech recognition. The year 2025 is already shaping up to be a breakthrough year for audio models, with a remarkable number of releases in January alone. We've seen the release of three new text-to-speech models: Kokoro, LLasa TTS and OuteTTS 0.3, as well as two new music models: JASCO and YuE. With this pace, we can expect even more exciting developments in the audio space throughout the year.
This song👇 was generated with YuE 🤯