ESP32 has some number crunching capacity, that is for sure. MP3 decoding is pretty easy to do on the ESP32, even at high data rates. But introduce the problem of MP3 licensing and you are deeper in the sand with your commercial application.
The alternative? Xiph’s Vorbis project! Vorbis is free and is very well supported by multimedia devices. There are a ton of applications and devices out there that support Vorbis along with MP3.
So how about Ogg Vorbis on ESP32?
First, the hardware we are proud of!
Well, we packed everything into a tiny system-on-module and are testing it at the moment. So we figured – why not try some Ogg Vorbis on ESP32-PICO-D4?
We now have the modules in stock and ready to purchase!
AudioSOM32 Audio Module
Why Ogg Vorbis?
The audio quality is really good, comparable to MP3 even when encoded/decoded with integral calculations (not floating point arithmetic). The codec has a variety of ports, including for platforms with low memory but high processing power (sounds like ESP32!).
The biggest advantage to using Ogg Vorbis is the fact that it sports a BSD license and not something else that would make you go through licensing formalities.
The codec works well with high data rate audio streams and is meant for online distribution with variable bit rate. This places it closer to MP4 and easily above MP3.
Tremor is an integer-only decoder that is fully compatible with Ogg Vorbis format and can be run on embedded devices (well, not the very simple ones). There is a low memory branch made available by Xiph.org that works well on low memory devices.
However, like most codecs that use windows for decoding, the tradeoff is between memory usage and processor grunt. If you want to run low on the memory, then you need to have a good processor to do the trick.
Is it feasible to run the decoder on ESP32?
The readily available code from the low-memory branch of Tremor seems to take up about 30k-words of RAM during run-time. This is without much optimization. The speed is good enough to decode any stream that you can throw at an I2S audio codec.
The header of the Ogg Vorbis file takes up a bulk of the file size if your audio content is really short. Because the header is so complex, it takes a long time to work on the header itself. An easy approach towards improving performance is restricting window size to 2048 (or 2048 and 256) only. This will virtually play all media very efficiently as other window sizes are quite rare. The time required to decode headers is significantly reduced by doing this, even though RAM footprint remains effectively the same.
Running Tremor in low accuracy configuration has no deal-breaking reduction in audio quality. It does tend to reduce the dynamic range a little bit. But typically, you will not sense it unless you compared the decoded streams on a good headphone.
Data input and output
We did not write the code for fetching media files from a server or memory card. This was just an experiment to see what happens when you decode Ogg Vorbis on ESP32 (dedicated core for Tremor decoder). So we used a simple data array from flash instead of getting a “real” file off the network.
The ESP32 does deliver an average data rate of 8mbps over WiFi and that should be enough for a lot of applications without any RAM for caching the stream. If you really need it, the ESP-WROVER module with pSRAM can do the job.
The output (decoded audio data) was fed into the SGTL5000 with a setting of 48kHz (stereo, 16-bps). The DMA engine takes care of playing the data from memory. The SGTL5000 makes the decoded stream sound comparable to any other WAV file that you would play.
Looking for ESP32 audio solutions?
We get to work a lot with audio applications built around the ESP32 (including the ESP32-D2WD and ESP32-PICO-D4). Be it a simple audio codec driver or recording 4 studio-quality streams simultaneously with the ESP32 – we have tried it all in hardware and firmware!
If you need help with one of your designs, feel free to contact us, we are friendly and we love embedded electronics!