Ogg Vorbis on ESP32 – Feasible?

ESP32 has some number crunching capacity, that is for sure. MP3 decoding is pretty easy to do on the ESP32, even at high data rates. But introduce the problem of MP3 licensing and you are deeper in the sand with your commercial application.

The alternative? Xiph's Vorbis project! Vorbis is free and is very well supported by multimedia devices. There are a ton of applications and devices out there that support Vorbis along with MP3.
So how about Ogg Vorbis on ESP32?

First, the hardware we are proud of!

audioSOM32 vorbis ESP32-PICO-D4 audio codec module

Well, we packed everything into a tiny system-on-module and are testing it at the moment. So we figured - why not try some Ogg Vorbis on ESP32-PICO-D4?

If you would like to have one of these modules (and base-board) from the very first lot, leave us a quick contact message from the page footer!

Why Ogg Vorbis?

Why not?

The audio quality is really good, comparable to MP3 even when encoded/decoded with integral calculations (not floating point arithmetic). The codec has a variety of ports, including for platforms with low memory but high processing power (sounds like ESP32!).

The biggest advantage to using Ogg Vorbis is the fact that it sports a BSD license and not something else that would make you go through licensing formalities.

The codec works well with high data rate audio streams and is meant for online distribution with variable bit rate. This places it closer to MP4 and easily above MP3.

Tremor

Tremor is an integer-only decoder that is fully compatible with Ogg Vorbis format and can be run on embedded devices (well, not the very simple ones). There is a low memory branch made available by Xiph.org that works well on low memory devices.

However, like most codecs that use windows for decoding, the tradeoff is between memory usage and processor grunt. If you want to run low on the memory, then you need to have a good processor to do the trick.

Is it feasible to run the decoder on ESP32?

Feasible?

The readily available code from the low-memory branch of Tremor seems to take up about 30k-words of RAM during run-time. This is without much optimization. The speed is good enough to decode any stream that you can throw at an I2S audio codec.

The header of the Ogg Vorbis file takes up a bulk of the file size if your audio content is really short. Because the header is so complex, it takes a long time to work on the header itself. An easy approach towards improving performance is restricting window size to 2048 (or 2048 and 256) only. This will virtually play all media very efficiently as other window sizes are quite rare. The time required to decode headers is significantly reduced by doing this, even though RAM footprint remains effectively the same.

Running Tremor in low accuracy configuration has no deal-breaking reduction in audio quality. It does tend to reduce the dynamic range a little bit. But typically, you will not sense it unless you compared the decoded streams on a good headphone.

Data input and output

We did not write the code for fetching media files from a server or memory card. This was just an experiment to see what happens when you decode Ogg Vorbis on ESP32 (dedicated core for Tremor decoder). So we used a simple data array from flash instead of getting a "real" file off the network.

The ESP32 does deliver an average data rate of 8mbps over WiFi and that should be enough for a lot of applications without any RAM for caching the stream. If you really need it, the ESP-WROVER module with pSRAM can do the job.

The output (decoded audio data) was fed into the SGTL5000 with a setting of 48kHz (stereo, 16-bps). The DMA engine takes care of playing the data from memory. The SGTL5000 makes the decoded stream sound comparable to any other WAV file that you would play.

Looking for ESP32 audio solutions?

We get to work a lot with audio applications built around the ESP32 (including the ESP32-D2WD and ESP32-PICO-D4). Be it a simple audio codec driver or recording 4 studio-quality streams simultaneously with the ESP32 - we have tried it all in hardware and firmware!

If you need help with one of your designs, feel free to contact us, we are friendly and we love embedded electronics!

4 thoughts on “Ogg Vorbis on ESP32 – Feasible?”

  1. Hi, nice idea. Seeing as you asked, I’m interested in an ESP32 based audio solution with roughly the following requirements.

    state : DEEP_SLEEP
    movement detected,
    state : LISTENING
    wake up device from deep sleep and wait for sound (above volume limit ‘x’),
    state : RECORDING + APPENDING_FILE
    create/append to file (SD card no doubt) until vol drops below ‘x’ for ‘y’ minutes
    state : LISTENING
    stop recording
    [repeat ‘n’ times until movement detected > 5mins goto state : DEEP_SLEEP

    Then :
    When paired BT device ‘d’ is within range, start up webserver, serve the sound files over Wifi.
    Delete downloaded files on request.
    When prompted over Wifi, set off short timer and shut back down, then start listening again.

    I’m confident I can achieve most of this (eventually), just really interested from someone with superior knowledge whether it is actually feasible using an ESP32, and if so, which board would manage some of these jobs for me?

    So DEEP_SLEEP is with BT acting as some kinda beacon I guess.

    All ears for any other tips on mics, deep sleep etc, oh, yeah, battery powered too.

    As you can see, all a bit theoretic atm, but glad I found your post, I never thought of using Ogg-vorbis before.

    1. Paul,
      The application is definitely feasible. The practicality of the idea will depend on how often you want to check if the BT device is in range. The ESP32 can run at as low as ~80mA when one core is turned off. The ULP processor can help as well (ULP-only consumption is in few milliamps). With this strategy, a 180mAh cell might last for a week or more.
      Here is my proposed solution:
      1. Use a MEMS microphone and ULP co-processor to monitor sound levels. This is much better than continuously processing an I2S codec output data stream.
      2. Use an external RP-SMA antenna for better RF reception. Also, properly matched antenna for high sensitivity.
      3. The MEMS mic can be used for recording, or a separate condenser/boom mic can be used if range is critical. I2S codec is used only for audio recording.

      There is no particular board for this that I know of. But you can use our upcoming audio development board for working with this! We would start selling around next month. 🙂
      Also, here is what you can already get for I2S (+ open driver for ESP32):
      https://github.com/IoTBits/ESP32_SGTL5000_driver/

Leave a Reply