Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: SNAC - Multi-Scale Neural Audio Codec (Read 2993 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

SNAC - Multi-Scale Neural Audio Codec

Hi,

Further from the mention on this (TSAC: Very Low Bitrate Audio Compression) thread, I think this impressive new(-ish) neural codec deserves some more attention.

Github page with demos: https://github.com/hubertsiuzdak/snac

Three models are available:
- snac_24khz - 0.98 kbps for 24khz Mono Speech content - 19.8 M parameter model (79.5MB model download)
- snac_32khz - 1.9 kbps for 32khz Mono Music content - 54.5 M parameter model (218MB model download)
- snac_44khz - 2.6 kbps for 44khz Mono Music content - 54.5 M parameter model (218MB model download)

"Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations."

ArXiv link: https://arxiv.org/abs/2410.14411

The demos look very impressive - I've been trying to get it to work but the code provided is quite bare.

From what I understand you need to convert your input audio to the appropriate sample rate and mono, then convert that audio into tensors, then feed those tensors to torch which will generate a torch tensors object that can be written to a file. It's a bit of a draw the rest of the owl moment! Maybe minds greater than mine can get this to work.

The Issues page shows this, suggesting code from Descript Audio Codec could be used as a basis to create a CLI encoder / decoder:
https://github.com/hubertsiuzdak/snac/issues/28
rc55.com - nothing going on

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #1
Looks promising, but I don't know how to build it, and make it accept lossless input files.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #2
OK, I managed to create the required environment, on WSL2 Ubuntu on Windows 10.
Also accept stereo .wav files now, by encoding and decoding each channel separately.
It yielded 143,856(ch1)+144,101(ch2) Bytes (7.672kbps), spending 2min54.037seconds(1.723x realtime) on stereo 44.1kHz 300 seconds lossless input.
It is amazing for a compression under 8kbps, though female and male vocals, strings, etc... tend to degrade.

Spoiler (click to show/hide)

Here is my encoded sample and decoded stereo wav. music is experiencia.wv, Latin, from https://www.rarewares.org/test_samples/

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #3
Awesome quality for such bitrate! Thnaks Kamedo2 for sharing the decoded sample :)
Wavpack Hybrid -c4hx6

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #4
OK, I managed to create the required environment, on WSL2 Ubuntu on Windows 10.
Also accept stereo .wav files now, by encoding and decoding each channel separately.
It yielded 143,856(ch1)+144,101(ch2) Bytes (7.672kbps), spending 2min54.037seconds(1.723x realtime) on stereo 44.1kHz 300 seconds lossless input.
It is amazing for a compression under 8kbps, though female and male vocals, strings, etc... tend to degrade.
Thanks Kamedo2.
300 seconds of input is mentioned but the decoded wav file in the attachment is 3 mb. 300 seconds of audio (2 ch, 44.1, 16 bit) is about 50 mb. Also, which is the processing time? It is not clear whether it is single/dual channel compression or single/dual channel decoding.
Also, the sample sound in the appendix cannot give us accurate information because it is too noisy. Can we see the total input size, encoding time and opening time exactly? We are not including the 256 mb model.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #5
300 seconds of input is mentioned but the decoded wav file in the attachment is 3 mb.

300 seconds of longer test wav file is not the 22ex_(experiencia), 20 seconds. It is forbidden in this forum to share one music file exceeding 30 sec.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #6
It is not clear whether it is single/dual channel compression or single/dual channel decoding.

For stereo 44.1kHz 16bit 300 seconds lossless input, it took 174.037 seconds to compress and decode both channels, on AMD Ryzen 7 5700X and NVIDIA GeForce RTX 3060 12GB. (which is far slower than TSAC on the same GPU)

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #7
Here's macabre, the Orchestral sample from http://www.rarewares.org/test_samples/
While the decoded sound is surprisingly coherent and consistent for 8kbps, the balance is different from the original. Just "different".

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #8
I cannot show you the encoding speed and decoding speed separately, because these two are combined in one program, one batch. macabre's encode and decode time is 6.166s in total (2.833x realtime).

The core lines:
Spoiler (click to show/hide)

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #9
Thanks Kamedo for this second sample. Still impressive. In comparison, USAC at 12 kbps is significantly worse. Experiencia at 12 kbps is better than Macabre.
Wavpack Hybrid -c4hx6

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #10
For stereo 44.1kHz 16bit 300 seconds lossless input, it took 174.037 seconds to compress and decode both channels, on AMD Ryzen 7 5700X and NVIDIA GeForce RTX 3060 12GB. (which is far slower than TSAC on the same GPU)
I don't think it's just using the GPU. If the processing speed with the mentioned CPU and GPU is like this, it is definitely not suitable for practical use. Also, I think it is not worth risking some things by going below 128 kbps nowadays.

@guruboolez Unfortunately, the files cannot be played.


Re: SNAC - Multi-Scale Neural Audio Codec

Reply #12
@guruboolez thanks a lot. Unfortunately I don't have a USAC decoder. When I look at the results USAC 12 kbps looks really good. And this is a much faster and more practical solution than other trained neural solutions.
In real life there is not much difference between 12 kbps and 8 kbps in terms of space saving. In this case, processing speed is always the reason for preference.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #13
@Kamedo2 I just wanted to thank you for creating that working encoder - that was very kind of you!
rc55.com - nothing going on