I came across this new audio codec: https://github.com/descriptinc/descript-audio-codec
Here are some samples of audios encoded with this: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5
Unfortunately I can't do a listening test at the moment because my headphones aren't very good, but honestly I didn't feel any difference between the original audio and the audio encoded with this codec (I listened to the music audio in the demonstration page).
What do you think?
This isn't meant for acoustic audio compression, it's meant to feed simplified data into AI algos, so I'd imagine it sounds pretty good to a neutral network and pretty bad to a human ear.
It's definitely meant for acoustic audio compression, and the demos sound pretty good to me.
I notice they make no mention of how long it takes to encode or decode the audio.
This is really impressive.
Although the provided samples are rather simple, and only in mono. On these samples I can't quickly tell where to focus to hear a difference.
(worth noting, the source audio samples have apparently already went through some sort of lossy compression - but that doesn't necessarily make it easier to mask further losses)
I'll try to install it, hopefully it doesn't require a huge GPU to work.
Has anyone understood if it's only able to work in hard CBR mode, or is there some flexibility possible?
For example, can it use less bandwidth during periods with relatively simple signal, or periods of complete silence?
Other interesting "AI-based" audio compression resources:
https://github.com/forart/HyMPS/blob/main/AIaudio.md#codecs-
I had a quick play - had to install CUDA 11.7 and a large amount of python stuff to get this working.
If you're on Windows and struggling with the pytorch not compiled for CUDA error, remove all Nvidia CUDA software and Nvidia drivers, and reinstall both using the CUDA 11.7 installer.
Encoded a WAV file (CD Audio, 1h16m51s) - 69 seconds to encode, 145 seconds to decode on a 3080.
Command line:
python -m dac encode in.wav --output .\
I had a quick play - had to install CUDA 11.7 and a large amount of python stuff to get this working.
If you're on Windows and struggling with the pytorch not compiled for CUDA error, remove all Nvidia CUDA software and Nvidia drivers, and reinstall both using the CUDA 11.7 installer.
Encoded a WAV file (CD Audio, 1h16m51s) - 69 seconds to encode, 145 seconds to decode on a 3080.
Command line:
python -m dac encode in.wav --output .\
I tried to install with pipx in my Debian, but as it was taking so long to download, I had to abort at 10 minutes of installation, and also I was afraid that it could install a lot of stuff and it would be difficult to remove afterwards. But this is not a problem, there is LXC in which I can make a clean installation without messing with the system; I will give a try when I have a spare time.
My PC is of 2017 and my GPU is a Nvidia GT 1030, comparing to a 3080 it is 12~15x slower. My question is if the decoding will be that slower too.
In archive.org there is some royalty free FLAC musics: https://archive.org/search?query=royalty+free+flac
/\ I will try to convert to .dac and will post the result here.
Note that HydrogenAudio user Kamedo2 posted some audio samples for blind listening tests in the following thread, which (I think, given it's a noncommercial kind-of-research study) you could use as well:
https://hydrogenaud.io/index.php/topic,98003.0.html
Chris
I successfully installed this encoder in a Python virtual environment (venv), it consumed 9.8GB of disk space.
I have success in converting a .wav file to this format, but I couldn't decode due to an error in the codec (or maybe my GPU is unsupported).
Here I use an AMD Ryzen 5 1400 and a Nvidia GT 1030 graphic board, it took 55 seconds to encode an audio with 6:12 of duration.
Maybe in the future this codec becomes more performant.
Sounds amazing at 8kbps.
Definitely the best sounding music compression I've heard at that bitrate.
Amazing at 8kbps... Can it be decoded without that 9,8GB of disk space? ;]
Amazing at 8kbps... Can it be decoded without that 9,8GB of disk space? ;]
That's what I wanted to ask - does it have some simple decoder, or does it need full power of AI computing to reconstruct it?
Amazing at 8kbps... Can it be decoded without that 9,8GB of disk space? ;]
That's what I wanted to ask - does it have some simple decoder, or does it need full power of AI computing to reconstruct it?
I've had a brief experimentation with using "Auto Py To Exe" to compile the dac.py script to an executable, and it's made a 40MB exe with over 4GB of support files in a subfolder (the vast majority is PyTorch).
This does not include the weights file which is ~300MB. The weights file contains the model data used to encode and decode the audio and is a necessity, and the model is not interchangeable. You have to use the same model to decode any encoded file.
I anticipate there is plenty of scope to optimise the software size, but it's leaning heavily on the PyTorch baggage. I anticipate as operating system support matures for AI models, it might be that there could be a standard for using models at the OS level so hooking into a model is no different than making sure you have the latest version of DirectX on Windows.
I'm trying to be careful to not violate TOS #8 - the MUSHRA scores on the GitHub should suffice, but I'd just like to comment that the performance is profoundly good. Mods - feel free to redact this last paragraph if I've made a mistake here.
When I buy a decent headphone I will comment about its quality.
I hope that its developers optimize the code for less GPU usage in the future.
... it's made a 40MB exe ...
This does not include the weights file which is ~300MB. The weights file contains the model data used to encode and decode the audio and is a necessity, and the model is not interchangeable. You have to use the same model to decode any encoded file.
Thanks for the analysis! Out of curiosity: could you 7zip (preset Ultra) that 40MB exe and 300MB weight file and let us know what file size comes out? That would be a rough estimate of how much room for reduction there is.
Thanks,
Chris
That would be a rough estimate of how much room for reduction there is.
You forgot 4 GB of support files there, Chris :)
No, I didn't, these apparently represent the Python/PyTorch installation itself and could be avoided in a software written e.g. in C or C++.
Chris
Python is slower than compiled languages such as C++ or Go: https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/python3-go.html
But efforts have been made for speeding up Python, such as Codon compiler: https://github.com/exaloop/codon
I don't know if Python code would be that slow on a GPU as well as in a CPU, but it would be awesome to have this codec compiled through LLVM.
Codon is still not ready to compile 100% of Python code. But the developers are working to implement the missing features such as metaclass support.
Hmm... 9GB to store the decoder, or 9GB to store files with a slightly less efficient compression and lightweight demands on hardware. Tricky...
AI is notorious for making things up in a believable way. The output might sound beautiful, but is it true?
90x smaller than wav, huh? Assuming they mean 16/44.1 PCM wav files, that'd be somewhere in the ballpark of 16 kbps, right?
I'm going to go out on a limb and say it either sounds bad or is completely impractical for most use cases. Or requires licensing over 9000 patents to implement.
90x smaller than wav, huh? Assuming they mean 16/44.1 PCM wav files, that'd be somewhere in the ballpark of 16 kbps, right?
Yeah, except: the sample files are
mono.
So that means CDDA encoded at 16 kbps as dual mono, without any stereo decorrelation strategy. I have not bothered to look up whether they have any stereo decorreleation algorithm (yet), but obviously that is room for improvement - and also an opportunity to spend more processing power.
I'm going to go out on a limb and say it either sounds bad
Well you can test it ... ? Although the samples are not that interesting ...
or is completely impractical for most use cases.
As of now? Sure.
... it's made a 40MB exe ...
This does not include the weights file which is ~300MB. The weights file contains the model data used to encode and decode the audio and is a necessity, and the model is not interchangeable. You have to use the same model to decode any encoded file.
Thanks for the analysis! Out of curiosity: could you 7zip (preset Ultra) that 40MB exe and 300MB weight file and let us know what file size comes out? That would be a rough estimate of how much room for reduction there is.
Thanks,
Chris
Happy to oblige!
dac.exe 41,865,797 bytes
dac.7z 41,456,443 bytes
weights.pth 306,720,768 bytes
weights.7z 278,740,892 bytes
I'm going to go out on a limb and say it either sounds bad
Well you can test it ... ? Although the samples are not that interesting ...
Ever since the "64 kbps WMA sounds as good as 128 kbps mp3" stuff, I don't trust a codec developer to not cherrypick a codec implementation that's subpar for the opposing format or cherrypick samples, or otherwise be somewhat dishonest in things like this.
And it sounds like you currently need a nVidia GPU to work with this codec? That's something I don't have to work with.
I'm going to go out on a limb and say it either sounds bad
Well you can test it ... ? Although the samples are not that interesting ...
Ever since the "64 kbps WMA sounds as good as 128 kbps mp3" stuff, I don't trust a codec developer to not cherrypick a codec implementation that's subpar for the opposing format or cherrypick samples, or otherwise be somewhat dishonest in things like this.
And it sounds like you currently need a nVidia GPU to work with this codec? That's something I don't have to work with.
You're right to distrust 1st party benchmarks always.
The following is vague guessing because I only have vague awareness of the tech, so pinch of salt:
There appears to be a CPU and GPU mode so you can run on the CPU, but likely the GPU mode is CUDA which is proprietary nvidia lock-in tech. It may be possible for AMD GPU's to run the cuda code (or a reasonably simple port job of it to hip) using rocm. On the other hand the repo contains mostly python (albeit it does reference cuda), the readme makes reference to torchrun which is presumably pytorch and I know Pytorch works on AMD, so maybe it wouldn't take too much for AMD GPU's to work. intel dGPU's I have less of a clue, they have oneAPI that they're trying to push as an interoperable standard, and apparently can also do pytorch. If you have AMD/intel GPU's and want to try then godspeed, it's likely the way of pain even if it is possible.
I am still using a 2015 GTX950 which has 1.88x speed of GT1030 according to some gaming benchmarks, but still slow. I mean, when this thing becomes feasible on consumer level then it could be used on streaming services and such, or before this happens I can already plug myself into The Matrix.
Probably much more interesting if they can make every 64kbps wma file sounds like lossless without cheating, like looking up on some existing lossless music catalogs to find the same song.
DAC-JAX: A JAX Implementation of the Descript Audio Codec
https://arxiv.org/abs/2405.11554
We present an open-source implementation of the Descript Audio Codec (DAC) using Google's JAX ecosystem of Flax, Optax, Orbax, AUX, and CLU. Our codebase enables the reuse of model weights from the original PyTorch DAC, and we confirm that the two implementations produce equivalent token sequences and decoded audio if given the same input. We provide a training and fine-tuning script which supports device parallelism, although we have only verified it using brief training runs with a small dataset. Even with limited GPU memory, the original DAC can compress or decompress a long audio file by processing it as a sequence of overlapping "chunks." We implement this feature in JAX and benchmark the performance on two types of GPUs. On a consumer-grade GPU, DAC-JAX outperforms the original DAC for compression and decompression at all chunk sizes. However, on a high-performance, cluster-based GPU, DAC-JAX outperforms the original DAC for small chunk sizes but performs worse for large chunks.