SNAC - Multi-Scale Neural Audio Codec

Topic: SNAC - Multi-Scale Neural Audio Codec (Read 2993 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

SNAC - Multi-Scale Neural Audio Codec

2025-03-07 13:22:52

Hi,

Further from the mention on this (TSAC: Very Low Bitrate Audio Compression) thread, I think this impressive new(-ish) neural codec deserves some more attention.

Github page with demos: https://github.com/hubertsiuzdak/snac

Three models are available:
- snac_24khz - 0.98 kbps for 24khz Mono Speech content - 19.8 M parameter model (79.5MB model download)
- snac_32khz - 1.9 kbps for 32khz Mono Music content - 54.5 M parameter model (218MB model download)
- snac_44khz - 2.6 kbps for 44khz Mono Music content - 54.5 M parameter model (218MB model download)

"Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks. This paper proposes the Multi-Scale Neural Audio Codec, a simple extension of RVQ where the quantizers can operate at different temporal resolutions. By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales. This leads to more efficient compression, as demonstrated by extensive objective and subjective evaluations."

ArXiv link: https://arxiv.org/abs/2410.14411

The demos look very impressive - I've been trying to get it to work but the code provided is quite bare.

From what I understand you need to convert your input audio to the appropriate sample rate and mono, then convert that audio into tensors, then feed those tensors to torch which will generate a torch tensors object that can be written to a file. It's a bit of a draw the rest of the owl moment! Maybe minds greater than mine can get this to work.

The Issues page shows this, suggesting code from Descript Audio Codec could be used as a basis to create a CLI encoder / decoder:
https://github.com/hubertsiuzdak/snac/issues/28

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #1 – 2025-03-08 06:11:39

Looks promising, but I don't know how to build it, and make it accept lossless input files.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #2 – 2025-03-08 17:02:59

OK, I managed to create the required environment, on WSL2 Ubuntu on Windows 10.
Also accept stereo .wav files now, by encoding and decoding each channel separately.
It yielded 143,856(ch1)+144,101(ch2) Bytes (7.672kbps), spending 2min54.037seconds(1.723x realtime) on stereo 44.1kHz 300 seconds lossless input.
It is amazing for a compression under 8kbps, though female and male vocals, strings, etc... tend to degrade.

Spoiler (click to show/hide)

Create directory which you will place program files. Place snac-encode.py from below.

Code: [Select]

user@DESKTOP-XXXXXXX:~$ mkdir snac && cd snac

Refreshes the local package index, upgrade packages

Code: [Select]

user@DESKTOP-XXXXXXX:~/snac$ sudo apt update && sudo apt upgrade

Prepare pip

Code: [Select]

user@DESKTOP-XXXXXXX:~/snac$ sudo apt install python3-pip

Install virtual environment for Python 3 libraries.

Code: [Select]

user@DESKTOP-XXXXXXX:~/snac$ sudo apt install python3.12-venv

Create your own virtual environment for this SNAC and required library to run, separated from the system's.

Code: [Select]

user@DESKTOP-XXXXXXX:~/snac$ python3 -m venv ~/venv

Enter the virtual environment. Your libraries will be placed in this separate environment.

Code: [Select]

user@DESKTOP-XXXXXXX:~/snac$ source ~/venv/bin/activate

Install torch and torchaudio Python libraries, this will be used in audio processing.

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Install required Python libraries and SNAC

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ pip install pandas numpy snac

Confirm Torch is installed

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ python -c "import torch; print(torch.__version__)"

Ensure CUDA is available (Print True)

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ python -c "import torch; print(torch.cuda.is_available())"

Install sox and FFmpeg, this is used to read and write .wav files.

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ sudo apt install sox ffmpeg

Usage:

Code: [Select]

(venv) user@DESKTOP-XXXXXXX:~/snac$ time python snac-encode.py -i input.wav -o encoded.snac -d decoded.wav

Source code (snac-encode.py)

Code: [Select]

#!/usr/bin/env python3
import torch
import torchaudio
import argparse
import os
import numpy as np
from snac import SNAC

def parse_arguments():
    parser = argparse.ArgumentParser(description='SNAC Audio Encoder/Decoder')
    parser.add_argument('-i', '--input', required=True, help='Input WAV file')
    parser.add_argument('-o', '--output', required=True, help='Output SNAC file')
    parser.add_argument('-d', '--decoded', required=True, help='Decoded (reconstructed) WAV file')
    parser.add_argument('-s', '--sample_rate', type=int, default=44100, 
                        help='Target sample rate (default: 44100)')
    parser.add_argument('--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu',
                        help='Device to use (cuda/cpu)')
    parser.add_argument('--model', type=str, default='hubertsiuzdak/snac_44khz',
                        help='SNAC model to use')
    return parser.parse_args()

def load_audio(file_path, target_sr=44100):
    """Load audio file and resample if needed."""
    waveform, sample_rate = torchaudio.load(file_path)
    
    # Resample if needed
    if sample_rate != target_sr:
        print(f"Resampling from {sample_rate}Hz to {target_sr}Hz")
        resampler = torchaudio.transforms.Resample(sample_rate, target_sr)
        waveform = resampler(waveform)
    
    return waveform, target_sr

def process_audio(model, audio, device):
    """Process audio through SNAC model."""
    # Move audio to the appropriate device
    audio = audio.to(device)
    
    with torch.inference_mode():
        # Process through SNAC model
        audio_hat, codes = model(audio)
    
    return audio_hat, codes

def save_codes(codes, file_path):
    """Save SNAC codes to a file."""
    # Convert codes to numpy arrays and save
    codes_data = {}
    for i, code_sequence in enumerate(codes):
        codes_data[f'layer_{i}'] = code_sequence.cpu().numpy()
    
    # Save using numpy's compressed format
    np.savez_compressed(file_path, **codes_data)
    print(f"Saved compressed codes to {file_path}")

def load_codes(file_path):
    """Load SNAC codes from a file."""
    data = np.load(file_path)
    codes = []
    
    # Convert back to torch tensors
    for i in range(len(data.files)):
        key = f'layer_{i}'
        if key in data:
            codes.append(torch.from_numpy(data[key]))
    
    return codes

def save_audio(audio, file_path, sample_rate):
    """Save audio tensor to WAV file."""
    torchaudio.save(file_path, audio.cpu(), sample_rate)
    print(f"Saved reconstructed audio to {file_path}")

def main():
    args = parse_arguments()
    device = args.device
    
    # Load SNAC model
    print(f"Loading SNAC model from {args.model}")
    model = SNAC.from_pretrained(args.model).eval().to(device)
    
    # Load and preprocess audio
    print(f"Loading audio from {args.input}")
    waveform, sample_rate = load_audio(args.input, args.sample_rate)
    num_channels = waveform.shape[0]
    
    if num_channels > 2:
        print(f"Warning: Audio has {num_channels} channels. Only the first two will be processed.")
        waveform = waveform[:2]
        num_channels = 2
    
    # Process each channel separately
    reconstructed_channels = []
    all_codes = []
    
    for ch in range(num_channels):
        print(f"Processing channel {ch+1}/{num_channels}")
        # Extract single channel and add batch dimension (B, 1, T)
        channel_audio = waveform[ch:ch+1].unsqueeze(0)
        
        # Process through SNAC
        audio_hat, codes = process_audio(model, channel_audio, device)
        
        # Store results
        reconstructed_channels.append(audio_hat.squeeze(0))
        all_codes.append(codes)
    
    # Combine reconstructed channels
    if num_channels == 2:
        reconstructed_audio = torch.cat(reconstructed_channels, dim=0)
    else:
        reconstructed_audio = reconstructed_channels[0]
    
    # Save reconstructed audio
    save_audio(reconstructed_audio, args.decoded, sample_rate)
    
    # Save compressed codes
    output_base, _ = os.path.splitext(args.output)
    
    if num_channels == 1:
        # Single channel, just save directly
        save_codes(all_codes[0], args.output)
    else:
        # For stereo, save each channel with suffix
        for ch in range(num_channels):
            channel_output = f"{output_base}_ch{ch+1}.snac"
            save_codes(all_codes[ch], channel_output)
        
        # Create a manifest file with channel info
        with open(args.output, 'w') as f:
            f.write(f"channels: {num_channels}\n")
            for ch in range(num_channels):
                f.write(f"channel_{ch+1}: {os.path.basename(output_base)}_ch{ch+1}.snac\n")
    
    print("Processing complete!")

if __name__ == "__main__":
    main()

Here is my encoded sample and decoded stereo wav. music is experiencia.wv, Latin, from https://www.rarewares.org/test_samples/

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #3 – 2025-03-08 17:10:14

Awesome quality for such bitrate! Thnaks Kamedo2 for sharing the decoded sample

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #4 – 2025-03-08 17:52:14

Quote from: Kamedo2 on 2025-03-08 17:02:59

OK, I managed to create the required environment, on WSL2 Ubuntu on Windows 10.
Also accept stereo .wav files now, by encoding and decoding each channel separately.
It yielded 143,856(ch1)+144,101(ch2) Bytes (7.672kbps), spending 2min54.037seconds(1.723x realtime) on stereo 44.1kHz 300 seconds lossless input.
It is amazing for a compression under 8kbps, though female and male vocals, strings, etc... tend to degrade.

Thanks Kamedo2.
300 seconds of input is mentioned but the decoded wav file in the attachment is 3 mb. 300 seconds of audio (2 ch, 44.1, 16 bit) is about 50 mb. Also, which is the processing time? It is not clear whether it is single/dual channel compression or single/dual channel decoding.
Also, the sample sound in the appendix cannot give us accurate information because it is too noisy. Can we see the total input size, encoding time and opening time exactly? We are not including the 256 mb model.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #5 – 2025-03-08 18:10:02

Quote from: genuine on 2025-03-08 17:52:14

300 seconds of input is mentioned but the decoded wav file in the attachment is 3 mb.

300 seconds of longer test wav file is not the 22ex_(experiencia), 20 seconds. It is forbidden in this forum to share one music file exceeding 30 sec.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #6 – 2025-03-08 18:21:54

Quote from: genuine on 2025-03-08 17:52:14

It is not clear whether it is single/dual channel compression or single/dual channel decoding.

For stereo 44.1kHz 16bit 300 seconds lossless input, it took 174.037 seconds to compress and decode both channels, on AMD Ryzen 7 5700X and NVIDIA GeForce RTX 3060 12GB. (which is far slower than TSAC on the same GPU)

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #7 – 2025-03-08 18:31:59

Here's macabre, the Orchestral sample from http://www.rarewares.org/test_samples/
While the decoded sound is surprisingly coherent and consistent for 8kbps, the balance is different from the original. Just "different".

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #8 – 2025-03-08 18:40:17

I cannot show you the encoding speed and decoding speed separately, because these two are combined in one program, one batch. macabre's encode and decode time is 6.166s in total (2.833x realtime).

The core lines:
Spoiler (click to show/hide)

Code: [Select]

    for ch in range(num_channels):
        print(f"Processing channel {ch+1}/{num_channels}")
        # Extract single channel and add batch dimension (B, 1, T)
        channel_audio = waveform[ch:ch+1].unsqueeze(0)
        
        # Process through SNAC
        audio_hat, codes = process_audio(model, channel_audio, device)
        
        # Store results
        reconstructed_channels.append(audio_hat.squeeze(0))
        all_codes.append(codes)

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #9 – 2025-03-08 18:43:16

Thanks Kamedo for this second sample. Still impressive. In comparison, USAC at 12 kbps is significantly worse. Experiencia at 12 kbps is better than Macabre.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #10 – 2025-03-08 19:49:24

Quote from: Kamedo2 on 2025-03-08 18:21:54

For stereo 44.1kHz 16bit 300 seconds lossless input, it took 174.037 seconds to compress and decode both channels, on AMD Ryzen 7 5700X and NVIDIA GeForce RTX 3060 12GB. (which is far slower than TSAC on the same GPU)

I don't think it's just using the GPU. If the processing speed with the mentioned CPU and GPU is like this, it is definitely not suitable for practical use. Also, I think it is not worth risking some things by going below 128 kbps nowadays.

@guruboolez Unfortunately, the files cannot be played.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #11 – 2025-03-08 19:55:10

@genuine: do you have an USAC decoder? It also must support SBR decoding.
I joined decoded files to this message.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #12 – 2025-03-08 20:25:37

@guruboolez thanks a lot. Unfortunately I don't have a USAC decoder. When I look at the results USAC 12 kbps looks really good. And this is a much faster and more practical solution than other trained neural solutions.
In real life there is not much difference between 12 kbps and 8 kbps in terms of space saving. In this case, processing speed is always the reason for preference.

Re: SNAC - Multi-Scale Neural Audio Codec

Reply #13 – 2025-03-09 21:30:57

@Kamedo2 I just wanted to thank you for creating that working encoder - that was very kind of you!

Notice