BigVGAN v2 24kHz 100band 256x: A High-Performance Neural Vocoder for Realistic Speech and Audio Generation

In the rapidly evolving world of speech synthesis, voice cloning, and AI-driven audio generation, the quality of the final waveform output determines the overall user experience. While acoustic models generate mel spectrograms or intermediate representations, it is the vocoder that converts those features into realistic, natural-sounding audio. Among the most advanced neural vocoders available today is nvidia/bigvgan_v2_24khz_100band_256x, part of the BigVGAN v2 family developed by NVIDIA.

BigVGAN v2 is a large-scale, GAN-based universal neural vocoder designed to generate high-fidelity audio across diverse domains, including multilingual speech, environmental sounds, and musical instruments. Built for scalability and performance, it introduces optimized CUDA kernels, improved discriminator strategies, and large-scale training methodologies that make it suitable for both research and production environments.

This article explores BigVGAN v2 24kHz in depth, including its architecture, features, performance optimizations, pretrained model configurations, installation steps, and practical use cases.

What Is BigVGAN?

BigVGAN is a universal neural vocoder based on Generative Adversarial Networks (GANs). It synthesizes time-domain waveforms from mel spectrogram inputs. The original research paper, BigVGAN: A Universal Neural Vocoder with Large-Scale Training, introduced a scalable framework for high-quality waveform generation trained on large and diverse datasets.

Unlike traditional signal-processing vocoders, BigVGAN leverages deep convolutional architectures combined with adversarial training to achieve highly realistic audio synthesis. The v2 release further improves both sound quality and inference speed.

The specific model bigvgan_v2_24khz_100band_256x generates 24kHz audio using 100 mel frequency bands and a 256x upsampling ratio.

Key Features of BigVGAN v2 24kHz

1. High-Fidelity 24kHz Audio Output

The model operates at a 24kHz sampling rate, delivering high-resolution audio suitable for:

Text-to-speech systems
Audiobooks
Virtual assistants
Voice cloning
Multilingual speech synthesis

Compared to lower sampling rates such as 16kHz, 24kHz provides clearer high-frequency detail and improved perceptual quality.

2. 100 Mel Band Configuration

The model uses 100 mel frequency bands, allowing it to capture fine-grained spectral details. This improves:

Natural prosody
Harmonic richness
Reduced artifacts in high frequencies

This configuration balances computational efficiency and audio quality.

3. 256x Upsampling Ratio

The 256x upsampling ratio enables conversion from low-resolution acoustic features to full waveform output. This architecture ensures accurate time-domain reconstruction from mel spectrogram inputs.

4. Improved Multi-Scale Discriminator

BigVGAN v2 introduces a multi-scale sub-band CQT discriminator and a multi-scale mel spectrogram loss. These enhancements improve:

Stability during training
Audio realism
Reduced high-frequency distortion

This design allows the model to generalize across speech, environmental sounds, and instruments.

5. Custom CUDA Kernel for Faster Inference

One of the major advancements in BigVGAN v2 is the fused CUDA kernel for anti-aliased activation. This combines:

Upsampling
Activation
Downsampling

into a single optimized CUDA operation. NVIDIA reports 1.5x to 3x faster inference speed on an A100 GPU when using the custom kernel.

This makes BigVGAN v2 suitable for:

Real-time speech synthesis
Large-scale batch inference
Production-level TTS systems

Technical Specifications

Feature	Specification
Model Name	bigvgan_v2_24khz_100band_256x
Sampling Rate	24 kHz
Mel Bands	100
Upsampling Ratio	256x
Parameters	~112M
Framework	PyTorch
License	MIT
Training Steps	5M
Dataset	Large-scale diverse compilation

The MIT license makes this model commercially friendly and suitable for enterprise deployment.

Installation and Setup

To use the pretrained model from Hugging Face:

git lfs install
git clone https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x

Basic usage example:

import torch
import bigvgan
import librosa
from meldataset import get_mel_spectrogram

device = 'cuda'

model = bigvgan.BigVGAN.from_pretrained(
    'nvidia/bigvgan_v2_24khz_100band_256x',
    use_cuda_kernel=False
)

model.remove_weight_norm()
model = model.eval().to(device)

wav, sr = librosa.load("audio.wav", sr=model.h.sampling_rate, mono=True)
wav = torch.FloatTensor(wav).unsqueeze(0)

mel = get_mel_spectrogram(wav, model.h).to(device)

with torch.inference_mode():
    wav_gen = model(mel)

To enable the fast CUDA kernel:

model = bigvgan.BigVGAN.from_pretrained(
    'nvidia/bigvgan_v2_24khz_100band_256x',
    use_cuda_kernel=True
)

The first run builds the kernel using nvcc and ninja. CUDA 12.1 compatibility is recommended.

Pretrained Model Variants

BigVGAN v2 supports multiple configurations:

44kHz 128-band 512x
44kHz 128-band 256x
24kHz 100-band 256x
22kHz 80-band 256x

The 24kHz 100-band 256x version offers a strong balance between performance, quality, and computational efficiency, making it a popular choice for speech applications.

Use Cases for BigVGAN v2

1. Text-to-Speech Systems

BigVGAN integrates seamlessly with acoustic models that output mel spectrograms, making it ideal for TTS pipelines.

2. Voice Cloning and Multi-Speaker Synthesis

Its large-scale training improves generalization across speakers and languages.

3. Audio Language Models

When paired with tokenizers and semantic models, BigVGAN can serve as the final waveform decoder in speech foundation models.

4. Environmental and Instrumental Audio Synthesis

Because it was trained on diverse datasets, BigVGAN is not limited to speech.

5. Real-Time Applications

The optimized CUDA kernel enables near real-time inference for:

Conversational AI
Interactive demos
Streaming audio generation

Why BigVGAN v2 Stands Out

Compared to earlier neural vocoders such as HiFi-GAN or WaveGlow, BigVGAN v2 provides:

Larger training scale
Improved adversarial training strategies
Better high-frequency modeling
Faster GPU inference
Commercial-friendly MIT license

Its universal design allows it to work across different domains, rather than being specialized only for speech datasets.

Conclusion

The nvidia/bigvgan_v2_24khz_100band_256x model represents one of the most advanced neural vocoders currently available. Developed by NVIDIA and trained at scale, BigVGAN v2 delivers high-fidelity 24kHz waveform generation with optimized inference performance.

With its improved discriminator design, multi-scale loss functions, fused CUDA kernel acceleration, and MIT license, BigVGAN v2 is suitable for both academic research and commercial deployment. Whether building a production-grade text-to-speech system, experimenting with audio language models, or deploying real-time voice synthesis applications, BigVGAN v2 offers a robust and scalable solution.

As speech AI continues to evolve, high-quality neural vocoders like BigVGAN will remain essential components of modern audio generation pipelines.