EveryVoice TTS Toolkit E2E Configuration
0.1Schema URL
Properties
EveryVoice requires a contact name and email to help prevent misuse. Please read our Guide https://docs.everyvoice.ca/latest/ to understand more about the importance of misuse prevention with TTS.
17 nested properties
The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.
The number of checkpoints to save.
The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter
The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'
How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.
Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training
Stop training after this many epochs
Stop training after this many steps
Automatically resume training from a checkpoint loaded from this path.
The path to a filelist containing samples belonging to your training set.
The path to a filelist containing samples belonging to your validation set.
Advanced. The function to use to load the filelist.
The number of CPU workers to use when loading data during validation.
The number of CPU workers to use when loading data during training.
Definitions
The initial learning rate to use
Advanced. The value of optimizer constant Epsilon, used for numerical stability.
Advanced. The values of the Adam Optimizer beta coefficients.
[
0.9,
0.98
]
The name of the optimizer to use.
The initial learning rate to use
Advanced. The value of optimizer constant Epsilon, used for numerical stability.
Advanced. The values of the AdamW Optimizer beta coefficients.
[
0.9,
0.98
]
The name of the optimizer to use.
The path of a preprocessing configuration file.
The path of a preprocessing configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.
3 nested properties
2 nested properties
The symbol(s) used to indicate silence.
[
"<SIL>"
]
EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.
{}
The minimum length of an audio sample in seconds. Audio shorter than this will be ignored during preprocessing.
The maximum length of an audio sample in seconds. Audio longer than this will be ignored during preprocessing. Increasing the max_audio_length will result in larger memory usage. If you are running out of memory, consider lowering the max_audio_length.
Advanced. The maximum value allowed to be in your wav files. For 16-bit audio, this should be (2**16)/2 - 1.
The sampling rate describes the number of samples per second of audio. The 'input_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the spectrograms predicted by your text-to-spec model will also be calculated from audio at this sampling rate. If you change this value, your audio will automatically be re-sampled during preprocessing.
Advanced. The sampling rate describes the number of samples per second of audio. The 'output_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the wav files generated by your vocoder or spec-to-wav model will be at this sampling rate. If you change this value, you will also need to change the upsample rates in your vocoder. Your audio will automatically be re-sampled during preprocessing.
Advanced. The sampling rate describes the number of samples per second of audio. The 'alignment_sampling_rate' describes the sampling rate used when training an alignment model. If you change this value, your audio will automatically be re-sampled during preprocessing.
Advanced. This is the bit depth of each sample in your audio files.
Advanced. This is the number of bins used by the Fast Fourier Transform (FFT).
Advanced. This is the window size used by the Fast Fourier Transform (FFT).
Advanced. This is the hop size for calculating the Short-Time Fourier Transform (STFT) which calculates a sequence of spectrograms from a single audio file. Another way of putting it is that the hop size is equal to the amount of non-intersecting samples from the audio in each spectrogram.
Advanced. This is the minimum frequency for the lowest frequency bin when calculating the spectrogram.
Advanced. This is the maximum frequency for the highest frequency bin when calculating the spectrogram.
Advanced. This is the number of filters in the Mel-scale spaced filterbank.
Advanced. Defines how to calculate the spectrogram. 'mel' uses the TorchAudio implementation for a Mel spectrogram. 'mel-librosa' uses Librosa's implementation. 'linear' calculates a non-Mel linear spectrogram and 'raw' calculates a complex-valued spectrogram. 'linear' and 'raw' are not currently supported by EveryVoice. We recommend using 'mel-librosa'.
Advanced. The vocoder, or spec-to-wav model is trained by sampling random fixed-size sections of the audio. This value specifies the number of samples in those sections.
The number of layers in the Conformer.
The number of heads in the multi-headed attention modules.
The number of hidden dimensions in the input. The input_dim value declared in the encoder and decoder modules must match the input_dim value declared in each variance predictor module.
The number of dimensions in the feedforward layers.
The size of the kernel in each convoluational layer of the Conformer.
The amount of dropout to apply.
The name of the contact person or organization responsible for answering questions related to this model.
The email address of the contact person or organization responsible for answering questions related to this model.
EveryVoice requires a contact name and email to help prevent misuse. Please read our Guide https://docs.everyvoice.ca/latest/ to understand more about the importance of misuse prevention with TTS.
The path of a preprocessing configuration file.
The path of a preprocessing configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.
3 nested properties
2 nested properties
The symbol(s) used to indicate silence.
[
"<SIL>"
]
EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.
{}
The number of dimensions in the LSTM layers.
The number of dimensions in the convolutional layers.
The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.
The number of checkpoints to save.
The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter
The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'
How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.
Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training
Stop training after this many epochs
Stop training after this many steps
Automatically resume training from a checkpoint loaded from this path.
The path to a filelist containing samples belonging to your training set.
The path to a filelist containing samples belonging to your validation set.
Advanced. The function to use to load the filelist.
The number of CPU workers to use when loading data during validation.
The number of CPU workers to use when loading data during training.
Optimizer configuration settings.
Use a binned length sampler
The maximum number of steps to plot
The alignment extraction algorithm to use. 'beam' will be quicker but possibly less accurate than 'dijkstra'
A label for the source of data
An attestation that permission has been obtained to use this data. You may not use EveryVoice to build a TTS system with data that you do not have permission to use and there are serious possible consequences for doing so. Finding data online does not constitute permission. The speaker should be aware and consent to their data being used in this way.
The path to the directory with your audio files.
The path to your dataset's filelist.
Advanced. The file-loader function to use to load your dataset's filelist.
Advanced. A list of SoX effects to apply to your audio prior to preprocessing. Run python -c 'import torchaudio; print(torchaudio.sox_effects.effect_names())' to see a list of supported effects.
[
[
"channels",
"1"
]
]
The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.
The number of checkpoints to save.
The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter
The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'
How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.
Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training
Stop training after this many epochs
Stop training after this many steps
Automatically resume training from a checkpoint loaded from this path.
The path to a filelist containing samples belonging to your training set.
The path to a filelist containing samples belonging to your validation set.
Advanced. The function to use to load the filelist.
The number of CPU workers to use when loading data during validation.
The number of CPU workers to use when loading data during training.
EveryVoice requires a contact name and email to help prevent misuse. Please read our Guide https://docs.everyvoice.ca/latest/ to understand more about the importance of misuse prevention with TTS.
The path of a model configuration file.
The path of a training configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.
The path of a text configuration file.
Configuration for energy, duration, and pitch variance predictors.
Whether to jointly learn alignments using monotonic alignment search module (See Badlani et. al. 2021: https://arxiv.org/abs/2108.10447). If set to False, you will have to provide text/audio alignments separately before training a text-to-spec (feature prediction) model.
The maximum length (i.e. number of symbols) for text inputs.
Whether to use a postnet module.
Whether to train a multilingual model. For this to work, your filelist must contain a column/field for 'language' with values for each utterance.
Whether to train a multispeaker model. For this to work, your filelist must contain a column/field for 'speaker' with values for each utterance.
The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.
The number of checkpoints to save.
The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter
The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'
How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.
Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training
Stop training after this many epochs
Stop training after this many steps
Automatically resume training from a checkpoint loaded from this path.
The path to a filelist containing samples belonging to your training set.
The path to a filelist containing samples belonging to your validation set.
Advanced. The function to use to load the filelist.
The number of CPU workers to use when loading data during validation.
The number of CPU workers to use when loading data during training.
Whether to use a sampler which oversamples from the minority language or speaker class for balanced training.
The optimizer to use during training.
{
"learning_rate": 0.001,
"eps": 1e-8,
"weight_decay": 1e-6,
"betas": [
0.9,
0.999
],
"name": "noam",
"warmup_steps": 1000
}
Multiply the spec loss by this weight
Multiply the postnet loss by this weight
Multiply the pitch loss by this weight
Multiply the energy loss by this weight
Multiply the duration loss by this weight
Multiply the Attention CTC loss by this weight
Multiply the Attention Binarization loss by this weight
Scale the Attention Binarization loss by (current_epoch / attn_bin_loss_warmup_epochs) until the number of epochs defined by attn_bin_loss_warmup_epochs is reached.
The path of a model configuration file.
The path of a training configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.
The path of a text configuration file.
EveryVoice requires a contact name and email to help prevent misuse. Please read our Guide https://docs.everyvoice.ca/latest/ to understand more about the importance of misuse prevention with TTS.
The path of a model configuration file.
The path of a training configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.
Which resblock to use. See Kong et. al. 2020: https://arxiv.org/abs/2010.05646
The stride of each convolutional layer in the upsampling module.
[
8,
8,
2,
2
]
The kernel size of each convolutional layer in the upsampling module.
[
16,
16,
4,
4
]
The number of dimensions to project the Mel inputs to before being passed to the resblock.
The kernel size of each convolutional layer in the resblock.
[
3,
7,
11
]
The dilations of each convolution in each layer of the resblock.
[
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
]
The activation function to use.
Whether to predict phase and magnitude values and use an inverse Short-Time Fourier Transform instead of predicting a waveform directly. See Kaneko et. al. 2022: https://arxiv.org/abs/2203.02395
The number of layers to use in the Multi-Scale Discriminator.
The size of each layer in the Multi-Period Discriminator.
[
2,
3,
5,
7,
11
]
The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.
The number of checkpoints to save.
The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter
The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'
How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.
Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training
Stop training after this many epochs
Stop training after this many steps
Automatically resume training from a checkpoint loaded from this path.
The path to a filelist containing samples belonging to your training set.
The path to a filelist containing samples belonging to your validation set.
Advanced. The function to use to load the filelist.
The number of CPU workers to use when loading data during validation.
The number of CPU workers to use when loading data during training.
The number of steps to run through before activating the discriminators.
The type of GAN to use. Can be set to either 'original' for a vanilla GAN, or 'wgan' for a Wasserstein GAN that clips gradients.
Configuration settings for the optimizer.
The gradient clip value when gan_type='wgan'.
Whether to use a sampler which oversamples from the minority language or speaker class for balanced training.
Whether to read spectrograms from 'preprocessed/synthesized_spec' instead of 'preprocessed/spec'. This is used when finetuning a pretrained spec-to-wav (vocoder) model using the outputs of a trained text-to-spec (feature prediction network) model.
The logger configures all the information needed for where to store your experiment's logs and checkpoints.
The structure of your logs will then be:
The name of the experiment. The structure of your logs will be
The directory to save your checkpoints and logs to.
The function that generates a string to call your runs - by default this is a timestamp. The structure of your logs will be
The version of your experiment. The structure of your logs will be
The initial learning rate to use
Advanced. The value of optimizer constant Epsilon, used for numerical stability.
Advanced. The values of the Adam Optimizer beta coefficients.
[
0.9,
0.98
]
The name of the optimizer to use.
The number of steps to increase the learning rate before starting to decrease it.
The name of the dataset.
The amount of the dataset to use for training. The rest will be used as validation. Hold some of the validation set out for a test set if you are performing experiments.
The seed to use when splitting the dataset into train and validation sets.
The directory to save preprocessed files to.
The path to an audio configuration file.
A list of datasets.
Exclamation punctuation symbols used in your datasets. Replaces these symbols with
[
"!",
"¡"
]
Question/interrogative punctuation symbols used in your datasets. Replaces these symbols with
[
"?",
"¿"
]
Quotemark punctuation symbols used in your datasets. Replaces these symbols with internally.
[
"\"",
"'",
"“",
"”",
"«",
"»"
]
Punctuation symbols indicating a 'big break' used in your datasets. Replaces these symbols with
[
".",
":",
";"
]
Punctuation symbols indicating a 'small break' used in your datasets. Replaces these symbols with
[
",",
"-",
"—"
]
Punctuation symbols indicating an ellipsis used in your datasets. Replaces these symbols with
[
"…"
]
The initial learning rate to use
Advanced. The value of optimizer constant Epsilon, used for numerical stability.
Advanced. The value of RMSProp optimizer alpha smoothing constant.
The name of the optimizer to use.
The symbol(s) used to indicate silence.
[
"<SIL>"
]
EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.
2 nested properties
The symbol(s) used to indicate silence.
[
"<SIL>"
]
EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.
{}
The loss function to use when calculate variance loss. Either 'mse' or 'mae'.
The number of layers in the variance predictor module.
The kernel size of each convolutional layer in the variance predictor module.
The amount of dropout to apply.
The number of hidden dimensions in the input. This must match the input_dim value declared in the encoder and decoder modules.
The number of bins to use in the variance predictor module.
Whether to use depthwise separable convolutions.
The loss function to use when calculate variance loss. Either 'mse' or 'mae'.
The number of layers in the variance predictor module.
The kernel size of each convolutional layer in the variance predictor module.
The amount of dropout to apply.
The number of hidden dimensions in the input. This must match the input_dim value declared in the encoder and decoder modules.
The number of bins to use in the variance predictor module.
Whether to use depthwise separable convolutions.
The level for the variance predictor to use. 'frame' will make predictions at the frame level. 'phone' will average predictions across all frames in each phone.
The path of a model configuration file.
The path of a training configuration file.
The preprocessing configuration, including information about audio settings.
The path of a preprocessing configuration file.