EveryVoice TTS Toolkit Feature Prediction Configuration (SchemaStore) JSON Schema

Type	object
File match	`everyvoice-text-to-spec.yaml` `everyvoice-text-to-spec.json`
Schema URL	https://catalog.lintel.tools/schemas/schemastore/everyvoice-tts-toolkit-feature-prediction-configuration/latest.json
Source	https://raw.githubusercontent.com/EveryVoiceTTS/everyvoice/main/everyvoice/.schema/everyvoice-text-to-spec-0.3.json

Versions

0.1 0.2 0.3

Validate with Lintel

npx @lintel/lintel check

Type: object

Properties

contact required

EveryVoice requires a contact name and email to help prevent misuse. Please read our Guide https://docs.everyvoice.ca/latest/ to understand more about the importance of misuse prevention with TTS.

All of: ContactInformation object

VERSION string

Default: "1.1"

model

The model configuration settings.

All of: FastSpeech2ModelConfig object

path_to_model_config_file string | null

The path of a model configuration file.

Default: null

training

The training configuration hyperparameters.

All of: FastSpeech2TrainingConfig object

path_to_training_config_file string | null

The path of a training configuration file.

Default: null

preprocessing

The preprocessing configuration, including information about audio settings.

All of: PreprocessingConfig object

path_to_preprocessing_config_file string | null

The path of a preprocessing configuration file.

Default: null

text

The text configuration.

All of: TextConfig object

path_to_text_config_file string | null

The path of a text configuration file.

Default: null

Definitions

AudioConfig object

min_audio_length number

The minimum length of an audio sample in seconds. Audio shorter than this will be ignored during preprocessing.

Default: 0.4

max_audio_length number

The maximum length of an audio sample in seconds. Audio longer than this will be ignored during preprocessing. Increasing the max_audio_length will result in larger memory usage. If you are running out of memory, consider lowering the max_audio_length.

Default: 11.0

max_wav_value number

Advanced. The maximum value allowed to be in your wav files. For 16-bit audio, this should be (2**16)/2 - 1.

Default: 32767.0

input_sampling_rate integer

The sampling rate describes the number of samples per second of audio. The 'input_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the spectrograms predicted by your text-to-spec model will also be calculated from audio at this sampling rate. If you change this value, your audio will automatically be re-sampled during preprocessing.

Default: 22050

output_sampling_rate integer

Advanced. The sampling rate describes the number of samples per second of audio. The 'output_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the wav files generated by your vocoder or spec-to-wav model will be at this sampling rate. If you change this value, you will also need to change the upsample rates in your vocoder. Your audio will automatically be re-sampled during preprocessing.

Default: 22050

alignment_sampling_rate integer

Advanced. The sampling rate describes the number of samples per second of audio. The 'alignment_sampling_rate' describes the sampling rate used when training an alignment model. If you change this value, your audio will automatically be re-sampled during preprocessing.

Default: 22050

target_bit_depth integer

Advanced. This is the bit depth of each sample in your audio files.

Default: 16

n_fft integer

Advanced. This is the number of bins used by the Fast Fourier Transform (FFT).

Default: 1024

fft_window_size integer

Advanced. This is the window size used by the Fast Fourier Transform (FFT).

Default: 1024

fft_hop_size integer

Advanced. This is the hop size for calculating the Short-Time Fourier Transform (STFT) which calculates a sequence of spectrograms from a single audio file. Another way of putting it is that the hop size is equal to the amount of non-intersecting samples from the audio in each spectrogram.

Default: 256

f_min integer

Advanced. This is the minimum frequency for the lowest frequency bin when calculating the spectrogram.

Default: 0

f_max integer

Advanced. This is the maximum frequency for the highest frequency bin when calculating the spectrogram.

Default: 8000

n_mels integer

Advanced. This is the number of filters in the Mel-scale spaced filterbank.

Default: 80

spec_type AudioSpecTypeEnum | string

Advanced. Defines how to calculate the spectrogram. 'mel' uses the TorchAudio implementation for a Mel spectrogram. 'mel-librosa' uses Librosa's implementation. 'linear' calculates a non-Mel linear spectrogram and 'raw' calculates a complex-valued spectrogram. 'linear' and 'raw' are not currently supported by EveryVoice. We recommend using 'mel-librosa'.

Default: "mel-librosa"

vocoder_segment_size integer

Advanced. The vocoder, or spec-to-wav model is trained by sampling random fixed-size sections of the audio. This value specifies the number of samples in those sections.

Default: 8192

AudioSpecTypeEnum string

ConformerConfig object

layers integer

The number of layers in the Conformer.

Default: 4

heads integer

The number of heads in the multi-headed attention modules.

Default: 2

input_dim integer

The number of hidden dimensions in the input. The input_dim value declared in the encoder and decoder modules must match the input_dim value declared in each variance predictor module.

Default: 256

feedforward_dim integer

The number of dimensions in the feedforward layers.

Default: 1024

conv_kernel_size integer

The size of the kernel in each convoluational layer of the Conformer.

Default: 9

dropout number

The amount of dropout to apply.

Default: 0.2

ContactInformation object

contact_name string required

The name of the contact person or organization responsible for answering questions related to this model.

contact_email string required

The email address of the contact person or organization responsible for answering questions related to this model.

format=email

Dataset object

label string

A label for the source of data

Default: "YourDataSet"

permissions_obtained boolean

An attestation that permission has been obtained to use this data. You may not use EveryVoice to build a TTS system with data that you do not have permission to use and there are serious possible consequences for doing so. Finding data online does not constitute permission. The speaker should be aware and consent to their data being used in this way.

Default: false

data_dir string

The path to the directory with your audio files.

Default: "/please/create/a/path/to/your/dataset/data"

format=path

filelist string

The path to your dataset's filelist.

Default: "/please/create/a/path/to/your/dataset/filelist"

format=path

filelist_loader string

Advanced. The file-loader function to use to load your dataset's filelist.

sox_effects array

Advanced. A list of SoX effects to apply to your audio prior to preprocessing. Run python -c 'import torchaudio; print(torchaudio.sox_effects.effect_names())' to see a list of supported effects.

Default:

[
  [
    "channels",
    "1"
  ]
]

FastSpeech2ModelConfig object

encoder

The configuration of the encoder module.

All of: ConformerConfig object

decoder

The configuration of the decoder module.

All of: ConformerConfig object

variance_predictors

Configuration for energy, duration, and pitch variance predictors.

All of: VariancePredictors object

target_text_representation_level

Default: "characters"

All of: TargetTrainingTextRepresentationLevel string

learn_alignment boolean

Whether to jointly learn alignments using monotonic alignment search module (See Badlani et. al. 2021: https://arxiv.org/abs/2108.10447). If set to False, you will have to provide text/audio alignments separately before training a text-to-spec (feature prediction) model.

Default: true

use_global_style_token_module boolean

Whether to use the Global Style Token (GST) module from Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (https://arxiv.org/abs/1803.09017)

Default: false

max_length integer

The maximum length (i.e. number of symbols) for text inputs.

Default: 1000

mel_loss

The loss function to use when calculating Mel spectrogram loss.

Default: "mse"

All of: VarianceLossEnum string

use_postnet boolean

Whether to use a postnet module.

Default: true

multilingual boolean

Whether to train a multilingual model. For this to work, your filelist must contain a column/field for 'language' with values for each utterance.

Default: false

multispeaker boolean

Whether to train a multispeaker model. For this to work, your filelist must contain a column/field for 'speaker' with values for each utterance.

Default: false

FastSpeech2TrainingConfig object

batch_size integer

The number of samples to include in each batch when training. If you are running out of memory, consider lowering your batch_size.

Default: 16

save_top_k_ckpts integer

The number of checkpoints to save.

Default: 5

ckpt_steps integer | null

The interval (in steps) for saving a checkpoint. By default checkpoints are saved every epoch using the 'ckpt_epochs' hyperparameter

Default: null

ckpt_epochs integer | null

The interval (in epochs) for saving a checkpoint. You can also save checkpoints after n steps by using 'ckpt_steps'

Default: 1

val_check_interval integer | number | null

How often to check the validation set. Pass a float in the range [0.0, 1.0] to check after a fraction of the training epoch. Pass an int to check after a fixed number of training batches.

Default: 500

check_val_every_n_epoch integer | null

Run validation after every n epochs. Defaults to 1, but if you have a small dataset you should change this to be larger to speed up training

Default: null

max_epochs integer

Stop training after this many epochs

Default: 1000

max_steps integer

Stop training after this many steps

Default: 100000

finetune_checkpoint string | null

Automatically resume training from a checkpoint loaded from this path.

Default: null

training_filelist string

The path to a filelist containing samples belonging to your training set.

Default: "path/to/your/preprocessed/training_filelist.psv"

format=path

validation_filelist string

The path to a filelist containing samples belonging to your validation set.

Default: "path/to/your/preprocessed/validation_filelist.psv"

format=path

filelist_loader string

Advanced. The function to use to load the filelist.

logger

The configuration for the logger.

All of: LoggerConfig object

val_data_workers integer

The number of CPU workers to use when loading data during validation.

Default: 0

train_data_workers integer

The number of CPU workers to use when loading data during training.

Default: 4

use_weighted_sampler boolean

Whether to use a sampler which oversamples from the minority language or speaker class for balanced training.

Default: false

optimizer

The optimizer to use during training.

Default:

{
  "learning_rate": 0.001,
  "eps": 1e-8,
  "weight_decay": 1e-6,
  "betas": [
    0.9,
    0.999
  ],
  "name": "noam",
  "warmup_steps": 1000
}

All of: NoamOptimizer object

vocoder_path string | null

Default: null

mel_loss_weight number

Multiply the spec loss by this weight

Default: 1.0

postnet_loss_weight number

Multiply the postnet loss by this weight

Default: 1.0

pitch_loss_weight number

Multiply the pitch loss by this weight

Default: 0.1

energy_loss_weight number

Multiply the energy loss by this weight

Default: 0.1

duration_loss_weight number

Multiply the duration loss by this weight

Default: 0.1

attn_ctc_loss_weight number

Multiply the Attention CTC loss by this weight

Default: 0.1

attn_bin_loss_weight number

Multiply the Attention Binarization loss by this weight

Default: 0.1

attn_bin_loss_warmup_epochs integer

Scale the Attention Binarization loss by (current_epoch / attn_bin_loss_warmup_epochs) until the number of epochs defined by attn_bin_loss_warmup_epochs is reached.

Default: 100

min=1

LoggerConfig object

The logger configures all the information needed for where to store your experiment's logs and checkpoints. The structure of your logs will then be: / / <sub_dir> <sub_dir> will be generated by calling <sub_dir_callable> each time the LoggerConfig is constructed.

name string

The name of the experiment. The structure of your logs will be / / <sub_dir>.

Default: "BaseExperiment"

save_dir string

The directory to save your checkpoints and logs to.

Default: "logs_and_checkpoints"

format=path

sub_dir_callable string

The function that generates a string to call your runs - by default this is a timestamp. The structure of your logs will be / / <sub_dir> where <sub_dir> is a timestamp.

version string

The version of your experiment. The structure of your logs will be / / <sub_dir>.

Default: "base"

NoamOptimizer object

learning_rate number

The initial learning rate to use

Default: 0.0001

eps number

Advanced. The value of optimizer constant Epsilon, used for numerical stability.

Default: 1e-8

weight_decay number

Default: 0.01

betas array

Advanced. The values of the Adam Optimizer beta coefficients.

Default:

[
  0.9,
  0.98
]

minItems=2maxItems=2

name string

The name of the optimizer to use.

Default: "noam"

warmup_steps integer

The number of steps to increase the learning rate before starting to decrease it.

Default: 1000

PreprocessingConfig object

dataset string

The name of the dataset.

Default: "YourDataSet"

train_split number

The amount of the dataset to use for training. The rest will be used as validation. Hold some of the validation set out for a test set if you are performing experiments.

Default: 0.9

min=0.0max=1.0

dataset_split_seed integer

The seed to use when splitting the dataset into train and validation sets.

Default: 1234

save_dir string

The directory to save preprocessed files to.

Default: "preprocessed/YourDataSet"

format=path

audio

Configuration settings for audio.

All of: AudioConfig object

path_to_audio_config_file string | null

The path to an audio configuration file.

Default: null

source_data Dataset[]

A list of datasets.

Punctuation object

exclamations string[]

Exclamation punctuation symbols used in your datasets. Replaces these symbols with internally.

Default:

[
  "!",
  "¡"
]

question_symbols string[]

Question/interrogative punctuation symbols used in your datasets. Replaces these symbols with internally.

Default:

[
  "?",
  "¿"
]

quotemarks string[]

Quotemark punctuation symbols used in your datasets. Replaces these symbols with internally.

Default:

[
  "\"",
  "'",
  "“",
  "”",
  "«",
  "»"
]

parentheses string[]

Punctuation symbols indicating parentheses, brackets, or braces. Replaces these symbols with internally.

Default:

[
  "(",
  ")",
  "[",
  "]",
  "{",
  "}"
]

periods string[]

Punctuation symbols indicating a 'period' used in your datasets. Replaces these symbols with internally.

Default:

[
  "."
]

colons string[]

Punctuation symbols indicating a 'colon' used in your datasets. Replaces these symbols with internally.

Default:

[
  ":"
]

semi_colons string[]

Punctuation symbols indicating a 'semi-colon' used in your datasets. Replaces these symbols with internally.

Default:

[
  ";"
]

hyphens string[]

Punctuation symbols indicating a 'hyphen' used in your datasets. * is a hyphen by default since unidecode decodes middle-dot punctuation as an asterisk. Replaces these symbols with internally.

Default:

[
  "-",
  "—",
  "*"
]

commas string[]

Punctuation symbols indicating a 'comma' used in your datasets. Replaces these symbols with internally.

Default:

[
  ","
]

ellipses string[]

Punctuation symbols indicating ellipses used in your datasets. Replaces these symbols with internally.

Default:

[
  "…"
]

Symbols object

silence string[]

The symbol(s) used to indicate silence.

Default:

[
  "<SIL>"
]

punctuation

EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.

All of: Punctuation object

TargetTrainingTextRepresentationLevel string

TextConfig object

symbols object

2 nested properties

silence string[]

The symbol(s) used to indicate silence.

Default:

[
  "<SIL>"
]

punctuation

EveryVoice will combine punctuation and normalize it into a set of five permissible types of punctuation to help tractable training.

All of: Punctuation object

to_replace Record<string, string>

Default:

{}

cleaners string[]

g2p_engines Record<string, string>

User defined or external G2P engines. See https://github.com/EveryVoiceTTS/everyvoice_g2p_template_plugin to implement your own G2P.

Default:

{}

Examples: "{"fr": "everyvoice_plugin_g2p4example.g2p"}"

VarianceLevelEnum string

VarianceLossEnum string

VariancePredictorBase object

loss

The loss function to use when calculate variance loss. Either 'mse' or 'mae'.

Default: "mse"

All of: VarianceLossEnum string

n_layers integer

The number of layers in the variance predictor module.

Default: 5

kernel_size integer

The kernel size of each convolutional layer in the variance predictor module.

Default: 3

dropout number

The amount of dropout to apply.

Default: 0.5

input_dim integer

The number of hidden dimensions in the input. This must match the input_dim value declared in the encoder and decoder modules.

Default: 256

n_bins integer

The number of bins to use in the variance predictor module.

Default: 256

depthwise boolean

Whether to use depthwise separable convolutions.

Default: true

VariancePredictorConfig object

loss

The loss function to use when calculate variance loss. Either 'mse' or 'mae'.

Default: "mse"

All of: VarianceLossEnum string

n_layers integer

The number of layers in the variance predictor module.

Default: 5

kernel_size integer

The kernel size of each convolutional layer in the variance predictor module.

Default: 3

dropout number

The amount of dropout to apply.

Default: 0.5

input_dim integer

The number of hidden dimensions in the input. This must match the input_dim value declared in the encoder and decoder modules.

Default: 256

n_bins integer

The number of bins to use in the variance predictor module.

Default: 256

depthwise boolean

Whether to use depthwise separable convolutions.

Default: true

level

The level for the variance predictor to use. 'frame' will make predictions at the frame level. 'phone' will average predictions across all frames in each phone.

Default: "phone"

All of: VarianceLevelEnum string

VariancePredictors object

energy

The variance predictor for energy

All of: VariancePredictorConfig object

duration

The variance predictor for duration

All of: VariancePredictorBase object

pitch

The variance predictor for pitch

All of: VariancePredictorConfig object