EveryVoice TTS Toolkit Data Configuration (SchemaStore) JSON Schema

Type	object
File match	`everyvoice-shared-data.yaml` `everyvoice-shared-data.json`
Schema URL	https://catalog.lintel.tools/schemas/schemastore/everyvoice-tts-toolkit-data-configuration/latest.json
Source	https://raw.githubusercontent.com/EveryVoiceTTS/everyvoice/main/everyvoice/.schema/everyvoice-shared-data-0.3.json

Versions

0.1 0.2 0.3

Validate with Lintel

npx @lintel/lintel check

Type: object

Properties

dataset string

The name of the dataset.

Default: "YourDataSet"

train_split number

The amount of the dataset to use for training. The rest will be used as validation. Hold some of the validation set out for a test set if you are performing experiments.

Default: 0.9

min=0.0max=1.0

dataset_split_seed integer

The seed to use when splitting the dataset into train and validation sets.

Default: 1234

save_dir string

The directory to save preprocessed files to.

Default: "preprocessed/YourDataSet"

format=path

audio

Configuration settings for audio.

All of: AudioConfig object

path_to_audio_config_file string | null

The path to an audio configuration file.

Default: null

source_data Dataset[]

A list of datasets.

Definitions

AudioConfig object

min_audio_length number

The minimum length of an audio sample in seconds. Audio shorter than this will be ignored during preprocessing.

Default: 0.4

max_audio_length number

The maximum length of an audio sample in seconds. Audio longer than this will be ignored during preprocessing. Increasing the max_audio_length will result in larger memory usage. If you are running out of memory, consider lowering the max_audio_length.

Default: 11.0

max_wav_value number

Advanced. The maximum value allowed to be in your wav files. For 16-bit audio, this should be (2**16)/2 - 1.

Default: 32767.0

input_sampling_rate integer

The sampling rate describes the number of samples per second of audio. The 'input_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the spectrograms predicted by your text-to-spec model will also be calculated from audio at this sampling rate. If you change this value, your audio will automatically be re-sampled during preprocessing.

Default: 22050

output_sampling_rate integer

Advanced. The sampling rate describes the number of samples per second of audio. The 'output_sampling_rate' is with respect to your vocoder, or spec-to-wav model. This means that the wav files generated by your vocoder or spec-to-wav model will be at this sampling rate. If you change this value, you will also need to change the upsample rates in your vocoder. Your audio will automatically be re-sampled during preprocessing.

Default: 22050

alignment_sampling_rate integer

Advanced. The sampling rate describes the number of samples per second of audio. The 'alignment_sampling_rate' describes the sampling rate used when training an alignment model. If you change this value, your audio will automatically be re-sampled during preprocessing.

Default: 22050

target_bit_depth integer

Advanced. This is the bit depth of each sample in your audio files.

Default: 16

n_fft integer

Advanced. This is the number of bins used by the Fast Fourier Transform (FFT).

Default: 1024

fft_window_size integer

Advanced. This is the window size used by the Fast Fourier Transform (FFT).

Default: 1024

fft_hop_size integer

Advanced. This is the hop size for calculating the Short-Time Fourier Transform (STFT) which calculates a sequence of spectrograms from a single audio file. Another way of putting it is that the hop size is equal to the amount of non-intersecting samples from the audio in each spectrogram.

Default: 256

f_min integer

Advanced. This is the minimum frequency for the lowest frequency bin when calculating the spectrogram.

Default: 0

f_max integer

Advanced. This is the maximum frequency for the highest frequency bin when calculating the spectrogram.

Default: 8000

n_mels integer

Advanced. This is the number of filters in the Mel-scale spaced filterbank.

Default: 80

spec_type AudioSpecTypeEnum | string

Advanced. Defines how to calculate the spectrogram. 'mel' uses the TorchAudio implementation for a Mel spectrogram. 'mel-librosa' uses Librosa's implementation. 'linear' calculates a non-Mel linear spectrogram and 'raw' calculates a complex-valued spectrogram. 'linear' and 'raw' are not currently supported by EveryVoice. We recommend using 'mel-librosa'.

Default: "mel-librosa"

vocoder_segment_size integer

Advanced. The vocoder, or spec-to-wav model is trained by sampling random fixed-size sections of the audio. This value specifies the number of samples in those sections.

Default: 8192

AudioSpecTypeEnum string

Dataset object

label string

A label for the source of data

Default: "YourDataSet"

permissions_obtained boolean

An attestation that permission has been obtained to use this data. You may not use EveryVoice to build a TTS system with data that you do not have permission to use and there are serious possible consequences for doing so. Finding data online does not constitute permission. The speaker should be aware and consent to their data being used in this way.

Default: false

data_dir string

The path to the directory with your audio files.

Default: "/please/create/a/path/to/your/dataset/data"

format=path

filelist string

The path to your dataset's filelist.

Default: "/please/create/a/path/to/your/dataset/filelist"

format=path

filelist_loader string

Advanced. The file-loader function to use to load your dataset's filelist.

sox_effects array

Advanced. A list of SoX effects to apply to your audio prior to preprocessing. Run python -c 'import torchaudio; print(torchaudio.sox_effects.effect_names())' to see a list of supported effects.

Default:

[
  [
    "channels",
    "1"
  ]
]