IndexTeam/IndexTTS Series Model Extended Parameter Description

This document is a special instruction for IndexTeam/IndexTTS series TTS models (e.g., IndexTeam/IndexTTS-2), applicable only for scenarios where these models are called via /v1/audio/speech on the UModelverse platform.

For basic call methods, paths, authentication methods, etc., please refer to the OpenAI TTS API Call Document.

For custom voice (voice_id) related content, please refer to the Custom Voice Management API Document.

Some parameters introduced here are not part of the official OpenAI’s TTS protocol but an extended capability of UModelverse for IndexTTS models, only effective on this platform.

I. Relationship with the OpenAI Standard Protocol

The basic fields of /v1/audio/speech (such as model, input, voice, response_format, speed, instructions, etc.) are fully compatible with OpenAI’s TTS protocol. For specific meanings, please refer to the quick start document.
On this basis, for IndexTTS series models, UModelverse additionally supports a set of extended fields for more granular control:
- Sample rate and volume
- Emotional control method and weight
- Emotional vector / text
- Sentence segmentation and silence behavior
- Whether to return in streaming form, etc.
These extended fields:
- Are effective only when calling IndexTeam/IndexTTS series models on UModelverse;
- Have no meaning for the official OpenAI endpoints and are not guaranteed to be directly passed;
- When not filled, the model-side default behavior will be used.

Suggestion: If you need to maintain “compatibility so that it can be switched back to the official OpenAI endpoint without modification,” please use only basic fields; if you wish to fully utilize the advanced control capabilities of the IndexTTS model, you can use the extended fields described here in the UModelverse environment.

II. IndexTTS Extended Field Overview

The following fields are on the same level as the basic fields in the JSON request body, for example:


{
  "model": "IndexTeam/IndexTTS-2",
  "input": "Hello, welcome to Modelverse TTS.",
  "voice": "jack_cheng",           // Basic field
 
  "sample_rate": 24000,             // Extended field
  "gain": 1.0,                      // Extended field
  "emo_control_method": 1,          // Extended field
  "emo_weight": 0.8,                // Extended field
  "emo_text": "Joyful",             // Extended field
  "interval_silence": true          // Extended field
}

Note: All extended fields are optional and default settings will be used if they are not filled.

Field Name	Type	Required	Default Value	Description
`speed`	float64	No	1	Speech playback speed, range 0.25-4, defaults to model’s default speed if not filled.
`sample_rate`	int	No	22050	Target audio sample rate. Specific values supported are defined by the provider, such as 16000, 22050, 24000, etc. Defaults to model’s default sample rate if not filled.
`gain`	float64	No	1	Output volume gain factor, used to amplify or reduce the volume of synthesized speech. Suggested range is (0, 10], note 0 is mute. Defaults to model’s default volume if not filled.

The following fields can be used in conjunction with custom emotional audio / text prompts for finer-grained emotional control. Specific numerical meanings are defined by the IndexTTS provider.

Field Name	Type	Required	Default Value	Description
`emo_control_method`	int	No	0	Identifier (enum/integer) for emotional control method, indicating which emotional control strategy the model adopts. 0: No emotional control, 1: Based on emotional audio, 2: Based on emotional vector, 3: Based on emotional text. Defaults to the model’s default emotional control method if not filled.
`emo_weight`	float64	No	1.0	Emotional control weight, specifying the degree to which emotional reference audio file, emotional vector, and textual emotion mode affect the output. Valid range is 0.0 to 1.0, with a default of 1.0 (100%). When using textual emotion mode, it is recommended to set `emo_weight` to around 0.6 (or lower) to achieve more natural voice effects.
`emo_vec`	float64[]	No	[0, 0, 0, 0, 0, 0, 0, 0]	Emotional vector representation, can be used to finely control emotional features in vector space. [Happy, Angry, Sad, Scared, Disgusted, Depressed, Surprised, Calm], each dimension’s value range is [0, 1.2] and all dimensions’ values should not exceed 1.5 in total.
`emo_text`	string	No	""	Text prompts describing emotions in natural language, such as “Joyful”, “Calm”, “Excited”, etc., provided by the supplier as textual inputs for emotional control.
`emo_random`	bool	No	False	Whether to introduce a certain randomness in emotional control to increase diversity or avoid completely consistent emotional expression in each sentence. Specific effects are implemented by the supplier.

2.3 Sentence Segmentation and Silence Control

Field Name	Type	Required	Default Value	Description
`interval_silence`	int	No	200	Whether to insert interval silence between sentences. For example, when synthesizing in multiple sentences, control whether to leave pauses between sentences. It is recommended to set to 200ms, defaults to the model’s default silent strategy if not filled.
`max_text_tokens_per_sentence`	int	No	120	Maximum token count / length threshold for single-sentence text segmentation. Used to control internal sentence segmentation strategies in long text scenarios, recommended set to 120, defaults to the model’s default segmentation strategy if not filled.

III: Example: IndexTTS Call with Extended Parameters

The following example demonstrates how to pass extended parameters to the IndexTTS model while maintaining the OpenAI call style.

3.1 curl Example


curl https://api.umodelverse.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -d '{
    "model": "IndexTeam/IndexTTS-2",
    "input": "Hello, this is a joyful emotional speech example.",
    "voice": "jack_cheng",
 
    // The following are IndexTTS extended parameters, only effective on this platform for IndexTTS series models
    "sample_rate": 24000,
    "gain": 1.0,
    "emo_control_method": 1,
    "emo_weight": 0.8,
    "emo_text": "Joyful",
    "interval_silence": 200,
    "max_text_tokens_per_sentence": 120
  }' \
  --output speech-indextts.wav

3.2 Using Custom Voice + Extended Parameters (Illustration)


VOICE_ID="uspeech:xxxx-xxxx-xxxx-xxxx"  # Obtain via /v1/audio/voice/upload
 
curl https://api.umodelverse.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -d '{
    "model": "IndexTeam/IndexTTS-shizhenfei",
    "input": "Hello, I am a custom voice example with emotions.",
    "voice": "'$VOICE_ID'",
 
    // IndexTTS extended parameters
    "emo_control_method": 2,
    "emo_weight": 0.6,
    "emo_random": true,
    "interval_silence": 200,
    "max_text_tokens_per_sentence": 120
  }' \
  --output speech-indextts-custom.wav

Note:

The extended parameter values in the above examples are for illustrative purposes only. The actual recommended value range and semantics are subject to the documentation provided by the supplier and the audition effect.

When no extended fields are passed, the IndexTTS model will adopt the default inference configuration, with behavior consistent with the examples in the quick start document.

IV. Relationship with Other Documents

Quick Start Document: The OpenAI TTS API Call Document is suitable for users who wish to quickly integrate and use only standard parameters.
Custom Voice Document: The Custom Voice Management API Document focuses on the upload and management of voice_id.
This Document: Specially aimed at users who need to deeply control the behavior of the IndexTTS series models, it describes non-OpenAI standard extended fields available in the UModelverse environment, without affecting or contaminating the basic quick start document.