Custom Voice Management API Documentation

This document only describes the request and response format of custom voice management interfaces (upload/list/delete), suitable for users with existing development background as a reference. For how to actually use custom voices in /v1/audio/speech, please refer to the section “Using Custom Voices (Optional)” in the OpenAI TTS API Call Documentation.

Overview

Domain Example: https://api.umodelverse.ai
Authentication Method: All interfaces require the Authorization: Bearer <MODELVERSE_API_KEY> in the Header.
Organization Isolation:
- Custom voices are isolated by organization; all sub-accounts in the same organization can share the custom voices within that organization.
- Sharing is not possible between different organizations.
Lifecycle: Custom voices are saved by default for 7 days, after which a background task will clean them up. For long-term storage needs, please contact the business team for evaluation.

1. Upload Custom Voice

HTTP Method: POST
Path: /v1/audio/voice/upload
Content-Type:
- Recommended: multipart/form-data (direct file upload)
- Also supported: Form fields passing Base64 strings or remote URLs

1.1 Request Parameters

Common Fields

Field	Type	Required	Description
name	string	Yes	The name of the voice, used for list display, e.g., “Gentle Female Voice”, “Customer Service Voice A”.
model	string	Yes	The corresponding TTS model name when using this voice, e.g., `IndexTeam/IndexTTS-2`. It should match the `model` in subsequent `/v1/audio/speech` requests.

Speaker (Voice Sample Audio - Choose One, Required)

Field	Type	Required	Description
speaker_file	file	Yes (choose one)	Local audio file (recommended method), uploaded via `multipart/form-data`.
speaker_file_base64	string	Yes (choose one)	The Base64 string of `speaker_file`, passed through a standard form field.
speaker_url	string	Yes (choose one)	A publicly accessible URL pointing to the voice audio file.

Notes:

The three speaker_* fields choose one, at least one must be provided;

If multiple are provided, the priority is: speaker_file → speaker_file_base64 → speaker_url;

If none of the three are provided, the request will be rejected (error code: missing_speaker).

Emotion (Emotion Sample Audio - Choose One, Optional)

Field	Type	Required	Description
emotion_file	file	No (choose one)	Emotion sample audio file, uploaded via `multipart/form-data`.
emotion_file_base64	string	No (choose one)	The Base64 string of `emotion_file`, passed through a standard form field.
emotion_url	string	No (choose one)	A publicly accessible URL pointing to the emotion sample audio file.

Notes:

The emotion_* fields are entirely optional and can be omitted;

If multiple are provided, the priority is: emotion_file → emotion_file_base64 → emotion_url;

If none of the emotion_* paths are provided, the voice characteristics will be constructed based solely on the Speaker.

1.2 Audio File Constraints

The following constraints apply to the uploaded speaker_* and emotion_* audio:

Format: Only MP3, WAV formats are supported.
Size: Each audio file should be ≤ 20MB.
Duration: 5–30 seconds.
Sample Rate: 16kHz or higher.

If any of the above conditions are not met, the interface will return a 4xx error with the specific reason indicated in the error.code (e.g., file_too_large, duration_out_of_range, sample_rate_too_low).

1.3 Request Example (Recommended: multipart file upload)


curl -X POST "https://api.umodelverse.ai/v1/audio/voice/upload" \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "name=温柔女声" \
  -F "model=IndexTeam/IndexTTS-2" \
  -F "speaker_file=@/path/to/speaker.wav" \
  -F "emotion_file=@/path/to/emotion.wav"

1.4 Successful Response Example


{
  "id": "uspeech:xxxx-xxxx-xxxx-xxxx"
}

id: Custom voice ID, to be referenced in subsequent /v1/audio/speech requests using the voice field (e.g., "voice": "uspeech:xxxx-xxxx-xxxx-xxxx").

1.5 Failed Response Example

All error responses use a unified format:


{
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "code": "missing_speaker",
    "param": "<Request ID or Parameter Name>"
  }
}

Common error code examples:

missing_name: name field not provided;
missing_speaker: Not a single speaker_* field provided;
invalid_speaker_base64: Failed to decode speaker_file_base64;
unsupported_audio_format: Audio format is not MP3/WAV;
file_too_large / duration_out_of_range / sample_rate_too_low: Audio does not meet size, duration, or sample rate requirements.

2. Query Custom Voice List

HTTP Method: GET
Path: /v1/audio/voice/list

2.1 Request Description

No request body is required, only the authentication information needs to be included in the Header.
The system will return the list of custom voices for the organization that the current API Key belongs to (top_org_id).
To ensure interface performance, a maximum of 1000 records is returned per call.

2.2 Response Example


{
  "list": [
    { "id": "uspeech:xxxx", "name": "Gentle Female Voice" },
    { "id": "uspeech:yyyy", "name": "Steady Male Voice" }
  ]
}

Field Explanation:

Field	Type	Description
list	array	A list of custom voices.
list[].id	string	Custom voice ID, which can be referenced in the `voice` field of `/v1/audio/speech`.
list[].name	string	The name of the voice entered during creation, intended for display only.

3. Delete Custom Voice

HTTP Method: POST
Path: /v1/audio/voice/delete
Content-Type: application/json

3.1 Request Parameters

Field	Type	Required	Description
id	string	Yes	ID of the custom voice to be deleted, i.e., the `id` returned by the upload interface.

3.2 Request Example


curl -X POST "https://api.umodelverse.ai/v1/audio/voice/delete" \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "uspeech:xxxx"
  }'

3.3 Successful Response Example


{
  "success": true
}

Note: After a successful deletion, the voice_id can no longer be used in /v1/audio/speech requests. Please confirm that the voice_id is no longer in use before deletion.

3.4 Possible Error Codes

missing_id: The id field not provided in the request body;
invalid_voice_id: The specified id does not exist under the current organization or has already been deleted;
Other server_error: Internal errors or object storage anomalies, investigate using the returned message and request ID.

Through these three interfaces, you can complete the full lifecycle management of custom voices:

Create a voice using the upload interface and obtain the voice_id;
Reference the voice_id in TTS calls using the voice field;
Manage existing voice resources through the list/delete interface and control storage costs in conjunction with the 7-day validity strategy.