| Version | Date | Description |
|---|---|---|
| 2.5 | 2025-01-15 | Added VAD (Voice Activity Detection) feature, supporting both auto and manual modes to automatically detect the end of user speech. |
| 2.4.1 | 2025-08-11 | Added input parameter input_audio_format, supporting raw Opus input (2-byte length prefix + 60ms frame); audio is converted to PCM before ASR; documentation updated. |
| 2.4 | 2025-07-15 | Added SPEAK message type, allowing characters to directly speak specified content. |
| 2.3 | 2025-07-10 | Changed connection timeout to 5 minutes and authentication timeout to 5 seconds; heartbeat packets no longer recommended; added connection close operation. |
| 2.2 | 2025-04-23 | Added audio format selection feature, supporting both PCM and Opus formats. |
| 2.1 | 2025-03-15 | Added support for the new model AnyoneV2.1. |
| 1.32 | 2025-01-03 | Added optional parameters for authentication messages and new prompt receipt message. |
| 1.31 | 2024-12-29 | Changed message start/end markers to ##START and ##END, and revised misleading descriptions in the documentation. |
| 1.30 | 2024-12-27 | Unified message structure and added new END_FRAME frame type. |
# Client sends heartbeat
##START\x0500000000[0000]##PING##END
# Server responds
##START\x0500000000[0000]##INFO:PONG##END\x05).00000000).0000).##START\x0500000000[0000]##DISCONNECT##END##START\x0500000000[0000]##INFO:DISCONNECT 3 seconds##END##START[MessageType][TaskID][SequenceNumber][MessageContent]##END[] are used only to illustrate the format and are not included in the actual data.| Type Name | Hex Value | Description |
|-------------|-----------|-------------|
| AUTH | `\x01` | Authentication message |
| AUDIO_FRAME | `\x02` | Audio-frame data |
| END_FRAME | `\x03` | End-of-stream marker |
| TEXT | `\x04` | Text message |
| STATUS | `\x05` | Status message (can be used for client heartbeats) |
| MCP | `\x06` | MCP message (optional extension) |
| SPEAK | `\x07` | Speak specified content (added in v2.4) |b'##START'b'##END'b'\x01' – b'\x05'.00000000 (reserved for system messages).00000000 – system messagetask0001 – ordinary taskabcd1234 – ordinary task0000–9999.0000 (used for status messages).##START\x0100000000[0000]JWT_TOKEN##[Param1]:[Value1]##[Param2]:[Value2]...##END| Parameter | Type | Default | Description |
|---|---|---|---|
| voiceid | string | Character default voice | Voice ID |
| emotion_status (WIP) | string | true/false | Whether to return emotion status |
| lang | string | "default" | Language setting. Natural-language values accepted: "default", "auto", "中文", "English", "JA", "kr", etc. |
| format | string | pcm | Output audio format. Affects audio data returned by the server. Allowed: pcm, opus |
| input_audio_format | string | pcm | Input audio format. Affects how the server parses upstream audio. Allowed: pcm, opus (raw stream: 2-byte length prefix + 60 ms frame) |
| in_rate | int | 16000 | Input sample rate (logged/validated; server always resamples to 16 kHz) |
| in_channels | int | 1 | Input channel count (logged/validated; server always converts to mono) |
| in_frame_ms | int | 60 | Input frame length in milliseconds; recommended frame size for Opus input |
| mode | string | manual | New in v2.5 Audio processing mode. Allowed: manual (manual mode), auto (automatic VAD mode) |
# Basic auth (token only)
##START\x01000000000000JWT_TOKEN##END
# Single parameter
##START\x01000000000000JWT_TOKEN##voiceid:voice1##END
# Multiple parameters
##START\x01000000000000JWT_TOKEN##voiceid:voice1##stage_mode:true##fast_mode:true##END
# Set downstream audio format to Opus
##START\x01000000000000JWT_TOKEN##format:opus##END
# Set upstream audio format to raw Opus
##START\x01000000000000JWT_TOKEN##input_audio_format:opus##END
# Enable auto mode (VAD auto-detection) – new in v2.5
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##END
# Auto mode with PCM input
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:pcm##END##START\x0500000000[0000]##INFO:Authentication succeeded, NPCID: <npcid>, mode: <mode>##END##START\x0500000000[0000]##ERROR:token error##ENDEND_FRAME to mark the end of speech.input_audio_format:opus is recommended for lower bandwidth.END_FRAME.# Start listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"start","mode":"auto"}##END
# Stop listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"stop","mode":"auto"}##END
# Noise-filter hint
##START\x05[TaskID][0000]##INFO:Noise or silence detected, still listening##ENDformat/input_audio_format (or explicitly set both to pcm).END_FRAME.1. Text frame:
##START\x04[Task ID][0000][text content]##END
2. End frame:
##START\x03[Task ID][0001]##END1. Prompt receipt:
##START\x05[TaskID][0000]##INFO:prompt: [PromptContent]##END
2. Response text:
##START\x04[TaskID][0000][ResponseContent]##END
3. Audio data (if any):
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...
4. End-of-stream frame:
##START\x03[TaskID][LastSeqNo+1]##END##START\x02[TaskID][0000][AudioData1]##END
##START\x02[TaskID][0001][AudioData2]##END
...##START\x03[TaskID][LastSeqNo+1]##END##START\x04[TaskID][0000][ResponseText]##END##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...##START\x03[TaskID][LastSeqNo+1]##ENDformat: pcm or opusopus = raw stream: 60 ms per frame, preceded by a 2-byte big-endian length header; several complete frames may be concatenated in one network packet; the server tries to pack as many full frames as possible into each TCP frameinput_audio_format: pcm or opusopus, the client must send the same raw Opus frame stream as used for output (2-byte length prefix + 60 ms frame, 16 kHz mono). The server decodes it to PCM before feeding it to ASRformat:opus during authinput_audio_format:opus during authEND_FRAME after the utterance; the server will aggregate all up-stream frames before recognitioninput_audio_format is missing or invalid, the server treats the stream as PCM##START\x07[TaskID][SeqNo][TextContent]##END1. Audio data frames:
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...
2. End frame:
##START\x03[TaskID][LastSerial+1]##END
3. Status message (optional):
##START\x05[TaskID][0000]##INFO:TTS completed##END##START\x05[TaskID][0000]##ERROR:Error description##END# Authentication (format not explicitly set; both upstream and downstream default to PCM)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>##END
# Text conversation
C -> S: ##START\x04123456780000Hello##END
C -> S: ##START\x03123456780001##END
S -> C: ##START\x04123456780000Hello, nice to meet you##END
S -> C: ##START\x02123456780001[AudioData1]##END
S -> C: ##START\x02123456780002[AudioData2]##END
S -> C: ##START\x03123456780003##END# Authentication (Opus enabled for both downstream and upstream)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##format:opus##input_audio_format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: manual##END
# Uplink audio (frame stream: 2-byte length prefix + 60 ms frames, multiple frames/packet allowed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
...
C -> S: ##START\x03task00010001##END
# Downlink audio (server aggregates and returns as many frames as possible per packet)
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END# Authentication (enable auto mode + Opus format)
C -> S: ##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: auto##END
# Server proactively sends "start listening"
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END
# Client starts streaming audio (no END_FRAME needed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
C -> S: ##START\x02task00010001[len+frame][len+frame]...##END
...
# VAD detects end-of-speech; server sends "stop listening"
S -> C: ##START\x05task00010000##LISTEN:{"session_id":"task0001","type":"listen","state":"stop","mode":"auto"}##END
# Server processes and returns response
S -> C: ##START\x05task00010000##INFO:prompt: what the user said##END
S -> C: ##START\x04task00010000AI's reply text##END
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END
# After response finishes, server resumes listening
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END
# Loop: client keeps sending audio, server auto-detects...
# Force-end dialogue example
C -> S: ##START\x0500000000[0000]##STOP_VAD##END
S -> C: ##START\x0500000000[0000]##INFO:Forcibly ending dialogue, processing current audio##END
S -> C: ##START\x0500000000[0000]##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END# Client sends audio
C -> S: ##START\x02task00020000[ambient noise]##END
C -> S: ##START\x02task00020001[ambient noise]##END
# VAD detects "end-of-speech" but STT result is invalid
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"stop","mode":"auto"}##END
S -> C: ##START\x05task00020000##INFO:Noise or silence detected, continuing to listen##END
# Server immediately resumes listening without user intervention
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"start","mode":"auto"}##END##START) and end delimiter (##END).upload_image. Equip the character with a vision function and it will automatically call the latest image memory during the dialogue.upload_image according to your scenario—e.g., when the call button is pressed, or every 5 seconds.input_audio_format:opus during auth and pack according to “Uplink Opus Frame Specification (6.2.4.1)”input_audio_format:opus)# Force-stop the current dialogue (valid in Auto mode only)
##START\x0500000000[0000]##STOP_VAD##END