Adventists - API Doc
  1. TCP - Chat With Voice
Adventists - API Doc
  • Overall
  • v2.5 Updates
  • Quick Start
  • TCP - Chat With Voice
    • Connect with TCP
    • Connect with VAD
    • Connect with MCP
    • Get Token
      POST
  • Settings Management
    • Args Description
    • Template
      • Query Public Template List
      • Query Current Org Template (including private and public template)
      • Create Template
      • Query Template By ID
      • Modify Template
      • Delete Template
      • Send Memories To Template
    • NPC
      • Create Npc With Template
      • Create Npc Without Template
      • Query Npc By Id
      • Modify Npc
      • Delete Npc
      • Send Memories To Npc
      • Query Memories From Npc
    • Skills Book
      • Query Skills Book List
      • Start Create Skills Book Task
      • Upload Skills Book Content
      • Finish Create Skills Book Task
      • Query Progress Of Create Skills Book Task
    • Voice Texture
      • Upload Voice Texture
        • Upload Voice
        • Query Status Of Uploading Voice
      • Query Voice List
  • Other Chat Functions
    • Upload Pictures
      POST
  1. TCP - Chat With Voice

Connect with TCP

AI Adventist Faction TCP Communication Protocol Documentation#

Version 2.4
Last Updated: 2025-07-15

Table of Contents#

1.
Document Information
2.
System Architecture
3.
Connection Management
3.1 Connection Establishment
3.2 Connection Maintenance
3.3 Connection Termination
4.
Message Protocol
5.
Authentication Mechanism
6.
Data Transmission
7.
Error Handling
8.
Best Practices
9.
Examples
10.
FAQ

1. Document Information#

1.1 Version History#

VersionDateDescription
2.52025-01-15Added VAD (Voice Activity Detection) feature, supporting both auto and manual modes to automatically detect the end of user speech.
2.4.12025-08-11Added input parameter input_audio_format, supporting raw Opus input (2-byte length prefix + 60ms frame); audio is converted to PCM before ASR; documentation updated.
2.42025-07-15Added SPEAK message type, allowing characters to directly speak specified content.
2.32025-07-10Changed connection timeout to 5 minutes and authentication timeout to 5 seconds; heartbeat packets no longer recommended; added connection close operation.
2.22025-04-23Added audio format selection feature, supporting both PCM and Opus formats.
2.12025-03-15Added support for the new model AnyoneV2.1.
1.322025-01-03Added optional parameters for authentication messages and new prompt receipt message.
1.312024-12-29Changed message start/end markers to ##START and ##END, and revised misleading descriptions in the documentation.
1.302024-12-27Unified message structure and added new END_FRAME frame type.

2. System Architecture#

2.1 Server Configuration#

Default listening address: ai.depthsdata.com
V2.1 production port: 8007
V2.2 test port: 8010

2.2 Performance Parameters#

Authentication timeout: 5 seconds
Session timeout: 300 seconds (5 minutes)
Maximum message size: 64 KB

3. Connection Management#

3.1 Connection Establishment#

1.
Client initiates TCP connection.
2.
Server accepts the connection.
3.
Client must send an authentication message within 5 seconds.
4.
Session is established upon successful authentication.

3.2 Connection Maintenance#

Session timeout: 300 seconds (5 minutes).
Heartbeat mechanism:
# Client sends heartbeat
##START\x0500000000[0000]##PING##END

# Server responds
##START\x0500000000[0000]##INFO:PONG##END
Note:
1.
Heartbeat packets use the STATUS type (\x05).
2.
Use the system task ID (00000000).
3.
Use a fixed sequence number (0000).
4.
Heartbeat functionality remains available but is no longer the recommended way to keep the connection alive.

3.3 Connection Termination#

The client may either actively close the connection or wait for the server to close it after timeout.

3.3.1 Active Close#

The client can actively disconnect by sending a close message:
##START\x0500000000[0000]##DISCONNECT##END

3.3.2 Close Response#

After receiving the close message, the server will respond:
##START\x0500000000[0000]##INFO:DISCONNECT 3 seconds##END

3.3.3 Close Flow#

1.
Client sends close message.
2.
Server sends close response.
3.
Server closes the TCP connection after a 3-second delay.
4.
Client detects the connection closure.

3.3.4 Abnormal Close#

If the client drops the TCP connection directly, the server will detect it and clean up the associated resources.
If the server closes unexpectedly, the client should implement a reconnection mechanism.

4. Message Protocol#

4.1 Basic Message Structure#

All messages follow a unified format:
##START[MessageType][TaskID][SequenceNumber][MessageContent]##END
Note: Square brackets [] are used only to illustrate the format and are not included in the actual data.

4.2 Message Type Definitions (1 byte)#

| Type Name   | Hex Value | Description |
|-------------|-----------|-------------|
| AUTH        | `\x01`    | Authentication message |
| AUDIO_FRAME | `\x02`    | Audio-frame data |
| END_FRAME   | `\x03`    | End-of-stream marker |
| TEXT        | `\x04`    | Text message |
| STATUS      | `\x05`    | Status message (can be used for client heartbeats) |
| MCP         | `\x06`    | MCP message (optional extension) |
| SPEAK       | `\x07`    | Speak specified content (added in v2.4) |

4.3 Message Field Description#

4.3.1 Frame Markers#

Start marker: b'##START'
End marker: b'##END'

4.3.2 Message Type (1 byte)#

Encoded as a single byte.
Valid range: b'\x01' – b'\x05'.

4.3.3 Task ID (8 bytes)#

Fixed-length ASCII string of 8 bytes.
Special value: 00000000 (reserved for system messages).
Any other value: client-defined, must be exactly 8 bytes.
Recommended charset: letters and digits.
Examples:
00000000 – system message
task0001 – ordinary task
abcd1234 – ordinary task

4.3.4 Sequence Number (4 bytes)#

Fixed-length ASCII string of 4 bytes.
Range: 0000–9999.
Incremented sequentially per sender.
Special value: 0000 (used for status messages).

5. Authentication Mechanism#

5.1 Authentication Message Format#

##START\x0100000000[0000]JWT_TOKEN##[Param1]:[Value1]##[Param2]:[Value2]...##END

5.2 Optional Parameters#

ParameterTypeDefaultDescription
voiceidstringCharacter default voiceVoice ID
emotion_status (WIP)stringtrue/falseWhether to return emotion status
langstring"default"Language setting. Natural-language values accepted: "default", "auto", "中文", "English", "JA", "kr", etc.
formatstringpcmOutput audio format. Affects audio data returned by the server. Allowed: pcm, opus
input_audio_formatstringpcmInput audio format. Affects how the server parses upstream audio. Allowed: pcm, opus (raw stream: 2-byte length prefix + 60 ms frame)
in_rateint16000Input sample rate (logged/validated; server always resamples to 16 kHz)
in_channelsint1Input channel count (logged/validated; server always converts to mono)
in_frame_msint60Input frame length in milliseconds; recommended frame size for Opus input
modestringmanualNew in v2.5 Audio processing mode. Allowed: manual (manual mode), auto (automatic VAD mode)

5.3 Authentication Example#

# Basic auth (token only)
##START\x01000000000000JWT_TOKEN##END

# Single parameter
##START\x01000000000000JWT_TOKEN##voiceid:voice1##END

# Multiple parameters
##START\x01000000000000JWT_TOKEN##voiceid:voice1##stage_mode:true##fast_mode:true##END

# Set downstream audio format to Opus
##START\x01000000000000JWT_TOKEN##format:opus##END

# Set upstream audio format to raw Opus
##START\x01000000000000JWT_TOKEN##input_audio_format:opus##END

# Enable auto mode (VAD auto-detection) – new in v2.5
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##END

# Auto mode with PCM input
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:pcm##END

5.4 Authentication Response#

Success:
##START\x0500000000[0000]##INFO:Authentication succeeded, NPCID: <npcid>, mode: <mode>##END
Failure:
##START\x0500000000[0000]##ERROR:token error##END

5.5 VAD Mode Description (new in v2.5)#

5.5.1 Manual Mode (default)#

Client has full control over when audio capture starts and stops.
Must send END_FRAME to mark the end of speech.
Fully backward-compatible with existing clients.
Best for push-to-talk or manually-controlled scenarios.

5.5.2 Auto Mode (VAD automatic detection)#

Server uses Voice-Activity Detection (VAD) to automatically discover the start and end of user speech.
Supports both PCM and Opus: input_audio_format:opus is recommended for lower bandwidth.
Client is not required to send END_FRAME.
Server proactively pushes listening-state messages.
Ideal for hands-free, natural conversations.

5.5.3 Listening-State Messages (Auto-mode only)#

The server pushes the following status messages to the client:
# Start listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"start","mode":"auto"}##END

# Stop listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"stop","mode":"auto"}##END

# Noise-filter hint
##START\x05[TaskID][0000]##INFO:Noise or silence detected, still listening##END

5.6 Quick-Start with PCM (recommended for first integration)#

Use PCM for the simplest first-time integration:
During auth, omit format/input_audio_format (or explicitly set both to pcm).
Send upstream 16 kHz, mono, 16-bit PCM bytes (you may chunk every 60 ms, but it is optional).
After each utterance, always send an END_FRAME.
Downstream you will receive PCM audio frames—play them directly or buffer as needed.

6. Data Transmission#

6.1 Text-Message Transmission#

6.1.1 Client → Server#

1. Text frame:
##START\x04[Task ID][0000][text content]##END

2. End frame:
##START\x03[Task ID][0001]##END

6.1.2 Server Response#

1. Prompt receipt:
##START\x05[TaskID][0000]##INFO:prompt: [PromptContent]##END

2. Response text:
##START\x04[TaskID][0000][ResponseContent]##END

3. Audio data (if any):
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...

4. End-of-stream frame:
##START\x03[TaskID][LastSeqNo+1]##END

6.2 Audio Message Transmission#

6.2.1 Client → Server#

1.
Audio frame sequence:
##START\x02[TaskID][0000][AudioData1]##END
##START\x02[TaskID][0001][AudioData2]##END
...
2.
End-of-stream frame:
##START\x03[TaskID][LastSeqNo+1]##END

6.2.2 Server Response#

1.
Text response:
##START\x04[TaskID][0000][ResponseText]##END
2.
Audio response sequence:
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...
3.
End-of-stream frame:
##START\x03[TaskID][LastSeqNo+1]##END

6.2.3 Audio Format Control (input / output separated)#

Output format is set by format: pcm or opus
opus = raw stream: 60 ms per frame, preceded by a 2-byte big-endian length header; several complete frames may be concatenated in one network packet; the server tries to pack as many full frames as possible into each TCP frame
Input format is set by input_audio_format: pcm or opus
When opus, the client must send the same raw Opus frame stream as used for output (2-byte length prefix + 60 ms frame, 16 kHz mono). The server decodes it to PCM before feeding it to ASR

6.2.4 Advanced Guide: Opus (optional)#

Use-case: mobile networks, bandwidth-limited or cost-sensitive scenarios
Enable:
Down-link: add format:opus during auth
Up-link: add input_audio_format:opus during auth
Packing: 60 ms frames, 2-byte big-endian length, multiple frames can be chained; each network packet may contain any integer number of complete frames
Alignment rule: never split a single “length-prefix + frame” unit across packets—keep each frame intact
6.2.4.1 Up-stream Opus Frame Stream Specification (input_audio_format=opus)#
Encoding & frame size: Opus, 60 ms/frame (16 kHz mono, ~960 samples/frame)
Frame boundary: 2-byte big-endian unsigned length immediately followed by the Opus payload
Payload assembly: concatenate any number of “length + frame” tuples
Network packing:
Strategies: 1 frame/packet or many frames per packet are both allowed
Recommended: keep each AUDIO_FRAME payload ≤ ~1 KB to reduce fragmentation and sticky-packet risk
Never split a single “length + frame” structure across packets
End marker: send END_FRAME after the utterance; the server will aggregate all up-stream frames before recognition
Fallback: if input_audio_format is missing or invalid, the server treats the stream as PCM

6.3 SPEAK Message Transmission (new in v2.4)#

6.3.1 Client → Server#

Ask the character to speak given text directly (bypasses LLM inference, goes straight to TTS):
##START\x07[TaskID][SeqNo][TextContent]##END
[Task ID]: 8 bytes, defined by the client
[Serial Number]: 4 bytes, recommended to start from 0000
[Text Content]: The words to be spoken by the character

6.3.2 Server Response#

After receiving the SPEAK message, the server directly converts the text content into an audio stream and returns it in the same format as a regular audio response.
1. Audio data frames:
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...

2. End frame:
##START\x03[TaskID][LastSerial+1]##END

3. Status message (optional):
##START\x05[TaskID][0000]##INFO:TTS completed##END

6.3.3 Usage Scenarios#

When you need the character to directly announce a piece of text (e.g., system prompts, external commands, etc.)
No inference involved; direct playback
Suitable for external control, customized announcements, and similar scenarios

7. Error Handling#

7.1 Error Message Format#

##START\x05[TaskID][0000]##ERROR:Error description##END

7.2 Error Types#

1.
Authentication Errors
TOKEN_ERROR: Invalid token
AUTH_TIMEOUT: Authentication timeout
INVALID_NPCID: Invalid NPCID
2.
Protocol Errors
INVALID_FORMAT: Malformed message
SEQUENCE_ERROR: Incorrect sequence number
FRAME_INCOMPLETE: Incomplete frame
3.
Business Errors
AUDIO_PROCESS_ERROR: Audio processing failed
TEXT_PROCESS_ERROR: Text processing failed
RESOURCE_ERROR: Resource unavailable

8. Best Practices#

8.1 Implementation Tips#

1.
Use frame delimiters to correctly identify message boundaries
2.
Implement automatic reconnection on disconnect
3.
Maintain heartbeat detection
4.
Set reasonable timeout values

8.2 Performance Optimization#

1.
Control audio frame size
2.
Use message queues
3.
Enable message compression
4.
Optimize memory usage

9. Examples#

9.1 Complete Session Flow (PCM, recommended for getting started)#

# Authentication (format not explicitly set; both upstream and downstream default to PCM)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>##END

# Text conversation
C -> S: ##START\x04123456780000Hello##END
C -> S: ##START\x03123456780001##END

S -> C: ##START\x04123456780000Hello, nice to meet you##END
S -> C: ##START\x02123456780001[AudioData1]##END
S -> C: ##START\x02123456780002[AudioData2]##END
S -> C: ##START\x03123456780003##END

9.2 Advanced Example (Opus Manual Mode)#

# Authentication (Opus enabled for both downstream and upstream)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##format:opus##input_audio_format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: manual##END

# Uplink audio (frame stream: 2-byte length prefix + 60 ms frames, multiple frames/packet allowed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
...
C -> S: ##START\x03task00010001##END

# Downlink audio (server aggregates and returns as many frames as possible per packet)
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END

9.3 Auto Mode Example (VAD auto-detection) – added in v2.5#

# Authentication (enable auto mode + Opus format)
C -> S: ##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: auto##END

# Server proactively sends "start listening"
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

# Client starts streaming audio (no END_FRAME needed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
C -> S: ##START\x02task00010001[len+frame][len+frame]...##END
...

# VAD detects end-of-speech; server sends "stop listening"
S -> C: ##START\x05task00010000##LISTEN:{"session_id":"task0001","type":"listen","state":"stop","mode":"auto"}##END

# Server processes and returns response
S -> C: ##START\x05task00010000##INFO:prompt: what the user said##END
S -> C: ##START\x04task00010000AI's reply text##END
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END

# After response finishes, server resumes listening
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

# Loop: client keeps sending audio, server auto-detects...

# Force-end dialogue example
C -> S: ##START\x0500000000[0000]##STOP_VAD##END
S -> C: ##START\x0500000000[0000]##INFO:Forcibly ending dialogue, processing current audio##END
S -> C: ##START\x0500000000[0000]##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

9.4 Auto Mode Noise-Handling Example#

# Client sends audio
C -> S: ##START\x02task00020000[ambient noise]##END
C -> S: ##START\x02task00020001[ambient noise]##END

# VAD detects "end-of-speech" but STT result is invalid
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"stop","mode":"auto"}##END
S -> C: ##START\x05task00020000##INFO:Noise or silence detected, continuing to listen##END

# Server immediately resumes listening without user intervention
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"start","mode":"auto"}##END

10. FAQ#

10.1 Connection Issues#

Q: How to handle connection timeout?
A: The server will drop the connection after 300 s of inactivity. The client should
1.
Implement a heartbeat to keep the connection alive (optional)
2.
Automatically reconnect when the connection is lost
3.
Re-send the authentication message after reconnecting
4.
Or actively send a close message to disconnect gracefully

10.2 Message Integrity#

Q: How to guarantee message integrity?
A: Use the frame start delimiter (##START) and end delimiter (##END).

10.3 Security#

Q: How to secure the communication?
A: Use JWT authentication.

10.4 Video Chat#

Q: How to start a video conversation?
A: Upload the most recent image memory via upload_image. Equip the character with a vision function and it will automatically call the latest image memory during the dialogue.
The upload timing is decoupled from the conversation itself; you can decide when to call upload_image according to your scenario—e.g., when the call button is pressed, or every 5 seconds.

10.5 Choosing an Audio Format#

Q: Which audio format should I use?
A:
PCM is lossless, high quality but large
Opus is compressed, smaller but requires decoding
Use PCM when bandwidth is plentiful
Use Opus when bandwidth is limited
On mobile devices prefer Opus to save data
If you also need Opus uplink, set input_audio_format:opus during auth and pack according to “Uplink Opus Frame Specification (6.2.4.1)”

10.6 Selecting a VAD Mode (new in 2.5)#

Q: Should I choose Manual or Auto mode?
A:
Manual
Best for push-to-talk
Client fully controls recording
Widest compatibility, all formats supported
Good when precise control is required
Auto
Best for hands-free conversation
More natural user experience
Supports PCM and Opus (recommend input_audio_format:opus)
Auto-filters ambient noise
Ideal for smart speakers, in-car systems, etc.

10.7 VAD-Related Issues#

Q: How does Auto mode cope with network latency?
A: VAD runs on the server; latency may affect response time. Tips:
Use a stable connection
Tune VAD parameters (contact support)
For ultra-low latency consider Manual mode
Q: Which audio formats does Auto mode support?
A: PCM and Opus
PCM: lossless, simple processing, good for LAN
Opus: better compression, less delay, good for mobile
Opus is recommended for real-time performance
Q: Auto mode triggers noise detection too often—what to do?
A: Possible causes & fixes:
Loud environment → improve recording conditions
Mic gain too high → lower client recording level
VAD threshold too low → contact support for tuning
Q: How to force-stop the current dialogue in Auto mode?
A: Send a special STATUS message to abort:
# Force-stop the current dialogue (valid in Auto mode only)
##START\x0500000000[0000]##STOP_VAD##END
Function Description:
Immediately processes any currently accumulated audio data (if present)
Resets the VAD state
Clears the audio buffer
Restarts listening
Valid only in Auto mode; in Manual mode, an informational message is returned
Use Cases:
User wants to end the current utterance immediately
System detects an abnormal condition requiring a reset
Client needs to take active control of the dialogue flow
修改于 2025-09-15 03:48:22
上一页
Quick Start
下一页
Connect with VAD
Built with