AI Adventist Faction TCP Communication Protocol Documentation

Version 2.4
Last Updated: 2025-07-15

10.

1. Document Information

1.1 Version History

Version	Date	Description
2.5	2025-01-15	Added VAD (Voice Activity Detection) feature, supporting both auto and manual modes to automatically detect the end of user speech.
2.4.1	2025-08-11	Added input parameter `input_audio_format`, supporting raw Opus input (2-byte length prefix + 60ms frame); audio is converted to PCM before ASR; documentation updated.
2.4	2025-07-15	Added `SPEAK` message type, allowing characters to directly speak specified content.
2.3	2025-07-10	Changed connection timeout to 5 minutes and authentication timeout to 5 seconds; heartbeat packets no longer recommended; added connection close operation.
2.2	2025-04-23	Added audio format selection feature, supporting both PCM and Opus formats.
2.1	2025-03-15	Added support for the new model AnyoneV2.1.
1.32	2025-01-03	Added optional parameters for authentication messages and new prompt receipt message.
1.31	2024-12-29	Changed message start/end markers to `##START` and `##END`, and revised misleading descriptions in the documentation.
1.30	2024-12-27	Unified message structure and added new `END_FRAME` frame type.

2. System Architecture

2.1 Server Configuration

Default listening address: ai.depthsdata.com

V2.1 production port: 8007

V2.2 test port: 8010

2.2 Performance Parameters

Authentication timeout: 5 seconds

Session timeout: 300 seconds (5 minutes)

Maximum message size: 64 KB

3. Connection Management

3.1 Connection Establishment

Client initiates TCP connection.

Server accepts the connection.

Client must send an authentication message within 5 seconds.

Session is established upon successful authentication.

3.2 Connection Maintenance

Session timeout: 300 seconds (5 minutes).

Heartbeat mechanism:

# Client sends heartbeat
##START\x0500000000[0000]##PING##END

# Server responds
##START\x0500000000[0000]##INFO:PONG##END

Note:

Heartbeat packets use the STATUS type (\x05).

Use the system task ID (00000000).

Use a fixed sequence number (0000).

Heartbeat functionality remains available but is no longer the recommended way to keep the connection alive.

3.3 Connection Termination

The client may either actively close the connection or wait for the server to close it after timeout.

3.3.1 Active Close

The client can actively disconnect by sending a close message:

##START\x0500000000[0000]##DISCONNECT##END

3.3.2 Close Response

After receiving the close message, the server will respond:

##START\x0500000000[0000]##INFO:DISCONNECT 3 seconds##END

3.3.3 Close Flow

Client sends close message.

Server sends close response.

Server closes the TCP connection after a 3-second delay.

Client detects the connection closure.

3.3.4 Abnormal Close

If the client drops the TCP connection directly, the server will detect it and clean up the associated resources.

If the server closes unexpectedly, the client should implement a reconnection mechanism.

4. Message Protocol

4.1 Basic Message Structure

All messages follow a unified format:

##START[MessageType][TaskID][SequenceNumber][MessageContent]##END

Note: Square brackets [] are used only to illustrate the format and are not included in the actual data.

4.2 Message Type Definitions (1 byte)

| Type Name   | Hex Value | Description |
|-------------|-----------|-------------|
| AUTH        | `\x01`    | Authentication message |
| AUDIO_FRAME | `\x02`    | Audio-frame data |
| END_FRAME   | `\x03`    | End-of-stream marker |
| TEXT        | `\x04`    | Text message |
| STATUS      | `\x05`    | Status message (can be used for client heartbeats) |
| MCP         | `\x06`    | MCP message (optional extension) |
| SPEAK       | `\x07`    | Speak specified content (added in v2.4) |

4.3 Message Field Description

4.3.1 Frame Markers

Start marker: b'##START'

End marker: b'##END'

4.3.2 Message Type (1 byte)

Encoded as a single byte.

Valid range: b'\x01' – b'\x05'.

4.3.3 Task ID (8 bytes)

Fixed-length ASCII string of 8 bytes.

Special value: 00000000 (reserved for system messages).

Any other value: client-defined, must be exactly 8 bytes.

Recommended charset: letters and digits.

Examples:

00000000 – system message

task0001 – ordinary task

abcd1234 – ordinary task

4.3.4 Sequence Number (4 bytes)

Fixed-length ASCII string of 4 bytes.

Range: 0000–9999.

Incremented sequentially per sender.

Special value: 0000 (used for status messages).

5. Authentication Mechanism

5.1 Authentication Message Format

##START\x0100000000[0000]JWT_TOKEN##[Param1]:[Value1]##[Param2]:[Value2]...##END

5.2 Optional Parameters

Parameter	Type	Default	Description
voiceid	string	Character default voice	Voice ID
emotion_status (WIP)	string	true/false	Whether to return emotion status
lang	string	"default"	Language setting. Natural-language values accepted: `"default"`, `"auto"`, `"中文"`, `"English"`, `"JA"`, `"kr"`, etc.
format	string	pcm	Output audio format. Affects audio data returned by the server. Allowed: `pcm`, `opus`
input_audio_format	string	pcm	Input audio format. Affects how the server parses upstream audio. Allowed: `pcm`, `opus` (raw stream: 2-byte length prefix + 60 ms frame)
in_rate	int	16000	Input sample rate (logged/validated; server always resamples to 16 kHz)
in_channels	int	1	Input channel count (logged/validated; server always converts to mono)
in_frame_ms	int	60	Input frame length in milliseconds; recommended frame size for Opus input
mode	string	manual	New in v2.5 Audio processing mode. Allowed: `manual` (manual mode), `auto` (automatic VAD mode)

5.3 Authentication Example

# Basic auth (token only)
##START\x01000000000000JWT_TOKEN##END

# Single parameter
##START\x01000000000000JWT_TOKEN##voiceid:voice1##END

# Multiple parameters
##START\x01000000000000JWT_TOKEN##voiceid:voice1##stage_mode:true##fast_mode:true##END

# Set downstream audio format to Opus
##START\x01000000000000JWT_TOKEN##format:opus##END

# Set upstream audio format to raw Opus
##START\x01000000000000JWT_TOKEN##input_audio_format:opus##END

# Enable auto mode (VAD auto-detection) – new in v2.5
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##END

# Auto mode with PCM input
##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:pcm##END

5.4 Authentication Response

Success:
##START\x0500000000[0000]##INFO:Authentication succeeded, NPCID: <npcid>, mode: <mode>##END

Failure:
##START\x0500000000[0000]##ERROR:token error##END

5.5 VAD Mode Description (new in v2.5)

5.5.1 Manual Mode (default)

Client has full control over when audio capture starts and stops.

Must send END_FRAME to mark the end of speech.

Fully backward-compatible with existing clients.

Best for push-to-talk or manually-controlled scenarios.

5.5.2 Auto Mode (VAD automatic detection)

Server uses Voice-Activity Detection (VAD) to automatically discover the start and end of user speech.

Supports both PCM and Opus: input_audio_format:opus is recommended for lower bandwidth.

Client is not required to send END_FRAME.

Server proactively pushes listening-state messages.

Ideal for hands-free, natural conversations.

5.5.3 Listening-State Messages (Auto-mode only)

The server pushes the following status messages to the client:

# Start listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"start","mode":"auto"}##END

# Stop listening
##START\x05[TaskID][0000]##LISTEN:{"session_id":"<session_id>","type":"listen","state":"stop","mode":"auto"}##END

# Noise-filter hint
##START\x05[TaskID][0000]##INFO:Noise or silence detected, still listening##END

5.6 Quick-Start with PCM (recommended for first integration)

Use PCM for the simplest first-time integration:

During auth, omit format/input_audio_format (or explicitly set both to pcm).

Send upstream 16 kHz, mono, 16-bit PCM bytes (you may chunk every 60 ms, but it is optional).

After each utterance, always send an END_FRAME.

Downstream you will receive PCM audio frames—play them directly or buffer as needed.

6. Data Transmission

6.1 Text-Message Transmission

6.1.1 Client → Server

1. Text frame：
##START\x04[Task ID][0000][text content]##END

2. End frame：
##START\x03[Task ID][0001]##END

6.1.2 Server Response

1. Prompt receipt:
##START\x05[TaskID][0000]##INFO:prompt: [PromptContent]##END

2. Response text：
##START\x04[TaskID][0000][ResponseContent]##END

3. Audio data (if any)：
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...

4. End-of-stream frame:
##START\x03[TaskID][LastSeqNo+1]##END

6.2 Audio Message Transmission

6.2.1 Client → Server

Audio frame sequence:

##START\x02[TaskID][0000][AudioData1]##END
##START\x02[TaskID][0001][AudioData2]##END
...

End-of-stream frame:

##START\x03[TaskID][LastSeqNo+1]##END

6.2.2 Server Response

Text response:

##START\x04[TaskID][0000][ResponseText]##END

Audio response sequence:

##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...

End-of-stream frame:

##START\x03[TaskID][LastSeqNo+1]##END

6.2.3 Audio Format Control (input / output separated)

Output format is set by format: pcm or opus

opus = raw stream: 60 ms per frame, preceded by a 2-byte big-endian length header; several complete frames may be concatenated in one network packet; the server tries to pack as many full frames as possible into each TCP frame

Input format is set by input_audio_format: pcm or opus

When opus, the client must send the same raw Opus frame stream as used for output (2-byte length prefix + 60 ms frame, 16 kHz mono). The server decodes it to PCM before feeding it to ASR

6.2.4 Advanced Guide: Opus (optional)

Use-case: mobile networks, bandwidth-limited or cost-sensitive scenarios

Enable:

Down-link: add format:opus during auth

Up-link: add input_audio_format:opus during auth

Packing: 60 ms frames, 2-byte big-endian length, multiple frames can be chained; each network packet may contain any integer number of complete frames

Alignment rule: never split a single “length-prefix + frame” unit across packets—keep each frame intact

6.2.4.1 Up-stream Opus Frame Stream Specification (input_audio_format=opus)

Encoding & frame size: Opus, 60 ms/frame (16 kHz mono, ~960 samples/frame)

Frame boundary: 2-byte big-endian unsigned length immediately followed by the Opus payload

Payload assembly: concatenate any number of “length + frame” tuples

Network packing:

Strategies: 1 frame/packet or many frames per packet are both allowed

Recommended: keep each AUDIO_FRAME payload ≤ ~1 KB to reduce fragmentation and sticky-packet risk

Never split a single “length + frame” structure across packets

End marker: send END_FRAME after the utterance; the server will aggregate all up-stream frames before recognition

Fallback: if input_audio_format is missing or invalid, the server treats the stream as PCM

6.3 SPEAK Message Transmission (new in v2.4)

6.3.1 Client → Server

Ask the character to speak given text directly (bypasses LLM inference, goes straight to TTS):

##START\x07[TaskID][SeqNo][TextContent]##END

[Task ID]: 8 bytes, defined by the client

[Serial Number]: 4 bytes, recommended to start from 0000

[Text Content]: The words to be spoken by the character

6.3.2 Server Response

After receiving the SPEAK message, the server directly converts the text content into an audio stream and returns it in the same format as a regular audio response.

1. Audio data frames:
##START\x02[TaskID][0001][AudioData1]##END
##START\x02[TaskID][0002][AudioData2]##END
...

2. End frame:
##START\x03[TaskID][LastSerial+1]##END

3. Status message (optional):
##START\x05[TaskID][0000]##INFO:TTS completed##END

6.3.3 Usage Scenarios

When you need the character to directly announce a piece of text (e.g., system prompts, external commands, etc.)

No inference involved; direct playback

Suitable for external control, customized announcements, and similar scenarios

7. Error Handling

7.1 Error Message Format

##START\x05[TaskID][0000]##ERROR:Error description##END

7.2 Error Types

Authentication Errors

TOKEN_ERROR: Invalid token

AUTH_TIMEOUT: Authentication timeout

INVALID_NPCID: Invalid NPCID

Protocol Errors

INVALID_FORMAT: Malformed message

SEQUENCE_ERROR: Incorrect sequence number

FRAME_INCOMPLETE: Incomplete frame

Business Errors

AUDIO_PROCESS_ERROR: Audio processing failed

TEXT_PROCESS_ERROR: Text processing failed

RESOURCE_ERROR: Resource unavailable

8. Best Practices

8.1 Implementation Tips

Use frame delimiters to correctly identify message boundaries

Implement automatic reconnection on disconnect

Maintain heartbeat detection

Set reasonable timeout values

8.2 Performance Optimization

Control audio frame size

Use message queues

Enable message compression

Optimize memory usage

9. Examples

9.1 Complete Session Flow (PCM, recommended for getting started)

# Authentication (format not explicitly set; both upstream and downstream default to PCM)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>##END

# Text conversation
C -> S: ##START\x04123456780000Hello##END
C -> S: ##START\x03123456780001##END

S -> C: ##START\x04123456780000Hello, nice to meet you##END
S -> C: ##START\x02123456780001[AudioData1]##END
S -> C: ##START\x02123456780002[AudioData2]##END
S -> C: ##START\x03123456780003##END

9.2 Advanced Example (Opus Manual Mode)

# Authentication (Opus enabled for both downstream and upstream)
C -> S: ##START\x01000000000000JWT_TOKEN##voiceid:voice1##format:opus##input_audio_format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: manual##END

# Uplink audio (frame stream: 2-byte length prefix + 60 ms frames, multiple frames/packet allowed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
...
C -> S: ##START\x03task00010001##END

# Downlink audio (server aggregates and returns as many frames as possible per packet)
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END

9.3 Auto Mode Example (VAD auto-detection) – added in v2.5

# Authentication (enable auto mode + Opus format)
C -> S: ##START\x01000000000000JWT_TOKEN##mode:auto##input_audio_format:opus##format:opus##END
S -> C: ##START\x05000000000000##INFO:Authentication successful, NPCID: <npcid>, mode: auto##END

# Server proactively sends "start listening"
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

# Client starts streaming audio (no END_FRAME needed)
C -> S: ##START\x02task00010000[len+frame][len+frame]...##END
C -> S: ##START\x02task00010001[len+frame][len+frame]...##END
...

# VAD detects end-of-speech; server sends "stop listening"
S -> C: ##START\x05task00010000##LISTEN:{"session_id":"task0001","type":"listen","state":"stop","mode":"auto"}##END

# Server processes and returns response
S -> C: ##START\x05task00010000##INFO:prompt: what the user said##END
S -> C: ##START\x04task00010000AI's reply text##END
S -> C: ##START\x02task00010001[AudioData1]##END
S -> C: ##START\x02task00010002[AudioData2]##END
S -> C: ##START\x03task00010003##END

# After response finishes, server resumes listening
S -> C: ##START\x05000000000000##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

# Loop: client keeps sending audio, server auto-detects...

# Force-end dialogue example
C -> S: ##START\x0500000000[0000]##STOP_VAD##END
S -> C: ##START\x0500000000[0000]##INFO:Forcibly ending dialogue, processing current audio##END
S -> C: ##START\x0500000000[0000]##LISTEN:{"session_id":"00000000","type":"listen","state":"start","mode":"auto"}##END

9.4 Auto Mode Noise-Handling Example

# Client sends audio
C -> S: ##START\x02task00020000[ambient noise]##END
C -> S: ##START\x02task00020001[ambient noise]##END

# VAD detects "end-of-speech" but STT result is invalid
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"stop","mode":"auto"}##END
S -> C: ##START\x05task00020000##INFO:Noise or silence detected, continuing to listen##END

# Server immediately resumes listening without user intervention
S -> C: ##START\x05task00020000##LISTEN:{"session_id":"task0002","type":"listen","state":"start","mode":"auto"}##END

10. FAQ

10.1 Connection Issues

Q: How to handle connection timeout?
A: The server will drop the connection after 300 s of inactivity. The client should

Implement a heartbeat to keep the connection alive (optional)

Automatically reconnect when the connection is lost

Re-send the authentication message after reconnecting

Or actively send a close message to disconnect gracefully

10.2 Message Integrity

Q: How to guarantee message integrity?
A: Use the frame start delimiter (##START) and end delimiter (##END).

10.3 Security

Q: How to secure the communication?
A: Use JWT authentication.

10.4 Video Chat

Q: How to start a video conversation?
A: Upload the most recent image memory via upload_image. Equip the character with a vision function and it will automatically call the latest image memory during the dialogue.
The upload timing is decoupled from the conversation itself; you can decide when to call upload_image according to your scenario—e.g., when the call button is pressed, or every 5 seconds.

10.5 Choosing an Audio Format

Q: Which audio format should I use?
A:

PCM is lossless, high quality but large

Opus is compressed, smaller but requires decoding

Use PCM when bandwidth is plentiful

Use Opus when bandwidth is limited

On mobile devices prefer Opus to save data

If you also need Opus uplink, set input_audio_format:opus during auth and pack according to “Uplink Opus Frame Specification (6.2.4.1)”

10.6 Selecting a VAD Mode (new in 2.5)

Q: Should I choose Manual or Auto mode?
A:

Manual

Best for push-to-talk

Client fully controls recording

Widest compatibility, all formats supported

Good when precise control is required

Auto

Best for hands-free conversation

More natural user experience

Supports PCM and Opus (recommend input_audio_format:opus)

Auto-filters ambient noise

Ideal for smart speakers, in-car systems, etc.

Q: How does Auto mode cope with network latency?
A: VAD runs on the server; latency may affect response time. Tips:

Use a stable connection

Tune VAD parameters (contact support)

For ultra-low latency consider Manual mode

Q: Which audio formats does Auto mode support?
A: PCM and Opus

PCM: lossless, simple processing, good for LAN

Opus: better compression, less delay, good for mobile

Opus is recommended for real-time performance

Q: Auto mode triggers noise detection too often—what to do?
A: Possible causes & fixes:

Loud environment → improve recording conditions

Mic gain too high → lower client recording level

VAD threshold too low → contact support for tuning

Q: How to force-stop the current dialogue in Auto mode?
A: Send a special STATUS message to abort:

# Force-stop the current dialogue (valid in Auto mode only)
##START\x0500000000[0000]##STOP_VAD##END

Function Description:

Immediately processes any currently accumulated audio data (if present)

Resets the VAD state

Clears the audio buffer

Restarts listening

Valid only in Auto mode; in Manual mode, an informational message is returned

Use Cases:

User wants to end the current utterance immediately

System detects an abnormal condition requiring a reset

Client needs to take active control of the dialogue flow

Connect with TCP

AI Adventist Faction TCP Communication Protocol Documentation#

Table of Contents#

1. Document Information#

1.1 Version History#

2. System Architecture#

2.1 Server Configuration#

2.2 Performance Parameters#

3. Connection Management#

3.1 Connection Establishment#

3.2 Connection Maintenance#

3.3 Connection Termination#

3.3.1 Active Close#

3.3.2 Close Response#

3.3.3 Close Flow#

3.3.4 Abnormal Close#

4. Message Protocol#

4.1 Basic Message Structure#

4.2 Message Type Definitions (1 byte)#

4.3 Message Field Description#

4.3.1 Frame Markers#

4.3.2 Message Type (1 byte)#

4.3.3 Task ID (8 bytes)#

4.3.4 Sequence Number (4 bytes)#

5. Authentication Mechanism#

5.1 Authentication Message Format#

5.2 Optional Parameters#

5.3 Authentication Example#

5.4 Authentication Response#

5.5 VAD Mode Description (new in v2.5)#

5.5.1 Manual Mode (default)#

5.5.2 Auto Mode (VAD automatic detection)#

5.5.3 Listening-State Messages (Auto-mode only)#

5.6 Quick-Start with PCM (recommended for first integration)#

6. Data Transmission#

6.1 Text-Message Transmission#

6.1.1 Client → Server#

6.1.2 Server Response#

6.2 Audio Message Transmission#

6.2.1 Client → Server#

6.2.2 Server Response#

6.2.3 Audio Format Control (input / output separated)#

6.2.4 Advanced Guide: Opus (optional)#

6.2.4.1 Up-stream Opus Frame Stream Specification (input_audio_format=opus)#

6.3 SPEAK Message Transmission (new in v2.4)#

6.3.1 Client → Server#

6.3.2 Server Response#

6.3.3 Usage Scenarios#

7. Error Handling#

7.1 Error Message Format#

7.2 Error Types#

8. Best Practices#

8.1 Implementation Tips#

8.2 Performance Optimization#

9. Examples#

9.1 Complete Session Flow (PCM, recommended for getting started)#

9.2 Advanced Example (Opus Manual Mode)#

9.3 Auto Mode Example (VAD auto-detection) – added in v2.5#

9.4 Auto Mode Noise-Handling Example#

10. FAQ#

10.1 Connection Issues#

10.2 Message Integrity#

10.3 Security#

10.4 Video Chat#

10.5 Choosing an Audio Format#

10.6 Selecting a VAD Mode (new in 2.5)#

10.7 VAD-Related Issues#