Streaming Support

Experience real-time AI responses with token-by-token streaming. Reduce perceived latency and create engaging, interactive experiences for your users.

Reduced Perceived Latency

Users see responses appear immediately as tokens are generated

Real-time Feedback

Immediate indication that the AI is processing and responding

Better User Experience

Interactive, engaging interface similar to modern chat applications

Live Tool Execution

See tool calls and results as they happen in real-time

How Streaming Works

1

Request Initiated

Client sends request with streaming enabled

POST /api/chat with stream: true parameter
2

Connection Established

Server establishes Server-Sent Events (SSE) connection

HTTP response with Content-Type: text/event-stream
3

Token Generation

AI model generates tokens progressively

Model streams tokens as they are generated
4

Real-time Delivery

Each token is immediately sent to the client

data: {"token": "word", "done": false} events

🚀 Performance Impact

Without Streaming:

User waits 5-10 seconds → Complete response appears

With Streaming:

Response starts in <1 second → Text appears progressively

Implementation Examples

JavaScript/Fetch API

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer your-api-key'
  },
  body: JSON.stringify({
    message: "Explain quantum computing",
    model: "claude",
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log('Token:', data.token);
      // Update UI with new token
      appendToResponse(data.token);
    }
  }
}

React Hook Implementation

const useStreamingChat = () => {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = async (message) => {
    setIsStreaming(true);
    setResponse('');
    
    const res = await fetch('/api/chat', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer your-api-key'
      },
      body: JSON.stringify({ message, stream: true })
    });

    const reader = res.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      
      const chunk = decoder.decode(value);
      const lines = chunk.split('\n');
      
      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = JSON.parse(line.slice(6));
          setResponse(prev => prev + data.token);
        }
      }
    }
    
    setIsStreaming(false);
  };

  return { response, isStreaming, sendMessage };
};

Model Streaming Support

ModelProviderStreamingPerformance
GPT-4 TurboOpenAIFast
GPT-3.5 TurboOpenAIVery Fast
Claude 3 OpusAnthropicFast
Claude 3 SonnetAnthropicFast
Claude 3 HaikuAnthropicVery Fast
Gemini 1.5 ProGoogleFast
NeuroSwitch MixFusionVariable

NeuroSwitch Streaming

When using NeuroSwitch, streaming performance depends on the routed model. The intelligent routing system maintains streaming compatibility across all supported providers.

Streaming Event Format

Standard Token Event

data: {
  "token": "Hello",
  "done": false,
  "model": "claude",
  "timestamp": "2024-01-15T10:30:00Z"
}

Tool Execution Event

data: {
  "tool_name": "web_search",
  "tool_input": {"query": "latest AI news"},
  "tool_result": "Found 10 articles...",
  "token": "Based on the search results...",
  "done": false
}

Stream Completion

data: {
  "done": true,
  "total_tokens": 1250,
  "finish_reason": "stop",
  "response_time": 4.2
}

Error Event

data: {
  "error": "Rate limit exceeded",
  "error_code": "RATE_LIMIT",
  "done": true,
  "retry_after": 60
}

Streaming Best Practices

Frontend Implementation

⚡ Performance

  • • Debounce UI updates for smooth rendering
  • • Use virtual scrolling for long responses
  • • Implement proper error boundaries
  • • Handle connection timeouts gracefully

🎨 User Experience

  • • Show typing indicators during streaming
  • • Provide stop/cancel functionality
  • • Display connection status clearly
  • • Handle reconnection scenarios

Backend Considerations

🔧 Technical

  • • Set appropriate connection timeouts
  • • Implement proper SSE headers
  • • Handle client disconnections
  • • Monitor stream health and performance

📊 Monitoring

  • • Track streaming success rates
  • • Monitor token delivery latency
  • • Log connection drop patterns
  • • Measure user engagement metrics

Related Features

Enable Streaming Today

Transform your AI application's user experience with real-time streaming. Just add stream: true to your requests.