- What Speech-to-Text Does
- Where Speech-to-Text Is Configured
- Supported Speech-to-Text Providers
- Browser Permissions and User Consent
- How Speech Input Is Processed
- Speech-to-Text and Chatbot Workflows
- Using STT in Real-Time Chatbots
- Moderation and Validation After Transcription
- Accuracy Considerations
- Language Handling
- Performance and Cost
- What Speech-to-Text Does Not Do
- Common Mistakes
- Best Practices
- Summary
Speech-to-Text (STT) in Aimogen allows chatbot users to speak instead of typing, with their voice input converted into text before being processed by the chatbot. This feature is designed to integrate cleanly with existing chatbot logic, workflows, and AI reasoning, without changing how conversations are evaluated internally.
Speech-to-Text affects input only. Everything after transcription behaves exactly like typed text.
What Speech-to-Text Does #
When enabled, Speech-to-Text:
- captures user voice input through the browser
- converts audio into text using an AI speech model
- sends the transcribed text to the chatbot
- triggers the same logic as manual text input
The chatbot never “hears” audio. It only receives text.
Where Speech-to-Text Is Configured #
Speech-to-Text is configured at the chatbot level.
Path:
Aimogen → Chatbots → Edit Chatbot → Voice / Audio Settings
STT can be enabled or disabled per chatbot. There is no forced global setting.
Supported Speech-to-Text Providers #
Speech-to-Text uses AI providers that expose transcription APIs.
Once the provider API key is entered in:
Settings → API Keys
Available speech-to-text models appear automatically in chatbot settings.
No additional toggles are required beyond selecting the transcription model.
Browser Permissions and User Consent #
Speech-to-Text requires:
- microphone access in the user’s browser
- explicit user permission
If permission is denied:
- voice input is unavailable
- text input remains fully functional
Aimogen does not bypass browser security or consent prompts.
How Speech Input Is Processed #
The STT flow is strictly ordered:
- user presses the microphone button
- browser records audio
- audio is sent to the transcription API
- text is returned
- text is processed by the chatbot
Triggers, workflows, moderation, and validation apply after transcription.
Speech-to-Text and Chatbot Workflows #
Speech-to-Text does not change:
- trigger evaluation
- conditional workflows
- hardcoded messages
- appended system prompts
- conversation termination rules
Voice input is treated exactly like typed input once transcribed.
Using STT in Real-Time Chatbots #
In real-time chatbots:
- voice input is captured continuously or per message
- transcription happens immediately
- responses may be spoken back via Text-to-Speech
- latency depends on provider speed
This enables fully voice-driven conversations.
Moderation and Validation After Transcription #
All moderation and validation rules apply to the transcribed text, not the audio.
This includes:
- content moderation
- prompt injection protection
- keyword triggers
- validation logic
Speech does not bypass safety layers.
Accuracy Considerations #
Speech-to-Text accuracy depends on:
- microphone quality
- background noise
- speaker clarity
- language support
- provider model quality
For best results:
- encourage short utterances
- avoid noisy environments
- use supported languages
STT is probabilistic, not perfect.
Language Handling #
Speech-to-Text supports multiple languages depending on the provider.
If language detection is enabled:
- the model detects the spoken language automatically
If not:
- language must be predefined
Mismatch between spoken language and model settings reduces accuracy.
Performance and Cost #
Speech-to-Text:
- adds an extra API call per voice input
- introduces slight latency
- increases cost per interaction
For high-traffic chatbots, STT should be enabled selectively.
What Speech-to-Text Does Not Do #
Speech-to-Text does not:
- change AI reasoning
- store audio permanently
- record conversations silently
- bypass consent
- replace text input
- improve chatbot intelligence
It converts speech to text, nothing more.
Common Mistakes #
- enabling STT without providing text fallback
- ignoring microphone permission UX
- using long, complex voice prompts
- assuming perfect transcription
- skipping moderation after transcription
Voice input still needs structure.
Best Practices #
Use Speech-to-Text where voice adds real value: real-time chatbots, accessibility use cases, hands-free environments, and conversational assistants. Keep voice interactions concise, always provide text fallback, and test transcription quality before deploying at scale.
Summary #
Speech-to-Text setup in Aimogen enables chatbot users to speak instead of typing, converting voice input into text before it enters the chatbot logic. Configured per chatbot and powered by supported AI transcription providers, STT integrates seamlessly with triggers, workflows, moderation, and validation. When used intentionally, it unlocks natural voice interaction without compromising control, safety, or predictability.