Top WordPress Chatbots with Image and Speech-to-Text Input

Why Most WordPress Chatbots Still Can’t See or Hear

Here’s the thing about WordPress chatbots in 2026: most of them are still stuck in text-only mode. You ask a question, you get an answer. Simple stuff.

But what if your visitors want to upload a photo and ask “What is this product?” Or speak their question instead of typing? That’s where multimodal AI chatbots come in, and surprisingly few WordPress plugins actually support these features.

After researching the WordPress chatbot landscape, I found that only 3-4 plugins genuinely support both image input (vision) and speech-to-text (voice input). The rest either don’t have these features, or they use misleading marketing. “Multimodal” often just means text plus AI-generated images, not analyzing photos your visitors upload.

If you’re still deciding which AI provider to use, check out our Mistral vs Gemini vs ChatGPT comparison for a detailed breakdown of costs and capabilities. For a broader comparison of chatbot options, see our guide to the best chatbot plugins for WordPress.


The AI Models That Power Vision and Voice

Your chatbot is only as smart as the AI behind it. Here’s what’s powering multimodal capabilities right now:

Vision-Capable Models

GPT-5 (released August 2025) brought significant improvements in visual perception. GPT-5.2 (December 2025) pushed things further with 86.3% accuracy on the ScreenSpot-Pro benchmark for understanding software interfaces and diagrams (source).

On Google’s side, Gemini 3 Flash (December 2025) scored 81.2% on MMMU Pro (source), which tests multimodal understanding and reasoning. It’s also 3x faster than Gemini 2.5 Pro while being cheaper to run.

Speech-to-Text Models

Modern LLM providers now offer native audio understanding as part of their multimodal capabilities. In plugins like AI Chat & Search Pro, speech-to-text works through each provider’s own system:

  • OpenAI: Uses the Whisper API (/v1/audio/transcriptions)
  • Gemini: Uses Google’s native speech recognition (all Gemini models are multimodal)
  • Mistral: Uses Voxtral (source), their multimodal audio model that outperforms Whisper at half the cost

This means your chatbot’s voice input quality depends on which AI provider you’ve configured, not a separate transcription service.


WordPress Plugins with Real Multimodal Support

1. AI Chat & Search Pro

Pricing: $59 One-time purchase

For image input, users click an image button, select their file, and it gets converted to base64 before being sent to the AI API. Works with GPT-5, GPT-5.2, Gemini 3 Pro, and Gemini 3 Flash.

For speech-to-text, users tap the microphone button and their audio is transcribed using each provider’s native capabilities: OpenAI’s Whisper API, Gemini’s multimodal audio, or Mistral’s Voxtral. Images and audio go directly to the AI provider and aren’t stored on your server.

Images and audio are sent directly to AI provider (OpenAI, Gemini, or Mistral) and are never stored on your WordPress server, keeping your hosting clean and reducing privacy liability. All uploads go through magic bytes validation to verify actual file types at the binary level, preventing users from uploading malicious files disguised with fake extensions.


2. AI Engine (Meow Apps)

Pricing: Free + $59/year Pro

The free version includes multi-file upload support for vision. You can enable “vision without query,” meaning users just drop an image and get analysis without typing anything.

Works with GPT-5, GPT-5.2, Gemini 3, Claude, and 50+ models via OpenRouter. Has a 25MB file limit (OpenAI’s API limitation).

For speech, the free version uses the browser’s Web Speech API (Chrome and Safari only). The Pro version unlocks the Realtime Audio Chatbot using OpenAI’s Realtime API.


3. Aimogen Pro (CodeCanyon)

Pricing: $249 one-time

Supports GPT-5 Vision and Gemini Vision models, plus an “AI Vision OmniBlock” for custom workflows. Speech-to-text uses OpenAI’s transcription models with a realtime chatbot option that includes Google TTS for spoken responses.


4. WPBot Pro (QuantumCloud)

Pricing: Base $59-199 + addons

Image input works through the Conversational Forms Pro module, not free-form chat. Voice requires separate addons ($21-22/year each). Total cost: around $142/year plus API costs.


Plugin Comparison Table

PluginLicenseBase PriceImageVoiceBest For
AI Chat & Search ProOne-timePro licenseNo recurring fees
AI EngineSubscriptionFree / $59/yr✅ Free⚠️ ProLarge community
Aimogen ProOne-time$249All-in-one toolkit
WPBot ProSubscription~$142/yr⚠️ Forms⚠️ AddonsWooCommerce

API Costs to Expect

Beyond the plugin price, you’ll pay for API usage. Costs vary by provider:

ProviderTranscription Cost
OpenAI Whisper$0.006/minute
Mistral Voxtral$0.001/minute
Gemini 3 FlashIncluded in token pricing

Privacy and GDPR Considerations

When users upload images or speak to your chatbot, that data flows from their device to your WordPress server to the AI provider and back.

What you should do:

  • Get explicit opt-in consent before capturing audio or images
  • Clearly disclose that data gets sent to third-party AI services
  • Ensure chat histories with media can be deleted upon request

Some plugins include built-in GDPR tools. AI Engine offers a “Privacy First” option with IP hashing and consent controls.


FAQ

Which plugin should I choose for both image and voice support?

For no recurring fees, AI Chat & Search Pro or Aimogen Pro. For frequent updates and a large community, AI Engine Pro at $59/year.

Does voice input work on all browsers?

It depends, in AI Chat & Search Pro – yes. In other plugins that use basic Web Speech API only works on Chrome and Safari. For broader support, you need plugins that fall back to OpenAI Whisper.

Is HTTPS required?

For voice input, yes. Browsers block microphone access on HTTP. For image upload, HTTPS isn’t technically required but strongly recommended.

Can I train the chatbot on my own content?

Yes. Most plugins support custom training. See our guide on how to train an AI chatbot on your WordPress knowledge base for a step-by-step walkthrough.


Wrapping Up

True multimodal chatbots on WordPress are still rare. You’re essentially choosing between AI Engine (subscription, massive features), AI Chat & Search Pro or Aimogen Pro (one-time purchases), or piecing together WPBot Pro with addons.

The underlying AI models have gotten incredibly capable. GPT-5.2 and Gemini 3 Flash can genuinely understand images and transcribe speech with high accuracy. The bottleneck isn’t the AI anymore. It’s finding WordPress plugins that expose these capabilities properly.

Whatever you choose, make sure your site runs HTTPS, prepare your privacy disclosures, and budget for API costs. Your visitors will appreciate being able to show, not just tell, what they need help with.

If you’re running a WooCommerce store, our best AI chatbot for WooCommerce guide covers product-specific considerations.

Purethemes