Google: Gemma 3n 4B
Google: Gemma 3n 4B is a text model for vision-language understanding. It combines multimodal input handling and audio processing with a 33K tokens context window and a low-cost profile. Use it for audio understanding and multimodal input when latency, cost, and throughput matters. It is a practical choice for teams that need reliable output, flexible deployment, and room to scale.
Input
$0.06/1M
Output
$0.12/1M