AI Model Selection Guide
Comprehensive guide to selecting the optimal AI model for your NPCs based on use case, performance, and cost considerations.
Available Models (2025)
General Purpose Models
Llama 3.3 70B Instruct (Fast) β‘
- Model ID:
@cf/meta/llama-3.3-70b-instruct-fp8-fast - Performance: 2-4x faster than previous versions
- Context Window: 8,192 tokens
- Best For: Chat conversations, fast responses, complex tasks
- Cost: Higher (8x baseline)
- Recommended Use: Premium NPCs requiring high-quality, fast responses
Llama 3.1 8B Instruct (Default)
- Model ID:
@cf/meta/llama-3.1-8b-instruct - Performance: Balanced
- Context Window: 8,192 tokens
- Best For: General-purpose tasks, educational content
- Cost: Medium (2x baseline)
- Recommended Use: Default choice for most NPCs
Llama 3.2 3B Instruct (Cost-Effective) π°
- Model ID:
@cf/meta/llama-3.2-3b-instruct - Performance: Fast, efficient
- Context Window: 8,192 tokens
- Best For: Quick responses, simple chat, cost optimization
- Cost: Low (1x baseline)
- Recommended Use: High-volume NPCs where cost is a concern
Specialized Models
Qwen2.5 Coder 32B π»
- Model ID:
@cf/qwen/qwen2.5-coder-32b-instruct - Performance: Code generation specialist
- Context Window: 32,768 tokens
- Best For: Code-related NPCs, programming tutors
- Cost: Higher (6x baseline)
- Capabilities: Matches GPT-4o for coding tasks
- Recommended Use: Programming mentor NPCs, code review assistants
QwQ 32B Reasoning π§
- Model ID:
@cf/qwen/qwq-32b - Performance: Deep analytical reasoning
- Context Window: 32,768 tokens
- Best For: Complex reasoning, analytical tasks, problem-solving
- Cost: Higher (6x baseline)
- Capabilities: Competitive with DeepSeek-R1
- Recommended Use: Logic puzzles, math tutors, strategic advisors
Mistral Small 3.1 24B
- Model ID:
@cf/mistralai/mistral-small-3.1-24b-instruct - Performance: State-of-the-art
- Context Window: 128,000 tokens
- Best For: Vision tasks, tool calling, advanced tasks
- Cost: Higher (5x baseline)
- Capabilities: Vision + tool calling
- Recommended Use: NPCs that need to process images or use tools
Multimodal & Vision Models
Llama 4 Scout 17B π¨
- Model ID:
@cf/meta/llama-4-scout-17b-16e-instruct - Performance: Multimodal (text + images)
- Context Window: 131,072 tokens
- Best For: Image understanding, multimodal interactions
- Cost: Higher (4x baseline)
- Capabilities: Native text and image understanding
- Recommended Use: Art NPCs, visual learning assistants, image analysis
Gemma 3 12B IT π
- Model ID:
@cf/google/gemma-3-12b-it - Performance: Multilingual + vision
- Context Window: 128,000 tokens
- Best For: Multilingual NPCs, global audience, vision tasks
- Cost: Higher (similar to Mistral)
- Capabilities: 140+ languages, vision support
- Recommended Use: International NPCs, language tutors, multicultural characters
Use Case Recommendations
By NPC Type
| NPC Type | Recommended Model | Reasoning |
|---|---|---|
| Quest Giver | Llama 3.1 8B | Balanced cost/quality for general dialogue |
| Combat Trainer | Llama 3.3 70B Fast | Fast responses critical for combat scenarios |
| Programming Mentor | Qwen2.5 Coder 32B | Specialized for code generation and explanation |
| Logic Puzzle Master | QwQ 32B | Superior reasoning for complex puzzles |
| Art Teacher | Llama 4 Scout 17B | Multimodal for discussing and analyzing art |
| Language Tutor | Gemma 3 12B | Multilingual support (140+ languages) |
| Shopkeeper | Llama 3.2 3B | Cost-effective for high-volume interactions |
| Story Narrator | Llama 3.1 8B | Good balance for creative content |
| Math Tutor | QwQ 32B | Reasoning capabilities for problem-solving |
| Character Co-Host | Llama 3.3 70B Fast | Premium experience for main game characters |
By Use Case Priority
Prioritize Speed:
- Primary: Llama 3.3 70B Fast β‘
- Budget: Llama 3.2 3B π°
Prioritize Quality:
- Complex Tasks: Llama 3.3 70B Fast
- Reasoning: QwQ 32B π§
- Code: Qwen2.5 Coder 32B π»
Prioritize Cost:
- Best Value: Llama 3.2 3B π°
- Balanced: Llama 3.1 8B (Default)
Special Requirements:
- Multimodal: Llama 4 Scout 17B π¨
- Multilingual: Gemma 3 12B π
- Vision: Mistral Small 3.1 24B or Gemma 3 12B
- Large Context: Llama 4 Scout (131K), Gemma 3 (128K), or Mistral (128K)
Cost Considerations
Relative Cost Matrix
Based on Llama 3.2 3B as baseline (1x):
| Model | Relative Cost | Best Use Case |
|---|---|---|
| Llama 3.2 3B | 1x | High-volume, simple interactions |
| Llama 3.1 8B | ~2x | General-purpose NPCs |
| Llama 4 Scout 17B | ~4x | Multimodal experiences |
| Mistral Small 3.1 24B | ~5x | Vision + tool calling |
| QwQ 32B | ~6x | Complex reasoning |
| Qwen2.5 Coder 32B | ~6x | Code generation |
| Llama 3.3 70B | ~8x | Premium chat experiences |
Cost Optimization Strategies
- Default to Efficient Models: Use Llama 3.1 8B or 3.2 3B for most NPCs
- Reserve Premium for Key NPCs: Only use 70B for main storyline characters
- Match Complexity: Donβt use reasoning models for simple chat
- Monitor Usage: Track per-NPC interaction costs
- Batch Operations: For statement generation, use cost-effective models
Performance Characteristics
Speed Comparison
Fastest to Slowest:
- Llama 3.3 70B Fast β‘ (paradoxically fast despite size)
- Llama 3.2 3B π°
- Llama 3.1 8B
- Smaller specialized models (12B-24B)
- Larger specialized models (32B)
Context Window Comparison
| Model | Context Window | Use Case |
|---|---|---|
| Llama 4 Scout | 131,072 tokens | Very long conversations, extensive lore |
| Gemma 3 12B | 128,000 tokens | Long-form educational content |
| Mistral Small 3.1 | 128,000 tokens | Complex multi-turn dialogues |
| QwQ 32B | 32,768 tokens | Extended reasoning chains |
| Qwen2.5 Coder | 32,768 tokens | Large code context |
| Llama 3.x (all) | 8,192 tokens | Standard conversations |
Migration Guide
Upgrading Existing NPCs
From Llama 3.1 8B:
- For faster responses: Upgrade to Llama 3.3 70B Fast
- For cost savings: Downgrade to Llama 3.2 3B
- For specialized tasks: Switch to appropriate specialist model
Note: Legacy models (DialoGPT, Mistral 7B v0.1, Qwen 1.5, OpenChat 3.5) have been deprecated and automatically upgraded to modern equivalents as of v1.4.0.
Testing New Models
- Create test NPC with new model
- Compare response quality with existing model
- Monitor response times and costs
- Gather user feedback
- Roll out gradually to production NPCs
Advanced Configuration
Temperature Settings by Model
Llama 3.3 70B Fast:
- Chat: 0.7-0.9 (creative)
- Tasks: 0.3-0.5 (focused)
QwQ 32B (Reasoning):
- Use lower temperatures (0.1-0.3) for precise reasoning
- Avoid high temperatures that can break reasoning chains
Qwen2.5 Coder:
- Code generation: 0.2-0.4 (precise)
- Code explanation: 0.5-0.7 (balanced)
Multimodal Models:
- Image analysis: 0.3-0.5 (accurate)
- Creative tasks: 0.7-1.0 (expressive)
Max Tokens by Use Case
- Quick responses: 256-512 tokens
- Standard chat: 512-1024 tokens
- Detailed explanations: 1024-2048 tokens
- Long-form content: 2048-4096 tokens
- Maximum (large context models): Up to context window limit
Best Practices
- Start with defaults: Use Llama 3.1 8B unless you have specific needs
- Match model to task: Use specialized models for their strengths
- Monitor costs: Track usage and optimize based on data
- Test thoroughly: Validate quality before switching production NPCs
- Consider user experience: Premium models for critical interactions
- Plan for scale: Cost-effective models for high-volume NPCs
- Stay updated: New models released regularly, review quarterly
Troubleshooting
Model Not Available
- Verify model ID is correct
- Check Cloudflare Workers AI status
- Fallback to default model (Llama 3.1 8B)
Poor Response Quality
- Try higher-tier model (e.g., 3.1 8B β 3.3 70B)
- Adjust temperature settings
- Improve system prompt
- Consider specialized model for task
Slow Responses
- Switch to Llama 3.3 70B Fast β‘
- Use Llama 3.2 3B for simpler tasks
- Reduce max_tokens if excessive
High Costs
- Analyze per-NPC usage
- Downgrade non-critical NPCs to 3.2 3B
- Reserve premium models for key characters
- Implement usage quotas
Version History
- v1.3.6: Added 6 new 2025 models (Llama 4 Scout, Llama 3.3 70B, etc.)
- v1.3.0: Initial multi-provider AI support
- v1.0.0: Single Cloudflare model support