AI Model Selection Guide

Comprehensive guide to selecting the optimal AI model for your NPCs based on use case, performance, and cost considerations.

Available Models (2025)

General Purpose Models

Llama 3.3 70B Instruct (Fast) ⚡

Model ID: @cf/meta/llama-3.3-70b-instruct-fp8-fast
Performance: 2-4x faster than previous versions
Context Window: 8,192 tokens
Best For: Chat conversations, fast responses, complex tasks
Cost: Higher (8x baseline)
Recommended Use: Premium NPCs requiring high-quality, fast responses

Llama 3.1 8B Instruct (Default)

Model ID: @cf/meta/llama-3.1-8b-instruct
Performance: Balanced
Context Window: 8,192 tokens
Best For: General-purpose tasks, educational content
Cost: Medium (2x baseline)
Recommended Use: Default choice for most NPCs

Llama 3.2 3B Instruct (Cost-Effective) 💰

Model ID: @cf/meta/llama-3.2-3b-instruct
Performance: Fast, efficient
Context Window: 8,192 tokens
Best For: Quick responses, simple chat, cost optimization
Cost: Low (1x baseline)
Recommended Use: High-volume NPCs where cost is a concern

Specialized Models

Qwen2.5 Coder 32B 💻

Model ID: @cf/qwen/qwen2.5-coder-32b-instruct
Performance: Code generation specialist
Context Window: 32,768 tokens
Best For: Code-related NPCs, programming tutors
Cost: Higher (6x baseline)
Capabilities: Matches GPT-4o for coding tasks
Recommended Use: Programming mentor NPCs, code review assistants

QwQ 32B Reasoning 🧠

Model ID: @cf/qwen/qwq-32b
Performance: Deep analytical reasoning
Context Window: 32,768 tokens
Best For: Complex reasoning, analytical tasks, problem-solving
Cost: Higher (6x baseline)
Capabilities: Competitive with DeepSeek-R1
Recommended Use: Logic puzzles, math tutors, strategic advisors

Mistral Small 3.1 24B

Model ID: @cf/mistralai/mistral-small-3.1-24b-instruct
Performance: State-of-the-art
Context Window: 128,000 tokens
Best For: Vision tasks, tool calling, advanced tasks
Cost: Higher (5x baseline)
Capabilities: Vision + tool calling
Recommended Use: NPCs that need to process images or use tools

Multimodal & Vision Models

Llama 4 Scout 17B 🎨

Model ID: @cf/meta/llama-4-scout-17b-16e-instruct
Performance: Multimodal (text + images)
Context Window: 131,072 tokens
Best For: Image understanding, multimodal interactions
Cost: Higher (4x baseline)
Capabilities: Native text and image understanding
Recommended Use: Art NPCs, visual learning assistants, image analysis

Gemma 3 12B IT 🌍

Model ID: @cf/google/gemma-3-12b-it
Performance: Multilingual + vision
Context Window: 128,000 tokens
Best For: Multilingual NPCs, global audience, vision tasks
Cost: Higher (similar to Mistral)
Capabilities: 140+ languages, vision support
Recommended Use: International NPCs, language tutors, multicultural characters

Use Case Recommendations

By NPC Type

NPC Type	Recommended Model	Reasoning
Quest Giver	Llama 3.1 8B	Balanced cost/quality for general dialogue
Combat Trainer	Llama 3.3 70B Fast	Fast responses critical for combat scenarios
Programming Mentor	Qwen2.5 Coder 32B	Specialized for code generation and explanation
Logic Puzzle Master	QwQ 32B	Superior reasoning for complex puzzles
Art Teacher	Llama 4 Scout 17B	Multimodal for discussing and analyzing art
Language Tutor	Gemma 3 12B	Multilingual support (140+ languages)
Shopkeeper	Llama 3.2 3B	Cost-effective for high-volume interactions
Story Narrator	Llama 3.1 8B	Good balance for creative content
Math Tutor	QwQ 32B	Reasoning capabilities for problem-solving
Character Co-Host	Llama 3.3 70B Fast	Premium experience for main game characters

By Use Case Priority

Prioritize Speed:

Primary: Llama 3.3 70B Fast ⚡
Budget: Llama 3.2 3B 💰

Prioritize Quality:

Complex Tasks: Llama 3.3 70B Fast
Reasoning: QwQ 32B 🧠
Code: Qwen2.5 Coder 32B 💻

Prioritize Cost:

Best Value: Llama 3.2 3B 💰
Balanced: Llama 3.1 8B (Default)

Special Requirements:

Multimodal: Llama 4 Scout 17B 🎨
Multilingual: Gemma 3 12B 🌍
Vision: Mistral Small 3.1 24B or Gemma 3 12B
Large Context: Llama 4 Scout (131K), Gemma 3 (128K), or Mistral (128K)

Cost Considerations

Relative Cost Matrix

Based on Llama 3.2 3B as baseline (1x):

Model	Relative Cost	Best Use Case
Llama 3.2 3B	1x	High-volume, simple interactions
Llama 3.1 8B	~2x	General-purpose NPCs
Llama 4 Scout 17B	~4x	Multimodal experiences
Mistral Small 3.1 24B	~5x	Vision + tool calling
QwQ 32B	~6x	Complex reasoning
Qwen2.5 Coder 32B	~6x	Code generation
Llama 3.3 70B	~8x	Premium chat experiences

Cost Optimization Strategies

Default to Efficient Models: Use Llama 3.1 8B or 3.2 3B for most NPCs
Reserve Premium for Key NPCs: Only use 70B for main storyline characters
Match Complexity: Don’t use reasoning models for simple chat
Monitor Usage: Track per-NPC interaction costs
Batch Operations: For statement generation, use cost-effective models

Performance Characteristics

Speed Comparison

Fastest to Slowest:

Llama 3.3 70B Fast ⚡ (paradoxically fast despite size)
Llama 3.2 3B 💰
Llama 3.1 8B
Smaller specialized models (12B-24B)
Larger specialized models (32B)

Context Window Comparison

Model	Context Window	Use Case
Llama 4 Scout	131,072 tokens	Very long conversations, extensive lore
Gemma 3 12B	128,000 tokens	Long-form educational content
Mistral Small 3.1	128,000 tokens	Complex multi-turn dialogues
QwQ 32B	32,768 tokens	Extended reasoning chains
Qwen2.5 Coder	32,768 tokens	Large code context
Llama 3.x (all)	8,192 tokens	Standard conversations

Migration Guide

Upgrading Existing NPCs

From Llama 3.1 8B:

For faster responses: Upgrade to Llama 3.3 70B Fast
For cost savings: Downgrade to Llama 3.2 3B
For specialized tasks: Switch to appropriate specialist model

Note: Legacy models (DialoGPT, Mistral 7B v0.1, Qwen 1.5, OpenChat 3.5) have been deprecated and automatically upgraded to modern equivalents as of v1.4.0.

Testing New Models

Create test NPC with new model
Compare response quality with existing model
Monitor response times and costs
Gather user feedback
Roll out gradually to production NPCs

Advanced Configuration

Temperature Settings by Model

Llama 3.3 70B Fast:

Chat: 0.7-0.9 (creative)
Tasks: 0.3-0.5 (focused)

QwQ 32B (Reasoning):

Use lower temperatures (0.1-0.3) for precise reasoning
Avoid high temperatures that can break reasoning chains

Qwen2.5 Coder:

Code generation: 0.2-0.4 (precise)
Code explanation: 0.5-0.7 (balanced)

Multimodal Models:

Image analysis: 0.3-0.5 (accurate)
Creative tasks: 0.7-1.0 (expressive)

Max Tokens by Use Case

Quick responses: 256-512 tokens
Standard chat: 512-1024 tokens
Detailed explanations: 1024-2048 tokens
Long-form content: 2048-4096 tokens
Maximum (large context models): Up to context window limit

Best Practices

Start with defaults: Use Llama 3.1 8B unless you have specific needs
Match model to task: Use specialized models for their strengths
Monitor costs: Track usage and optimize based on data
Test thoroughly: Validate quality before switching production NPCs
Consider user experience: Premium models for critical interactions
Plan for scale: Cost-effective models for high-volume NPCs
Stay updated: New models released regularly, review quarterly

Troubleshooting

Model Not Available

Verify model ID is correct
Check Cloudflare Workers AI status
Fallback to default model (Llama 3.1 8B)

Poor Response Quality

Try higher-tier model (e.g., 3.1 8B → 3.3 70B)
Adjust temperature settings
Improve system prompt
Consider specialized model for task

Slow Responses

Switch to Llama 3.3 70B Fast ⚡
Use Llama 3.2 3B for simpler tasks
Reduce max_tokens if excessive

High Costs

Analyze per-NPC usage
Downgrade non-critical NPCs to 3.2 3B
Reserve premium models for key characters
Implement usage quotas

Version History

v1.3.6: Added 6 new 2025 models (Llama 4 Scout, Llama 3.3 70B, etc.)
v1.3.0: Initial multi-provider AI support
v1.0.0: Single Cloudflare model support

References

← Circuit Breaker Authentication System →