AI Model Selection Guide

Comprehensive guide to selecting the optimal AI model for your NPCs based on use case, performance, and cost considerations.

Available Models (2025)

General Purpose Models

Llama 3.3 70B Instruct (Fast) ⚑

  • Model ID: @cf/meta/llama-3.3-70b-instruct-fp8-fast
  • Performance: 2-4x faster than previous versions
  • Context Window: 8,192 tokens
  • Best For: Chat conversations, fast responses, complex tasks
  • Cost: Higher (8x baseline)
  • Recommended Use: Premium NPCs requiring high-quality, fast responses

Llama 3.1 8B Instruct (Default)

  • Model ID: @cf/meta/llama-3.1-8b-instruct
  • Performance: Balanced
  • Context Window: 8,192 tokens
  • Best For: General-purpose tasks, educational content
  • Cost: Medium (2x baseline)
  • Recommended Use: Default choice for most NPCs

Llama 3.2 3B Instruct (Cost-Effective) πŸ’°

  • Model ID: @cf/meta/llama-3.2-3b-instruct
  • Performance: Fast, efficient
  • Context Window: 8,192 tokens
  • Best For: Quick responses, simple chat, cost optimization
  • Cost: Low (1x baseline)
  • Recommended Use: High-volume NPCs where cost is a concern

Specialized Models

Qwen2.5 Coder 32B πŸ’»

  • Model ID: @cf/qwen/qwen2.5-coder-32b-instruct
  • Performance: Code generation specialist
  • Context Window: 32,768 tokens
  • Best For: Code-related NPCs, programming tutors
  • Cost: Higher (6x baseline)
  • Capabilities: Matches GPT-4o for coding tasks
  • Recommended Use: Programming mentor NPCs, code review assistants

QwQ 32B Reasoning 🧠

  • Model ID: @cf/qwen/qwq-32b
  • Performance: Deep analytical reasoning
  • Context Window: 32,768 tokens
  • Best For: Complex reasoning, analytical tasks, problem-solving
  • Cost: Higher (6x baseline)
  • Capabilities: Competitive with DeepSeek-R1
  • Recommended Use: Logic puzzles, math tutors, strategic advisors

Mistral Small 3.1 24B

  • Model ID: @cf/mistralai/mistral-small-3.1-24b-instruct
  • Performance: State-of-the-art
  • Context Window: 128,000 tokens
  • Best For: Vision tasks, tool calling, advanced tasks
  • Cost: Higher (5x baseline)
  • Capabilities: Vision + tool calling
  • Recommended Use: NPCs that need to process images or use tools

Multimodal & Vision Models

Llama 4 Scout 17B 🎨

  • Model ID: @cf/meta/llama-4-scout-17b-16e-instruct
  • Performance: Multimodal (text + images)
  • Context Window: 131,072 tokens
  • Best For: Image understanding, multimodal interactions
  • Cost: Higher (4x baseline)
  • Capabilities: Native text and image understanding
  • Recommended Use: Art NPCs, visual learning assistants, image analysis

Gemma 3 12B IT 🌍

  • Model ID: @cf/google/gemma-3-12b-it
  • Performance: Multilingual + vision
  • Context Window: 128,000 tokens
  • Best For: Multilingual NPCs, global audience, vision tasks
  • Cost: Higher (similar to Mistral)
  • Capabilities: 140+ languages, vision support
  • Recommended Use: International NPCs, language tutors, multicultural characters

Use Case Recommendations

By NPC Type

NPC TypeRecommended ModelReasoning
Quest GiverLlama 3.1 8BBalanced cost/quality for general dialogue
Combat TrainerLlama 3.3 70B FastFast responses critical for combat scenarios
Programming MentorQwen2.5 Coder 32BSpecialized for code generation and explanation
Logic Puzzle MasterQwQ 32BSuperior reasoning for complex puzzles
Art TeacherLlama 4 Scout 17BMultimodal for discussing and analyzing art
Language TutorGemma 3 12BMultilingual support (140+ languages)
ShopkeeperLlama 3.2 3BCost-effective for high-volume interactions
Story NarratorLlama 3.1 8BGood balance for creative content
Math TutorQwQ 32BReasoning capabilities for problem-solving
Character Co-HostLlama 3.3 70B FastPremium experience for main game characters

By Use Case Priority

Prioritize Speed:

  • Primary: Llama 3.3 70B Fast ⚑
  • Budget: Llama 3.2 3B πŸ’°

Prioritize Quality:

  • Complex Tasks: Llama 3.3 70B Fast
  • Reasoning: QwQ 32B 🧠
  • Code: Qwen2.5 Coder 32B πŸ’»

Prioritize Cost:

  • Best Value: Llama 3.2 3B πŸ’°
  • Balanced: Llama 3.1 8B (Default)

Special Requirements:

  • Multimodal: Llama 4 Scout 17B 🎨
  • Multilingual: Gemma 3 12B 🌍
  • Vision: Mistral Small 3.1 24B or Gemma 3 12B
  • Large Context: Llama 4 Scout (131K), Gemma 3 (128K), or Mistral (128K)

Cost Considerations

Relative Cost Matrix

Based on Llama 3.2 3B as baseline (1x):

ModelRelative CostBest Use Case
Llama 3.2 3B1xHigh-volume, simple interactions
Llama 3.1 8B~2xGeneral-purpose NPCs
Llama 4 Scout 17B~4xMultimodal experiences
Mistral Small 3.1 24B~5xVision + tool calling
QwQ 32B~6xComplex reasoning
Qwen2.5 Coder 32B~6xCode generation
Llama 3.3 70B~8xPremium chat experiences

Cost Optimization Strategies

  1. Default to Efficient Models: Use Llama 3.1 8B or 3.2 3B for most NPCs
  2. Reserve Premium for Key NPCs: Only use 70B for main storyline characters
  3. Match Complexity: Don’t use reasoning models for simple chat
  4. Monitor Usage: Track per-NPC interaction costs
  5. Batch Operations: For statement generation, use cost-effective models

Performance Characteristics

Speed Comparison

Fastest to Slowest:

  1. Llama 3.3 70B Fast ⚑ (paradoxically fast despite size)
  2. Llama 3.2 3B πŸ’°
  3. Llama 3.1 8B
  4. Smaller specialized models (12B-24B)
  5. Larger specialized models (32B)

Context Window Comparison

ModelContext WindowUse Case
Llama 4 Scout131,072 tokensVery long conversations, extensive lore
Gemma 3 12B128,000 tokensLong-form educational content
Mistral Small 3.1128,000 tokensComplex multi-turn dialogues
QwQ 32B32,768 tokensExtended reasoning chains
Qwen2.5 Coder32,768 tokensLarge code context
Llama 3.x (all)8,192 tokensStandard conversations

Migration Guide

Upgrading Existing NPCs

From Llama 3.1 8B:

  • For faster responses: Upgrade to Llama 3.3 70B Fast
  • For cost savings: Downgrade to Llama 3.2 3B
  • For specialized tasks: Switch to appropriate specialist model

Note: Legacy models (DialoGPT, Mistral 7B v0.1, Qwen 1.5, OpenChat 3.5) have been deprecated and automatically upgraded to modern equivalents as of v1.4.0.

Testing New Models

  1. Create test NPC with new model
  2. Compare response quality with existing model
  3. Monitor response times and costs
  4. Gather user feedback
  5. Roll out gradually to production NPCs

Advanced Configuration

Temperature Settings by Model

Llama 3.3 70B Fast:

  • Chat: 0.7-0.9 (creative)
  • Tasks: 0.3-0.5 (focused)

QwQ 32B (Reasoning):

  • Use lower temperatures (0.1-0.3) for precise reasoning
  • Avoid high temperatures that can break reasoning chains

Qwen2.5 Coder:

  • Code generation: 0.2-0.4 (precise)
  • Code explanation: 0.5-0.7 (balanced)

Multimodal Models:

  • Image analysis: 0.3-0.5 (accurate)
  • Creative tasks: 0.7-1.0 (expressive)

Max Tokens by Use Case

  • Quick responses: 256-512 tokens
  • Standard chat: 512-1024 tokens
  • Detailed explanations: 1024-2048 tokens
  • Long-form content: 2048-4096 tokens
  • Maximum (large context models): Up to context window limit

Best Practices

  1. Start with defaults: Use Llama 3.1 8B unless you have specific needs
  2. Match model to task: Use specialized models for their strengths
  3. Monitor costs: Track usage and optimize based on data
  4. Test thoroughly: Validate quality before switching production NPCs
  5. Consider user experience: Premium models for critical interactions
  6. Plan for scale: Cost-effective models for high-volume NPCs
  7. Stay updated: New models released regularly, review quarterly

Troubleshooting

Model Not Available

  • Verify model ID is correct
  • Check Cloudflare Workers AI status
  • Fallback to default model (Llama 3.1 8B)

Poor Response Quality

  • Try higher-tier model (e.g., 3.1 8B β†’ 3.3 70B)
  • Adjust temperature settings
  • Improve system prompt
  • Consider specialized model for task

Slow Responses

  • Switch to Llama 3.3 70B Fast ⚑
  • Use Llama 3.2 3B for simpler tasks
  • Reduce max_tokens if excessive

High Costs

  • Analyze per-NPC usage
  • Downgrade non-critical NPCs to 3.2 3B
  • Reserve premium models for key characters
  • Implement usage quotas

Version History

  • v1.3.6: Added 6 new 2025 models (Llama 4 Scout, Llama 3.3 70B, etc.)
  • v1.3.0: Initial multi-provider AI support
  • v1.0.0: Single Cloudflare model support

References

PadawanForge v1.4.1