Monitoring & Observability System
Overview
PadawanForge includes a sophisticated monitoring and observability system that provides comprehensive tracking of application performance, errors, and system health. This system is built around the EnhancedMonitoring class and provides real-time insights into application behavior.
Architecture
Core Components
- Request Metrics Tracking: Unique request IDs and performance monitoring
- Database Query Monitoring: SQL query performance and error tracking
- Health Checks: Automated health monitoring for all services
- Alert System: Configurable alert rules with severity levels
- System Metrics: Aggregated performance and health data
Key Features
- Request Tracing: Every request gets a unique ID for end-to-end tracing
- Performance Monitoring: Response times, memory usage, and throughput tracking
- Error Categorization: Automatic error classification and severity assessment
- Health Monitoring: Real-time health checks for database, KV, and AI services
- Alert Management: Configurable alert rules with cooldown periods
Implementation
EnhancedMonitoring Class
import { EnhancedMonitoring } from '@/lib/utils/monitoring';
// Get singleton instance
const monitoring = EnhancedMonitoring.getInstance();
// Start tracking a request
const metrics = monitoring.startRequest(request, env);
// Track database queries
monitoring.trackDatabaseQuery(
requestId,
'SELECT * FROM players WHERE uuid = ?',
startTime,
true,
undefined,
1
);
// Track errors
monitoring.trackError(requestId, error, 'database');
// End request tracking
const finalMetrics = monitoring.endRequest(requestId, response);
Request Metrics Interface
interface RequestMetrics {
requestId: string;
method: string;
endpoint: string;
startTime: number;
endTime?: number;
duration?: number;
statusCode?: number;
userId?: string;
userAgent?: string;
country?: string;
colo?: string;
errorCount: number;
retryCount: number;
databaseQueries: DatabaseQuery[];
memoryUsage?: number;
responseSize?: number;
}
Health Check System
The system automatically monitors the health of critical services:
// Database health check
const dbHealth = await monitoring.checkDatabaseHealth(db);
// KV health check
const kvHealth = await monitoring.checkKVHealth(kv);
// AI service health check
const aiHealth = await monitoring.checkAIHealth(ai);
Alert Rules
Configurable alert rules for proactive monitoring:
// Example alert rule
{
name: 'High Error Rate',
condition: (metrics) => metrics.errorRate > 0.05,
severity: 'high',
message: 'Error rate exceeds 5%',
cooldown: 15 // minutes
}
Usage Examples
Basic Request Monitoring
// In middleware or API route
const metrics = monitoring.startRequest(request, env);
try {
// Your application logic
const result = await processRequest(request);
// End request with success
monitoring.endRequest(metrics.requestId, response);
} catch (error) {
// Track error
monitoring.trackError(metrics.requestId, error, 'application');
// End request with error
monitoring.endRequest(metrics.requestId, errorResponse);
}
Database Query Monitoring
const startTime = Date.now();
try {
const result = await db.prepare('SELECT * FROM players').all();
monitoring.trackDatabaseQuery(
requestId,
'SELECT * FROM players',
startTime,
true,
undefined,
result.results.length
);
} catch (error) {
monitoring.trackDatabaseQuery(
requestId,
'SELECT * FROM players',
startTime,
false,
error.message
);
}
Health Check Integration
// Periodic health checks
setInterval(async () => {
const systemMetrics = monitoring.getCurrentMetrics();
// Check for alerts
monitoring.checkAlerts(systemMetrics);
// Log system health
console.log('System Health:', {
requestCount: systemMetrics.requestCount,
errorRate: systemMetrics.errorRate,
averageResponseTime: systemMetrics.averageResponseTime,
databaseHealth: systemMetrics.databaseHealth.status
});
}, 60000); // Every minute
Configuration
Environment Variables
# Monitoring configuration
MONITORING_ENABLED=true
MONITORING_MAX_HISTORY=1000
MONITORING_ALERT_COOLDOWN=15
Alert Configuration
// Configure alert rules
const alertRules = [
{
name: 'High Response Time',
condition: (metrics) => metrics.averageResponseTime > 2000,
severity: 'medium',
message: 'Average response time exceeds 2 seconds'
},
{
name: 'Database Unhealthy',
condition: (metrics) => metrics.databaseHealth.status === 'unhealthy',
severity: 'critical',
message: 'Database service is unhealthy'
}
];
Metrics Dashboard
The system provides a comprehensive metrics dashboard:
System Metrics
- Request Count: Total requests processed
- Error Rate: Percentage of failed requests
- Average Response Time: Mean response time across all requests
- Memory Usage: Current memory consumption
- Active Connections: Number of active WebSocket connections
Service Health
- Database: Connection status and query performance
- KV Storage: Read/write performance and availability
- AI Services: Model availability and response times
Performance Trends
- Request Volume: Requests per minute/hour
- Error Trends: Error rate over time
- Response Time Trends: Performance over time
- Resource Usage: Memory and CPU utilization
Best Practices
1. Request ID Propagation
Always propagate request IDs through your application:
// Add request ID to all downstream calls
const headers = {
'X-Request-ID': requestId,
'Authorization': `Bearer ${token}`
};
2. Error Context
Provide rich context for errors:
monitoring.trackError(requestId, error, 'database', {
query: 'SELECT * FROM players',
params: [playerId],
duration: queryDuration
});
3. Performance Thresholds
Set appropriate performance thresholds:
// Alert on slow database queries
if (queryDuration > 1000) {
monitoring.trackError(requestId, new Error('Slow query'), 'database');
}
4. Health Check Frequency
Configure health checks based on your needs:
// More frequent checks for critical services
const criticalServices = ['database', 'ai'];
const standardServices = ['kv', 'r2'];
// Check critical services every 30 seconds
setInterval(() => checkServices(criticalServices), 30000);
// Check standard services every 2 minutes
setInterval(() => checkServices(standardServices), 120000);
Troubleshooting
Common Issues
-
High Memory Usage
- Check for memory leaks in long-running processes
- Monitor object creation and cleanup
- Review WebSocket connection management
-
Slow Response Times
- Analyze database query performance
- Check external API response times
- Monitor AI model inference times
-
High Error Rates
- Review error categorization
- Check service dependencies
- Monitor external service health
Debug Mode
Enable debug mode for detailed logging:
// Enable debug logging
if (process.env.NODE_ENV === 'development') {
monitoring.enableDebugMode();
}
Integration with External Tools
Logging Integration
The monitoring system can integrate with external logging services:
// Send metrics to external logging service
monitoring.onMetricsUpdate((metrics) => {
externalLogger.send(metrics);
});
Alert Integration
Configure alert notifications:
// Send alerts to external notification service
monitoring.onAlert((alert) => {
notificationService.send(alert);
});
This monitoring system provides the foundation for maintaining high availability and performance in the PadawanForge platform.