Monitoring & Observability System

Overview

PadawanForge includes a sophisticated monitoring and observability system that provides comprehensive tracking of application performance, errors, and system health. This system is built around the EnhancedMonitoring class and provides real-time insights into application behavior.

Architecture

Core Components

  • Request Metrics Tracking: Unique request IDs and performance monitoring
  • Database Query Monitoring: SQL query performance and error tracking
  • Health Checks: Automated health monitoring for all services
  • Alert System: Configurable alert rules with severity levels
  • System Metrics: Aggregated performance and health data

Key Features

  • Request Tracing: Every request gets a unique ID for end-to-end tracing
  • Performance Monitoring: Response times, memory usage, and throughput tracking
  • Error Categorization: Automatic error classification and severity assessment
  • Health Monitoring: Real-time health checks for database, KV, and AI services
  • Alert Management: Configurable alert rules with cooldown periods

Implementation

EnhancedMonitoring Class

import { EnhancedMonitoring } from '@/lib/utils/monitoring';

// Get singleton instance
const monitoring = EnhancedMonitoring.getInstance();

// Start tracking a request
const metrics = monitoring.startRequest(request, env);

// Track database queries
monitoring.trackDatabaseQuery(
  requestId,
  'SELECT * FROM players WHERE uuid = ?',
  startTime,
  true,
  undefined,
  1
);

// Track errors
monitoring.trackError(requestId, error, 'database');

// End request tracking
const finalMetrics = monitoring.endRequest(requestId, response);

Request Metrics Interface

interface RequestMetrics {
  requestId: string;
  method: string;
  endpoint: string;
  startTime: number;
  endTime?: number;
  duration?: number;
  statusCode?: number;
  userId?: string;
  userAgent?: string;
  country?: string;
  colo?: string;
  errorCount: number;
  retryCount: number;
  databaseQueries: DatabaseQuery[];
  memoryUsage?: number;
  responseSize?: number;
}

Health Check System

The system automatically monitors the health of critical services:

// Database health check
const dbHealth = await monitoring.checkDatabaseHealth(db);

// KV health check
const kvHealth = await monitoring.checkKVHealth(kv);

// AI service health check
const aiHealth = await monitoring.checkAIHealth(ai);

Alert Rules

Configurable alert rules for proactive monitoring:

// Example alert rule
{
  name: 'High Error Rate',
  condition: (metrics) => metrics.errorRate > 0.05,
  severity: 'high',
  message: 'Error rate exceeds 5%',
  cooldown: 15 // minutes
}

Usage Examples

Basic Request Monitoring

// In middleware or API route
const metrics = monitoring.startRequest(request, env);

try {
  // Your application logic
  const result = await processRequest(request);
  
  // End request with success
  monitoring.endRequest(metrics.requestId, response);
} catch (error) {
  // Track error
  monitoring.trackError(metrics.requestId, error, 'application');
  
  // End request with error
  monitoring.endRequest(metrics.requestId, errorResponse);
}

Database Query Monitoring

const startTime = Date.now();
try {
  const result = await db.prepare('SELECT * FROM players').all();
  
  monitoring.trackDatabaseQuery(
    requestId,
    'SELECT * FROM players',
    startTime,
    true,
    undefined,
    result.results.length
  );
} catch (error) {
  monitoring.trackDatabaseQuery(
    requestId,
    'SELECT * FROM players',
    startTime,
    false,
    error.message
  );
}

Health Check Integration

// Periodic health checks
setInterval(async () => {
  const systemMetrics = monitoring.getCurrentMetrics();
  
  // Check for alerts
  monitoring.checkAlerts(systemMetrics);
  
  // Log system health
  console.log('System Health:', {
    requestCount: systemMetrics.requestCount,
    errorRate: systemMetrics.errorRate,
    averageResponseTime: systemMetrics.averageResponseTime,
    databaseHealth: systemMetrics.databaseHealth.status
  });
}, 60000); // Every minute

Configuration

Environment Variables

# Monitoring configuration
MONITORING_ENABLED=true
MONITORING_MAX_HISTORY=1000
MONITORING_ALERT_COOLDOWN=15

Alert Configuration

// Configure alert rules
const alertRules = [
  {
    name: 'High Response Time',
    condition: (metrics) => metrics.averageResponseTime > 2000,
    severity: 'medium',
    message: 'Average response time exceeds 2 seconds'
  },
  {
    name: 'Database Unhealthy',
    condition: (metrics) => metrics.databaseHealth.status === 'unhealthy',
    severity: 'critical',
    message: 'Database service is unhealthy'
  }
];

Metrics Dashboard

The system provides a comprehensive metrics dashboard:

System Metrics

  • Request Count: Total requests processed
  • Error Rate: Percentage of failed requests
  • Average Response Time: Mean response time across all requests
  • Memory Usage: Current memory consumption
  • Active Connections: Number of active WebSocket connections

Service Health

  • Database: Connection status and query performance
  • KV Storage: Read/write performance and availability
  • AI Services: Model availability and response times
  • Request Volume: Requests per minute/hour
  • Error Trends: Error rate over time
  • Response Time Trends: Performance over time
  • Resource Usage: Memory and CPU utilization

Best Practices

1. Request ID Propagation

Always propagate request IDs through your application:

// Add request ID to all downstream calls
const headers = {
  'X-Request-ID': requestId,
  'Authorization': `Bearer ${token}`
};

2. Error Context

Provide rich context for errors:

monitoring.trackError(requestId, error, 'database', {
  query: 'SELECT * FROM players',
  params: [playerId],
  duration: queryDuration
});

3. Performance Thresholds

Set appropriate performance thresholds:

// Alert on slow database queries
if (queryDuration > 1000) {
  monitoring.trackError(requestId, new Error('Slow query'), 'database');
}

4. Health Check Frequency

Configure health checks based on your needs:

// More frequent checks for critical services
const criticalServices = ['database', 'ai'];
const standardServices = ['kv', 'r2'];

// Check critical services every 30 seconds
setInterval(() => checkServices(criticalServices), 30000);

// Check standard services every 2 minutes
setInterval(() => checkServices(standardServices), 120000);

Troubleshooting

Common Issues

  1. High Memory Usage

    • Check for memory leaks in long-running processes
    • Monitor object creation and cleanup
    • Review WebSocket connection management
  2. Slow Response Times

    • Analyze database query performance
    • Check external API response times
    • Monitor AI model inference times
  3. High Error Rates

    • Review error categorization
    • Check service dependencies
    • Monitor external service health

Debug Mode

Enable debug mode for detailed logging:

// Enable debug logging
if (process.env.NODE_ENV === 'development') {
  monitoring.enableDebugMode();
}

Integration with External Tools

Logging Integration

The monitoring system can integrate with external logging services:

// Send metrics to external logging service
monitoring.onMetricsUpdate((metrics) => {
  externalLogger.send(metrics);
});

Alert Integration

Configure alert notifications:

// Send alerts to external notification service
monitoring.onAlert((alert) => {
  notificationService.send(alert);
});

This monitoring system provides the foundation for maintaining high availability and performance in the PadawanForge platform.

PadawanForge v1.4.1