Our goal is to provide highly accurate, near real-time status reporting that users can depend on. To achieve this, we utilize a custom-designed, multi-layered monitoring infrastructure that balances speed, accuracy, and firewall resilience.
Monitoring Infrastructure & Process
Our automated checks follow a strict queue execution protocol. We perform lightweight, standardized HTTP requests that simulate clean web browser connection handshakes. This prevents false positive blocks from standard rate limiters while ensuring we do not overload the target servers.
1. Starvation-Free Queue Scheduling
Our backend scheduler uses a starvation-free oldest-checked queue logic. This ensures that every monitored website in our database is regularly processed in rotation. We do not prioritize high-traffic domains over smaller sites, ensuring fair and consistent update cycles across our entire database.
2. WAF-Resilient Parallel Probing
We initiate checks using high-performance parallel cURL requests. To prevent Web Application Firewalls (such as Cloudflare, Akamai, or Sucuri) from dropping our requests as suspected bots, we curate our request headers to match clean, standard HTTP connection handshakes, avoiding unnecessary browser headers that trigger firewall false positives.
3. Multi-Check Failure Verification
To completely eliminate false-positive downtime reporting, we never declare a site DOWN based on a single failed connection. If a parallel check fails, our engine triggers a synchronous, multi-retry validation sequence:
- Immediate Recheck: The target is immediately re-queued for checking from an alternative connection hook.
- Exponential Backoff: If the second check fails, we perform 3 consecutive synchronous checks with exponential delay backoff (1s, 2s, 4s).
- Status Confirmation: Only if all retry checks fail is the target marked as DOWN in our database and logged as an outage.
Performance Metrics
We calculate the following metrics for every monitored target:
- Response Latency: The round-trip connection time in milliseconds from our checking nodes to the target server.
- Uptime Percentage: The ratio of successful checks to total checks over 24-hour, 7-day, and 30-day windows.
- Downtime Incidents: Logs containing start time, end time, and total outage duration.