Most business owners find out about a website or server outage from clients — through chat complaints, dropping sales, or a call at the worst time. This isn't just inconvenient — it's direct reputational and financial damage. Properly configured monitoring lets you receive an alert before the first client notices the problem. Let's break down what and how to monitor.

What needs to be monitored

System resources

  • CPU — warning threshold: 70%, critical: 90%. Sustained load above 80% signals a need to scale or optimize.

  • RAM — warning threshold: 75%, critical: 90%. Memory reaching 100% leads to swap usage, severe slowdowns, and process crashes.

  • Disk — warning threshold: 75%, critical: 85%. A full disk means database stoppage, lost logs, and unpredictable failures. Also monitor IOPS and write speed.

  • Network — bandwidth, connection count, packet loss, latency. A sudden traffic spike may indicate a DDoS attack or application problem.

Services and applications

  • HTTP/HTTPS availability — 200 status, response time

  • Database health — MySQL/PostgreSQL connections, query time, replication lag

  • Queues (Redis, RabbitMQ) — queue length, consumer count

  • SSL certificate — expiry date (alerts at 30 and 7 days)

  • Processes — nginx, php-fpm, mysql must be running

Business metrics

The most advanced level — monitoring not just technical metrics, but business ones: orders in the last 15 minutes, registrations, transaction volume. A sharp drop in these figures can indicate a problem even when all technical metrics are green.

Monitoring tools

Zabbix: the enterprise standard

Zabbix is a free open-source infrastructure monitoring platform. Supports thousands of metrics out of the box, has a powerful trigger and alerting engine, and includes ready-made templates for Linux, Windows, MySQL, Nginx, Apache, and hundreds of other systems.

  • Advantages: free, powerful, stores data locally, flexible

  • Drawbacks: more complex to configure, requires a dedicated server

  • Best for: companies with 5+ servers, data residency requirements

Prometheus + Grafana: the modern stack

Prometheus collects metrics via a pull model (polls agents itself), Grafana displays them in beautiful dashboards. The de facto standard for cloud and containerized environments (Kubernetes).

  • Advantages: excellent Docker/Kubernetes integration, powerful PromQL query language, large community

  • Drawbacks: long-term data storage is more complex (requires Thanos or VictoriaMetrics)

  • Best for: DevOps teams, microservice architectures, cloud-native projects

UptimeRobot / Better Uptime: external availability monitoring

These services check your site's availability every 1–5 minutes from various points around the world. The key advantage — external monitoring: if your server goes down and Zabbix goes down with it, UptimeRobot will still send an alert.

  • UptimeRobot — free plan up to 50 monitors, checks every 5 minutes

  • Better Uptime — cleaner interface, client status pages, on-call scheduling

  • Uptime Kuma — self-hosted alternative, free, deployed on your own server

Alerting: how to notify correctly

Notification channels

Telegram — the most convenient for teams. A bot sends formatted messages with incident details, a link to the graph, and acknowledgment buttons. Configured via the Telegram Bot API directly in Zabbix or Prometheus Alertmanager.

Email — for non-critical alerts and reports. Not suitable as the primary channel for P1 incidents — email can be delayed or go to spam.

SMS / phone call — for critical overnight incidents when Telegram might be missed. Tools: PagerDuty, OpsGenie, or simple SMS gateway integration.

Good alerting rules

  • No alert fatigue — if there are too many alerts, the team starts ignoring them. Set thresholds so an alert means a real action is needed

  • Escalation — if the on-call engineer hasn't responded in 10 minutes, the alert goes to the next person

  • Context in the alert — "Disk /var/lib/mysql 89% on db01.example.com" is far more useful than "Disk alert"

  • Maintenance silence — suppressing routine alerts during planned maintenance

Dashboards: seeing the full picture

A good dashboard answers "is everything OK?" in 3 seconds. Minimum panel set: uptime for critical services, CPU/RAM/Disk across all servers as a heat map, API response time, error statistics for the last 24 hours.

Monitoring is not a luxury — it's insurance. The cost of a quality monitoring system is 500–3,000 UAH/month. The cost of one hour of e-commerce downtime on Friday evening is far more.