The Startup Monitoring Stack: What You Actually Need (And What You Don't)

You Don't Need the Enterprise Stack

Every few months, a "complete guide to production monitoring" makes the rounds on Hacker News. It recommends Datadog for metrics, PagerDuty for on-call, Atlassian Statuspage for incident communication, Grafana plus Prometheus for dashboards, Sentry for error tracking, and maybe Honeycomb for observability. The total bill? Somewhere between $300 and $500 per month for a small team, scaling quickly into the thousands.

These guides are written by and for companies with 200+ engineers, dedicated SRE teams, and the operational complexity to justify that spend. If you are a startup with fewer than 50 engineers, most of this stack is waste. Not "nice to have" waste. Actual waste: tools nobody configures properly, dashboards nobody checks, alerts nobody responds to, and integrations that break when you upgrade something else.

This post is the guide I wish I had when we were a five-person team trying to figure out monitoring. It is opinionated on purpose.

Stage 1: Pre-Product-Market-Fit (1-5 Engineers)

At this stage, your job is to ship product and talk to users. Every hour spent configuring monitoring infrastructure is an hour not spent finding product-market fit. Your monitoring stack should take less than 30 minutes to set up and require near-zero ongoing maintenance.

What you actually need:

Uptime monitoring -- Know when your site or API goes down. HTTP checks every 60 seconds from multiple regions. That is it.
A status page -- When something breaks, your users need a single URL to check. This builds trust and cuts inbound support requests in half.
Email or Slack alerts -- Route downtime notifications to wherever your team already communicates.

What you do not need:

On-call scheduling. With 1-5 engineers, everyone is on call all the time. You do not need software to formalize that.
APM or distributed tracing. Your application is not complex enough for traces to tell you anything console.log cannot.
Custom dashboards. You do not have enough traffic or services to justify a Grafana setup.
Log aggregation. Your cloud provider's built-in log viewer is fine for now. CloudWatch, Google Cloud Logging, or fly logs will do.

Recommended setup: A free monitoring tool with a built-in status page. Alert24's free tier gives you 10 monitors, a hosted status page, and incident management at no cost. UptimeRobot's free plan is another option, though it lacks a status page and incident workflows.

Monthly cost: $0

Stage 2: Post-PMF and Growing (5-15 Engineers)

You have paying customers. You have SLAs, or at least informal uptime expectations. Your team is big enough that "everyone is on call" means nobody is on call. Incidents happen at 2 AM and nobody notices until a customer emails at 9 AM.

This is where most startups make their first monitoring mistake: they buy five different tools to solve five different problems. One for monitoring, one for alerting, one for on-call, one for incident management, one for status pages. Each tool has its own login, its own alert rules, its own integration config. Each one is $20 to $50 per month. And now you have a fragmented system where the monitoring tool detects the problem, but the alert goes to the wrong person because the on-call tool is not synced, and the status page stays green because nobody thought to update it.

What you actually need:

Uptime monitoring with multiple check types -- HTTP, DNS, SSL certificate expiry, keyword checks. Monitor your API endpoints, not just your homepage.
On-call scheduling with escalation policies -- Define who gets the alert first. Define what happens if they do not respond in 10 minutes. This alone prevents most "nobody noticed" incidents.
Incident management -- A structured workflow for acknowledging, investigating, and resolving incidents. Bonus if it connects to your status page automatically.
A branded status page -- Your customers are paying you money now. A status page on a custom domain with your logo is table stakes.
SSL and domain monitoring -- SSL certificates expire. Domains expire. Both cause outages. Automated monitoring for these is trivially cheap insurance.

What you still do not need:

APM. Unless you are debugging specific performance problems daily, you do not need a $100+/month APM tool running in the background. Use it on demand if you must -- services like New Relic offer limited free tiers you can activate when you need to investigate something.
Synthetic monitoring with complex user flows. Simple HTTP checks cover 90% of outage detection. Save the Playwright-based synthetic checks for later.
Custom metrics and dashboards. Your cloud provider's built-in metrics (CPU, memory, request count) are accessible for free. Building custom Grafana dashboards is satisfying, but it is not what will keep your startup alive.

Recommended setup: A single platform that covers monitoring, on-call, incidents, and status pages. Alert24 Pro at 3 units ($54/month) covers a team of 3 with 15-per-unit monitors, 3 status pages, on-call scheduling, SMS and phone alerts, Slack and Teams integration, and incident management. Better Stack's starter plan ($24/month for 1 seat) is comparable for monitoring and status pages but charges separately for on-call. Alternatively, Pagerduty's free tier (up to 5 users) plus UptimeRobot Pro ($7/month) works, but now you are managing two tools.

Monthly cost: $24-50

Stage 3: Scaling (15-50 Engineers)

Multiple teams own different services. You have microservices, or at least service-oriented architecture. Customer-facing SLAs are contractual, not informal. You need audit trails. Compliance is a real requirement, not a theoretical one.

Now -- and only now -- does the broader observability toolkit start earning its keep.

What to add:

APM / distributed tracing -- When a request touches five services before returning a response, you need traces to find the bottleneck. Datadog, New Relic, and Grafana Cloud all offer this. Budget $100-200/month.
Log aggregation -- Centralized, searchable logs across all services. Datadog Logs, Grafana Loki, or Axiom are options. Cloud-native solutions like CloudWatch Logs Insights can also work if you are AWS-only.
Synthetic monitoring -- Multi-step browser checks that simulate real user flows: login, checkout, API sequences. Checkly and Grafana's synthetic monitoring are good options.
Advanced alerting -- Composite alerts that trigger when multiple conditions are met. Alert grouping to prevent notification storms. These features matter when you have 100+ monitors.
Compliance and audit logging -- SOC 2 and ISO 27001 require evidence of monitoring and incident response. Your tools need to produce audit trails.

What you probably still do not need:

AIOps. The promise of AI-powered root cause analysis sounds great. In practice, these tools are expensive, noisy, and require months of historical data to produce anything useful. At 15-50 engineers, a good runbook and a structured incident process will outperform any AIOps tool.
Service graphs and topology maps. These are visually impressive in demos but rarely consulted during actual incidents. Engineers debug with logs, traces, and metrics, not topology diagrams.
Multi-cloud observability platforms. Unless you are genuinely running production workloads across AWS and GCP simultaneously (you are probably not), you do not need a tool that unifies multi-cloud metrics.
Custom metrics dashboards for everything. The instinct to "dashboard all the things" is strong. Resist it. Every dashboard you build is a dashboard you have to maintain. Build dashboards for the 5-10 metrics that actually predict customer-facing problems. Ignore the rest.

Monthly cost: $150-400

The Real Danger of Over-Tooling Early

Monitoring tools are not free, even when they are free-tier. Every tool you add to your stack has hidden costs:

Integration maintenance. Your monitoring tool posts to Slack via a webhook. Your on-call tool reads from a different Slack channel. Your status page has its own API integration. When Slack changes its API, or you switch from Slack to Discord, you are updating three configurations instead of one.

Configuration drift. Alert thresholds in your monitoring tool do not match escalation rules in your on-call tool. One tool monitors your new API endpoint; the other does not know about it. Nobody notices until the next incident.

Alert fatigue. More tools means more sources of alerts. Each tool has its own default alert rules, its own severity levels, its own notification preferences. Without careful tuning -- which nobody at a startup has time for -- the result is a constant low-grade stream of notifications that everyone learns to ignore. When the real outage happens, the alert drowns in the noise.

Context switching. During an incident, you do not want to be jumping between four different dashboards in four different browser tabs. You want a single pane of glass. Consolidation is not just a cost play; it is an incident-response-speed play.

The simplest way to avoid all of this: use one tool that covers as much of the surface area as possible, and only break out specialized tools when you have a specific, recurring pain that the consolidated tool cannot solve.

Cost Comparison: Enterprise Stack vs. Startup Stack

Here is what the monitoring bill looks like at different levels of complexity, for a team of roughly 10 engineers:

The "I read the enterprise guide" stack:

Datadog Pro (5 hosts): ~$75/month
PagerDuty Business (10 users): ~$210/month
Atlassian Statuspage (Startup plan): ~$79/month
Sentry Team: ~$26/month
Total: ~$390/month ($4,680/year)

The startup stack (consolidated):

Alert24 Pro (10 units): $180/month -- covers monitoring, on-call, incidents, and status pages
Sentry free tier or cloud provider error logging: $0
Total: ~$180/month

The minimum viable stack:

Alert24 Free: $0 -- 5 monitors, a status page, incident management
Total: $0/month

The difference is not $310/month. It is $310/month plus the engineering hours spent configuring, integrating, and maintaining four separate tools instead of one. At a startup where every engineer's time is your scarcest resource, those hours matter more than the dollars.

When to Upgrade Your Stack

Stay at your current level until you hit one or more of these signals. Do not upgrade preemptively.

Move from Stage 1 to Stage 2 when:

You have paying customers with uptime expectations
You have more than 3 engineers and nobody knows who is "on call"
You have had an incident where the team did not find out until a customer reported it
You need to monitor more than just "is the homepage up" (APIs, background jobs, SSL certs)

Move from Stage 2 to Stage 3 when:

You have contractual SLAs with financial penalties
Debugging production issues regularly takes more than 30 minutes because you cannot trace requests across services
You are blocked on compliance requirements (SOC 2, HIPAA, ISO 27001)
Your team has grown past 15 engineers and multiple teams own different services
Simple uptime checks no longer catch the classes of failures you are experiencing (partial degradations, slow queries, background job backlogs)

Do not upgrade because:

You saw a conference talk about observability and felt guilty about your setup
A vendor cold-emailed you with a "free trial" that requires instrumenting your entire codebase
Your competitor's status page looks fancier than yours
You think you might need distributed tracing someday

If you are not experiencing the pain, you do not need the tool. Monitoring should solve today's problems, not tomorrow's hypothetical ones.

The Bottom Line

The best monitoring stack is the one your team actually uses. A single tool with 10 well-configured monitors, clear escalation rules, and an up-to-date status page will catch more incidents than a $500/month observability platform with 47 dashboards that nobody looks at.

Start with uptime monitoring and a status page. Add on-call and incident management when your team grows. Add APM and log aggregation when your architecture demands it. At each stage, prefer a single consolidated platform over a collection of point solutions.

Your startup has a hundred problems to solve. Choosing monitoring tools should not be one of them.