The four golden signals
1. Latency — how long do requests take?
2. Traffic — how many requests per second?
3. Errors — what fraction are failing?
4. Saturation — how full is the system?
Start there. Everything else is detail.
What to alert on
Only things that need a human awake right now. If you cannot articulate what the on-call person should do when the alert fires, the alert should not exist.