
Logging Sucks
Logging sucks. And here's how to make it better. Your logs are lying to you. Not maliciously. They're just not equipped to tell the truth. You've probably spent hours grep-ing through logs trying to understand why a user couldn't check out, why that webhook failed, or why your p99 latency spiked at 3am. You found nothing useful. Just timestamps and vague messages that mock you with their uselessness. This isn't your fault. Logging, as it's commonly practiced, is fundamentally broken. And no, slapping OpenTelemetry on your codebase won't magically fix it. Let me show you what's wrong, and more importantly, how to fix it. The Core Problem Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005. Here's what a typical logging setup looks like: That's 13 log lines for a single successful request. Now multiply that by 10,000 concurrent users. You've got 130,000 log lines per second. Most of them saying absolutely nothing useful. But here's the real problem: when something goes wrong, these logs won't help you. They're missing the one thing you need: context . Why String Search is Broken When a user reports "I can't complete my purchase," your first instinct is to search your logs. You type their email, or maybe their user ID, and hit enter. String search treats logs as bags of characters. It has no understanding of structure, no concept of relationships, no way to correlate events across services. When you search for "user-123", you might find it logged 47 different ways across your codebase: user-123 user_id=user-123 {"userId": "user-123"} [USER:user-123] processing user: user-123 And those are just the logs that include the user ID. What about the downstream service that only logged the order ID? Now you need a second search. And a third. You're playing detective with one hand tied behind your back. The fundamental problem: logs are optimized for writing , not for querying . Developers write console.log("Payment failed") because it's easy in the moment. Nobody thinks about the poor soul who'll be searching for this at 2am during an outage. Let's Define Some Terms Before I show you the fix, let me define some terms. These get thrown around a lot, often incorrectly. Structured Logging : Logs emitted as key-value pairs (usually JSON) instead of plain strings. {"event": "payment_failed", "user_id": "123"} instead of "Payment failed for user 123" . Structured logging is necessary but not sufficient. Cardinality : The number of unique values a field can have. user_id has high cardinality (millions of unique values). http_method has low cardinality (GET, POST, PUT, DELETE, etc.). High cardinality fields are what make logs actually useful for debugging. Dimensionality : The number of fields in your log event. A log with 5 fields has low dimensionality. A log with 50 fields has high dimensionality. More dimensions = more...
Preview: ~500 words
Continue reading at Hacker News
Read Full Article