All of those query time histogram aggregations are making pretty subtle trade offs that make analysis fraught.
Disclaimer: I am a co-author.
Do I read right that circllhist has a pretty big number of bin sizes and is not configurable (except that they're sparse so may be small on disk)?
I've found myself using high-cardinality Prometheus metrics where I can only afford 10-15 distinct histogram buckets. So I end up
(1) plugging in my live system data from normal operations and from outage periods into various numeric algorithms that propose optimal bucket boundaries. These algorithms tell me that I could get great accuracy if I chose thousands of buckets, which, thanks for rubbing it in about my space problems :(. Then I write some more code to collapse those into 15 buckets while minimizing error at various places (like p50, p95, p99, p999 under normal operations and under irregular operations).
(2) making sure I have an explicit bucket boundary at any target that represents a business objective (if my service promises no more than 1% of requests will take >2500ms, setting a bucket boundary at 2500ms gives me perfectly precise info about whether p99 falls above/below 2500ms)
(3) forgetting to tune this and leaving a bunch of bad defaults in place which often lead to people saying "well, our graph shows a big spike up to 10000ms but that's just because we forgot to tune our histogram bucket boundaries before the outage, actually we have to refer to logs to see the timeouts at 50 sec"
I think it’s a great addition to the product and will excitedly use it but it’s a pretty big difference from a histogram centric db like circonus’ or a schema’d one like monarch.
In practice none of the implementations seem to provide that. Within the each set of buckets for a given log base you have reasonable precision at that magnitude. If your metric is oscillating around 1e6 you shouldn’t care much about the variance at 1e2, and with this scheme you don’t have to tune anything to provide for that.
As for the circllhist: There are no knobs to turn. It uses base 10 and two decimal digits of precision. In the last 8 years I have not seen a single use-case in the operational domain where this was not appropriate.
Thanks for making and ossing circllhist. I’ve been close by to the whole “what’s the OTel histogram going to be” discussion for the last many months and learned a lot from that. That discussion is what introduced me to circllhist and got me using them.
In practice, this still allows for highly precise percentile calculations (<0.1% error), see the evaluation in the Circllhist paper.
I tried applying it to a service with much lower traffic and found the bucketing to be extremely fussy.
Leaving the world of single numeric type for each datum will influence the next generation of open source metrics db.
Granted, it looks like Monarch supports a more cleanly-defined schema for distributions, whereas Prometheus just relies on you to define the buckets yourself and follow the convention of using a "le" label to expose them. But the underlying representation (an empirical CDF) seems to be the same, and so the accuracy tradeoffs should also be the same.
I guess its nice to publish at least the conceptual design so that others can implement it in “rest of the world” case. Working with OSS can be painful, slow and time consuming so this seems like a reasonable middle ground (although selfishly I do wish all of this was source available).
The query language is mql which closely resembles the internal Python based query language: https://cloud.google.com/monitoring/mql
Sometimes I feel many open source systems do not give a shit about productivity.
1. A pull-based system can pull less when the system is overload, which means a pulled service needs to keep historical stats. For instance, the endpoint `/metric` needs to keep previous gauge values or the accumulated counters. That said, a push-based metric library can keep history too. Indeed, it is exactly what the micrometers library does.
2. Don't let the metric system overload. This sounds like a hyperbole, but it is what companies do in practice: telemetry system is so foundational and critical to an internet company that it should always run smoothly.
We're diving into OTEL, and the registration / discovery challenges don't seem to have any kind of best-practice consensus out there. We're looking at NodeRED (telegraf agent can query from same at startup) but it brings its own challenges.
I haven't read the full paper, but do you know if the push model was revisited mostly for auto-registration / discovery, or performance bottlenecks at the server, or some other concern?
Typically for us, once we've got the hard part - an entity registered - we're happy with pull only. A no-response from a prod end-point is an automatic critical. I guess at their scale there's more nuance around one or more agents being non-responsive.
EDIT: Oh, there's not much in the paper on the subject, as it happens. And yes, it's vanilla discovery woes.
"Push-based data collection improves system robustness while simplifying system architecture. Early versions of Monarch discovered monitored entities and “pulled” monitoring data by querying the monitored entity.
"This required setting up discovery services and proxies, complicating system architecture and negatively impacting overall scalability. Push-based collection, where entities simply send their data to Monarch, eliminates these dependencies."
The way the "pull" collection worked was that there was an external process-discovery mechanism, which the leaf used to connect to the entities it was monitoring, the leaf backend processes would connect to the monitored entities to an endpoint that the collection library would listen on, and those entities collection libraries would stream the metric measurements according to the schedules that the leaves sent.
First, the leaf-side data structures and TCP connections become very expensive. If that leaf process is connecting to many many many thousands of monitored entities, TCP buffers aren't free, keep-alives aren't free, and a host of other data structures. Eventually this became an...interesting...fraction of the CPU and RAM on these leaf processes.
Second, this implies a service discovery mechanism so that the leaves can find the entities to monitor. This was a combination of code in Monarch and an external discovery service. This was a constant source of headaches an outages, as the appearance and disappearance of entities is really spiky and unpredictable. Any burp in operation of the discovery service could cause a monitoring outage as well. Relatedly, the technical "powers that be" decided that the particular discovery service, of which Monarch was the largest user, wasn't really something that was suitable for the infrastructure at scale. This decision was made largely independently of Monarch, but required Monarch to move off.
Third, Monarch does replication, up to three ways. In the pull-based system, it wasn't possible to guarantee that the measurement that each replica sees is the same measurement with the same microsecond timestamp. This was a huge data quality issue that made the distributed queries much harder to make correct and performant. Also, the clients had to pay both in persistent TCP connections on their side and in RAM, state machines, etc., for this replication as a connection would be made from each backend leaf processes holding a replica for a given client.
Fourth, persistent TCP connections and load balancers don't really play well together.
Fifth, not everyone wants to accept incoming connections in their binary.
Sixth, if the leaf process doesn't need to know the collection policies for all the clients, those policies don't have to be distributed and updated to all of them. At scale this matters for both machine resources and reliability. This can be made a separate service, pushed to the "edge", etc.
Switching from a persistent connection to the clients pushing measurements in distinct RPCs as they were recorded eventually solved all of these problems. It was a very intricate transition that took a long time. A lot of people worked very hard on this, and should be very proud of their work. I hope some of them jump in to the discussion! (At very least they'll add things I missed/didn't remember... ;^)
We're using prom + cortex/mimir. With ~30-60k hosts + at least that figure again for other endpoints (k8s, snmp, etc), so we can get away with semi-manual sharding (os, geo, env, etc). We're happy with 1m polling, which is still maybe 50 packets per query, but no persistent conns held open to agents.
I'm guessing your TCP issues were exacerbated by a much high polling frequency requirement? You come back to persistent connections a lot, so this sounds like a bespoke agent, and/or the problem was not (mostly) a connection establish/tear-down performance issue?
The external discovery service - I assume an in-house, and now long disappeared and not well publicly described system? ;) We're looking at NodeRED to fill that gap, so it also becomes a critical component, but the absence only bites at agent restart. We're pondering wrapping some code around the agents to be smarter about dealing with a non-responsive config service. (During a major incident we have to assume a lot of things will be absent and/or restarting.)
The concerns around incoming conns to their apps, it sounds like those same teams you were dealing with ended up having to instrument their code with something from you anyway -- was it the DoS risk they were concerned about?
Amusingly in the pre-web 1990's, at Telstra (Australia telco) we also developed & implemented a custom performance monitoring library that was integrated into in-house applications.
The phrase 'you aren't Google' is true for 99.9% of us. We all get to fix the problems in front of us, was my point. And at that scale you've got unique problems, but also an architecture, imperative, and most importantly an ethos that lets you solve them in this fashion.
I was more reflecting on the (actually pretty fine) tools available to SREs caring for off-the-shelf OS's and products, and a little on the whole 'we keep coming full circle' thing.
Anyway, I very much appreciate the insights.
What are some problems (or peculiarities that otherwise didn't exist) with the push based setup?
At another BigCloud, pull/push made for tasty design discussions as well, given the absurd scale of it all.
General consensus was, smaller fleet always pulls from its downstream; push only if downstream and upstream both have similar scaling characteristics.
Thus there was no queue like a pubsub or Kafka in front of Monarch.
At scale this required a "smoothness of flow". What I mean by this is that at the scale the system was operating the extent and shape of the latency long tail began to matter. If there are many many many many thousands of RPCs flowing through servers in the intermediate routing layers, any pauses at that layer or at the leaf layer below that extended even a few seconds could cause queueing problems at the routing layer that could impact flows to leaf instances that were not delayed. This would impact quality of collection.
Even something as simple as updating a range map table at the routing layer had to be done carefully to avoid contention during the update so as to not disturb the flow, which in practice could mean updating two copies of the data structure in a manner analogous to a blue green deployment.
At the leaf backends this required decoupling--to make eventual--many ancillary data structure updates for data structures that were consulted in the ingest path, and to eventually get to the point where queries and ingest shared no locks.
On top of that there is also less risk that herd of misbehaving clients DoS the monitoring system, usually moments when you need such system the most. This of course wouldn't be a problem with a more scalable solution that distributes ingestion from querying, like the Monarch.
That's why Google spent all that money to build Monarch. At the end of the day Monarch is vastly cheaper in person time and resources than manually-configured Borgmon/Prometheus. And there is much less friction in trying new queries, etc.
It's not about comparisons, every tool has it's own place and feature set that may be right for you depending on what you're doing. But if you've reached the end of the road with Prometheus due to scale and you need massive scale and perfect compatibility... Then Mimir stands out.
> Federation allows a Prometheus server to scrape selected time series from another Prometheus server
It basically does the opposite of what every scalable system does.
To get HA you double you’re number of pollers.
To get scale your queries you aggregate them into other prometheii.
If this is scalability: everything is scalable.
High Availability always requires duplication of effort. Scaling queries always requires sharding and aggregation at some level.
I've deployed stock Prometheus at global scale, O(100k) targets, with great success. You have to understand and buy into Prometheus' architectural model, of course.
It does not; itself, have highly scalable properties built in.
It does not do sharding, it does not do proxying, it does not do batching, it does not do anything that would allow it to run multiple servers and query over multiple servers.
Look. I’m not saying that it doesn’t work; but when I read about borgmon and Prometheus: I understood the design goal was intentionally not to solve these hard problems, and instead use them as primitive time series systems that can be deployed with a small footprint basically everywhere (and individually queried).
I submit to you, I could also have an influxdb in every server and get the same “scalability”.
Difference being that I can actually run a huge influxdb cluster with a dataset that exceeds the capabilities of a single machine.
The fact that it could theoretically ingest an infinite amount of data that it cannot thereafter query is not very interesting.
Scalability means running a single workload across multiple machines.
Prometheus intentionally does not scale this way.
I’m not being mean, it is fact.
It has made engineering design trade offs and one of those means it is not built to scale, this is fine, I’m not here pooping on your baby.
You can build scalable systems on top of things which do not individually scale.
Sorry for being rude, but this level of ignorance is extremely frustrating.
Scalability is defined differently depending on context; in this context (a monitoring/time series solution) it is defined as being able to hold a dataset larger than a single machine that scales horizontally.
Downsampling the data or transforming it does not meet that criteria, since that’s no longer the original data.
The way Prometheus “scales” today is a bolt-on passthrough with federation. It’s not designed for it at all, and means that your query will use other nodes as data sources until it runs out of ram evaluating the query. Or not.
The most common method of “scaling” Prometheus is making a tree; you can do that with anything (so it is not inherent to the technology, thus not a defining characteristic, if everything can be defined the same way then nothing can be- the term ceases to have meaning: https://valyala.medium.com/measuring-vertical-scalability-fo...)
I’ll tell you how influx scales: your data is horizontally sharded across nodes, queries are conducted cross shards.
That’s what scalability of the database layer is.
Not fetching data from other nodes and putting it together yourself.
Rehydrating from many datasets is not the storage system scaling: the collector layer doing the hydration is the thing that is scaling.
If you sold me a solution that used Prometheus underneath but was distributed across all nodes, perhaps we could talk.
But scalability is not a nebulous concept.
You should refer to your own docs if you think Prometheus isn’t a database, it certainly contains one: https://prometheus.io/docs/prometheus/latest/storage/
I should add (and extremely frustratedly): if you’re not lying and you’re a core Prometheus maintainer, you should know this. I’m deeply embarrassed to be telling you this.
This just isn't true :shrug: Horizontal scaling is one of many strategies.
While Prometheus supports sharding queries when a user sets it up, my understanding is that this has to be done manually, which is definitely less convenient. This is better than a hypothetical system that doesn't allow this at all, but still not the same as something that handles scaling magically.
Monitoring as a service has a lot of advantages.
It scales so well that many aggregations are set up and computed for every service across the whole company (CPU, memory usage, error rates, etc.). For basic monitoring you can run a new service in production and go and look at a basic dashboard for it without doing anything else to set up monitoring.
Requiring query authors to understand the arrangement of their aggregation layer seems like a reasonable idea but is in fact quite ridiculous.
Why it exists is laid out quite plainly.
The pain of it is we’re all jumping on Prometheus (borgmon) without considering why Monarch exists. Monarch doesn’t have a good corollary outside of google.
Maybe some weird mix of timescale DB backed by cockroachdb with a Prometheus push gateway.
Disclaimer: I work at vmware on an unrelated thing.
If you think about Bigtable, a key observation that the Monarch team made very early on is that, if you can support good materialized views (implemented as periodic standing queries) written back to the memtable, and the memtable can hold the whole data set needed to drive alerting, this can work even if much of Google's infrastructure is having problems. It also allows Monarch to monitor systems like Bigtable, Colossus, etc., as it doesn't use them as serving dependencies for alerting or recent dashboard data.
It's a question of optimizing for graceful degradation in the presence of failure of the infrastructure around the monitoring system. The times the system will experience its heaviest and most unpredictable load will be when everyone's trying to figure out why their service isn't working.