/Yelp open-sources Clusterman, a cluster autoscaler for Kubernetes and Mesos

Yelp open-sources Clusterman, a cluster autoscaler for Kubernetes and Mesos


Earlier this year, I wrote a blog post
showing off some cool features of our in-house compute cluster autoscaler, Clusterman (our Cluster Manager). This time,
I’m back with two announcements that I’m really excited about! Firstly, in the last few months, we’ve added another
supported backend to Clusterman; so not only can it scale Mesos clusters, it can also scale Kubernetes clusters. Second,
Clusterman is now open-source on GitHub so that you, too, can benefit from
advanced autoscaling techniques for your compute clusters. If you prefer to just read the code, you can head there now
to find some examples and documentation on how to use it; and if you’d like to know a bit more about the new features
and why we’ve built them, read on!

Going from Mesos to Kubernetes

Over the last five years, we’ve talked (and
written) a lot about
our compute stack at Yelp; we’ve gone from our monolithic yelp_main repo to a fully-distributed, service-oriented
architecture running in the cloud on top of Apache Mesos and our in-house platform-as-a-service,
PaaSTA. And, truthfully, without that move, we wouldn’t have been able to grow to the
scale that we are now. We’ve been hard at work this year preparing our infrastructure for an even more
growth, and realized that the best way to achieve this is to move away from Mesos and onto Kubernetes.

Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka, among others) that were once difficult to
manage under Mesos (due to local state requirements). We strongly believe that managing these workloads under a common
platform (PaaSTA) will boost our infrastructure engineers’ output by an order of magnitude (can you imagine spinning up
a new Cassandra cluster with just a few lines of YAML? We can!).

In addition, we’re migrating all of our existing microservices and batch workloads onto Kubernetes. This was a point of
discussion at Yelp, but we eventually settled on this approach as both a way to reduce the overhead of maintaining two
competing schedulers (Mesos and Kubernetes), and to take advantage of the fast-moving Kubernetes ecosystem. Thanks to
the abstractions that PaaSTA provides, we’ve been able to do this migration seamlessly! Our feature developers don’t
know their service is running on top of an entirely different compute platform.

Of course, to make this migration possible, we need to build support for Kubernetes into all our tooling around our
compute clusters, including our very important autoscaler, Clusterman. Due to Clusterman’s modular design, this was
easy! We simply defined a new connector class that conforms to the interface the autoscaler expects. This connector
knows how to talk to the Kubernetes API server to retrieve metrics and statistics about the state of the Kubernetes
cluster it’s scaling. These metrics are then saved in our metrics data store, which is sent to the signals and
autoscaling engine to determine how to add or remove compute resources.

Why Clusterman? Why Now?

We’re big proponents of open-source software at Yelp; we benefit from the efforts of many other open-source projects and
release what we can back into the community. Ever since Clusterman’s inception, we’ve had the dream of open-sourcing
it, and now that it has support for Kubernetes, there’s no better time to do so!

Whenever a project like this is released, the first question people ask is, “Why should I use your product instead of
this other, established one?” Two such products are the AWS Auto Scaling for Spot
Fleet
and the Kubernetes
Cluster Autoscaler
. So let’s compare and
contrast Clusterman with them:

ClustermanAuto Scaling for Spot FleetKubernetes Cluster Autoscaler
Supports any type of cloud resource (ASGs, spot fleets, etc)Only for Spot FleetsOnly supports homogeneous cloud resources (all compute resources must be identical)
Pluggable signal architectureThree different scaling choices: target tracking, step functions, or time-basedScales the cluster when pods are waiting to be scheduled
Can proactively autoscale to account for delays in node bootstrapping timeNo proactive scalingWaits for nodes to join the cluster before continuing
Basic Kubernetes supportNo knowledge of KubernetesSupports advanced features like node and pod affinity
Can simulate autoscaling decisions on production dataNo simulatorNo simulator
Extensible (open-source)Closed-source APIExtensible (open-source)

A few highlights we’d like to call out: firstly, note that Clusterman is the only autoscaler that can support a mixture
of cloud resources (Spot Fleets, Auto-Scaling Groups, etc.) – it can even handle this in the same cluster! This allows
for a very flexible infrastructure design.

Moreover, Clusterman’s pluggable signal architecture lets you write any type of scaling signal you can imagine (and
write in code). At Yelp, we generally believe that the Kubernetes Cluster Autoscaler approach (scale up when pods are
waiting) is right for “most use cases,” but having the flexibility to create more complex autoscaling behavior is really
important to us. One example of how we’ve benefitted from this capability is Jolt, an internal tool for running unit and
integration tests. The Jolt cluster runs millions of tests every day, and has a very predictable workload; thus, we
wrote a custom signal that allows us to scale up and down before pods get queued up in the “waiting” state, which saves
our developers a ton of time running tests! To put it another way, the Kubernetes Cluster Autoscaler is reactive, but
Clusterman has enough flexibility to be proactive and scale up before resources are required.

To be fair, not everyone needs the ability to make complex autoscaling decisions; many users will be just fine using
something like the AWS Spot Fleet Autoscaler or Kubernetes Cluster Autoscaler. Fortunately for these users, Clusterman
can be easily swapped in as needed. For example, it can be configured to read all of the same node labels that the
Kubernetes Cluster Autoscaler does, and behave appropriately. Also note that the Kubernetes Cluster Autoscaler does
support some Kubernetes features that Clusterman doesn’t (yet) know about, like pod affinity and anti-affinity. But
we’re constantly adding new features to Clusterman, and of course, pull requests are always welcome!

Want to Know More?

If you’re as excited as we are about this release, we encourage you to head over to our
GitHub and check it out! Give it a star if you like it, and if you have any
questions about getting Clusterman set up in your environment, feel free to open an issue or send us an email! Also,
we’d love to hear any success stories you have about autoscaling with Clusterman, or Kubernetes in general; you can
reach us on Twitter (@YelpEngineering) or on Facebook
(@yelpengineers).


David is going to be at KubeCon 2019 and will happily talk your ear off about Clusterman and Kubernetes; ping him on
Twitter or find him in the hallway track.


Become an Infrastructure Engineer at Yelp

Want to work on exciting projects like Clusterman? Apply here!

View Job

Back to blog

Original Source