I’ve mostly written blog posts and whitepapers, mostly for employers. I’ve included links here in case they might be interesting.
- (2024, Netflix) Introducing Netflix’s TimeSeries Data Abstraction Layer: A post I contributed to written by my colleagues Rajiv Shringi, Vinay Chella, Kaidan Fullerton, and Oleksii Tkachuk for Netflix. The post lays out how we were able to handle 15 million+ writes to immutable event datasets, including concepts and architecture needed to hit this scale I co-designed with Rajiv.
- (2024, Netflix) Introducing Netflix’s Key-Value Data Abstraction Layer: A post I co-authored with Vidhya Arvind, Rajasekhar Ummadisetty, and Vinay Chella for Netflix that shared how we built our Key-Value Data Abstraction Layer on the Data Gateway platform. Goes into some great API design tips for making reliable stateful systems.
- (2024, InfoQ) How Netflix Ensures Highly-Reliable Online Stateful Systems: An article published on InfoQ on how to structure stateful systems to be designed for reliability, handle load spikes, and gracefully handle failure. A review article of the same talk from QConSF 2023. Also included in PDF form in the Architecture Through Different Lenses eMag.
- (2024, Netflix) Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding: A post I co-authored with Anirudh Mendiratta, Kevin Wang, Javier Fernandez-Ivern and Benjamin Fedorka for Netflix that shared how we implemented quality-of-service prioritization techniques at the service layer to keep Netflix up even under sudden traffic spikes.
- (2024, Netflix) Data Gateway - A Platform For Growing and Protecting the Data Tier: A post I co-authored with Shahar Zimmerman, Vidhya Arvind, and Vinay Chella for Netflix that shared the architecture of our Data Gateway Platform. The Data Gateway Platform at Netflix hosts Data Abstraction Layers (DALs) that shield Developers from complex and backwards-incompatible database API changes.
- (2020, IEEE) Towards Practical Self-Healing Distributed Databases: A paper published in proceedings of the 2020 IEEE Infrastructure Conference about how one can build self-healing databases out of existing software and hardware without replacing the entire database engine.
- (2019, Netflix) Garbage Collecting Unhealthy JVMs, a proactive approach:
A post I authored with Josh Snyder for Netflix that introduced and explained
how we use the
jvmquake
agent to rescue our distributed databases written in Java from JVM death spirals. - (2018, whitepaper) Cassandra Availability with Virtual Nodes: A whitepaper I authored with Josh Snyder that attempted to formally model Cassandra’s availability under different numbers of tokens per node. TLDR: use no more than 4 tokens if you want high availability in a Dynamo style database. The paper is based on this notebook
- (2017, Yelp) Taking Zero Downtime Load Balancing even Further: In this post I showed how Yelp had evolved their highly available and scalable service mesh based on SmartStack to use NGINX and HAProxy and get the best of all worlds.
- (2016, Yelp) Monitoring Cassandra at Scale: A post I wrote for Yelp about how they monitored their distributed Cassandra deployments taking into account full ring health. Also includes some helpful examples for how to interact with Cassandra’s JMX interface from Python.
- (2015, Yelp) True Zero Downtime HAProxy Reloads: A post I wrote for Yelp about how they reloaded HAProxy without any downtime to requests using Linux queueing disciplines. A super hacky way to achieve the end that has since been superseded by better techniques. It was pretty novel at the time though.
- (2014, Yelp) Scaling Elasticsearch to Hundreds of Developers: A blog post I wrote for Yelp about the Apollo data gateway which acted as a proxy tier to NoSQL databases such as Elasticsearch (And later Cassandra). Basically this was an API-gateway for datastores, which was ridiculously useful and helped Yelp’s distsys-data team scale datastores and upgrade them all the time.