Skip to main content

[ep 8] Geo-distribution, Data Residency, App Performance & Compliance | The Cockroach Hour

In this episode of The Cockroach Hour, we discuss some important concepts in distributed systems including, geo-distributed data, data residency, app performance & compliance. Whether data is distributed across a single region, the country, or across the globe, WHERE it resides can be incredibly important. The speed of light, data privacy, and compliance with data regulations (like the GDPR) are all things you may need to consider.

But, how do you control data residency? Do you leave it to the application layer? We believe the database should control where data lives. Why? Because with distributed systems this is of the utmost importance.

So, we built geo-partitioning in CockroachDB to help you configure this on each table at the row level. And while this helps you get better performance in geo-distributed deployments, it also has the added benefit of helping organizations comply with data privacy regulations.

  • 00:00 Introduction to geo-partitioning in global applications (and to the speakers)
  • 6:26 What is geo-partitioning?
  • 10:31 How does CockroachDB partition data at the row level?
  • 12:55 Using a Key-Value Store
  • 13:23 Primary Keys in CockroachDB
  • 16:44 Covering index
  • 17:30 How automatic sharding relieves operational burden
  • 18:03 What happens when you have a user that moves? What pattern keeps data close? ('Follow the user' use-case) 1
  • 09:12 CockroachDB was built to solve latency problems
  • 19:45 Using data partitioning for compliance with data regulations
  • 22:42 DATA PARTITIONING DEMO BEGINS with a distributed 9 node cluster
  • 28:40 How to improve database latency (online schema change)
  • 33:00 p99 and p90 latency improvements
  • 37:35 Design to Survive. Don't design for failure.
  • 41:15 Is there an easy way to add a 3rd database that participates in quorum? And why do we need 3 databases?
  • 46:45 Can CockroachDB run in Kubernetes?
  • 49:35 How to solve hot spots 

Jim Walker:

Alrighty. Good morning, good afternoon, good evening, everybody wherever you are. And thank you again for many of you, I'm seeing some similar names here to people that have joined us in the past. But thank you everybody for joining us today. Today's topic of The Cockroach Hour is something that is very near and dear to my heart. Something I love about Cockroach, something I believe in, and that's a feature that we call geo-partitioning. And we're going to talk about it in terms of how do you think about global applications and how do you think about surviving things? How do you think about your data privacy and compliance with data regulations across the whole planet and whatnot. This is a feature that I think our engineers built. And I think the value of this goes well beyond where we started with this.

Jim Walker:

And so we're going to dive into that. But first, before I get into everything, a bit of housekeeping. So there is a QA panel. Please do ask questions there. I know there's many members of our team that are engaged here. Tim who's with me today, and Keith, both can answer in the QA. Feel free to engage in the chat as well. And the chat is a little bit more public. I like the chat channel, actually. In fact, again, many of our technical resources are there. We've had conversations in there where it's just the whole time people are chatting, which is just fantastic. A recording of this event will be available. I know our team gets this up on our YouTube channel. Absolutely this afternoon, it's usually pretty quick before these things are up there.

Jim Walker:

So those are some quick items before we get started. So we often get asked, "Is this session beginner, immediate, advanced?" Just to set the table up front, this is one of those feedback items that somebody gave us on our surveys that we run at the end. And by the way, at the end of this there is going to be a survey. So we'd love to get your feedback in terms of how we can make these things better. This is kind of one of those things that we learned, people do like to understand if this is beginner, intermediate, or advanced. We've had a couple advanced, we've had a lot of beginners. This one's a really nice intermediate. We're going to talk about geo-partitioning, which is definitely a feature here unique to Cockroach database in the way we do it. We're going to talk about latency, 100 millisecond rule, what that actually means to people and to transactions as you do these things across an entire planet and deploy applications in that way.

Jim Walker:

And we're going to get into code. We're going to actually get into product today a little bit. My friend Keith is going to actually run a little bit of a demo here too. So I'm looking forward to this, but I am joined... Well actually, last thing is, I guess I have to do my little advertisement. So if you ask good questions, you get one of these mugs. So that lasts like 40 seconds. I was running to get my coffee from the kitchen. So Keith and Tim, do you want to join me on onscreen?

Tim Veil:

Morning [crosstalk 00:02:58]

Jim Walker:

Uh-oh, Keith's got some beta issues going on, so I'm sure he'll be fine though. Good morning, Tim. So if anybody has joined us for these before, you'll probably be pretty familiar with my friend, Tim Veil. Tim, do you want to introduce yourself and Keith? I think you guys come together as a team, right?

Tim Veil:

We do. So Tim Veil, Head of Solutions Engineering here at Cockroach Labs. Still haven't replaced the ceiling tiles in my office, and that is not a propellor on my head, that is just my ceiling fan. And by the way, I still don't have a mug. So maybe I'll start asking the questions today. But congratulations to Nicholas. Nicholas is out there welcoming everybody. I love the enthusiasm. Get that guy a mug.

Jim Walker:

That's right. That's right. And then Keith, you joined us as well. Are you guys sales engineers, solution engineers. You guys work a lot with our customers, right? So Keith, what is your role here? And please say hello.

Keith McClellan:

So I am on the solutions engineering team here at Cockroach Labs. I largely work with our large partner in technology ecosystem. I also don't have a mug, so maybe I should ask some good questions today. And then both Tim and I will get a package from JP. I'm looking forward to participating today.

Jim Walker:

You've been so incredibly helpful, you deserve a mug anyway, throughout these entire things. So y'all, and Keith is going to give us a demo a little bit later and actually show us how we can actually partition data and tie data to locations and some really cool conversations. And I personally thank Keith, because he's helped move me in some directions in terms of how I think about these things as well. I think when I look at Cockroach Database and what we're doing and other solutions that are out there, I've been joking lately with people and saying things like, "Man, can you imagine the pandemic without Cassandra?" It's kind of a funny joke, but not really because honestly, think about Netflix, think about Apple, think about these large consumers and a lot of the stuff that, that people have entertained themselves at home, Cassandra is driving a lot of that stuff.

Jim Walker:

And Keith comes to us with a world of Cassandra experience. And I think if you're rooted in that world, understanding some of the concepts in Cockroach Database makes sense. And I think geo-partitioning is one of those things that I think you can do in Cassandra. I remember I asked Keith how to do it and I felt like I was asking him what time it was and he explained to me how to build a clock because it's tough. It can be complex and we try to make these things as easy as possible. And hopefully in this conversation, we'll get into that a little bit. So anything else guys, before I get into... We had a basic flow here?

Tim Veil:

Looking forward to it as always.

Jim Walker:

Neither of you disagreed with my comment that I think the pandemic would be horrible without Cassandra.

Tim Veil:

You threw me there for a minute. Not going to lie, but I got it now.

Jim Walker:

I know, that's my job is to throw you off a little bit, Tim. That's kind of my...

Tim Veil:

Easy to do.

Jim Walker:

I show up every day thinking how I do that.

Keith McClellan:

The pandemic would be worse without CockroachDB too, by the way.

Jim Walker:

Well, that's true too.

Keith McClellan:

I've had lots of food delivery from DoorDash.

Jim Walker:

That's right. Exactly. And we'll get into that one too. And the importance of geo-partitioning in the context of scale and resilience and all that. So let's just start with, I've said this word geo-partitioning, people think about it. I think it makes sense to me. Do one that you guys want to take a crack at, just what is your definition of geo-partitioning if you were to do that? Tim came off mute first. So there you go, buddy.

Tim Veil:

Yeah, sure. So the way I think of it is, you have to go back and fundamentally understand what CockroachDB is. And I think probably most folks who've come to this understand, but we are, at heart, a distributed OLTP engine. Well, what does that word distributed mean? It means, fundamentally we're going to be storing data across many different nodes. Most distributed systems are doing that, right? They're placing data around a cluster of hardware virtual or otherwise. But geo-partitioning really takes that a step further, at least it does for us. And where you have a lot of control at your disposal about where data lives in the cluster. And that becomes very, very important as the topology of your Cockroach cluster gets farther and farther apart, to the point where you could have nodes contributing to a Cockroach cluster that are in different parts of the world. And so we'll get in today about all the ins and outs and benefits. But for me, geo-distributed really means we're taking distributed to the extreme and moving the end points far out to the edge, but still performing in our world very, very well.

Jim Walker:

That's right. That's right. Keith, do you want to add anything to that? What was missing? What do you think?

Tim Veil:

Yeah Keith, give them the real answer now.

Keith McClellan:

Well, so for me, the way I like to think about this is that we distribute the authority to act on the data that's in CockroachDB across all of the nodes in the cluster. So any one node at any one location doesn't necessarily have the authority to act on all of the data, but it owns some portion of the data. And what that allows us to do is distribute the data in a highly available fashion, allows us to guarantee consistency because we're not necessarily having every node involved in every single query. But it also gives us this great ability to be performant and distributed at the same time. So that would be the only thing I would add to that.

Jim Walker:

Yeah. And I think performant, distributed and survivable, I think is one of those things, Keith and just to add. And I think just to build on something that both of you said, Tim, the concept of geo-distributed versus geo-partitioning, I think the fact that we are geo-distributed and the fact that we can actually distribute data across broad geographies or even in a single data center is what required us to do a feature called geo-partitioning. And I think when you think about the data model as you build an application, one of those unique things that you have to think about with Cockroach Database is... Well, I usually think about, I don't know, referential integrity or data types. That's what you're thinking about when you're defining tables. With cockroach, it's a little bit different because you've got to start thinking about resilience and latency and where data lives in a distributed environment will attribute back to what that looks like.

Jim Walker:

And it's at the table level. This isn't the whole database, we are doing this at the row level actually, which I kind of find to me quite unique. Tim Veil, you know this person. There was a guy named Sean Connolly who I look up to as one of these really brilliant minds. And when I was first looking at Cockroach, he's like, "Jim it's right." Because his premise is that all data in the future will need time and location because it's space, time continuum. And I agree. And the space, where things are, is I think always a challenge.

Jim Walker:

And I think why do it in the application? Let's let the database deal with that because that's the layer that should go and be abstracted out. So our engineers have been building these features into cockroach for some time, but underneath the covers there's a whole bunch of stuff going on. At a cocktail party, Keith, if you had to go and explain this to somebody, how we actually do this... Because look, there's some smart people on the phone, but from a code point of view. Ultimately under the covers, how do you explain how we do geo-partitioning without slides?

Keith McClellan:

Well, at a cocktail party, if someone's asking me this question, usually I'm going to offer them another cocktail and maybe try to change the topic of conversation a little bit. But at a high level...

Jim Walker:

There's no alcohol in this, by the way, I swear.

Keith McClellan:

What we try to do is, and from an engine level, what we do is we have a KV store. Currently, we use something called Rocks DB, but we're moving to an in-house key value store that we built called Pebble in our next release, which is a really cool advancement. We can talk a little bit about that later here. And we have a SQL engine that runs... It's the user facing query interface. And then we have this KV engine that transforms rows and columns into this key value structure. And the novel thing about using this key value store effectively as our storage engine for CockroachDB is it allows us to put partitioning and placement information in the key, which allows for deterministic placement and retrieval of data in a way that previously wouldn't have been possible. So that enables a full asset compliant experience with rich SQL and the scale out attributes that we had with KB stores and other NoSQL databases for the last 10, 15 years.

Jim Walker:

Yeah. And ultimately, Keith, the trick is basically, look, we're going to expose the database as a fully relational, transactional database. To the developer, that's what this looks like. Underneath the covers, we're using something called Pebble. And by the way, Keith, I've just flipped the switch in my mind, of Pebble actually. So I think in 20.1, any current customers, we're actually going to flip over to full use of Pebble underneath the covers. And there was a blog post actually on our blog that gets into Pebble, if anybody's interested in actually reading about that, it was pretty cool stuff. But ultimately on our store, we're storing things as KV, where we're using key value pairs.

Jim Walker:

But the key, the K in the KV is a big ordered set. And if we can embed a code or whatever into that, we can actually sort things correctly and then pick things off and place them where we want to do. So I think it's an incredible piece. And I think one of the things where people, I think struggle or one of the first challenges people think about is primary keys in Cockroach Database. So Tim, you get a lot of conversations with customers on primary key. And what are those conversations look about? What should people be thinking about as they think about their primary key for each table?

Tim Veil:

Yeah, it's a great question. So we have a number of best practices. We have some anti-patterns maybe while I'm talking to somebody can post the link into the chat because I think it's really good reading actually, for folks who are starting out with CockroachDB. But for us, primary keys in terms of best practice, tend to be something like a GUID or UUID as opposed to some monotonically increasing thing that could potentially lead to hotspots. But as we're having a conversation about geo-partitioning, one of the things that you can begin to do, and I know we'll explore this in greater detail, is begin to add some kind of locality type key as essentially a composite primary key. And this opens up a lot of interesting possibilities with us. So great example of this, and again, I assume we'll get into greater detail, would be I have a primary key that is composed in part of a UUID, as well as, say, for example, a US state indicator. And this has all sorts of far reaching positive impacts on the kinds of things that you can do.

Jim Walker:

All right. So Tim, you would have UUID and then the state ID, or let's say country code.

Tim Veil:

Yeah, country code. Sure.

Jim Walker:

Or even just Germany, US. And basically the KV, basically the storage layer is sorting everything alphabetically, just think about. Less con graphically, but you know what I mean. So basically you have UUID-

... just think about. Less graphically, but you know what I mean? So basically you have UUID. And so basically everything German is in one section of the table. Everything US is in another section of the table. And as long as we're creating ranges and creating these replica sets, and if you guys are familiar, if you want to actually dive into this, there is the Architecture of CockroachDB webinar, we did last week, will actually get into this. But it's like the first half of the table lives on this server. The second half of the table is going to live on that server. We can do that now because Graft and our replicas allow us to actually do that. So I think it's a really key part is the primary key is a key piece of this. Sorry for the, it's a key piece, get it? Keith, come on, man. It was good. Come on.

Tim Veil:

Have you seen Keith smile? He seems very angry about the topic [crosstalk 00:00:43].

Jim Walker:

He's very angry today.

Tim Veil:

Come on.

Jim Walker:

But I think it's a key piece and we basically just encode that in the KV store. So I think underneath the covers, it's actually pretty important to go into that. People can go wrong too. Choosing the wrong primary key can get you in trouble. And I think it's a really critical thing. I think in the last release, we also introduced ... we can change these things, right?

Keith McClellan:

Yes. You absolutely can. I'll answer, I'll say the same words that Tim was going to say while he was on mute. So one of those things that they can get you into trouble, quite frankly, with a lot of the NoSQL databases that are out there on the market right now, is that once you've picked a key pattern, you have to do a lot of work to get yourself into a healthy situation if you have to make a change there. Whereas we can do those changes live, either by changing the primary key on the table. We also have something called a covering index, which allows us to have indexes that can respond to queries that are distributed differently than the base table, but also guaranteed to be consistent.

Keith McClellan:

Between those two patterns, we can generally handle any partitioning structure that we're going to need to survive a failure or to give you performing queries in a lot of different structures, we're actually going to cover that later in the demo today.

Jim Walker:

Yeah. And I think it's important. Online schema change is a big deal. If you think about it, this is just basically sharding based on location. That's really what we're doing here, right? We can shard on lots of different reasons and our ranges allow us to do that. And so if anybody's ever had to manually shard a database and then reshard that database and then reshard it again. We just basically deal with all that. And the online schema changes will help you move that data to various different locations. So in our example, we had German records at the top and then US records. What if, I don't know, let's just do Albania. And that came in at the top, AG, USA, whatever we can actually alter a table and all those records that were pushed all over the place would automatically magically be spun up on or reside on these other locations.

Keith McClellan:

More challenging of a type of a problem is what happens if you have a user that moves. So a pretty common use case, is somebody is moving across the country or moving to a new country. And all of a sudden the partitioning decisions you were making for that user previously no longer make sense. Or even if you're just traveling to a new space for a while. If I'm on vacation and I decide to head out to the West coast, I'm based on the East coast, but let's say I want to head to California for a vacation. I don't want to have bad performance on my apps for the next two weeks because my data is distributed so that it's living on the East coast, we should be able to sense that and change the way we're distributing the data dynamically. So I only have to take that performance hit the first time that I'm working in a particular app or whatnot from a new location. Those are the types of advanced patterns that are really hard to do in pretty much every database on the planet except for CockroachDB.

Jim Walker:

Yeah. And the original premise here was it was about latency, right? When the team built this, it was a largely due to solving these type of issues. And again, another architectural consideration, if you think about Raft and what a Raft leader is, if you're familiar with that, I mean, we're actually putting a Raft leader for each replica in the right place too, so we can get low latency transactions yet still survive regional failures. There's lots of different things that we can do there. But I think the follow the user use case is one of them. People use this for compliance as well right, Tim?

Tim Veil:

They absolutely do. So all of the kind of things that were indicating about, "Hey, I don't want to have bad performance when I kind of go across the country for vacation." Certainly the geo partitioning can support that, but for compliance, it kind of is a whole other thing. I live in Germany and my data must stay in servers local to that country and can't exist in a data center that I have in London or Paris or the US. And so the fact that we can do this kind of the fundamental core building blocks, as you've talked about, kind of at the key value store.

Tim Veil:

And then also we haven't really talked about it yet, but kind of how you start a node and the kind of tags that you put on a note. Anyway, these core building blocks allow us or users of CockroachDB to adhere to compliance or regulatory standards. GDPR was mentioned in a typed question, there are others, California here in the States is moving in that direction potentially. And so these kinds of fundamental building blocks of partitioning data at the row level, tying that data to physical hardware, if need be in certain geographies can absolutely meet many of the compliance challenges that are out there.

Jim Walker:

That's right. And I find it interesting too, Tim. I think the way that people used to do this, if you want to comply with some sort of compliance law that says, "You need data to live in a particular place." You're doing that in application code typically, right? That was basically how people were doing this, right?

Tim Veil:

Yeah. Or worse. And we're seeing a lot of folks kind of who've eyes have been opened if you will, to Cockroach. What they were having to do is essentially duplicate these entire stacks, both data and application in all of these regions that they needed to be present in. And so with Cockroach, what we're finding for this reason, and certainly many others is you can really simplify your architecture.

Tim Veil:

You can have this kind of geo distributed database, as we've discovered and talked about many times, has all this neat stuff, but the data can remain local if it needs to. So again, not only kind of simplifying application architecture, but really, deployment architectures.

Jim Walker:

And costs associated with that and everything else. Spin up a node, you have access to that data. I could access that data from a node here in the US if it lived in Germany, that's the beauty of it. It's logically, one database. Which is, it's not two logical systems. It's one thing. Right. So, I don't know, Keith that you already with the demo stuff?

Keith McClellan:

Yeah, absolutely. And so we have a couple of live questions about indexing strategies and geo replication that we're actually going to show in the demo. So hopefully, everybody will understand what we're going about here. So, I've got a nine node cluster of CockroachDB running with three nodes in each region. And I've got a load generator for our internal ride sharing app Mover.

Jim Walker:

Keith, you've spun this up. This has shipped in our binary. So if somebody wanted to actually do this, there's Cockroach Demo, right? And that's where you got this. Isn't like Keith built this. It's actually part of the binary already, right?

Keith McClellan:

That is a 100% true. And the script that I'm using on the backend to not have to type while we're all talking is also out and available on the internet. So, what we have going on here, I spun up this 900 cluster. It is running across three regions in the United States, US East one, US Central one, and US West one. These are all in Google. Although I could have put some of this in US and some of this in an Azure as well. I did not in this particular case. We are currently running a couple hundred queries per second. Latency isn't all that good, but we do have some work on. So I'm going to go ahead and while we're talking here, I'm going to kill one of the nodes in the database.

Keith McClellan:

So this is a really common scenario where you have a hard drive failure, or let's say you're in the cloud. And you have to restart one of your instances or Amazon or GKE, restarts, or Google restarts the node that your instance is running on and you get migrated, those kinds of things. And what happens from an application perspective is we noticed that that node is no longer available. And we start routing that traffic to other nodes in the database.

Keith McClellan:

And then when that node comes back online, we will get it caught back up with the rest of the cluster. So as you can see here, one of my nodes in US East went down. It's been down for about a minute. If I'm going to go ahead and let it restart. And what'll happen is the rest of the remaining nodes will reseed the appropriate data to that node in the cluster. If that node was out of band for too long, we will mark it as dead. All that means is that node will not automatically reject rejoin the cluster, and an administrator will have to go in and say, "Hey, this node is now in a good state. Let's go ahead and add it back in."

Keith McClellan:

So this is one of those tricky bits of being distributed. How do you keep the database up and running? So while that node was down, and while everything was going on, the database stayed up and running. We continued to have queries running in the database. I seem to be having some performance issues with my-

Jim Walker:

Of course, Keith.

Keith McClellan:

With my personal internet, maybe I should stop streaming my video. So the database continued to run. In this particular case, we have data loading in Central and in the East. And we continue to run queries even while the database was down.

Jim Walker:

Yeah. And you're still getting around the same P99 latency. And we're going to come back to that in tune, so how we actually tune these things. One of the questions that had come in a little bit earlier, Keith was how does this work? When we put nodes in a location, sure. We spun them up in a different data center, but underneath the covers, we had to label each node. And can you just show me the nodes and how we'd labeled them here? I know we did it when we configured them, but you could actually show the nodes, right? Here we go.

Keith McClellan:

Yeah. So a node, when it joins the cluster and announces its location with this locality flag in the topology. And that's a hierarchy of where it is, so it can have multiple levels. So in this case, the first level was cloud, which is Google compute engine. Then we're using three regions in GCE. So we've got us West one, US Central, one, US East one. And then if you go down a level further, you see that we have one node running in three different availability zones. In our case, in the East it's East one B, East one D, and East one C. If we had more nodes than three per location, if I'd spin up a bigger cluster, you would see that you'd have multiple nodes per zone. I didn't feel the need to do that for this particular demonstration.

Jim Walker:

Yeah. And I just asked that Keith, because we're going to get into some zone configurations. And it's really, when you spin up that node is what you're naming it, and you're putting it into that, that, that hierarchy, right? And so it's kind of key to think about that as you're kind of spinning up and deploying new hardware in Cockroach, right?

Keith McClellan:

That is absolutely correct. So right now the database is humming along, it's learning about five, 600 queries per second right now. Our P99 latency is not ideal. So our worst case latency in the cluster right now is about a half a second, which is higher than that kind of 100 millisecond rule that we have for real-time interactions with the database. If we go ahead and take a look at what our P90 query time is, we're running that 60, 70 millisecond timeframe. I really would like that to be better. So first thing I'm going to do is I'm going to apply some geo partitioning rules and build some secondary indexes in the database. So you're going to see the performance degrades slightly as we're building these indexes and shuffling the data a little bit, and then we're going to see our performance increase dramatically. So this is an example of that online schema change that we were talking about, where I'm redistributing the data while the cluster is live. This will take a few minutes to kind of settle in. So-

Jim Walker:

Of course, it's got to move a bunch of data around, right?

Keith McClellan:

Hopefully not too much data, but it is going to move some records to be primarily domiciled in other data centers. Largely because we're using a replication factor three and we have three regions. Each region already has a replica of the data. So what we're doing more is optimizing leaseholder placement and the range partitions to make it easier for us to make some of those types of changes downstream. So what you'll see if we go ahead and pull up the tables, we look into something like promo codes-

 

... the tables, right? We look into something like promo codes, and promo codes is not the best. I seem to be having some rather bad Internet connectivity issues today, so I apologize for that.

Jim Walker:

It is slow.

Keith McClellan:

It's just running a little slow.

Jim Walker:

It's all that Cassandra workloads running Netflix through around your neighborhood.

Keith McClellan:

I think it has probably more to do with the fact that I was still running my camera at the same time. So I'm going to disable that for the time being.

Jim Walker:

Yeah.

Keith McClellan:

So the base table is here, create table rides. We got some user IDs. We've got some city information. We made some changes. We added some partitions to the index. Our current ride share app is partitioned across. It operates in three cities, New York, Chicago, and Seattle. We have added, we have altered the partitions to make sure that the lease holders are optimized to be in each of those, in the locations that are most appropriate for each of those three cities, by making some changes. What we'll see over the next minute or so is, our performance is going to increase dramatically. Whereas before, our worst-case queries, our P99 latency, was in that 400 milliseconds range, now we're at less than 40 milliseconds, worst-case scenario, which is a 10X performance improvement. If we were to go ahead and take a look at the P90, so nine out of every 10 queries in the environment is now running under two milliseconds, except for the one data center that's not quite caught up yet. But even that, that's not as good as I would like it to be.

Jim Walker:

Where are you at now on the P99? Down around 60?

Keith McClellan:

We're at 35.

Jim Walker:

Pretty good.

Keith McClellan:

Which is pretty good. We can run a couple of queries in that kind of 100 millisecond window. I want to improve that, so I've got one table that, promo codes, that's going to be used in all of our geographies. That's reference data. I want it to be locally consistently available. I don't want to have to reach across regions for the authority to act on that data. So I'm going to create some partitioned covering indexes, we talked a little bit about this earlier, that allow us to respond from the local replica while still guaranteeing consistency. The trade-off is that there's a bit of a write penalty there.

Keith McClellan:

I'm going to go ahead and make a change now, to that promo codes table. And what we're going to see on the back end is our P99 latency is going to improve further, which will allow us to do multiple queries in that 100 millisecond window. Because now for that reference data that is going to be accessed from all of our different locations, all of those locations are going to have local access to that data.

Jim Walker:

So where are we going to get ... the P99's going to come down to, what's your target here?

Keith McClellan:

I think the P99 will hit somewhere in the 10 millisecond range.

Jim Walker:

Pretty awesome.

Keith McClellan:

The P90 is going to be two milliseconds.

Jim Walker:

Yeah.

Keith McClellan:

Right, which you can run, at worst case, you can run 10 queries before a user would notice any lag in a customer-facing app.

Jim Walker:

Yeah.

Keith McClellan:

That's more than enough time to have multiple interactions with the database and have your app do some app processing.

Jim Walker:

Yeah, that's right. And I mean, Keith, this is across a distributed environment. Yeah, it's pseudo what we're doing here, but 32 milliseconds through, across a distributed environment is pretty, pretty awesome. If you just do a raw MySQL, you're going to be under 10 milliseconds, right? And you're saying we can actually tune this to do that, yet still be distributed, based on all these tuning policies that you can do, right?

Keith McClellan:

That's correct.

Jim Walker:

Yeah, which is just awesome. I think it's one of those things, it's like once you understand how data works and where data lives, you can do some really cool things.

Keith McClellan:

Yeah, so what we did here, we made this change to the promo codes table where I added ... I modified the primary to be in the US central, just because it's quote unquote centrally located, is for the cluster that we have provisioned here. Then I created two secondary indexes, each that had, that were pinned to the other two data centers. That's what allows us to have super fast read performance. Like most things in a distributed system, there are trade-offs.

Jim Walker:

[crosstalk 00:05:05].

Keith McClellan:

But by default, we're going to distribute the data in the most highly available pattern possible, and we're going to take our best guess at where you're going to be interacting with that data. But we provide the controls to allow an application developer or a DBA to override those decisions for application design reasons. So in this case, I'm accepting, because I don't have ... I'm not publishing a new promo code every 10 milliseconds. If it's $10 off my ride, it's probably, that coupon code is probably going to be valid for the next seven days. I'm willing to accept that a write to this table might take twice as long-

Jim Walker:

That's okay.

Keith McClellan:

... as it would to another table to get super fast reads. Whereas with my other, the other interactions with the database, I need a much better read-write performance balance.

Jim Walker:

Well, and that's right, too. It's like, how often are we adding a new promo code? Depending on what the table is doing, you have different types of access. And that's also not really a ... the read side of the promo code table is absolutely consumer-facing, but the write side is an internal function, right? You have a little bit more leeway with that, right?

Keith McClellan:

You can see my P99 latency is floating somewhere between five, and right now, it looks like it's about 11 milliseconds. If we go ahead and take a look at our average worst case, that P90 number, we're looking at two milliseconds per query. Our current worst-case query is eight milliseconds.

Jim Walker:

You can do better, Keith.

Keith McClellan:

It's hard to optimize much further than that, right?

Jim Walker:

Exactly.

Keith McClellan:

From there, we would be talking about scaling out to be able to handle more transactions per second, rather than optimizing for individual query performance.

Jim Walker:

This is, it's really cool, Keith. How does this typically work with a company? You know what I mean? You just went through a couple different steps of how you were able to squeeze Cockroach more and more to get to this sub-10 millisecond latency. How's it typically working with companies? Do they just go out, implement, don't care about these things, and then start to tune and get deeper and deeper. What's the typical process, or the way that people think about these things?

Keith McClellan:

Yeah, so generally speaking, when, when I'm working with customers to do design and implementation of CockroachDB, the first thing that I want to talk to them about is designing to survive rather than designing for failure.

Jim Walker:

That's right.

Keith McClellan:

So traditionally, when we're talking about designing a highly available or disaster-resistant cluster, you're talking about, "Well, when failure X happens, what do we do?" Instead, the conversation we're having is, "What do we need to be able to survive?" Most of our customers are looking to be able to survive a full site failure.

Jim Walker:

Right.

Keith McClellan:

A region in AWS, or GCE, a full data center failure. They want to survive that without having a service outage. They might be willing to accept a performance penalty in that scenario. But they're certainly not going to accept a total failure of the system. Because we're largely doing things like financial transactions and inventory management, things where correctness are really important.

Jim Walker:

Right.

Keith McClellan:

So you don't want to be in a scenario where you've lost data or you've lost state, because then you're going to have unhappy customers.

Jim Walker:

Yeah, and it's-

Keith McClellan:

Um-

Jim Walker:

Sorry, go on.

Keith McClellan:

That's usually where we start the conversation. So, "What do we need to be able to survive? Do we need to be able to survive the entire US going down? Do we need to be able to survive a single cloud provider fat-fingering the configuration for a critical service that takes down multiple regions at once?" Or maybe I have some other application limitations elsewhere in my stack, and if I lose more than one data center, I'm going to lose everything anyway. So I only really need to design my database to survive a full site failure or something along those lines.

Jim Walker:

Right.

Keith McClellan:

So-

Jim Walker:

And geo-partitioning comes into that conversation, correct? You're basically doing what you can to help with that.

Keith McClellan:

Yeah.

Jim Walker:

But Cockroach is dealing with that.

Keith McClellan:

I don't even necessarily talk about it as geo-partitioning. I just talk about it as partitioning. Right?

Jim Walker:

That's right.

Keith McClellan:

Geo-partitioning is, the net effect is that we're going to distribute the authority to act and the data itself across multiple locations. But partitioning can be used for other types of things, like the compliance use case we were talking about earlier.

Jim Walker:

Yep.

Keith McClellan:

So I usually go from, "What do I need to survive?" To, "What does my local performance need to look like?" Then I would generally go to, "What are my compliance and regulatory types of constraints?"

Jim Walker:

Right.

Keith McClellan:

From there, we're going to get a candidate architecture for a CockroachDB deployment.

Jim Walker:

Right.

Keith McClellan:

Tim, did I miss anything? Were there any other items that you usually highlight with customers?

Tim Veil:

No, I think that's great. One of the things I was going to point out is, and Jim, keep me honest here on time, so we've got a lot of great questions.

Keith McClellan:

Yeah.

Tim Veil:

I've been typing like a madman over here, trying to get to some of them, but some of them, I just can't get to. I want to save a little bit of time where we can maybe just go down the list and answer some of these live.

Jim Walker:

Yeah, and I think that's where I was going with this now, Tim. I think there's some really good things, and I think Keith, as you were presenting, there was a couple of good things that were in there. But Tim, do you want to just start going through some of this stuff?

Tim Veil:

Yeah. Let me go scroll up here. I thought there was an interesting one. It's one of our favorite topics, Keith, it's favorite topics from Rigoberto Rojas, essentially, is there an easy way to add a third data center that participates in quorum? It's kind of adding capacity and a third data center is not affordable. It really kind of hitting at this kind of two data center, three data center thing. I think this actually also hits it, something that [Shayon 00:11:44], I believe I'm pronouncing these right, asked, it's, "Why do we need three, and why is that important, and what are some of the trade-offs and reasons why you would or would not have more or less?" I know that was a big nasty question, but-

Keith McClellan:

Yeah, do we have another hour? I thought it thought this was the Cockroach Hour, not the Cockroach Two Hours, but-

Tim Veil:

And this is, yeah, you've got like-

Jim Walker:

Well, exactly.

Tim Veil:

You've got [inaudible 00:42:11] seconds.

Keith McClellan:

The way I usually like to think about this, there's three components. Because we're distributing the authority to act across the cluster, it's important to remember that adding a third site does not necessarily mean adding extra cost, right? If I'm running a third of my nodes across three regions in AWS or half of my nodes across two regions in AWS, my costs are roughly going to be the same. Actually, and because we're designing to survive rather than designing to fail, rather than having to have twice the amount of infrastructure to be able to survive a failure and meet my performance SLAs, I only need a fraction more additional infrastructure to meet those SLAs. So a lot of cases, being distributed across three sites can be less expensive for customers than running across two sites.

Keith McClellan:

Beyond all of that, from an architectural perspective, we do synchronous replication across sites. For us to be able to make automated decisions about correctness, we have to have a quorum for any given range of data, which is what we call a shard. So that's, right now, I think it's 256 meg junk of the ... of any given table or database. We need to have at least 51% of the replicas responding. The actual amount of replicas is configurable. By default, it's going to be three. We recommend five or seven in some use cases, for sure. It all comes down to what you need to be able to survive.

Keith McClellan:

Do I need to be able to survive a full site failure and an arbitrary node failure? Well if you do, then you definitely, you're going to need at least a replication factor of five. It's those types of things that we need to talk about. But the net effect is, you can actually save money by being in three sites, at least if you're in the cloud. I understand for customers that are talking about being installed in a legacy on-prem data center, that there are hard costs for having, maintaining a third site. But assuming you can use the cloud, whether it's for some of your sites or all of your sites, you can dramatically save money by doing this.

Tim Veil:

Yeah, and I think one of the follow-up subjects, I'll just note this from Rigoberto, was around Galera. I know one of our dear friends and associates is an expert in Galera, so happy to follow up on, if he'd like to reach out, have a more detailed conversation about some differences there.

Tim Veil:

A question... have a more detailed conversation about some differences there. A question came up, one of our more active chat participants, this Bailey, was interested in Rust and a whole bunch of other things. But I think Bailey was also asking about Linode, and I know you've had some conversations there. Any thoughts or updates on where we are with some of the Linode integrations? I know it had been on part of the conversation. Any insight you can provide there?

Keith McClellan:

Yeah. So we did some initial work with the Linode team. They were very supportive of getting CockroachDB running in their environment. Things generally went pretty well. We're not in their marketplace catalog, but assuming that you have sufficient resources available in their environment, then CockroachDB will run just fine. The one thing that I will mention about... And this is true about wherever you run it. We're a database. We're going to be dependent on certain performance characteristics of your environment. So, if I'm running 16 VCPUs and 32 to 64 gigs of memory, I need a couple thousand IOPS on my disc to be able to actually use the CPU and memory the way you expect to use it. For tests and development, something like a Linode, or similar, can be a great environment for production. I'd have to go back and look. We were able to get a performance cluster running with some of their larger instance types. So, it's certainly feasible. [crosstalk 00:46:40]

Tim Veil:

Just cut you off there because we've got a couple more. So, speaking of where we run, Illya has asked about, "Is it okay to use Cockroach inside of K8s?" What do we say to that?

Keith McClellan:

Yes. As a matter of fact, I spend most of my time working with Kubernetes distributions in my full-time job as the partner solutions engineer. We are architected from the ground up to run properly and run performantly in a cloud native environment. That means that we're, frankly, pretty easy to run in Kubernetes. We've gotten away with not even having a Kubernetes operator for three years. And I think 40% of our customers manage CockroachDB themselves and 100% of our customers that are in CockroachCloud are running on Kubernetes.

Keith McClellan:

We are in the process of building an operator just to make it easier for people to manage the day two operations. These are things like, "Oh, I need to manage backups." "I want my key management to be automatically synchronized between Kubernetes and CockroachDB." I want my encryption at rest to be turned on properly from the get-go, rather than me having to go figure out how to do it." Those types of things. So we are in the process of shipping an operator, and we have a beta operator available out in a GitHub repo right now. [crosstalk 00:48:14] But the fact is, we do it very well.

Tim Veil:

Well, and the other thing to mention here, I don't think we've talked a lot about CockroachCloud, but as most folks know, we offer a fully managed and hosted solution called CockroachCloud. What does CockroachCloud run on underneath the hood?

Keith McClellan:

It runs on Kubernetes.

Jim Walker:

Kubernetes, man.

Tim Veil:

Kubernetes.

Keith McClellan:

Yes.

Tim Veil:

[crosstalk 00:48:34] You guys as a company have a tremendous amount of faith in that. [crosstalk 00:03:38].

Jim Walker:

The very first time I ever saw CockroachDB was at OpenStack Summit. It was about four years ago. And they had the president of OpenStack Summit, and then you had Alex Polvi from CoreOS, who, the CoreOS team drive a whole lot of this early Kubernetes adoption. And they needed a distributed app to run on top of Kubernetes to show how it would survive. They used CockroachDB, because it was purpose built for this. It isn't like take a legacy database and try to run it in this new world. It was built for this. And there's a great video about that, and it just basically shows the same thing we just showed today, kill a node and there it comes back. So, the pod comes back, database comes back. It's great, right?

Tim Veil:

Yep.

Jim Walker:

So, it's built for that world.

Tim Veil:

Another question I think we should touch on, so there's a question from [Rui 00:58:49] about, "What if, essentially, data access becomes much more frequent in a certain region? Does the partitioning schemes get skewed? Does it influence latency?" Maybe, Keith, you could touch on, or we could touch on, are two things that we provide. One is, follow the workload as a mechanism for potentially avoiding range hotspotting, as well as follower reads. I thought those would be two things we should quickly touch on because I think they help answer his question here.

Keith McClellan:

Yeah. So, there's a third topic I want to add in there, and that's dynamic range splitting. Because each range is a sorted segment of the total key space at a table, sometimes you'll have, particularly if you aren't careful when you're picking your partitioning keys, you could get some natural hotspots in the data. So one of the things that CockroachDB can do in the background, other than just being able to live change the partition structure, which is something we did during the demo, is the database itself will notice that a large number of transactions against a given table are going to a relatively few number of the ranges and will split those ranges for you and redistribute them across the cluster as best as we can, to avoid that hotspotting.

Keith McClellan:

One of the other things that we're going to do is, we're going to move the authority to act around, as appropriate. So each range has a leaseholder that's the raft leader, effectively. Leaseholder has a couple of special things on top of being the raft leader. So not only are they always involved in every right because they're the coordinator for rights, but because they're the coordinator for rights they know that they have the latest version of the data locally, so they can respond to reads without doing a quorum check.

Keith McClellan:

So what we will do in a scenario where we've got hotspotting is, we will try to remove as many of the other bottlenecks as we can, so that we could distribute that load as evenly across the cluster as possible, given whatever rules were set. Because we're serializably isolated, which means that we guarantee that any two permutations against the same key will operate as if they came in one at a time, that's effectively what serializable means, we do have scenarios where long running queries potentially could block other transactions and create some performance issues.

Keith McClellan:

So we have something called a follower read, which allows you to get a snapshot of the data based on a particular time, rather than the latest version of the record. So this is useful for OLAP type queries. It's very similar to the way a number of all TP systems, including Oracle, run by default, because it gives you a snapshot as of a particular point in time, rather than automatically effectively locking a record while it's being changed. And it also allows you to query from the local replica rather than the leaseholder. So, that's the other component. So now, instead of having one replica that's responding to reads for any given range of the data, now it's three or five or seven, depending on what my replication factor is.

Jim Walker:

Hey. Tim, I don't know if you have like one more. We're at three minutes or two minutes left, so if there's something quick.

Tim Veil:

Yeah. Christian just kind of, "Hey, do you have a strategy to keep on-prem and cloud clusters in sync?" And I think just to answer this really quickly, I mean, one of the nice hidden features here, of course, is that you can set up a cluster that spans these environments simultaneously.

Jim Walker:

Multi-clouding.

Tim Veil:

So you could have a single cockroach cluster that has nodes participating from your on-prem data center, plus maybe two things that you've started up on the cloud. And of course, as Keith mentioned, when a transaction happens in Cockroach, those nodes, or a majority or quorum of nodes in a range, are updated within the scope of a transaction synchronously. So there's not this kind of later replicated to the cloud as he uses.

Tim Veil:

By the way, this is a really interesting strategy for cloud migration. So folks have spun up nodes in the on-prem, spun up nodes in the cloud, let the data synchronized, in effect, and then begin shutting down nodes on-prem. So a lot of really good hybrid architectures enabled with Cockroach.

Jim Walker:

And Tim, I mean, if you wanted to keep certain data on-prem and only have certain data in cloud, you could use geo-partitioning to do that, right?

Tim Veil:

Oh, yeah.

Jim Walker:

I think it's one of those cool things about Cockroach. I mean, the feature goes really far. Keith, you and I have had conversations about, "Well, could this help with confidentiality?" So, like certain data lives on certain servers, maybe even in the same data center. You know what I mean?

Keith McClellan:

Yeah.

Jim Walker:

So, there's a whole bunch of things that I think when you start to think about geo-partitioning, thinking about your tables and your rows and different ways, and how data actually physically resides, you don't have to do in the application layer anymore.

Keith McClellan:

Yeah. I will add, that if you do have two separate clusters, or let's say you have another system... Maybe it's not two CockroachDB environments, maybe you're taking data from CockroachDB and feeding into a data warehouse. Right?

Jim Walker:

That's right.

Keith McClellan:

We do provide change data capture feeds. We publish to a change data capture feed. I think right now we support flat file syncs and Kafka-based syncs for that. So we can publish data to a message broker or a feed that can then be consumed by another database for [crosstalk 00:55:32] as well.

Jim Walker:

Yeah. So, lots of information on that, actually. We ran a whole lot of Cockroach [R 00:55:38] on CDC. But I would be remiss, Tim, if I had not mentioned our docs team and the [crosstalk 00:55:46].

Tim Veil:

We almost did a whole-

Jim Walker:

I've almost done a whole session, dude.

Tim Veil:

What?

Jim Walker:

Right? All of this stuff that we talked about, the CDC stuff that Keith was just mentioning, if you all want to get more information, our docs, and what Jesse and team does, it's phenomenal. It's almost laughable how awesome it is. And so, if you really want to get into how primary keys work or you want to talk about geo... There's some really, really great stuff. And you could download the binary of CockroachDB core and actually run workloads as... The one that Keith had run is Cockroach Demo. It's a real simple thing. Look it up in our docs. You could actually do all that. Or you could spin up a cluster right now in CockroachCloud, and we'll get you a code so you could actually run a free cluster for a month. So just reach out to us and we'll make sure you get up and running if you want to actually try that as well.

Jim Walker:

But lots of different ways you can go out and actually play with things. You can absolutely call. You can try this for free for a month with CockroachCloud. Actually, that's the easiest way to start playing with some of these things, and start learning these things yourself. Our Slack channel is also very popular, well-trafficked. Lots of people out there answering questions and talking through some of these things as well, and our team is always happy to help out. But I always point people in our docs first.

Jim Walker:

So guys, we are a minute past. Thank you for doing this today. I know you guys are really, really busy. Tim, thank you for helping out with everything. And it's great insight. And Keith, man, I love that demo, buddy. So thank you guys. Anything else?

Tim Veil:

I mean, I could go on and on, but we'll wrap it up. We've got questions coming in fast and furious, but we can't get to them all.

Jim Walker:

Yeah.

Tim Veil:

Hey, I'm on LinkedIn. You can always send me a message. I'm happy to answer questions. I don't disappear completely after this is over. [crosstalk 00:57:36]

Jim Walker:

And I like I said, [crosstalk 00:57:38] community, join our public Slack channel. Yeah. We're happy to engage and the more we can help people understand these things. I I'm a believer in it, and obviously Keith and Tim here are believers in some of the things that we're doing here. And all thanks to this some incredible software engineering that's been going on with this development team over the last five years to actually make this work.

Tim Veil:

Hey, can we not talk about that for a second?

Jim Walker:

Yeah.

Tim Veil:

Join the team. [crosstalk 00:58:06] Our engineering team is out there and looking for folks left, right, and center. So if you're on this call and interested in a new challenge or different challenge, I know we'd love to hear from you.

Jim Walker:

Yeah. Yeah, absolutely. So thank you, Tim. That's the best going away we've had.

Tim Veil:

That's it.

Jim Walker:

All right, everybody. Thank you. It's Wednesday. It's almost October of 2020. I think we went like January, February, March, April, October, November, literally.

Keith McClellan:

It's [March-tember eleventy-seven 00:13:41].

Jim Walker:

All right, you all. Thank you. We'll put a little timestamp on this one. All right, guys. Thank you very much, you all.

Tim Veil:

See you all.