Kubernetes isn’t just a platform—it’s a revolution. On this episode of Fork Around and Find Out, Justin and Autumn sit down with Kubernetes co-creator Brian Grant to explore the origins of this game-changing technology. From Google’s internal tooling to the cloud-native juggernaut it is today, Brian takes us behind the scenes of Kubernetes’ evolution, including its roots in Borg and the creation of CNCF.
Brian also opens up about his fascinating career, from debugging GPUs at PeakStream to improving Google’s threading systems. Along the way, he shares his candid thoughts on Terraform, GitOps, and the future of infrastructure management. We’re talking insider stories, tech critiques, and the cyclical nature of trends like AI—all packed into one unmissable episode.
Brian is a visionary who’s shaped the cloud-native ecosystem as we know it. We can’t wait for you to hear his story and insights!
Show Highlights
(0:00) Intro
(0:31) Tremolo Security sponsor read
(2:42) Brian’s background
(6:20) What it’s like working on something great but it not being the right time
(9:17) How Brian’s work from the 2000s is still important today
(11:16) Why Brian said ‘yes’ to Google after previously turning them down
(12:59) The history of the FDIV bug
(16:49) What Brian was doing when his old company was bought by Google
(20:51) How Brian’s education helped him get started down this path
(23:47) Brian’s jump from Borg to Kubernetes
(32:27) The effect Kubernetes has had on the landscape of infrastructure and applications
(35:48) Tremolo Security sponsor read
(36:47) Times Brian has been frustrated at how people use Kubernetes
(41:05) The patterns Brian notices thanks to his years in the tech industry
(48:04) What Brian expects to see next as manual actions to make providers work to make a comeback
(54:58) Reflecting on Brian’s serendipitous journey through the tech world
(1:02:18) Where you can find more from Brian
About Brian Grant
Brian Grant is the CTO and co-founder of ConfigHub, pioneering a new approach to provisioning, deploying, and operating cloud applications and infrastructure. As the original lead architect of Kubernetes, Brian created its declarative configuration model (KRM) and tools like kubectl apply and kustomize. With over 30 years in high-performance and distributed computing, he’s held pivotal roles, including tech lead for Google’s Borg platform and founder of the Omega R&D project. A Kubernetes Steering Committee and CNCF Technical Oversight Committee member, Brian also boasts 15+ patents and a Ph.D. in Computer Science, shaping the future of cloud and computing innovation.
Links Referenced
Sponsor
Tremolo: http://fafo.fm/tremolo
Sponsor the FAFO Podcast!
1
00:00:00,140 --> 00:00:03,530
From my perspective, one of the things that it did is it created
2
00:00:03,680 --> 00:00:06,979
an infrastructure ecosystem that was broader than any single cloud.
3
00:00:12,940 --> 00:00:16,340
Welcome to Fork Around and Find Out the podcast about
4
00:00:16,340 --> 00:00:19,450
building, running, and maintaining software and systems.
5
00:00:31,490 --> 00:00:34,949
Managing role-based access control for Kubernetes isn’t the
6
00:00:34,949 --> 00:00:38,000
easiest thing in the world, especially as you have more clusters,
7
00:00:38,030 --> 00:00:41,989
and more users, and more services that want to use Kubernetes.
8
00:00:42,349 --> 00:00:45,200
OpenUnison helps solve those problems by bringing
9
00:00:45,200 --> 00:00:47,580
single-sign on to your Kubernetes clusters.
10
00:00:47,730 --> 00:00:53,000
This extends Active Directory, Okta, Azure AD and other sources as
11
00:00:53,000 --> 00:00:56,699
your centralized user management for your Kubernetes access control.
12
00:00:56,980 --> 00:01:00,100
You can forget managing all those YAML files to give someone access
13
00:01:00,100 --> 00:01:03,970
to the cluster, and centrally manage all of their access in one place.
14
00:01:04,170 --> 00:01:06,509
This extends to services inside the cluster
15
00:01:06,530 --> 00:01:09,179
like Grafana, Argo CD and Argo Workflows.
16
00:01:09,379 --> 00:01:12,519
OpenUnison is a great open-source project, but relying
17
00:01:12,520 --> 00:01:15,340
on open-source without any support for something as
18
00:01:15,340 --> 00:01:18,830
critical as access management may not be the best option.
19
00:01:18,990 --> 00:01:20,929
Tremolo Security offers support for OpenUnison
20
00:01:21,820 --> 00:01:24,600
and other features around identity and security.
21
00:01:24,700 --> 00:01:27,909
Tremolo provides open-source and commercial support for OpenUnison
22
00:01:27,959 --> 00:01:31,580
in all of your Kubernetes clusters, whether in the cloud or on-prem.
23
00:01:31,810 --> 00:01:35,780
So, check out Tremolo Security for your single sign-on needs in Kubernetes.
24
00:01:36,070 --> 00:01:37,670
You can find them at fafo.fm/tremolo.
25
00:01:40,960 --> 00:01:43,819
That’s T-R-E-M-O-L-O.
26
00:01:50,039 --> 00:01:52,220
Welcome to Fork Around and Find Out.
27
00:01:52,570 --> 00:01:54,880
On this episode today, we are reaching a quorum.
28
00:01:54,890 --> 00:01:55,759
We have three of us.
29
00:01:55,760 --> 00:01:56,580
We have Brian Grant.
30
00:01:56,580 --> 00:01:58,140
Thank you so much for coming on the show.
31
00:01:58,679 --> 00:01:58,949
Hi.
32
00:01:58,950 --> 00:01:59,719
Thanks for inviting me.
33
00:02:00,219 --> 00:02:02,029
It’s so weird that you didn’t say, ‘ship
34
00:02:02,030 --> 00:02:04,029
it.’ Like, it’s just, like, we have made—
35
00:02:04,440 --> 00:02:05,279
We have forked.
36
00:02:05,520 --> 00:02:06,670
We forked the—
37
00:02:08,080 --> 00:02:08,170
Oh—
38
00:02:10,520 --> 00:02:10,620
—podcast.
39
00:02:10,620 --> 00:02:10,721
—we did fork.
40
00:02:10,721 --> 00:02:10,822
Oh, my God.
41
00:02:10,822 --> 00:02:12,280
[laugh] . This is brand new open-source.
42
00:02:12,280 --> 00:02:13,210
It’s a brand new fork.
43
00:02:13,210 --> 00:02:15,720
And again, we’re reaching, I don’t know, the [unintelligible] consensus here,
44
00:02:15,720 --> 00:02:19,780
Autumn and I have elected Brian as the leader of this episode [laugh] . And so—
45
00:02:20,270 --> 00:02:21,889
Poor Brian’s like, “I just got here.
46
00:02:21,900 --> 00:02:25,570
How did they, like, put me onto this new responsibility?” Like—
47
00:02:25,570 --> 00:02:28,500
Look Brian is responsible for that joke, right?
48
00:02:28,500 --> 00:02:29,465
I’m putting all the—
49
00:02:29,750 --> 00:02:33,419
Oh, you should have seen the dad jokes that happened before you got here, okay?
50
00:02:33,870 --> 00:02:36,810
We were trying to figure out the YouTube Short thing, and I was
51
00:02:36,810 --> 00:02:39,560
like, “We don’t even have any shorts.” And Justin’s like, “It’s cold.
52
00:02:39,560 --> 00:02:41,920
I have pants on.” And I was like, “Oh, my God.”
53
00:02:42,209 --> 00:02:46,250
For people not familiar, Brian is one of the three founders,
54
00:02:46,260 --> 00:02:48,970
or people that started Kubernetes within Google, and—
55
00:02:49,510 --> 00:02:52,330
Well, actually we had roughly five.
56
00:02:52,349 --> 00:02:56,680
So, there was Joe Beda and Brendan Burns on the cloud side,
57
00:02:56,750 --> 00:03:01,369
and Tim Hockin and I on the internal infrastructure side.
58
00:03:01,660 --> 00:03:03,190
Does that mean that we, like—that people
59
00:03:03,190 --> 00:03:05,130
blame you when they scream at Kubernetes?
60
00:03:05,410 --> 00:03:09,260
Oh, I am to blame for a lot of things in Kubernetes.
61
00:03:10,429 --> 00:03:11,300
We could talk about that.
62
00:03:13,219 --> 00:03:16,060
Probably more things than any other single person, maybe.
63
00:03:16,209 --> 00:03:18,239
Maybe Tim now because I’ve been away for a while.
64
00:03:18,730 --> 00:03:20,870
Yeah, were you one of the first people that wrote it in Java?
65
00:03:20,990 --> 00:03:21,630
Was that the first—
66
00:03:21,660 --> 00:03:23,600
No, that was the prototype.
67
00:03:23,610 --> 00:03:24,170
Brendan—
68
00:03:24,530 --> 00:03:25,720
Brendan wrote the first one in Java?
69
00:03:25,720 --> 00:03:26,849
That sounded scary.
70
00:03:27,170 --> 00:03:31,390
[Beda] was very early, and worked on the initial implementation.
71
00:03:31,410 --> 00:03:33,720
Also Ville Aikas, who’s at Chainguard now.
72
00:03:33,920 --> 00:03:38,410
But he didn’t really carry over once we started building it for real.
73
00:03:38,620 --> 00:03:39,380
He did something else.
74
00:03:39,670 --> 00:03:42,870
And then there was Craig McLuckie on our product side.
75
00:03:42,910 --> 00:03:45,260
But when we started, we weren’t in our location.
76
00:03:45,270 --> 00:03:47,720
We didn’t have a manager, you know?
77
00:03:47,720 --> 00:03:48,760
We just started working on it.
78
00:03:48,920 --> 00:03:51,519
And you were coming from internal infrastructure.
79
00:03:51,520 --> 00:03:53,830
You were doing Borg, and Omega, and everything
80
00:03:53,830 --> 00:03:56,098
else inside of Google and kind of shifted—
81
00:03:56,098 --> 00:03:56,830
What’s Borg and Omega?
82
00:03:56,830 --> 00:03:57,230
You got to, tell—
83
00:03:57,230 --> 00:03:58,415
That’s what—I was just going to go into that.
84
00:03:58,415 --> 00:03:58,950
That was great.
85
00:03:58,950 --> 00:03:59,310
Okay good.
86
00:03:59,780 --> 00:04:03,590
Because I feel like sometimes when, like, you work in a certain type of
87
00:04:03,730 --> 00:04:07,680
software, we forget that everybody doesn’t know, like, necessarily what that is.
88
00:04:07,690 --> 00:04:09,279
So, tell us all the things.
89
00:04:10,020 --> 00:04:10,200
Yeah.
90
00:04:10,200 --> 00:04:13,580
So Google, many large companies and even a lot of small
91
00:04:13,580 --> 00:04:16,770
companies, built all of its own infrastructure tooling.
92
00:04:17,240 --> 00:04:20,839
So, it had an internal container platform called Borg, which was actually the
93
00:04:20,839 --> 00:04:24,580
reason… the motivation for adding cgroups to the kernel, to the Linux kernel.
94
00:04:25,780 --> 00:04:30,270
So, that was just added and rolling out the time I joined Google in 2007.
95
00:04:31,220 --> 00:04:34,140
So, that was just like brand new in Borg.
96
00:04:34,140 --> 00:04:36,940
And the reason Google created Borg was they had two previous
97
00:04:36,940 --> 00:04:40,669
systems, Babysitter—which Borg ran on when I started on
98
00:04:40,670 --> 00:04:45,559
the Borg project—and Work Queue, which ran MapReduce jobs.
99
00:04:45,929 --> 00:04:50,610
So, they had one, kind of, batch queuing system and one service running system.
100
00:04:51,170 --> 00:04:54,270
And what they found was that they didn’t have
101
00:04:54,270 --> 00:04:57,040
enough resources for all the batch workloads.
102
00:04:57,360 --> 00:05:00,130
There were a lot of idle resources in the serving workloads,
103
00:05:00,460 --> 00:05:04,219
especially during certain parts of day, in certain regions.
104
00:05:04,619 --> 00:05:06,580
So, they wanted to make a system that could run both
105
00:05:06,600 --> 00:05:09,960
types of workloads to achieve better resource utilization.
106
00:05:10,290 --> 00:05:14,630
Resources were scarce for years and years and years,
107
00:05:14,700 --> 00:05:17,140
you know, even though they had a vast fleet of machines.
108
00:05:17,320 --> 00:05:19,250
It’s funny to think you’re like, “Oh yeah, no, we have hundreds
109
00:05:19,250 --> 00:05:21,500
of thousands of nodes, and we have not enough resources.” [laugh]
110
00:05:21,549 --> 00:05:21,999
.
Yeah.
111
00:05:22,510 --> 00:05:26,640
Yeah because the main services were run all the time, they were
112
00:05:26,720 --> 00:05:30,950
adding new services, they were moving services that were previously
113
00:05:30,950 --> 00:05:34,479
not on Borg onto Borg, they were moving acquisitions onto Borg.
114
00:05:34,479 --> 00:05:36,220
There just weren’t enough resources.
115
00:05:36,490 --> 00:05:39,869
Borg project kicked off around the beginning of 2004, so before
116
00:05:39,870 --> 00:05:43,030
I joined, around the time that Tim joined Google, I think.
117
00:05:43,520 --> 00:05:44,020
And—
118
00:05:44,590 --> 00:05:45,680
Yeah, Tim just hit 20 years.
119
00:05:45,680 --> 00:05:46,299
That’s amazing.
120
00:05:46,410 --> 00:05:46,790
Yeah.
121
00:05:47,390 --> 00:05:47,910
Yeah, yeah.
122
00:05:47,940 --> 00:05:49,609
So, I was only there 17 years.
123
00:05:50,559 --> 00:05:51,070
[laugh] . Slacker.
124
00:05:51,370 --> 00:05:51,815
Come on, Brian.
125
00:05:51,815 --> 00:05:55,179
You said, “Only,” like, it was not that long [laugh]
126
00:05:55,179 --> 00:05:57,739
.
But you joined directly to the Borg team?
127
00:05:58,119 --> 00:05:58,479
No.
128
00:05:59,359 --> 00:06:04,740
Actually that was one of the teams that—so I came into an acquisition before.
129
00:06:05,110 --> 00:06:08,860
I did something that was not of interest at the time, which,
130
00:06:08,860 --> 00:06:10,960
if you read my LinkedIn page, you may know what that is, but
131
00:06:11,000 --> 00:06:14,130
I did high performance computing on GPUs way, way too early…
132
00:06:15,240 --> 00:06:15,270
[laugh]
133
00:06:15,330 --> 00:06:16,600
.
Like 2005.
134
00:06:17,820 --> 00:06:18,610
Nobody cared.
135
00:06:18,610 --> 00:06:19,390
That was the problem [laugh]
136
00:06:20,220 --> 00:06:22,710
.
What’s it like working on something and knowing that
137
00:06:22,710 --> 00:06:25,330
it’s going to be great, but it not being the right time?
138
00:06:25,330 --> 00:06:26,790
Like, is that so frustrating?
139
00:06:26,800 --> 00:06:27,420
Because, like—
140
00:06:27,950 --> 00:06:29,469
Yeah, I’ve done that a few times.
141
00:06:29,849 --> 00:06:31,459
You said a few times [laugh] . Not once,
142
00:06:31,459 --> 00:06:33,300
but he’s like, I’ve been in that struggle.
143
00:06:34,000 --> 00:06:39,070
The startup was PeakStream, and the challenge was that there
144
00:06:39,070 --> 00:06:42,260
weren’t people who needed extremely high performance computing
145
00:06:42,290 --> 00:06:44,345
in something that was not, like, a Cray supercomputer.
146
00:06:44,360 --> 00:06:46,580
Because I did supercomputing in the ’90s and worked for
147
00:06:46,580 --> 00:06:49,219
a national lab and things like that as well, and you
148
00:06:49,219 --> 00:06:52,570
know, they had their own kind of big, metal machines.
149
00:06:52,609 --> 00:06:54,389
But there were people who needed high performance
150
00:06:54,389 --> 00:06:57,840
computing, but not of that scale or cost.
151
00:06:57,920 --> 00:07:02,099
The people who did were willing and able—and able is
152
00:07:02,099 --> 00:07:05,720
an important part—to actually hire experts to squeeze
153
00:07:05,730 --> 00:07:08,489
every last cycle of whatever chip they were using.
154
00:07:08,849 --> 00:07:11,080
There are also a bunch of other challenges, like—I
155
00:07:11,080 --> 00:07:14,909
mean, at the time we—it was before Nvidia launched CUDA.
156
00:07:15,249 --> 00:07:18,120
So, Nvidia was working on CUDA.
157
00:07:18,220 --> 00:07:20,590
Our founder was from Nvidia.
158
00:07:21,000 --> 00:07:24,429
They called it GPGPU back in those days, General Purpose Computing on GPUs.
159
00:07:24,429 --> 00:07:27,359
So, that was starting to attract interest, somebody wrote a book.
160
00:07:27,900 --> 00:07:30,770
PeakStream was one of the companies, kind of,
161
00:07:30,789 --> 00:07:32,919
starting in that area, one of the earliest.
162
00:07:33,410 --> 00:07:37,530
And back in those days, the chips, I mean, they weren’t designed for it.
163
00:07:37,530 --> 00:07:39,829
They were designed for graphics, right, so they
164
00:07:39,830 --> 00:07:42,369
didn’t really have a normal computing model.
165
00:07:42,370 --> 00:07:45,050
They didn’t do I.e.,EE floating point; they did
166
00:07:45,080 --> 00:07:47,749
something was that was sort of floating point-ish.
167
00:07:48,250 --> 00:07:50,270
And they didn’t do integer computation
168
00:07:50,360 --> 00:07:53,420
either because the shaders didn’t need it.
169
00:07:53,790 --> 00:07:56,710
So, they didn’t do 32-bit integer computation.
170
00:07:56,710 --> 00:07:57,670
They did some simple computations.
171
00:07:57,690 --> 00:08:03,090
Like, indexing into memory, normally, the way that works in a CPU is the
172
00:08:03,099 --> 00:08:07,439
memory unit has an adder that takes an integer memory address of whatever
173
00:08:07,440 --> 00:08:11,210
the word size is on the machine, like, 64 bits these days, on those chips,
174
00:08:11,430 --> 00:08:15,989
and it does an add of an index, an add and a shift to scale the index.
175
00:08:15,990 --> 00:08:19,940
So, if you’re loading something that’s four bytes or eight bytes or
176
00:08:19,960 --> 00:08:23,210
one byte, you know, it does the shift appropriately and adds the index.
177
00:08:23,240 --> 00:08:24,659
So, that’s an integer computation.
178
00:08:24,670 --> 00:08:27,040
You get the memory location that you want.
179
00:08:27,160 --> 00:08:28,220
These chips didn’t do that.
180
00:08:28,270 --> 00:08:31,020
They actually indexed into memory using floating point.
181
00:08:31,020 --> 00:08:33,870
And I make—can I just say, like, already, off the bat, we’re less than
182
00:08:33,870 --> 00:08:36,860
ten minutes into this episode, and we’ve already explained [laugh]
183
00:08:37,059 --> 00:08:41,789
, like, deep chip, like, addition, for how these things are working.
184
00:08:41,799 --> 00:08:44,179
And also, like, we got our first, like, ‘um, actually,’ which
185
00:08:44,179 --> 00:08:46,480
I feel like I need, like, a sound bite for when you were
186
00:08:46,480 --> 00:08:50,060
correcting me on the founding of Kubernetes, which was amazing.
187
00:08:50,190 --> 00:08:51,060
I love this already.
188
00:08:51,060 --> 00:08:52,110
This is going to be such a good show.
189
00:08:52,120 --> 00:08:54,490
Well, also that, but like, for context, when you’re
190
00:08:54,490 --> 00:08:57,819
talking about this chip work, how long ago was that, Brian?
191
00:08:58,449 --> 00:08:58,759
This was 2005.
192
00:09:00,139 --> 00:09:02,729
And think about how relevant this is today, and how
193
00:09:02,730 --> 00:09:05,280
much people want to get all they can out of chips.
194
00:09:05,370 --> 00:09:10,810
I looked at the APIs for XLA and some of the recent machine-learning GPU
195
00:09:10,820 --> 00:09:16,755
interfaces, and they’re very, very eerily similar to what we did [crosstalk]
196
00:09:16,755 --> 00:09:17,015
.
That’s what I’m saying.
197
00:09:17,080 --> 00:09:20,830
Like, how crazy—y’all didn’t see Brian’s face when we said, “How many
198
00:09:20,830 --> 00:09:24,470
times have you worked on stuff that, you know, was too early,” but
199
00:09:24,570 --> 00:09:28,160
talk about how relevant that is into what people are doing right now.
200
00:09:28,270 --> 00:09:28,560
Yeah.
201
00:09:28,560 --> 00:09:31,779
Well, the thing I did before PeakStream was another interesting
202
00:09:31,820 --> 00:09:34,810
hardware-software thing, which was a company called Transmeta.
203
00:09:35,370 --> 00:09:36,140
Do you sleep, Brian?
204
00:09:36,570 --> 00:09:38,160
Have you slept in the last 20 years?
205
00:09:38,160 --> 00:09:40,090
Like, what [laugh] —wait, how long have you been
206
00:09:40,090 --> 00:09:42,260
in tech, total because you’ve done some things.
207
00:09:42,610 --> 00:09:43,960
We’re only ten minutes in.
208
00:09:44,910 --> 00:09:45,880
More than 30 years.
209
00:09:46,389 --> 00:09:47,630
You’re the whole dotcom bubble.
210
00:09:47,630 --> 00:09:48,139
That’s awesome.
211
00:09:48,360 --> 00:09:53,480
Well, I went to grad school during the dotcom bubble, mostly, in Seattle.
212
00:09:53,490 --> 00:09:55,870
So like, a lot of students were dropping out to go
213
00:09:55,870 --> 00:09:59,220
to Amazon and things like that, in the mid-’90s.
214
00:09:59,430 --> 00:10:01,579
One of the first Web crawlers was designed by
215
00:10:01,620 --> 00:10:03,880
another student at University of Washington.
216
00:10:04,150 --> 00:10:08,479
So yeah, I was watching that, and I don’t know, I didn’t really feel the pull
217
00:10:08,480 --> 00:10:14,930
of startups at that time, but when I did approach finishing my PhD, I considered
218
00:10:14,930 --> 00:10:19,520
both industry research and startups; I didn’t really think about academia.
219
00:10:19,680 --> 00:10:22,099
And the startups did appeal to me more.
220
00:10:22,330 --> 00:10:24,889
I did turn down a startup you may have heard
221
00:10:24,889 --> 00:10:27,040
of—which is Google—when it was 80 people.
222
00:10:27,920 --> 00:10:28,789
80 people [laugh]
223
00:10:29,089 --> 00:10:31,460
?
You turned down Google at 80 people?
224
00:10:32,660 --> 00:10:33,290
That’s a startup.
225
00:10:33,290 --> 00:10:34,030
You don’t want to go there.
226
00:10:34,070 --> 00:10:36,050
It’s who knows what the future looks like for that.
227
00:10:36,340 --> 00:10:38,480
It didn’t work well for me compared to the other search
228
00:10:38,480 --> 00:10:41,000
engines, so I was like, I don’t want to move to California.
229
00:10:41,280 --> 00:10:42,250
Alta Vista was awesome.
230
00:10:42,250 --> 00:10:42,629
Yeah, I know.
231
00:10:42,630 --> 00:10:43,690
Alta Vista was awesome.
232
00:10:43,740 --> 00:10:46,549
And I searched for, like, I need a preschool for my daughter.
233
00:10:46,670 --> 00:10:48,109
Like, how do I search for that?
234
00:10:48,109 --> 00:10:50,180
And I just—it was just awful.
235
00:10:50,370 --> 00:10:52,010
So, I was not impressed by that.
236
00:10:52,010 --> 00:10:55,840
And I also was interested in the technical area I was working, which
237
00:10:55,840 --> 00:11:00,419
was dynamic compilers, which is, you know what Transmeta was all about.
238
00:11:01,359 --> 00:11:03,599
PeakStream also used that, a dynamic compiler.
239
00:11:03,610 --> 00:11:07,250
So, I built three dynamic compilers in my career: one in grad school; I
240
00:11:07,250 --> 00:11:10,610
worked on one at Transmeta, as well as a static compiler; and PeakStream.
241
00:11:11,230 --> 00:11:13,140
Because everybody does that on a Tuesday.
242
00:11:13,190 --> 00:11:14,170
Like [laugh]
243
00:11:14,920 --> 00:11:15,030
.
[laugh]
244
00:11:15,030 --> 00:11:16,099
.
That is awesome.
245
00:11:16,099 --> 00:11:20,319
So, what finally led you back to Google to say yes the second time?
246
00:11:20,500 --> 00:11:21,329
I was acquired.
247
00:11:22,059 --> 00:11:22,399
Oh, right.
248
00:11:22,410 --> 00:11:23,539
You’re right, PeakStream was acquired.
249
00:11:23,769 --> 00:11:24,123
PeakStream was acquired.
250
00:11:24,123 --> 00:11:24,980
You were like, “I didn’t even have a choice.”
251
00:11:24,980 --> 00:11:26,859
So like, now they’re like, “We still want Brian.
252
00:11:27,040 --> 00:11:28,339
We’re going to buy the whole company to get you in here.” [laugh]
253
00:11:28,339 --> 00:11:30,280
.
They were going to get you at, like, some point.
254
00:11:30,880 --> 00:11:34,990
I did the pitch, so you know, it’s not completely involuntary.
255
00:11:35,200 --> 00:11:38,979
The other potential acquire was… well, I won’t say
256
00:11:38,980 --> 00:11:41,440
that, but we did have other potential acquires.
257
00:11:41,520 --> 00:11:44,120
But you know, we hadn’t found product-market fit because
258
00:11:44,740 --> 00:11:50,090
customers like high-performance trading or seismic analysis,
259
00:11:50,110 --> 00:11:53,380
you know, these kind of they could hire the high performance
260
00:11:53,420 --> 00:11:55,740
computing engineers to actually build what they needed.
261
00:11:55,750 --> 00:12:00,470
So, in addition to all the exotic hardware bugs, which I could talk about
262
00:12:00,470 --> 00:12:04,770
for a long time if we wanted to do that because that was super fun, but like,
263
00:12:04,790 --> 00:12:13,750
the 1U and 2U server boxes would put in the cards in an orientation such
264
00:12:13,750 --> 00:12:19,230
that the fans on the GPUs would get in the way, and even if they widened the
265
00:12:19,230 --> 00:12:23,170
space between the slots, it would then blow into the motherboard and melt it.
266
00:12:23,260 --> 00:12:23,600
Sure.
267
00:12:23,920 --> 00:12:26,255
So, like, this is a thing that happened.
268
00:12:26,255 --> 00:12:28,660
[laugh] . A small problem.
269
00:12:28,710 --> 00:12:29,930
What do you mean, just melting [laugh]
270
00:12:31,099 --> 00:12:31,479
?
Yeah.
271
00:12:31,620 --> 00:12:33,080
There’s a lot of heat.
272
00:12:33,130 --> 00:12:34,749
It would melt solder, it would melt plastic.
273
00:12:34,779 --> 00:12:38,200
Well, you’re probably at, like, 300 watts, 400 watts of GPU, even more.
274
00:12:38,220 --> 00:12:39,520
Like, that heat got to go somewhere.
275
00:12:39,570 --> 00:12:45,589
Yeah, and the quality was also a problem because, for graphics,
276
00:12:45,590 --> 00:12:47,839
they’re like, just most of the pixels need to be right.
277
00:12:47,860 --> 00:12:51,229
If one pixel doesn’t compute the right value, make it zero, and
278
00:12:51,229 --> 00:12:54,920
it will be black, and nobody will notice in 1/24th of a second.
279
00:12:55,000 --> 00:12:58,710
So, their bar for correctness was not the same as Intel.
280
00:12:58,940 --> 00:13:00,920
Like, after the FDIV bug, Intel was just,
281
00:13:00,920 --> 00:13:03,240
like, super paranoid about correctness.
282
00:13:03,240 --> 00:13:04,150
And so, [crosstalk]
283
00:13:04,210 --> 00:13:06,480
—
Oh man, I was just reading about that bug.
284
00:13:06,730 --> 00:13:07,640
That was so big.
285
00:13:07,650 --> 00:13:09,049
I completely forgot about that.
286
00:13:09,059 --> 00:13:11,080
Can we give the listeners some context?
287
00:13:11,109 --> 00:13:13,800
What was the FDIV bug, guys?
288
00:13:13,800 --> 00:13:16,620
The FDIV bug, the there was a bug in the floating point division unit where it
289
00:13:16,630 --> 00:13:22,176
sometimes would give the wrong result, and that was not considered acceptable.
290
00:13:22,176 --> 00:13:24,839
The computer didn’t math, and this was a problem.
291
00:13:24,840 --> 00:13:26,290
And it was in the chip itself.
292
00:13:26,290 --> 00:13:27,399
Actually, there was a Bluesky thread.
293
00:13:27,410 --> 00:13:29,459
I will find it, and we will put it in the [show notes] . Because it
294
00:13:29,459 --> 00:13:33,200
was an amazing—they had, like, they decapped the chip and looking
295
00:13:33,200 --> 00:13:36,079
at the trace, like, here’s where the bug is, physically on the chip.
296
00:13:36,260 --> 00:13:38,110
Y’all are missing Justin’s very excited face—
297
00:13:38,110 --> 00:13:38,970
I love it.
298
00:13:39,050 --> 00:13:42,000
—because his face, it’s like Christmas morning, and it’s crazy.
299
00:13:42,009 --> 00:13:44,780
Yeah, so that was—at Transmeta, we had a lot of that
300
00:13:44,780 --> 00:13:47,159
because the industry was undergoing a lot of change.
301
00:13:47,170 --> 00:13:48,970
First of all, we were changing everything.
302
00:13:48,970 --> 00:13:50,650
We had a new approach.
303
00:13:50,650 --> 00:13:55,480
We were doing dynamic binary translation in software from x86 to a custom VLIW.
304
00:13:56,520 --> 00:13:58,090
Like, not emulation layer?
305
00:13:58,090 --> 00:13:58,630
Like…
306
00:13:58,800 --> 00:13:59,610
In software.
307
00:13:59,860 --> 00:14:00,240
Okay.
308
00:14:00,500 --> 00:14:02,680
It was an emulation in the software, in a hidden
309
00:14:02,680 --> 00:14:05,989
virtual machine that the end-user could not access.
310
00:14:06,299 --> 00:14:06,900
What could go wrong?
311
00:14:07,230 --> 00:14:08,690
Actually, all that worked, awesome.
312
00:14:08,690 --> 00:14:10,350
[laugh] . The hardware was the problem.
313
00:14:10,850 --> 00:14:14,145
Well, the industry was transitioning from 130 nanometer to
314
00:14:14,145 --> 00:14:17,480
90 nanometer, which the leakage characteristics just changed
315
00:14:17,480 --> 00:14:20,940
dramatically, and from aluminum wires to copper wires.
316
00:14:21,339 --> 00:14:25,199
And we changed our fab to TSMC, a little fab that nobody had ever heard of.
317
00:14:25,630 --> 00:14:28,000
And month after month, we were looking at these photos
318
00:14:28,030 --> 00:14:30,720
of, from an electron scanning microscope, saying, you
319
00:14:30,720 --> 00:14:33,140
know, this is the reason the chips don’t work this month.
320
00:14:33,220 --> 00:14:34,154
There’s a thing called the vias.
321
00:14:34,154 --> 00:14:36,670
So, the chips are multiple layers, alternating silicon
322
00:14:36,670 --> 00:14:40,539
and metal, and the metal is the wire layers that connect
323
00:14:40,539 --> 00:14:43,040
all the gates together, all the transistors together.
324
00:14:43,490 --> 00:14:45,790
The metal layers all need to be connected because the
325
00:14:45,850 --> 00:14:49,790
electricity comes in on the pins on one surface of the
326
00:14:49,790 --> 00:14:52,630
chip and needs to flow through all the metal on the chip.
327
00:14:52,639 --> 00:14:56,630
So, there’s a thing called vias, which is holes in the chip and
328
00:14:56,639 --> 00:14:58,979
the metal needs to drip down through as part of the process of
329
00:14:58,980 --> 00:15:03,650
manufacturing these things, at microscopic, like, atomic-level scales.
330
00:15:03,940 --> 00:15:07,760
So, there’s all kinds of things in the viscosity of the metal, where, if
331
00:15:07,760 --> 00:15:11,740
it’s not exactly right, it won’t go through the hole because it’s so small.
332
00:15:11,790 --> 00:15:14,999
So, if you can imagine, like, raindrops collecting on a sheet of
333
00:15:15,000 --> 00:15:18,410
plastic, or something like that, and not falling off, kind of like that.
334
00:15:18,670 --> 00:15:20,560
So, we would see these pictures of, oh, this via
335
00:15:20,560 --> 00:15:22,220
didn’t go through, that via didn’t go through.
336
00:15:22,440 --> 00:15:24,700
Oh, this one actually went through, and splattered
337
00:15:24,700 --> 00:15:26,540
across, and shorted a bunch of wires together.
338
00:15:26,750 --> 00:15:29,570
So, we had a bunch of photos like that for, I forget
339
00:15:29,590 --> 00:15:31,280
how many months, like, six months or something.
340
00:15:31,280 --> 00:15:33,470
It was a long time for somebody trying to get a product out.
341
00:15:33,830 --> 00:15:34,130
Yeah.
342
00:15:34,150 --> 00:15:35,389
So, that was exciting.
343
00:15:35,389 --> 00:15:39,420
Then once we got the chips back for the 90 nanometer generation, which was the
344
00:15:39,460 --> 00:15:45,160
second generation chip design—and I just started at a—fortuitously the week that
345
00:15:45,200 --> 00:15:48,920
project kicked off, so I was there from the beginning on that chip generation.
346
00:15:49,360 --> 00:15:52,390
The software was all new, the static compiler was new, the dynamic
347
00:15:52,400 --> 00:15:55,209
compiler was new, the boards were new, the chips were new, like,
348
00:15:55,390 --> 00:15:58,160
the fab was new, the process was new, like, everything was new.
349
00:15:58,410 --> 00:15:59,709
So, of course, nothing worked, right?
350
00:16:00,060 --> 00:16:00,400
Yeah.
351
00:16:00,400 --> 00:16:03,860
I was going to say this then trying to figure out what’s wrong is just—
352
00:16:03,910 --> 00:16:04,130
Yeah.
353
00:16:04,130 --> 00:16:08,500
So, we had a 24-hour bring-up rotation, so there’s always people
354
00:16:08,520 --> 00:16:12,040
in the lab trying to figure out what’s wrong and working around it.
355
00:16:12,170 --> 00:16:16,220
So, my parts, eventually, after the hardcore bring-up
356
00:16:16,220 --> 00:16:18,459
lab, where it’s like, well, we don’t have a clock signal.
357
00:16:18,630 --> 00:16:19,650
Why don’t we have a clock signal?
358
00:16:19,660 --> 00:16:21,260
Well, the phase-locked loop has a problem.
359
00:16:21,630 --> 00:16:25,070
Well, what can we do to electrically make the phase-locked loop work?
360
00:16:25,349 --> 00:16:28,809
Once I got to the point where they could kind of run, I got a board on my
361
00:16:28,809 --> 00:16:32,520
desk with a socket I could just open and close, and there were balls on the
362
00:16:32,520 --> 00:16:35,979
chips, rather than pin so I could actually just get a tray of chips and slap
363
00:16:35,980 --> 00:16:41,090
one in and close the socket and turn it on and try to debug what was going on.
364
00:16:41,120 --> 00:16:43,540
Because different chips had different characteristics.
365
00:16:43,549 --> 00:16:45,310
Probabilistically, there’s a distribution.
366
00:16:45,759 --> 00:16:47,310
If this is not interesting, by the way, you can stop me any time.
367
00:16:47,320 --> 00:16:47,540
No—
368
00:16:47,720 --> 00:16:49,000
No, it’s so interesting.
369
00:16:49,050 --> 00:16:52,350
I did not expect this to go this direction, and I absolutely love it.
370
00:16:52,370 --> 00:16:55,075
But also, we have so much other stuff I want to talk about.
371
00:16:55,075 --> 00:16:57,810
This is, like, 20 years ago, and at some point Google bought the company.
372
00:16:57,980 --> 00:17:00,439
Why did Google buy it and what were you doing when you joined?
373
00:17:00,440 --> 00:17:02,480
Because you said you weren’t on the Borg team originally.
374
00:17:02,940 --> 00:17:06,179
I honestly, we had the same investors as Google, Kleiner and
375
00:17:06,289 --> 00:17:10,300
Sequoia, and actually, when we started PeakStream, I worked
376
00:17:10,300 --> 00:17:12,983
in the back of the Sequoia office for a few months before we
377
00:17:12,983 --> 00:17:15,319
found our office, and got a company name, and things like that.
378
00:17:15,660 --> 00:17:18,230
I was the third engineer there, but not a co-founder.
379
00:17:18,410 --> 00:17:23,810
Effectively, there was some hope that maybe the technology could be useful.
380
00:17:24,029 --> 00:17:28,090
And actually, my investigation into the data centers in Borg
381
00:17:28,160 --> 00:17:30,730
was one of the things that convinced me it was going to be quite
382
00:17:30,730 --> 00:17:34,630
challenging, but also we didn’t find a customer within Google
383
00:17:34,639 --> 00:17:38,409
for that, for dense floating point computation at that time.
384
00:17:38,480 --> 00:17:41,839
Like, the computations were more sparse for
385
00:17:41,879 --> 00:17:43,440
the types of things they were doing back then.
386
00:17:44,050 --> 00:17:46,450
So yeah, we spent a few months talking to lots of people
387
00:17:46,450 --> 00:17:49,210
in the company and tried to find something useful, but
388
00:17:49,210 --> 00:17:51,490
then said, well, it wasn’t going to be actually useful.
389
00:17:51,500 --> 00:17:57,846
So, then I pivoted the team that was brought over to focus on something that was
390
00:17:58,070 --> 00:18:03,280
a problem, which was Google’s—about half of its code was C++ and half was Java.
391
00:18:03,690 --> 00:18:07,910
And this, 17 years ago, it was just at the beginning
392
00:18:07,910 --> 00:18:12,250
of the NPTL, the new POSIX threading library in Linux.
393
00:18:12,550 --> 00:18:14,049
Before that, there was this thing called Linux
394
00:18:14,059 --> 00:18:16,849
Threads that was terrible and not really usable.
395
00:18:16,859 --> 00:18:21,270
So, when Google started—and there was no C++ threading standard, right, so
396
00:18:21,270 --> 00:18:24,919
you had to write your own threading primitives, effectively, to do stuff.
397
00:18:24,940 --> 00:18:28,000
And you know how memory safe C is, right?
398
00:18:28,630 --> 00:18:31,430
So, Google had developed all of its own threading primitives.
399
00:18:32,940 --> 00:18:34,269
They were pretty low level though.
400
00:18:34,340 --> 00:18:40,850
And the first engineer hired into Google decided, well, we’re
401
00:18:40,920 --> 00:18:43,580
scraping the entire web; we need throughput, which was true.
402
00:18:44,130 --> 00:18:47,210
And the chips at the time were the Pentium 4, which was what Transmeta
403
00:18:47,450 --> 00:18:51,720
was competing against, which—well, anyway, I won’t go back into CPUs,
404
00:18:52,080 --> 00:18:54,979
but determination was made that the most efficient thing to do would
405
00:18:54,990 --> 00:18:57,440
be to write a single-threaded event loop and run everything that way.
406
00:18:57,780 --> 00:19:02,540
And that was true at the time, but very shortly
407
00:19:02,590 --> 00:19:06,870
after 1998, when Google started, multi-core happened.
408
00:19:07,059 --> 00:19:08,310
Chips changed everything.
409
00:19:08,310 --> 00:19:13,250
Now, multi-threading is good for CPU utilization and latency.
410
00:19:13,990 --> 00:19:18,600
Java had a very strong threading model from very early on, so all the
411
00:19:18,610 --> 00:19:23,630
Java code was actually in pretty good shape, but the C code was not.
412
00:19:23,630 --> 00:19:25,329
There’s a lot of single-threaded code in
413
00:19:25,330 --> 00:19:29,110
Google, so I started an initiative to fix that.
414
00:19:29,230 --> 00:19:31,530
So, an opportunity was the Borg team.
415
00:19:31,960 --> 00:19:35,639
Other opportunity was, make everything on Borg run better.
416
00:19:36,070 --> 00:19:38,580
So, I ended up doing the latter.
417
00:19:38,580 --> 00:19:42,260
I started a bunch of projects to help the new POSIX threading
418
00:19:42,260 --> 00:19:47,270
library roll out, and the fleet to develop some new, easier
419
00:19:47,270 --> 00:19:49,449
to use threading primitives to develop documentation.
420
00:19:49,470 --> 00:19:52,650
I mean, back in those days in Google, it’s like, engineering was maybe 10,000.
421
00:19:53,090 --> 00:19:54,520
Biggest I’d ever worked for at the time.
422
00:19:54,520 --> 00:19:56,300
You know, I’d done two startups before that.
423
00:19:56,620 --> 00:20:00,239
So, I thought, “Oh, man, Google’s so huge.” Little did
424
00:20:00,240 --> 00:20:02,360
I know 17 years later, it would be 20 times bigger.
425
00:20:02,740 --> 00:20:05,610
It’s so crazy that you were almost the 80th employee now that,
426
00:20:05,610 --> 00:20:08,290
like, a thousand that probably seemed so big in context, but—
427
00:20:08,290 --> 00:20:08,640
Ten thousand.
428
00:20:09,170 --> 00:20:09,950
Oh, ten thousand.
429
00:20:09,960 --> 00:20:13,050
But then, like, it’s just massive now.
430
00:20:13,120 --> 00:20:14,280
Hundreds of thousands, yeah.
431
00:20:14,700 --> 00:20:16,339
It was big, but in those days, I could do
432
00:20:16,340 --> 00:20:20,559
things like having a company-wide tech talk.
433
00:20:20,710 --> 00:20:20,980
So, I did.
434
00:20:21,490 --> 00:20:24,419
I started an initiative called the multi-core initiative to
435
00:20:24,640 --> 00:20:28,800
actually promote threading in C++ and to make it work better.
436
00:20:29,160 --> 00:20:34,180
So, we built a multi-threaded [HP] server, which is still in use—HP server—and
437
00:20:34,240 --> 00:20:38,710
some threading primitives, and worked on documentation and thread profiling
438
00:20:38,710 --> 00:20:42,990
tools, some annotations in the compiler that would, kind of similar to
439
00:20:42,990 --> 00:20:47,110
annotations in Java, where you could identify areas that are supposed to be
440
00:20:47,110 --> 00:20:51,210
locked and ensure that the [mutex] was used properly, and things like that.
441
00:20:51,830 --> 00:20:54,495
This is a random question, but what was your PhD in?
442
00:20:54,530 --> 00:20:58,950
Because how did you get started in, like, GPU and, like, these very in-depth—
443
00:20:59,219 --> 00:21:00,710
My background was systems.
444
00:21:00,719 --> 00:21:03,470
So, as an undergrad, I worked on networking
445
00:21:03,510 --> 00:21:06,240
and operating systems and supercomputing.
446
00:21:06,720 --> 00:21:11,870
And I also started grad school working in the supercomputing area.
447
00:21:11,870 --> 00:21:14,240
I did three summers at Lawrence Livermore National Lab.
448
00:21:14,480 --> 00:21:17,460
And I worked on the climate model, and some group communication
449
00:21:17,460 --> 00:21:21,020
primitives, and porting to MPI, which was brand new back in those days.
450
00:21:21,020 --> 00:21:25,290
I actually went to one of the MPI spec development meetings.
451
00:21:25,640 --> 00:21:28,280
That’s the Message Passing Interface: MPI.
452
00:21:28,790 --> 00:21:32,130
So, I transitioned into compilers because there was an interesting
453
00:21:32,589 --> 00:21:37,830
project doing runtime partial evaluation, and that’s where
454
00:21:37,830 --> 00:21:42,939
you take some runtime values in the program and use that to
455
00:21:42,940 --> 00:21:46,350
generate specialized code that just works for those values.
456
00:21:46,690 --> 00:21:49,099
You know, so in some cases, you could get kind of
457
00:21:49,130 --> 00:21:52,529
dramatic speed-ups from the code from doing that.
458
00:21:52,539 --> 00:21:52,809
So—
459
00:21:52,809 --> 00:21:55,450
You’re doing, like, dynamic tuning for the code?
460
00:21:55,450 --> 00:21:55,970
Or was it like—
461
00:21:56,300 --> 00:21:58,140
It’s not tuning; it’s compiling.
462
00:21:58,140 --> 00:22:03,709
So, you know, if you have some computation that uses some value like an
463
00:22:03,719 --> 00:22:07,300
integer, there are standard compiler—if that’s a constant, like, five,
464
00:22:07,480 --> 00:22:11,530
there are standard compiler optimizations, like, constant folding that
465
00:22:11,699 --> 00:22:14,740
will take that value and pre-compute any values that can be pre-computed.
466
00:22:15,490 --> 00:22:17,540
If that value is input at runtime, you don’t know what
467
00:22:17,560 --> 00:22:19,820
the value is, then you just have to generate all the code.
468
00:22:19,820 --> 00:22:22,360
And if there are conditionals based on that code, you have to
469
00:22:22,450 --> 00:22:25,149
generate branches and evaluate those conditionals, et cetera.
470
00:22:25,520 --> 00:22:29,864
If there were certain values, even data structures that were known
471
00:22:30,000 --> 00:22:34,590
to be constant, you could potentially do some pretty impressive
472
00:22:34,630 --> 00:22:39,080
optimizations, like unrolling loops, which allows you to pre-compute
473
00:22:39,080 --> 00:22:42,160
even more things and reduce the amount of code dramatically.
474
00:22:42,160 --> 00:22:43,500
So, the biggest speed-ups we would get
475
00:22:43,650 --> 00:22:46,130
were, like, 10x speed-ups from doing that.
476
00:22:46,320 --> 00:22:49,550
So, for example, if you had an interpreter, an interpreter would
477
00:22:49,550 --> 00:22:52,850
normally have an execution loop where it would read some operation
478
00:22:52,850 --> 00:22:57,449
to interpret, you know, dispatch to something to evaluate it.
479
00:22:57,450 --> 00:23:00,660
Like, if it’s an add, it will go add, and return back.
480
00:23:00,700 --> 00:23:03,360
So, if you actually gave the interpreter a
481
00:23:03,360 --> 00:23:06,570
program as a constant value, what could you do?
482
00:23:06,580 --> 00:23:09,530
Well, effectively, you can compile it into
483
00:23:09,530 --> 00:23:11,830
the code instead of interpreting it, right?
484
00:23:11,840 --> 00:23:14,350
So, that was sort of the most impressive use case.
485
00:23:14,350 --> 00:23:18,409
Not super realistic, but there were some cases like that could be done.
486
00:23:18,410 --> 00:23:22,919
So, to do this, what you have to do is analyze where all the values
487
00:23:22,969 --> 00:23:26,060
flowed to that you wanted to take advantage of, and split the
488
00:23:26,200 --> 00:23:29,280
program into two: one piece would be a compiler that would do all
489
00:23:29,280 --> 00:23:31,320
the pre-computation, and the other piece would be the code that
490
00:23:31,650 --> 00:23:34,289
would be admitted once you had the values that were pre-computed.
491
00:23:34,559 --> 00:23:38,230
And over and over again, it seems—I mean, like your graduates,
492
00:23:39,049 --> 00:23:40,739
your doctorate, all this stuff, like, you’ve been doing this
493
00:23:40,770 --> 00:23:43,790
optimization over and over again, and now you’re at Google,
494
00:23:43,790 --> 00:23:47,120
you’re doing this C++ multi-core, we got to do this thing.
495
00:23:47,400 --> 00:23:50,620
Let’s fast forward to, like, where do you go from, you’re
496
00:23:50,620 --> 00:23:53,290
doing Borg stuff to, like, this Kubernetes thing comes out?
497
00:23:53,300 --> 00:23:55,200
Like, this is, like, hey, we want to do something else
498
00:23:55,200 --> 00:23:57,640
that’s going to—we want to open-source it, we want to do
499
00:23:57,730 --> 00:24:00,370
something generalized from the stuff we were doing internally.
500
00:24:00,629 --> 00:24:01,120
Okay, yeah.
501
00:24:01,120 --> 00:24:05,649
I mean, after about a year-and-a-half, I got about as far as I could on
502
00:24:06,070 --> 00:24:08,680
making things multi-threaded, so I transitioned to the Borg team, 2009.
503
00:24:09,790 --> 00:24:14,450
After short order—Borg was about five years old at that point in time—it
504
00:24:14,900 --> 00:24:17,669
was clear that it was being used in ways it wasn’t really designed for.
505
00:24:17,670 --> 00:24:20,780
So, I came up with an idea for how to rearchitect it.
506
00:24:20,970 --> 00:24:25,990
I started the project called Omega, which was an R&D project to rearchitect it.
507
00:24:26,130 --> 00:24:27,940
Worked on that for a while.
508
00:24:29,090 --> 00:24:31,060
After a couple of years, cloud became a priority.
509
00:24:31,060 --> 00:24:35,839
I mean, before that, it was not really a priority for the whole company.
510
00:24:36,860 --> 00:24:39,050
You know, the Cloud team was a pretty small team in Seattle,
511
00:24:39,650 --> 00:24:42,730
and most of Google was down in Mountain View in the Bay Area.
512
00:24:43,140 --> 00:24:46,805
Google had App Engine for a couple of years, but kind of core cloud product that
513
00:24:46,920 --> 00:24:50,740
people would think about would be the IaaS product, the Google Compute Engine.
514
00:24:52,240 --> 00:24:53,890
And my understanding was, App Engine was basically, like,
515
00:24:53,890 --> 00:24:56,350
a customer front to run jobs onboard directly, right?
516
00:24:56,350 --> 00:24:59,910
It was so restricted that it didn’t have a layer in between.
517
00:24:59,920 --> 00:25:00,750
Did have a layer.
518
00:25:01,030 --> 00:25:01,360
Okay, s—
519
00:25:01,380 --> 00:25:03,230
It had a pretty elaborate layer, actually.
520
00:25:03,780 --> 00:25:06,530
And Cloud Run shares some DNA with that.
521
00:25:07,030 --> 00:25:10,859
But the restrictions behind App Engine were because of that platform
522
00:25:10,859 --> 00:25:13,920
layer between because it was just, like, you have to architect
523
00:25:13,920 --> 00:25:16,760
your application in a specific way to make it run here, and we
524
00:25:16,760 --> 00:25:18,860
will take care of all the infrastructure side of it for you.
525
00:25:18,870 --> 00:25:21,920
Yeah, a big part of that is—well, there’s multiple parts of it.
526
00:25:22,080 --> 00:25:24,450
One is sandboxing, so you can actually run stuff
527
00:25:24,450 --> 00:25:29,740
multi-tenant before we had hardware virtualization primitives
528
00:25:30,139 --> 00:25:32,250
that could be used for sandboxing, like, in gVisor.
529
00:25:32,290 --> 00:25:33,720
I mean, eventually it moved to gVisor.
530
00:25:33,750 --> 00:25:36,480
But before there was gVisor, there was like
531
00:25:36,480 --> 00:25:38,190
a Ptrace sandbox or something like that.
532
00:25:38,349 --> 00:25:41,830
But all the networking stuff in Google is different and exotic.
533
00:25:41,910 --> 00:25:44,210
You know, like things don’t communicate by HTTP.
534
00:25:44,500 --> 00:25:47,059
You know, I had some RPC system that sort of
535
00:25:47,949 --> 00:25:51,650
predated use of HTTP as a standard networking layer.
536
00:25:51,850 --> 00:25:55,790
None of your normal naming DNS, service discovery,
537
00:25:56,320 --> 00:25:59,829
proxying, load balancing, none of that stuff works, right,
538
00:25:59,870 --> 00:26:02,110
because they have all their own internal stuff for that.
539
00:26:02,110 --> 00:26:02,259
Yeah.
540
00:26:02,259 --> 00:26:05,429
Yeah, all the internet things are just like, “Ah, that’s not for us.”
541
00:26:05,990 --> 00:26:08,290
And the compute layer, you know, they did want the sandboxing to
542
00:26:08,300 --> 00:26:11,679
be really strong, so, yeah there are a bunch of reasons for the
543
00:26:11,680 --> 00:26:14,660
restrictions, but they’re, you know, based on the technology at the time.
544
00:26:14,830 --> 00:26:16,845
And it’s before Docker containers, things like that.
545
00:26:16,860 --> 00:26:17,169
Right.
546
00:26:17,210 --> 00:26:21,039
Yeah, but in this so you have this research project, basically internal,
547
00:26:21,070 --> 00:26:25,629
like, to rearchitect Borg into this Omega thing, and then what happens there?
548
00:26:25,630 --> 00:26:26,730
Like, where’s that transition?
549
00:26:27,320 --> 00:26:33,340
I mean, it turned out to be not worth it and somewhat infeasible to roll it out.
550
00:26:33,340 --> 00:26:36,719
We did partially roll out pieces of it, internally, and
551
00:26:37,349 --> 00:26:40,000
it kind of made things more complex operationally during
552
00:26:40,000 --> 00:26:43,370
the transition, and it just didn’t provide enough value.
553
00:26:43,420 --> 00:26:46,200
The install base was really big, and it was growing faster
554
00:26:46,200 --> 00:26:50,400
than we could write new code, and that was a time when the
555
00:26:50,400 --> 00:26:54,950
Borg ecosystem was just exploding internally, and new things
556
00:26:54,950 --> 00:26:58,980
were being added at all the layers at a very rapid pace.
557
00:26:59,359 --> 00:27:03,940
And changing the user interface was actually one of the most
558
00:27:03,950 --> 00:27:06,970
problematic parts of it was also pretty much a non-starter because
559
00:27:07,380 --> 00:27:09,920
there was like, zillions and zillions of lines of configuration
560
00:27:09,920 --> 00:27:13,669
files and about a thousand clients calling the APIs directly.
561
00:27:14,210 --> 00:27:15,320
So, it was just too much.
562
00:27:15,320 --> 00:27:18,820
So, some of the ideas were folded back into Borg, like labels
563
00:27:18,820 --> 00:27:22,040
in Watch, which if you know Kubernetes, it may sound familiar.
564
00:27:22,090 --> 00:27:24,330
And other parts were turned down, but
565
00:27:24,450 --> 00:27:27,020
Kubernetes—you know, as cloud became important.
566
00:27:27,030 --> 00:27:31,519
GCE, Google Compute Engine, GA’ed, at the end of 2013.
567
00:27:31,949 --> 00:27:35,939
And that was also when Joe Beda kind of discovered Docker, and said,
568
00:27:35,939 --> 00:27:38,939
“Hey, look at this Docker thing.” Management directors and above
569
00:27:38,940 --> 00:27:41,570
were kind of trying to figure out, how can we apply our internal
570
00:27:41,570 --> 00:27:44,870
infrastructure expertise to cloud, now that it’s becoming a priority?
571
00:27:44,870 --> 00:27:50,540
So, I shifted off in that direction, and we started exploring, well, there’s
572
00:27:50,540 --> 00:27:53,839
this group put together by a couple of directors called the Unified Compute
573
00:27:53,859 --> 00:27:58,600
Working Group, and actually the original motivation, nominally, was to
574
00:27:58,780 --> 00:28:02,670
produce a cloud platform that Google could actually use itself someday.
575
00:28:03,190 --> 00:28:08,129
Because App Engine was considered too restrictive, and
576
00:28:08,280 --> 00:28:12,260
Google Compute Engine was VMs, and Google had never used VMs.
577
00:28:12,720 --> 00:28:14,169
Like, it just skipped that.
578
00:28:14,169 --> 00:28:17,570
It used containers, more or less, processes, Unix processes from
579
00:28:17,570 --> 00:28:20,140
the beginning, so there was no way they were going to use VMs.
580
00:28:20,140 --> 00:28:23,340
They’re, like, way too inefficient, they’re too opaque,
581
00:28:23,559 --> 00:28:25,710
they’re hard to manage how to be container-based.
582
00:28:25,719 --> 00:28:29,519
So, you know, some of the original things were, yeah, it should be like, Borg.
583
00:28:29,520 --> 00:28:30,720
And I’m like, wait, wait, wait, wait, wait.
584
00:28:30,969 --> 00:28:32,899
We just spent years trying to [unintelligible] Borg.
585
00:28:32,900 --> 00:28:34,460
Let’s not do it just like Borg.
586
00:28:34,780 --> 00:28:37,690
Kubernetes actually ended up being open-source Omega, more
587
00:28:37,690 --> 00:28:41,960
or less, based on a lot of the architectural ideas, and
588
00:28:42,040 --> 00:28:44,340
some specific features, even, like, scheduling features.
589
00:28:44,540 --> 00:28:49,120
So, some of the more unusual terminology was just lifted whole cloth from
590
00:28:49,120 --> 00:28:54,080
Omega, like taints and tolerations, for example, as just one example.
591
00:28:54,400 --> 00:28:57,400
So, there were a bunch of things from Omega we just simplified.
592
00:28:57,580 --> 00:28:59,570
Wasn’t the pod aspects in Omega?
593
00:28:59,610 --> 00:29:00,320
Like, the grouping?
594
00:29:00,330 --> 00:29:01,120
It was, yeah.
595
00:29:01,120 --> 00:29:03,889
So, the pod was one thing I felt was really important,
596
00:29:04,520 --> 00:29:07,220
and I tried to introduce it into Borg around 2012.
597
00:29:07,820 --> 00:29:09,330
That was super hard to introduce at the
598
00:29:09,330 --> 00:29:11,530
core layer, since the ecosystem was so big.
599
00:29:11,679 --> 00:29:15,949
But Borg’s model was, it had a concept called an [ALEC] , which was an
600
00:29:15,950 --> 00:29:21,209
array of resources across machines, and the idea was that, you know,
601
00:29:21,210 --> 00:29:23,609
it’s kind of your own virtual cluster that you can schedule into.
602
00:29:24,129 --> 00:29:26,139
But nobody, almost nobody, used it that way.
603
00:29:26,570 --> 00:29:31,990
What teams did was they had a set of processes they wanted to deploy
604
00:29:31,990 --> 00:29:37,359
together, usually an application and a bunch of side cars for logging
605
00:29:37,360 --> 00:29:42,110
and other things, and they wanted those things deployed in sets.
606
00:29:42,900 --> 00:29:45,730
So, you know, I talked to the SREs, and they said, “Ah, we just
607
00:29:45,730 --> 00:29:48,469
want this.” That led to the concept, which, at the time, was
608
00:29:48,470 --> 00:29:52,990
called Scheduling Unit in Omega, and for the experiment in Borg.
609
00:29:53,630 --> 00:29:58,779
And that was just what in Borg was a set of tasks from jobs.
610
00:29:58,850 --> 00:30:01,840
And tasks weren’t even a first class resource in Borg.
611
00:30:02,170 --> 00:30:04,860
Jobs were the first Class resource, and jobs were arrays of tasks.
612
00:30:04,860 --> 00:30:05,000
So, you had
613
00:30:07,610 --> 00:30:10,800
this weird, challenging model where you had an array
614
00:30:10,809 --> 00:30:13,630
of resources across machines, and you had multiple
615
00:30:13,639 --> 00:30:16,369
arrays of tasks that you wanted to pack into those.
616
00:30:17,010 --> 00:30:20,510
So, if you needed to horizontally scale, you needed to
617
00:30:20,510 --> 00:30:24,820
grow your ALEC first, and then grow your jobs after.
618
00:30:24,910 --> 00:30:27,490
And if you wanted to skill down, you had to do it in the opposite order.
619
00:30:27,960 --> 00:30:30,250
And technically, you could do it in either order, and things
620
00:30:30,250 --> 00:30:33,340
would just go pending and not schedule into the ALECs, but
621
00:30:33,750 --> 00:30:36,550
that created a lot of confusion, so people tried to avoid that.
622
00:30:37,030 --> 00:30:39,730
But the pod or scheduling unit primitives was
623
00:30:39,970 --> 00:30:42,360
just a lot easier for how people were using it.
624
00:30:42,380 --> 00:30:43,260
I have this set of things.
625
00:30:43,260 --> 00:30:46,010
I want those deployed together, just as if they were on one machine.
626
00:30:46,010 --> 00:30:46,659
Just do that.
627
00:30:46,659 --> 00:30:48,830
If you want to scale, that’s a unit to scale by.
628
00:30:49,530 --> 00:30:52,410
I remember, like, in like, Mesos time, like, it was always
629
00:30:52,410 --> 00:30:54,270
like, oh, well, don’t try to schedule things together.
630
00:30:54,270 --> 00:30:55,300
Just write better code.
631
00:30:55,390 --> 00:30:57,380
And I’m like, that’s not how the real world works.
632
00:30:57,400 --> 00:30:58,260
Like, that’s [laugh]
633
00:30:59,030 --> 00:31:01,770
—
Yeah, we had a bunch of cases, like, we have very complex
634
00:31:01,770 --> 00:31:04,320
[fat] clients for interacting with storage systems that
635
00:31:04,320 --> 00:31:09,159
were just super challenging to rewrite in all the languages.
636
00:31:09,160 --> 00:31:11,799
And Google restricted the languages you could use in production.
637
00:31:11,830 --> 00:31:15,160
For a long time, it was C++ and Java.
638
00:31:15,820 --> 00:31:18,310
Python was added, but it wasn’t as widely
639
00:31:18,310 --> 00:31:20,990
used, not for serving workloads anyway.
640
00:31:20,990 --> 00:31:22,919
It was used more for tooling.
641
00:31:23,280 --> 00:31:27,100
Eventually Go came around, but you know, that was decades later.
642
00:31:27,440 --> 00:31:32,899
But rewriting the Colossus client to interact with the files, the
643
00:31:32,900 --> 00:31:35,459
distributed files to work, for example, you know, if that’s tens of
644
00:31:35,460 --> 00:31:37,720
thousands of lines of code, you don’t want to do that multiple times.
645
00:31:38,200 --> 00:31:41,839
So, how those things evolved over the years, I mean, eventually there was a
646
00:31:41,840 --> 00:31:47,660
way that was created for running those things without the normal sidecar model.
647
00:31:47,740 --> 00:31:50,330
They would all do it in the same container, effectively.
648
00:31:50,330 --> 00:31:55,000
But there were a bunch of reasons to have side cars for various reasons.
649
00:31:55,070 --> 00:31:58,769
And for anyone that wants to read more about this—I mean, we’ll link your
650
00:31:58,770 --> 00:32:02,239
blog posts in the [show notes] —“The Technical History of Kubernetes” is a
651
00:32:02,250 --> 00:32:06,089
collection of a lot of your old Twitter threads that gather a lot of these
652
00:32:06,210 --> 00:32:09,833
pieces together, which was a great combination of them all in one spot.
653
00:32:09,910 --> 00:32:13,560
“The Road to 1.0” post also has kind of a different perspective.
654
00:32:13,560 --> 00:32:16,120
It was more once the Kubernetes project started,
655
00:32:16,830 --> 00:32:18,899
how did it evolve for the first couple of years.
656
00:32:19,320 --> 00:32:21,570
Now, I’m going to make a jump here.
657
00:32:21,920 --> 00:32:23,330
Kubernetes is 10, going to be 11 years old.
658
00:32:23,330 --> 00:32:24,750
I mean, for me, it is more than 11 years old.
659
00:32:25,279 --> 00:32:25,299
Yeah,
660
00:32:27,719 --> 00:32:27,929
exactly.
661
00:32:28,069 --> 00:32:28,966
You’ve been on it for a while.
662
00:32:29,030 --> 00:32:32,129
But like, as far as, like, an open-source, the official, you know,
663
00:32:32,130 --> 00:32:38,449
stance of hit ten in 2024, what has this shift for what Google
664
00:32:38,450 --> 00:32:41,830
was trying to do with Kubernetes initially, and the open-sourcing
665
00:32:42,000 --> 00:32:44,905
of everyone else also using this in various places, what has
666
00:32:44,950 --> 00:32:48,140
that done to the landscape of infrastructure and applications?
667
00:32:48,620 --> 00:32:52,030
From my perspective, one of the things that it did is it created
668
00:32:52,170 --> 00:32:55,489
an infrastructure ecosystem that was broader than any single cloud.
669
00:32:56,099 --> 00:32:58,629
Because at the time we started Kubernetes, there was
670
00:32:58,629 --> 00:32:59,289
the AWS ecosystem, and that was pretty much it, right?
671
00:33:03,229 --> 00:33:08,270
Like, obviously Google had before GCE was GA’ed, it had
672
00:33:08,309 --> 00:33:11,580
pretty negligible usage on measurable market share, I think.
673
00:33:11,730 --> 00:33:13,910
And that was the time that the Kubernetes project started.
674
00:33:14,230 --> 00:33:17,759
Even Azure wasn’t really very, very present.
675
00:33:17,760 --> 00:33:21,540
And even now, ecosystem-wise, I look at Infrastructure as
676
00:33:21,540 --> 00:33:23,540
Code tools, for example, there are a bunch that work for
677
00:33:23,540 --> 00:33:28,439
AWS only, and there aren’t very many things that work only
678
00:33:28,490 --> 00:33:31,889
on the other clouds in the open-source ecosystem, at least.
679
00:33:32,509 --> 00:33:35,509
But Kubernetes sort of created its own island, where
680
00:33:36,049 --> 00:33:38,570
you could have this rich ecosystem that works pretty
681
00:33:38,570 --> 00:33:40,930
much anywhere, it works on-prem, it works on any cloud.
682
00:33:41,950 --> 00:33:43,880
People have differing opinions on whether it’s a good thing or a bad
683
00:33:43,880 --> 00:33:47,219
thing, but I view it as mostly a good thing that you have a large
684
00:33:47,219 --> 00:33:51,819
ecosystem of tools that work everywhere, and that was not the case before.
685
00:33:53,080 --> 00:33:56,440
And for the especially for the people who are on-prem and what
686
00:33:56,450 --> 00:34:00,600
the thing that was available before was Mesos and OpenStack.
687
00:34:01,510 --> 00:34:06,850
And Mesos, in my opinion, it’s kind of overly complicated.
688
00:34:06,860 --> 00:34:10,020
The scheduling model just didn’t work at a theoretical
689
00:34:10,020 --> 00:34:15,739
level, and the open-source ecosystem was not as strong.
690
00:34:15,739 --> 00:34:19,089
Like, a lot of the big users just built their own frameworks and
691
00:34:19,089 --> 00:34:23,459
then open-sourced them, and that’s sort of death to the ecosystem.
692
00:34:23,460 --> 00:34:26,949
But, you know, even those who did, the tooling was not
693
00:34:26,960 --> 00:34:31,190
compatible across frameworks, so it’s just super fragmented.
694
00:34:31,429 --> 00:34:34,060
So, it didn’t really have the potential to grow
695
00:34:34,560 --> 00:34:36,459
this sort of ecosystem that Kubernetes did.
696
00:34:36,480 --> 00:34:40,699
And then when we created the CNCF, you know, taking inspiration
697
00:34:40,699 --> 00:34:43,822
from what happened in the JavaScript area, where there was
698
00:34:43,822 --> 00:34:47,139
the Node.js Foundation, and I forget what the foundation
699
00:34:47,139 --> 00:34:49,940
was before they unified, but there was another foundation.
700
00:34:49,969 --> 00:34:54,800
And a couple of things like Express went into the Node.js Foundation,
701
00:34:54,800 --> 00:34:58,404
but most other projects were not accepted into that foundation, so they
702
00:34:58,639 --> 00:35:03,209
had to find a home in some other foundation, and that was really awkward.
703
00:35:03,210 --> 00:35:07,720
So, one thing I wanted to do with CNCF was ensure
704
00:35:07,720 --> 00:35:09,490
there was a home for all those other projects.
705
00:35:09,850 --> 00:35:13,970
Before CNCF was really ready, Kubernetes project itself kind of
706
00:35:13,980 --> 00:35:16,610
became an umbrella project and took on a bunch of those projects.
707
00:35:16,969 --> 00:35:21,300
Like Kubespray, for example, for setting up Kubernetes clusters with Ansible.
708
00:35:21,550 --> 00:35:24,639
But, you know, as soon as after we created the—initially, it was
709
00:35:24,779 --> 00:35:27,029
called Inception, I think, but then, you know, after became the
710
00:35:27,030 --> 00:35:31,860
sandbox, then kind of the doors really open to all those projects.
711
00:35:31,870 --> 00:35:35,190
So, I think that’s been very positive for
712
00:35:35,190 --> 00:35:37,080
experimentation and developing of new things.
713
00:35:37,080 --> 00:35:39,952
You know, it does give you a paradox of choice, it makes things a
714
00:35:39,952 --> 00:35:43,930
little bit hard for figuring out what you should actually use versus
715
00:35:43,930 --> 00:35:48,000
what’s available, but overall, I see it as a very healthy development.
716
00:35:51,610 --> 00:35:54,109
Running Kubernetes at scale is challenging.
717
00:35:54,480 --> 00:35:58,390
Running Kubernetes at scale securely is even more challenging.
718
00:35:58,390 --> 00:35:58,587
Are you struggling with access management and user management?
719
00:35:58,609 --> 00:36:01,450
Access management and user management are some of the most
720
00:36:01,450 --> 00:36:04,690
important tools that we have today to be able to secure
721
00:36:04,840 --> 00:36:07,700
your Kubernetes cluster and protect your infrastructure.
722
00:36:07,890 --> 00:36:12,439
Using Tremolo security with open unison is the easiest way, whether it be
723
00:36:12,469 --> 00:36:16,889
on prem or in the cloud, to simplify access management to your clusters.
724
00:36:17,020 --> 00:36:21,140
It provides a single sign on and helps you with its robust security
725
00:36:21,140 --> 00:36:24,210
features to secure your clusters and automate your workflows.
726
00:36:24,600 --> 00:36:28,850
So check out Tremolo Security for your single sign on needs in Kubernetes.
727
00:36:29,179 --> 00:36:31,070
You can find them at fafo.fm
728
00:36:33,330 --> 00:36:34,620
slash Tremolo.
729
00:36:34,920 --> 00:36:40,259
That's T-R-E-M-O-L-O.
730
00:36:47,200 --> 00:36:52,049
I feel like you guys did a great job with almost unifying a lot of things and
731
00:36:52,049 --> 00:36:57,580
just kind of having, I don’t know—were you and has anybody ever done anything
732
00:36:57,580 --> 00:37:01,260
with Kubernetes that you were just, like, almost offended by that it’s so—
733
00:37:01,260 --> 00:37:01,370
[laugh]
734
00:37:01,370 --> 00:37:01,480
.
[laugh]
735
00:37:01,590 --> 00:37:02,389
.
This is your baby.
736
00:37:02,389 --> 00:37:05,389
You’ve seen it go from so many—I have, like,
737
00:37:05,389 --> 00:37:07,019
three questions, but I want to start here [laugh]
738
00:37:08,210 --> 00:37:11,810
.
Well certainly, there were a lot of things I was very—that
739
00:37:11,810 --> 00:37:14,460
I didn’t really imagine that I was very happy about.
740
00:37:14,560 --> 00:37:17,980
Retail edge was one of these scenarios where I
741
00:37:17,980 --> 00:37:20,650
wanted to make sure Kubernetes could scale down.
742
00:37:20,679 --> 00:37:23,669
Borg, I think the minimum footprint is, like, 300 machines
743
00:37:23,670 --> 00:37:26,060
or something at the time I worked on it, so there’s no
744
00:37:26,060 --> 00:37:28,230
way it could scale down to something you could just run.
745
00:37:28,789 --> 00:37:30,840
And Mesos kind of had that problem, too.
746
00:37:30,840 --> 00:37:33,210
It had a lot of components, multiple stateful components.
747
00:37:33,520 --> 00:37:34,084
Cloud Foundry required a bunch of components.
748
00:37:34,084 --> 00:37:34,430
So, I wanted it to be
749
00:37:37,390 --> 00:37:39,810
able to scale down to one node, so it just
750
00:37:39,940 --> 00:37:42,106
has one stateful component, which is etcd.
751
00:37:42,106 --> 00:37:44,640
It doesn’t have, like, a separate message bus, you know,
752
00:37:44,640 --> 00:37:47,050
although that was a design that could be considered.
753
00:37:47,340 --> 00:37:50,009
But the reason was for doing kind of local development,
754
00:37:50,100 --> 00:37:52,770
like, Minikube or Kind type things, mostly.
755
00:37:53,000 --> 00:37:55,550
Retail edge was sort of really fun that, you know, it’s
756
00:37:55,550 --> 00:37:59,120
like in every Target store, been on spacecraft, and ships,
757
00:37:59,120 --> 00:38:02,380
and all kinds of other places I never really imagined.
758
00:38:02,380 --> 00:38:04,970
In terms of offended, you know, I remember one time—
759
00:38:04,970 --> 00:38:06,720
Like, have they ever made it, like, overly complicated when
760
00:38:06,720 --> 00:38:08,430
you were trying to make it simple or just something that
761
00:38:08,430 --> 00:38:11,220
you’re just like, “Dude, I was trying so hard to prevent this.”
762
00:38:11,890 --> 00:38:12,390
There is.
763
00:38:12,390 --> 00:38:15,700
I mean, early on, I was very concerned about fragmentation,
764
00:38:15,710 --> 00:38:18,350
which is why I helped create the conformance program.
765
00:38:18,520 --> 00:38:21,970
So, all the attempts to sort of fork it and do something a
766
00:38:21,980 --> 00:38:25,515
little bit different, and there were some cases like that where
767
00:38:25,590 --> 00:38:28,730
some people said, oh, I just want to run the pod executor.
768
00:38:28,730 --> 00:38:31,500
I just want to run Kubelet, but I need to make changes.
769
00:38:31,510 --> 00:38:32,290
No, no, no.
770
00:38:32,560 --> 00:38:36,430
You actually need to make sure that the API works.
771
00:38:36,509 --> 00:38:38,440
When Kubernetes was sort of young and
772
00:38:38,440 --> 00:38:41,069
vulnerable, I think that was a big concern I had.
773
00:38:41,410 --> 00:38:43,320
Or other cases, like, the Virtual Kubelets,
774
00:38:44,230 --> 00:38:48,250
you know, I didn’t want to fork the ecosystems.
775
00:38:48,250 --> 00:38:49,870
Like, oh, only certain things work with Virtual
776
00:38:50,360 --> 00:38:51,840
Kubelets, or only certain things work with Windows.
777
00:38:51,849 --> 00:38:56,420
So, on Virtual Cubelet, I kind of started sketching a bar
778
00:38:56,420 --> 00:39:00,120
for what I think compatibility would need to be required.
779
00:39:00,120 --> 00:39:01,179
Minimum cubelet [laugh]
780
00:39:01,530 --> 00:39:06,100
.
I honestly think that your work in that aspect really shows, though because
781
00:39:06,550 --> 00:39:10,290
even when people say that Kubernetes is difficult, there’s a reason why so many
782
00:39:10,290 --> 00:39:17,070
people use it because it really does have that whole ecosystem that is really,
783
00:39:17,139 --> 00:39:21,200
kind of—I think open-source can be so political, and the fact that there’s
784
00:39:21,200 --> 00:39:25,130
so many different projects, but they all kind of align is really impressive.
785
00:39:25,590 --> 00:39:27,630
Were you involved in the naming because, like,
786
00:39:27,710 --> 00:39:30,509
Kubernetes naming, like, just cracks me up.
787
00:39:30,650 --> 00:39:30,740
No.
788
00:39:30,759 --> 00:39:33,430
Honestly, the naming was outsourced.
789
00:39:33,719 --> 00:39:36,239
There’s, like, a search for potential names
790
00:39:36,240 --> 00:39:37,876
and a trademark search, and things like that.
791
00:39:38,070 --> 00:39:39,250
That aspect is pretty boring.
792
00:39:39,270 --> 00:39:40,390
Lawyers got involved.
793
00:39:40,390 --> 00:39:40,923
And [laugh]
794
00:39:40,923 --> 00:39:41,216
—
Yes.
795
00:39:41,509 --> 00:39:41,539
[laugh]
796
00:39:41,539 --> 00:39:45,249
.
You know, it couldn't be named what the code name
797
00:39:45,250 --> 00:39:48,590
was, so, you know, that was never a contender.
798
00:39:48,820 --> 00:39:51,529
But did you have any influence on the fact that it’s Greek, right?
799
00:39:51,559 --> 00:39:52,680
Like, all the different—
800
00:39:53,080 --> 00:39:54,559
I mean, it did start a trend.
801
00:39:54,610 --> 00:39:58,670
Istio, for example, for a while, everything was getting a Greek name.
802
00:39:59,130 --> 00:40:01,070
I now work at a startup that has a Greek name.
803
00:40:01,070 --> 00:40:01,800
This is
804
00:40:03,960 --> 00:40:04,092
how this works [laugh]
805
00:40:04,092 --> 00:40:04,100
.
That’s what I’m saying.
806
00:40:04,100 --> 00:40:07,110
Like, I feel like, just the continuity of the naming started, kind of, a lot
807
00:40:07,110 --> 00:40:09,940
of the way that people start choosing to name their open-source projects.
808
00:40:09,940 --> 00:40:13,930
And kind of, you almost make sure you could relate the fact
809
00:40:13,930 --> 00:40:17,240
that these projects were related by their naming, you know?
810
00:40:17,240 --> 00:40:17,986
I thought that was cool.
811
00:40:18,130 --> 00:40:20,100
It seems like Kubernetes was the first to really do that.
812
00:40:20,420 --> 00:40:21,300
Docker did it as well.
813
00:40:21,300 --> 00:40:22,740
There were a bunch of shipping analogies
814
00:40:22,740 --> 00:40:25,880
and… and Helm sort of followed that pattern.
815
00:40:26,340 --> 00:40:28,970
I mean, themes were big for any technology.
816
00:40:28,970 --> 00:40:31,830
Like, config management, you had Puppet, and you had Chef, and you had
817
00:40:31,830 --> 00:40:34,795
all these, like, words that, like, oh, it has to be the cookbook and the—
818
00:40:34,795 --> 00:40:36,204
My opinion, Salt took it to an extreme.
819
00:40:36,204 --> 00:40:37,040
Kubernetes had so many though.
820
00:40:37,320 --> 00:40:37,420
And the fact—
821
00:40:37,590 --> 00:40:40,780
Salt with the pillars and everything else, yeah, you’re right.
822
00:40:40,860 --> 00:40:40,960
[laugh] . That’s true.
823
00:40:42,260 --> 00:40:45,600
Okay, so with your experience, right, you’ve gone through the
824
00:40:45,600 --> 00:40:50,330
chips, you’ve gone through supercomputing early, you were in the,
825
00:40:50,350 --> 00:40:54,190
you know, C and Java, and now, with people wanting to rewrite
826
00:40:54,200 --> 00:40:57,120
everything—you saw when they wanted to rewrite everything in Java,
827
00:40:57,120 --> 00:40:59,840
right, now, everybody wants to rewrite everything in REST, right?
828
00:40:59,860 --> 00:41:04,430
You saw supercomputing before it was cool, and now everything is chip boom, AI.
829
00:41:05,119 --> 00:41:08,040
Are there patterns that you see that, like, either you’re
830
00:41:08,040 --> 00:41:11,250
excited about or alarmed about, or is it weird seeing
831
00:41:11,250 --> 00:41:13,470
it go from where you started with all these things?
832
00:41:13,520 --> 00:41:16,339
And it’s kind of like the same but different?
833
00:41:16,339 --> 00:41:19,490
[Kind of] same, but different aspect is, you know, I think what keeps
834
00:41:19,490 --> 00:41:23,245
software engineers employed, so I can’t argue too much with that,
835
00:41:23,480 --> 00:41:27,710
but redoing the same things over and over in slightly different and
836
00:41:27,720 --> 00:41:31,850
hopefully better ways is, I think, something that will continue to exist.
837
00:41:31,850 --> 00:41:34,470
Like, now, everything with AI, right?
838
00:41:34,559 --> 00:41:38,700
So, it’s very reminiscent of the dotcom bubble in that
839
00:41:38,700 --> 00:41:41,889
sense, where everything’s like a retail store, but dotcom.
840
00:41:42,059 --> 00:41:46,240
Mostly, there were a few big winners there, like, you know,
841
00:41:46,240 --> 00:41:50,470
Amazon, eBay, but you know, most of the companies did not succeed.
842
00:41:50,470 --> 00:41:54,810
A lot of the kind of existing companies got their act together and
843
00:41:54,810 --> 00:41:59,150
put together a web storefront, right, and now that’s easier than ever.
844
00:41:59,670 --> 00:42:03,470
So, I think AI will kind of be similar where, you know, there’s a
845
00:42:03,470 --> 00:42:07,210
bunch of startups that are experimenting in cases where they are
846
00:42:07,210 --> 00:42:10,480
sort of doing something that people already do, but just with AI.
847
00:42:10,820 --> 00:42:12,060
Sprinkle little AI on it.
848
00:42:12,060 --> 00:42:12,550
Yep [laugh]
849
00:42:13,310 --> 00:42:14,640
.
That will probably end up being a product feature.
850
00:42:14,860 --> 00:42:17,439
In the positive case for them, it will end up being
851
00:42:17,440 --> 00:42:20,379
an acquisition that makes it into an existing product.
852
00:42:20,680 --> 00:42:23,589
It is super challenging for big companies to innovate,
853
00:42:23,910 --> 00:42:26,830
certainly a challenge that Google has, I think.
854
00:42:27,330 --> 00:42:28,509
Honestly, Google always had it.
855
00:42:28,510 --> 00:42:31,580
So, if you think about what are the big products at Google, a lot
856
00:42:31,580 --> 00:42:34,554
of them are acquisitions, even things you think of, of Google is
857
00:42:34,554 --> 00:42:37,530
all about ads, I mean, most of that technology is acquisitions.
858
00:42:37,810 --> 00:42:39,370
Yeah, DoubleClick, and—yeah.
859
00:42:40,130 --> 00:42:43,070
I always find it interesting where it’s like, it’s not that you can’t innovate
860
00:42:43,140 --> 00:42:46,230
at a large company, it’s that it’s really hard to get that to actually have
861
00:42:46,240 --> 00:42:49,900
impact because I know so many cool, innovative internal projects that have
862
00:42:49,900 --> 00:42:53,710
been at all these big companies, but the only way they get it to be an impact
863
00:42:53,719 --> 00:42:56,780
at the company is they have to leave, go make a startup, and they get bought
864
00:42:56,780 --> 00:42:59,990
by the company, and [laugh] now they have a say of like, oh, now it’s the
865
00:42:59,990 --> 00:43:04,150
innovative thing that I was doing here ten years ago, but you didn’t believe me.
866
00:43:04,219 --> 00:43:07,949
That’s also how we reward certain innovation.
867
00:43:07,950 --> 00:43:09,450
Like, people are always trying to figure
868
00:43:09,450 --> 00:43:11,409
out the projects that go in their promo doc.
869
00:43:11,469 --> 00:43:15,779
And if you don’t reward a certain type of innovation, you’re—
870
00:43:15,880 --> 00:43:16,150
Yeah.
871
00:43:16,330 --> 00:43:17,580
—almost strangling it.
872
00:43:17,580 --> 00:43:20,670
That system is very rigged for a certain type of innovation.
873
00:43:20,670 --> 00:43:22,950
And it’ll be, like, the dumbest projects that they waste the
874
00:43:22,960 --> 00:43:26,560
stupidest amount of money on, and it has absolutely no value,
875
00:43:26,560 --> 00:43:29,250
and then—when people talk about empire-building, you know what I
876
00:43:29,250 --> 00:43:32,819
mean?—and then somebody actually built something that’s helpful and
877
00:43:32,820 --> 00:43:36,290
cool, and they have to go [laugh] [unintelligible] and come back.
878
00:43:36,630 --> 00:43:37,705
I mean, like, even look at Meta.
879
00:43:37,860 --> 00:43:40,490
Like, it most—look at all the acquisitions they’ve done.
880
00:43:40,570 --> 00:43:43,890
I mean, a lot of times, when these things start, like PeakStream,
881
00:43:43,900 --> 00:43:47,069
for example, it’s not clear that something is going to be—whether
882
00:43:47,070 --> 00:43:49,319
it’s going to succeed, whether it’s going to be important.
883
00:43:49,630 --> 00:43:50,410
It’s a risk, right?
884
00:43:50,410 --> 00:43:53,750
Like Nvidia played a really long bet on compute on GPUs.
885
00:43:55,620 --> 00:43:57,300
ATI, at the time, decided not to do that,
886
00:43:57,490 --> 00:44:00,410
and they ended up getting acquired by AMD.
887
00:44:00,410 --> 00:44:01,629
And AMD doubled down on graphics.
888
00:44:01,630 --> 00:44:04,899
And they actually won all the consoles, laptops, mobile
889
00:44:04,909 --> 00:44:08,330
phone deals, like, all of them away from Nvidia at that time.
890
00:44:08,550 --> 00:44:11,149
For a long time, basically the national labs were the customers of that
891
00:44:11,150 --> 00:44:14,450
stuff, but now, it’s everybody, so the long bet has really paid off.
892
00:44:14,490 --> 00:44:18,200
But that really requires a lot of faith, I think.
893
00:44:18,710 --> 00:44:22,789
It’s crazy how—you know how, at one point, Apple invested
894
00:44:22,799 --> 00:44:25,560
in—wait, was it Windows invested in Apple, right?
895
00:44:25,600 --> 00:44:29,390
And then how AMD was doing better than Nvidia at one point, you know?
896
00:44:29,390 --> 00:44:31,920
Like, just the way that the—just, it’s so hard
897
00:44:31,920 --> 00:44:33,940
to know what is going to work out, you know?
898
00:44:33,940 --> 00:44:36,380
Like, look at where we’re talking about the dotcom, and remember when
899
00:44:36,380 --> 00:44:39,094
we had Rich on Ship It, and he was talking about how huge WebMD was—
900
00:44:39,099 --> 00:44:39,879
WebMD, yeah.
901
00:44:39,940 --> 00:44:40,370
—right?
902
00:44:40,889 --> 00:44:43,420
And then we were just talking about Amazon versus eBay.
903
00:44:43,470 --> 00:44:45,450
Who even buys stuff on eBay, anymore [laugh]
904
00:44:45,450 --> 00:44:47,435
?
I just bought stuff on eBay.
905
00:44:47,435 --> 00:44:48,255
What are you talking about [laugh]
906
00:44:48,260 --> 00:44:52,320
?
You and, like, five other people [laugh] . You know what I mean?
907
00:44:52,320 --> 00:44:57,280
Like, Yahoo was so big, and now nobody uses that, and it’s just crazy.
908
00:44:57,360 --> 00:45:01,170
And I feel like I haven’t even been involved in tech that long,
909
00:45:01,180 --> 00:45:04,720
and I can’t even imagine the things you’ve saw in 30 years, Brian.
910
00:45:04,890 --> 00:45:05,856
Like, you’ve seen it go—
911
00:45:05,856 --> 00:45:09,659
So, as far as doing things too early, multiple times, Transmeta’s chips
912
00:45:09,679 --> 00:45:14,420
were low power, general purpose computing chips, and they went into
913
00:45:15,059 --> 00:45:22,390
devices like ultra-light laptops, tablets, wearables, smartphones, in 2000.
914
00:45:22,950 --> 00:45:23,760
The year 2000.
915
00:45:23,760 --> 00:45:24,310
No way.
916
00:45:24,660 --> 00:45:27,500
Did you bet on anything or really believe in anything, and then nobody
917
00:45:27,500 --> 00:45:30,920
thought it was cool, and then now you’re like, see, [laugh] like, I told you.
918
00:45:31,360 --> 00:45:33,710
Well, so in Transmeta, yeah.
919
00:45:34,449 --> 00:45:36,870
And I really liked what Transmeta was doing.
920
00:45:36,880 --> 00:45:40,009
And that was kind of my dream job because in school I had
921
00:45:40,010 --> 00:45:42,350
electrical engineering classes, and computing classes, and
922
00:45:42,350 --> 00:45:44,850
things like that, but I started programming when I was ten.
923
00:45:45,099 --> 00:45:47,840
The first computer was a kit computer that my
924
00:45:47,840 --> 00:45:51,220
dad built, a 6502-based KIM-1 kit computer.
925
00:45:51,220 --> 00:45:54,860
And it had no persistent memory, no persistent
926
00:45:54,870 --> 00:45:57,600
disks, nothing, and no ROM with firmware.
927
00:45:58,660 --> 00:46:00,790
So, every time you turn on the power, it’s a clean slate.
928
00:46:00,870 --> 00:46:01,520
There’s nothing.
929
00:46:01,730 --> 00:46:02,769
There’s no assembler.
930
00:46:03,090 --> 00:46:03,619
There’s nothing.
931
00:46:03,619 --> 00:46:05,610
It just had an LED display and a hex keypad.
932
00:46:05,679 --> 00:46:07,989
So, I would have to type in the program from
933
00:46:08,000 --> 00:46:09,460
scratch every time you turn on the power.
934
00:46:09,630 --> 00:46:12,600
And back in those days, those Byte magazine would have
935
00:46:12,850 --> 00:46:16,780
6502 assembly programs, and I would have to manually—
936
00:46:16,890 --> 00:46:17,590
Flip them all [laugh]
937
00:46:17,960 --> 00:46:20,310
?
—manually assemble them, and type in the
938
00:46:20,310 --> 00:46:23,180
hexadecimal machine code and then run it.
939
00:46:23,429 --> 00:46:26,150
But anyway, when we got an Apple II, we’d turn
940
00:46:26,150 --> 00:46:28,100
on the power, and there would be a prompt, right?
941
00:46:28,100 --> 00:46:30,939
There would be a program running, and that was just so amazing for me.
942
00:46:30,940 --> 00:46:34,810
So, you know, Transmeta, I really learned, from the time you
943
00:46:34,810 --> 00:46:37,690
turn on the power, what happens, how does the computer work?
944
00:46:38,360 --> 00:46:42,090
Like, I worked on the code that decompressed
945
00:46:42,090 --> 00:46:44,089
the firmware out of the ROM, for example.
946
00:46:44,099 --> 00:46:45,759
I worked on frequency-voltage scaling.
947
00:46:46,400 --> 00:46:48,209
I worked on the static compiler.
948
00:46:49,369 --> 00:46:54,570
So, we had software TOB handlers that ran through my static compiler.
949
00:46:54,810 --> 00:46:58,079
Like I dealt with things all at, like, this crazy, super low level.
950
00:46:58,530 --> 00:47:02,130
If the instructions didn’t get scheduled, right, the chip had no interlocks.
951
00:47:02,280 --> 00:47:05,389
What an interlock does is, if you have one instruction that writes the
952
00:47:05,389 --> 00:47:08,435
register, and another instruction that reads from that register, an
953
00:47:08,620 --> 00:47:15,490
interlock will stall the CPU pipeline until that register value is written.
954
00:47:15,910 --> 00:47:17,930
There’s like a scoreboard that keeps track of these things.
955
00:47:18,400 --> 00:47:21,769
Transmeta chips, in order to be low power, is trying
956
00:47:21,770 --> 00:47:24,560
to cut circuit count, so it didn’t have interlocks.
957
00:47:25,000 --> 00:47:27,129
That leads me right into, like, the last thing I want to talk about here
958
00:47:27,140 --> 00:47:31,250
because we have this—Kubernetes thing exists, we have this extensible API that
959
00:47:31,250 --> 00:47:34,569
you helped make it conformant so it is consistent for everyone in whatever
960
00:47:34,570 --> 00:47:39,990
environment they’re in, and in one of the ways that we’ve been seeing with that
961
00:47:40,010 --> 00:47:44,930
is this notion of using that API and this notion of control loops to do more
962
00:47:44,950 --> 00:47:51,109
infrastructure managements, things like Cloud Connector at GCP, ACK at AWS.
963
00:47:51,109 --> 00:47:53,150
And they’re reimplementing some of that, like you mentioned
964
00:47:53,160 --> 00:47:55,410
in, like, very cloud specific, like, this is my cloud
965
00:47:55,410 --> 00:47:57,190
implementation of this thing because I know the APIs.
966
00:47:57,309 --> 00:48:01,310
And in most cases, those are now generated from the APIs, right?
967
00:48:01,310 --> 00:48:03,710
Like, we’re not manually writing this stuff out again.
968
00:48:04,150 --> 00:48:08,700
Like, with Terraform, we had to do a lot of manual stuff to make providers work.
969
00:48:09,000 --> 00:48:13,909
And there’s this new wave of Terraform-like things that are happening,
970
00:48:14,670 --> 00:48:17,389
which is also, again, you started taking a risk there and looking
971
00:48:17,389 --> 00:48:20,940
into this more, and what do you see coming in that area next?
972
00:48:21,300 --> 00:48:26,450
Well, for the Kubernetes-based controllers, and in general, what I’ve
973
00:48:26,450 --> 00:48:31,209
seen, I came up with the idea for what became Config Connector around
974
00:48:31,210 --> 00:48:35,619
the end of 2017, when Kubernetes initially had third-party resources,
975
00:48:35,619 --> 00:48:41,010
and then that was redesigned to Custom Resource Definitions, CRDs.
976
00:48:41,710 --> 00:48:43,560
CRDs were in beta for a really long time.
977
00:48:43,560 --> 00:48:46,939
It had a lot of features that were hard to get to the GA level.
978
00:48:46,969 --> 00:48:49,330
But it was starting to become popular at that time.
979
00:48:49,340 --> 00:48:51,310
People were writing controllers to manage,
980
00:48:51,310 --> 00:48:53,729
like, S3 buckets and individual cloud resources.
981
00:48:53,920 --> 00:48:57,640
I saw it as a way to solve a couple of problems for Google.
982
00:48:57,640 --> 00:49:00,180
And Google had a Deployment Manager product that had a
983
00:49:00,180 --> 00:49:03,779
bunch of technical, non-technical challenges at the time.
984
00:49:03,790 --> 00:49:05,900
Kubernetes and Terraform started at the same
985
00:49:05,920 --> 00:49:09,859
time, so Terraform is still pretty early in 2017.
986
00:49:09,929 --> 00:49:14,270
You know, Ansible was way more used at that time than in Terraform.
987
00:49:14,800 --> 00:49:18,020
We did have a team that had started to maintain Terraform,
988
00:49:18,870 --> 00:49:23,570
and it had a semi… I would say, semi-automatic, ability
989
00:49:23,570 --> 00:49:26,219
to generate the Terraform providers from the APIs.
990
00:49:26,649 --> 00:49:28,150
And that still remains true.
991
00:49:28,150 --> 00:49:30,420
It’s still semi automatic, it’s not fully automatic.
992
00:49:30,610 --> 00:49:32,250
And I actually wrote a blog post about some of the
993
00:49:32,250 --> 00:49:34,939
challenges with APIs that make it hard to automate.
994
00:49:35,460 --> 00:49:38,370
And I don’t think Google’s APIs are the only ones that have these issues.
995
00:49:38,690 --> 00:49:42,764
Kubernetes was growing a lot by the end of 2017.
996
00:49:42,799 --> 00:49:47,530
I think that’s when AWS launched EKS, and VMware, and, you know, pretty
997
00:49:47,530 --> 00:49:51,590
much everybody, even Mesosphere, had, like, a Kubernetes product.
998
00:49:52,020 --> 00:49:54,950
So, it seemed like with a Kubernetes-centric universe, maybe
999
00:49:54,950 --> 00:49:57,220
it would be something you would want to do, and it would
1000
00:49:57,220 --> 00:50:00,930
provide that more consistent API that you couldn’t get from
1001
00:50:01,309 --> 00:50:04,149
the providers, so something you could build tooling against.
1002
00:50:04,580 --> 00:50:08,510
You know, there are some big Google Cloud customers that adopted
1003
00:50:08,510 --> 00:50:13,609
it, but overall, not remotely as many as have adopted Terraform.
1004
00:50:13,910 --> 00:50:19,699
And it’s much less popular, especially for—even amongst GKE customers, it’s
1005
00:50:19,700 --> 00:50:22,289
not nearly as popular, and most of those platform teams know Terraform.
1006
00:50:23,240 --> 00:50:25,790
And they’re used to Terraform, so they manage infrastructure with Terraform.
1007
00:50:26,610 --> 00:50:30,859
I think the one potentially sweet spot for it is for resources
1008
00:50:30,860 --> 00:50:33,319
that application developers would need to interact with, like,
1009
00:50:33,440 --> 00:50:36,690
database, or a Redis instance, or message queue or something like
1010
00:50:36,700 --> 00:50:39,209
that, from the cloud provider where you could, in theory, provision
1011
00:50:39,210 --> 00:50:42,280
it using the same sort of tooling that used to deploy your app.
1012
00:50:42,480 --> 00:50:45,150
Although, you know, these days—people used to love Kubernetes
1013
00:50:45,150 --> 00:50:47,230
in the early days, that was always very gratifying.
1014
00:50:47,400 --> 00:50:50,710
Some users would say, you know, it changed their lives and things like that.
1015
00:50:51,020 --> 00:50:53,320
These days, with the larger number of people using
1016
00:50:53,320 --> 00:50:55,509
it, you get some people who don’t love it as much.
1017
00:50:55,520 --> 00:50:57,759
You know, anything widely used has that.
1018
00:50:57,759 --> 00:50:58,809
Terraform has that, too.
1019
00:50:58,840 --> 00:50:59,590
Helm has that.
1020
00:50:59,799 --> 00:51:05,390
But yeah, it just hasn’t really materialized, people managing resources there.
1021
00:51:05,430 --> 00:51:09,000
Crossplane is probably the most prevalent way, although, you
1022
00:51:09,000 --> 00:51:12,690
know, not on GCP because GCP customers want to use something
1023
00:51:12,690 --> 00:51:16,089
that’s supported, and GCP endorses and things like that.
1024
00:51:16,150 --> 00:51:20,460
So, ACK and the Azure service operator, I’d be interested
1025
00:51:20,460 --> 00:51:22,500
to know how many users there are, but just looking
1026
00:51:22,500 --> 00:51:25,730
at, kind of, social media posts and things like that.
1027
00:51:25,880 --> 00:51:28,709
I feel like it kind of came out of this notion, especially in, like,
1028
00:51:28,710 --> 00:51:32,790
the serverless worlds, where once you deploy a Lambda function, you’re
1029
00:51:32,790 --> 00:51:36,290
like, oh, I need my queuing system, and my S3 bucket, and my database,
1030
00:51:36,420 --> 00:51:40,040
and I want them all the deployed from the same CloudFormation stack.
1031
00:51:40,190 --> 00:51:43,270
And people were like, oh, I could replicate the same thing with containers,
1032
00:51:43,410 --> 00:51:46,699
and get that same sort of feeling of, I don’t care about the infrastructure,
1033
00:51:46,700 --> 00:51:49,740
but someone has to care about how that infrastructure got there, and who
1034
00:51:49,740 --> 00:51:53,129
runs those controllers, and how they’re authenticated, and where they go.
1035
00:51:53,129 --> 00:51:56,280
And usually that used to be a service of something like CloudFormation,
1036
00:51:56,590 --> 00:52:00,670
and now it’s something that, oh, the platform team has to run 87 different
1037
00:52:00,670 --> 00:52:04,049
controllers for every different connection that we want [laugh] to put in there.
1038
00:52:04,150 --> 00:52:04,840
Right, yeah.
1039
00:52:04,910 --> 00:52:09,190
And upgrading controllers in CRDs is still pretty challenging.
1040
00:52:09,349 --> 00:52:12,653
I actually wrote a blog post about using KRM, the Kubernetes
1041
00:52:12,660 --> 00:52:15,720
Resource Model, for provisioning cloud infrastructure as well.
1042
00:52:15,720 --> 00:52:18,850
There are a bunch of challenges with using the Kubernetes tooling,
1043
00:52:18,850 --> 00:52:23,160
like a lot of the cloud APIs are designed so that you call one,
1044
00:52:23,160 --> 00:52:26,360
it gets provisioned, some IP addresses or allocated or something.
1045
00:52:26,360 --> 00:52:27,520
You get that back in a result.
1046
00:52:27,820 --> 00:52:30,930
That may take 20 minutes, it may take a long time, then you need
1047
00:52:31,030 --> 00:52:33,650
to take those values and pass them as inputs to another call.
1048
00:52:34,480 --> 00:52:39,270
And that requires orchestration at a level that—you know, in
1049
00:52:39,270 --> 00:52:41,325
Kubernetes, everything—the controllers are all designed, so
1050
00:52:41,400 --> 00:52:43,479
you just apply everything and the controllers sort it out.
1051
00:52:44,250 --> 00:52:49,190
And if you don’t design your infrastructure controllers to do the same
1052
00:52:49,190 --> 00:52:52,009
thing, the Kubernetes controlling functionality doesn’t actually work.
1053
00:52:52,330 --> 00:52:59,020
So, like, if you deploy a set of resources with Helm, and you can’t actually
1054
00:52:59,440 --> 00:53:02,730
provision one thing until the other thing is already provisioned, and your
1055
00:53:02,730 --> 00:53:05,590
controller doesn’t do the waiting, Helm’s not going to do the waiting.
1056
00:53:05,920 --> 00:53:06,920
Like, you’re just hosed.
1057
00:53:07,100 --> 00:53:11,260
So, you could actually do that, you know, if you want to design the
1058
00:53:11,420 --> 00:53:14,540
controllers to work, like the built in controllers in Kubernetes.
1059
00:53:14,540 --> 00:53:17,860
That’s a lot more work because the APIs don’t work that way.
1060
00:53:17,860 --> 00:53:21,289
If you wrap the Terraform providers, they don’t work that way, right?
1061
00:53:21,290 --> 00:53:25,310
So, that’s another big layer that you would have to build in your, sort of,
1062
00:53:25,440 --> 00:53:29,140
meta controller over the underlying controllers to actually make that work.
1063
00:53:29,310 --> 00:53:32,680
And you know, there ends up being this demand, for the people who do adopt
1064
00:53:32,680 --> 00:53:36,220
it, to have every infrastructure resource they want to use covered by it.
1065
00:53:36,469 --> 00:53:39,369
So, all the work just goes into that, and the work
1066
00:53:39,370 --> 00:53:41,629
doesn’t go into, like, fixing the usability problems.
1067
00:53:42,069 --> 00:53:45,340
So, I think Crossplane has at least a partial solution to that,
1068
00:53:45,340 --> 00:53:49,040
but you have to do it in their composition layer, so the user of
1069
00:53:49,070 --> 00:53:52,600
Crossplane has to specify those dependencies, at least in some cases.
1070
00:53:53,059 --> 00:53:55,299
That just makes it feel more like Terraform again.
1071
00:53:56,030 --> 00:53:56,320
Yeah.
1072
00:53:56,420 --> 00:53:59,089
You’re basically just making a new module, right?
1073
00:53:59,100 --> 00:54:00,589
It’s just, like, a module in a different form.
1074
00:54:01,100 --> 00:54:03,100
Honestly, I don’t think it’s going to be dramatically
1075
00:54:03,100 --> 00:54:06,820
more popular ever than it is right now to do it that way.
1076
00:54:06,820 --> 00:54:08,320
There’s just not enough benefits.
1077
00:54:08,470 --> 00:54:13,900
There are some benefits, but they are kind of killed by how people use it.
1078
00:54:13,910 --> 00:54:19,944
So, for example, the composition layer in Crossplane, effectively is a
1079
00:54:20,000 --> 00:54:24,740
templating layer, so now you can’t just go change the manage resources
1080
00:54:24,740 --> 00:54:27,600
directly because it will create drift with the composition layer.
1081
00:54:27,690 --> 00:54:33,320
And if you need to template the composition resources using Helm,
1082
00:54:33,400 --> 00:54:36,790
now you’re storing it in a Git repo in some templating form,
1083
00:54:36,800 --> 00:54:40,190
Go template format or whatever, and that’s hard to change and
1084
00:54:40,200 --> 00:54:42,629
hard to write, right, so you can’t build tooling on top of that.
1085
00:54:42,629 --> 00:54:46,610
The big benefit of using KRM could be that you could actually
1086
00:54:46,620 --> 00:54:50,650
build controllers or tooling that actually just automates
1087
00:54:50,900 --> 00:54:53,390
the generation and editing of those resources for you.
1088
00:54:53,850 --> 00:54:56,129
The way people use it, they pretty much destroy
1089
00:54:56,130 --> 00:54:58,130
that potential benefit of using a control plane.
1090
00:54:58,700 --> 00:54:59,540
I have a question.
1091
00:54:59,920 --> 00:55:04,130
So, you know how you said that chip job was your dream job at the time, right?
1092
00:55:04,870 --> 00:55:08,175
What’s it like having a career as long as yours, and doing the
1093
00:55:08,190 --> 00:55:11,019
things that you’ve done, do you just keep getting the next dream job?
1094
00:55:11,080 --> 00:55:14,073
And what was your favorite out of those dream jobs, you know?
1095
00:55:14,130 --> 00:55:17,120
Yeah, it was pretty serendipitous.
1096
00:55:17,120 --> 00:55:21,340
I wish I could say, like, I really planned my career, but I really didn’t.
1097
00:55:21,580 --> 00:55:23,010
I loved all the jobs.
1098
00:55:23,010 --> 00:55:25,010
I loved Transmeta and PeakStream.
1099
00:55:25,030 --> 00:55:27,940
They were amazing and awesome.
1100
00:55:27,940 --> 00:55:31,740
I learned a lot, and it was very exciting for a while.
1101
00:55:31,780 --> 00:55:35,300
And then, you know, Google, working on Borg, and especially
1102
00:55:35,300 --> 00:55:37,270
Kubernetes, you know, Kubernetes is definitely the most
1103
00:55:37,270 --> 00:55:40,930
industry impact—and CNCF—anybody could pretty much ask for.
1104
00:55:41,090 --> 00:55:44,569
So, for the next thing I’m planning to do, that was definitely a
1105
00:55:44,570 --> 00:55:49,530
consideration when I spent six to nine months deciding what I wanted to do.
1106
00:55:49,690 --> 00:55:52,910
The opportunity to have industry impact again, will it be as big as
1107
00:55:52,910 --> 00:55:58,270
Kubernetes, mmm, maybe not, but it could become—you know, has that potential.
1108
00:55:58,420 --> 00:56:01,615
We just went down this whole deep path of, if anyone doesn’t know what
1109
00:56:01,670 --> 00:56:04,060
Crossplane is, and doesn’t know what Kubernetes is, and doesn’t know
1110
00:56:05,170 --> 00:56:08,620
what ACK and Config Connect, I’m sorry we didn’t explain that very well.
1111
00:56:08,870 --> 00:56:11,299
But basically these are all—the book I wrote at the time, we call
1112
00:56:11,300 --> 00:56:13,390
it Infrastructure as Software, where it’s basically like Terraform
1113
00:56:13,670 --> 00:56:17,509
in a for loop that keeps applying something or driving to a state.
1114
00:56:17,809 --> 00:56:20,609
And what you were describing was all of the pains that I’ve
1115
00:56:20,639 --> 00:56:24,030
lived over the last decade of trying to template Helm and
1116
00:56:24,450 --> 00:56:25,890
all these other things of, like, oh well, you know what?
1117
00:56:25,940 --> 00:56:29,089
Like, at some point templates aren’t good enough, for all the reasons
1118
00:56:29,090 --> 00:56:32,940
you just—like, the configuration drifts, and the ability to do
1119
00:56:32,970 --> 00:56:34,970
complex things, and all that stuff just becomes really difficult.
1120
00:56:34,970 --> 00:56:36,759
But as a user, I just want the template.
1121
00:56:36,790 --> 00:56:39,210
I just want the, give me some sane defaults, and I
1122
00:56:39,240 --> 00:56:41,600
just give you a little more data for what I want.
1123
00:56:41,910 --> 00:56:46,660
But, in my head, what you just kind of described—and my last question
1124
00:56:46,660 --> 00:56:49,089
here is, what I’m kind of curious about is, what you were describing,
1125
00:56:49,090 --> 00:56:52,560
of all of these problems, how that relates to something like System
1126
00:56:52,560 --> 00:56:55,679
Initiative, where System Initiative took a different approach of it’s
1127
00:56:55,690 --> 00:57:00,809
not Terraform, it’s a direct model to database, sort of—the UI, the
1128
00:57:00,809 --> 00:57:04,430
GUI on top of it is a representation of the actual infrastructure,
1129
00:57:04,430 --> 00:57:07,560
based on the actual API calls and what’s actually in the database.
1130
00:57:07,770 --> 00:57:10,520
And being able to modify those things directly
1131
00:57:11,150 --> 00:57:13,769
is one of its strong points, from what I’ve seen.
1132
00:57:14,020 --> 00:57:15,200
Is that what you’ve seen as well?
1133
00:57:15,200 --> 00:57:18,430
Is that something that you think is the actual ultimate goal?
1134
00:57:18,650 --> 00:57:21,360
Well, I definitely think that Infrastructure as Code
1135
00:57:21,360 --> 00:57:24,520
as we know it has reached a dead end, more or less.
1136
00:57:25,000 --> 00:57:28,899
I think in my entire career of more than 30 years, what we’re
1137
00:57:28,900 --> 00:57:32,430
doing today feels very similar to what I did in the late-’80s.
1138
00:57:32,880 --> 00:57:36,250
It’s, you have some build-like process that generates some stuff
1139
00:57:36,279 --> 00:57:39,850
that you apply to some system, and the actual details of the
1140
00:57:39,850 --> 00:57:42,400
syntax, and the tools, and whatever has changed a little bit,
1141
00:57:42,400 --> 00:57:45,100
but it feels pretty much the same as what I did in college.
1142
00:57:45,100 --> 00:57:48,250
So, I understand the reasons for how we got there.
1143
00:57:48,250 --> 00:57:49,460
It’s pretty expedient.
1144
00:57:49,840 --> 00:57:51,379
I don’t mean it in a disparaging way.
1145
00:57:51,380 --> 00:57:54,540
I actually mean it in a very complimentary way, but
1146
00:57:54,990 --> 00:57:58,060
Infrastructure as Code tools were easy to build.
1147
00:57:59,040 --> 00:58:01,690
They really hit a sweet spot in terms of making it easy.
1148
00:58:01,710 --> 00:58:05,009
For example, Terraform, the orchestration it does is pretty simple, the
1149
00:58:05,010 --> 00:58:09,219
compilation it does is pretty simple, the model is pretty straightforward.
1150
00:58:09,230 --> 00:58:10,980
The providers are pretty easy to write.
1151
00:58:11,440 --> 00:58:13,560
They don’t ask too much of the provider author.
1152
00:58:13,750 --> 00:58:15,965
And even for using it, it feels like scripting.
1153
00:58:16,320 --> 00:58:19,460
Need to provision a few resources, you can write some Terraform.
1154
00:58:20,430 --> 00:58:25,490
Once you learn the language, works—except for some baffling decisions—like
1155
00:58:25,850 --> 00:58:30,189
deleting stuff by default, it works in a mostly predictable way, right?
1156
00:58:30,190 --> 00:58:33,759
So, it’s pretty expedient, you know, is a pretty useful tool.
1157
00:58:33,770 --> 00:58:34,680
It got pretty far.
1158
00:58:35,330 --> 00:58:39,630
But at scale and for some people, it’s not that easy to use.
1159
00:58:40,120 --> 00:58:43,269
And actually adding that kind of scripting layer on top of the
1160
00:58:43,339 --> 00:58:47,489
APIs, much like Crossplane and the other [unintelligible] -based
1161
00:58:47,850 --> 00:58:50,090
tools where people are, you know, using Helm on top of it,
1162
00:58:50,090 --> 00:58:53,280
compositions on top, kind of takes away the power of the APIs.
1163
00:58:54,200 --> 00:58:59,780
So, APIs as the source of truth is what enables interoperable
1164
00:58:59,780 --> 00:59:05,810
ecosystems of clients and tools to interact with those APIs, right?
1165
00:59:05,810 --> 00:59:09,410
You publish an API, and you can build a GUI on top, and a CLI on
1166
00:59:09,410 --> 00:59:13,620
top, and automation tools on top, terminal consoles, and all kinds of
1167
00:59:13,620 --> 00:59:16,599
cool things, ChatOps, whatever, like, you can build all that on top.
1168
00:59:17,070 --> 00:59:18,896
And if you wrap it and say, “No, no, no, you have to go out
1169
00:59:18,896 --> 00:59:21,390
to Terraform, and check it into Git, and get it reviewed,”
1170
00:59:21,960 --> 00:59:25,040
and you’re saying, “No, you can’t do that anymore,” right?
1171
00:59:25,040 --> 00:59:28,710
And I think that’s a huge limitation to
1172
00:59:28,720 --> 00:59:30,273
what we can do with Infrastructure as Code.
1173
00:59:30,273 --> 00:59:31,629
And it’s not just Terraforming.
1174
00:59:32,529 --> 00:59:33,749
That’s just the most popular one.
1175
00:59:33,760 --> 00:59:37,640
The same is true of Pulumi, and anything else out there.
1176
00:59:38,120 --> 00:59:42,130
And that was just some, like, deep-seated, GitOps is not the
1177
00:59:42,130 --> 00:59:44,700
answer you’re looking for, sort of like, vibes there [laugh]
1178
00:59:44,730 --> 00:59:47,810
.
Yeah, I think GitOps—I have a couple of blog posts about GitOps.
1179
00:59:47,810 --> 00:59:50,860
GitOps, I think, solved certain problems.
1180
00:59:50,860 --> 00:59:57,509
The core benefit that I see from GitOps—I mean, retail edge,
1181
00:59:57,509 --> 01:00:00,640
it has a networking benefit, so there’s, like, a specialized
1182
01:00:00,640 --> 01:00:04,340
benefit there, and if you have a large number of targets, you need
1183
01:00:04,340 --> 01:00:06,750
something that retries better than a pipeline and stuff like that.
1184
01:00:06,760 --> 01:00:11,829
But what GitOps does is it creates a one to one binding between the resources
1185
01:00:11,849 --> 01:00:15,099
that are provisioned or created in Kubernetes, if you’re talking about
1186
01:00:15,099 --> 01:00:20,100
GitOps for Kubernetes, and the source of truth for that configuration, right?
1187
01:00:20,100 --> 01:00:22,540
So, I think there’s value in that, especially in the world where
1188
01:00:22,590 --> 01:00:26,009
you’re saying you have to go change that configuration to do anything.
1189
01:00:26,410 --> 01:00:31,290
The unidirectionality of it, where if you want to make a change, you
1190
01:00:31,290 --> 01:00:34,969
have to change your configuration generator, program, or template, or
1191
01:00:34,969 --> 01:00:38,200
you have to change the input variables, you have to check that into Git,
1192
01:00:38,510 --> 01:00:44,780
go through your CI pipeline to deploy it, and that is… very restrictive.
1193
01:00:44,809 --> 01:00:45,769
It’s very slow.
1194
01:00:46,599 --> 01:00:47,910
It creates a lot of toil.
1195
01:00:48,120 --> 01:00:52,010
Why do I have to go edit Infrastructure as Code by hand, right?
1196
01:00:52,010 --> 01:00:56,120
So, different people are exploring different solutions for
1197
01:00:56,120 --> 01:00:58,430
not writing the Infrastructure as Code by hand, like, you have
1198
01:00:58,430 --> 01:01:01,790
the Infrastructure from Code tools that are generating it.
1199
01:01:02,509 --> 01:01:03,910
I don’t really think that’s the answer.
1200
01:01:04,050 --> 01:01:07,520
You have the System Initiative is kind of interesting, although
1201
01:01:07,860 --> 01:01:10,470
kind of challenging to sort of understand exactly what it is.
1202
01:01:11,049 --> 01:01:16,240
But I do think it’s good that folks are exploring alternatives.
1203
01:01:16,240 --> 01:01:19,455
I don’t think just kind of building more generation layers that
1204
01:01:19,720 --> 01:01:22,880
still have the same overall properties, like the unidirectional flow,
1205
01:01:23,290 --> 01:01:27,419
are going to provide dramatic benefits over what we’re doing now.
1206
01:01:27,969 --> 01:01:30,710
Like, people are, of course, try to use AI to
1207
01:01:30,710 --> 01:01:32,950
generate Terraform and other Infrastructure as Code.
1208
01:01:33,040 --> 01:01:35,130
Like, I’ve tried… doing that.
1209
01:01:35,130 --> 01:01:38,030
It works kind of okay for CloudFormation.
1210
01:01:38,070 --> 01:01:40,770
Works less okay for Terraform, in my experience.
1211
01:01:41,490 --> 01:01:42,679
That could be another whole podcast.
1212
01:01:43,750 --> 01:01:47,260
But I don’t think that ultimately changes sort
1213
01:01:47,260 --> 01:01:49,119
of the overall math in the equation, right?
1214
01:01:49,119 --> 01:01:52,100
Like, you still have to have humans that understand it, that
1215
01:01:52,100 --> 01:01:55,009
can review it, and make sure it’s correct and not hallucinated.
1216
01:01:55,010 --> 01:01:55,040
And—
1217
01:01:55,570 --> 01:01:58,550
You need some experts that have more context than the system itself.
1218
01:01:58,580 --> 01:02:00,640
Like, so there’s someone outside of the system
1219
01:02:00,640 --> 01:02:03,000
that knows, is this safe or the right way to do it.
1220
01:02:03,050 --> 01:02:04,730
Which, everybody’s plan to automate it is going
1221
01:02:04,730 --> 01:02:07,169
to make it harder for those humans to have that.
1222
01:02:07,219 --> 01:02:07,469
Yeah.
1223
01:02:07,469 --> 01:02:09,620
Then you also have to deal with configuration drift, and you
1224
01:02:09,620 --> 01:02:11,610
know, all the other problems that are kind of independent of
1225
01:02:11,700 --> 01:02:15,060
the configure—Infrastructure as Code tool that you’re using.
1226
01:02:15,760 --> 01:02:17,470
Brian, this has been awesome.
1227
01:02:17,470 --> 01:02:18,730
Thank you so much for coming on the show.
1228
01:02:18,730 --> 01:02:21,220
Where should people find you online if they want to reach out, if they want
1229
01:02:21,280 --> 01:02:24,549
to ask you more questions, if they want to, I don’t know, like, get in touch?
1230
01:02:25,180 --> 01:02:27,040
Yeah, I a—thanks for having me on.
1231
01:02:27,140 --> 01:02:31,710
I’m BGrant0607—it’s a trivia question, what the numbers
1232
01:02:31,710 --> 01:02:38,130
stand for—but on LinkedIn, Twitter, BlueSky, Medium.
1233
01:02:38,570 --> 01:02:42,030
And I mean, I’m also on Hachyderm and some other
1234
01:02:42,530 --> 01:02:44,450
Mastodon things, but that seems a lot more fragmented.
1235
01:02:44,860 --> 01:02:50,610
And also still on Kubernetes Slack and CNCF Slack, as just Brian Grant, I think.
1236
01:02:51,410 --> 01:02:52,450
Well, thanks again so much.
1237
01:02:52,740 --> 01:02:56,089
Anyone that has questions or wants to reach out, we actually
1238
01:02:56,090 --> 01:02:59,339
don’t have a Slack instance for Fork Around and Find Out.
1239
01:02:59,340 --> 01:03:01,880
We’re not doing any, sort of like, real-time chat for this.
1240
01:03:02,240 --> 01:03:04,970
BlueSky is like—social media is kind of where I’m trying
1241
01:03:04,970 --> 01:03:06,890
to gravitate towards for these sorts of conversations,
1242
01:03:06,890 --> 01:03:08,660
if you have other feedback or want to reach out.
1243
01:03:08,950 --> 01:03:12,495
I don’t want to check another chat system and log into another system for it.
1244
01:03:12,500 --> 01:03:13,400
Like, I’m already there.
1245
01:03:13,500 --> 01:03:14,360
Autumn and I are both there.
1246
01:03:14,360 --> 01:03:17,979
We have the Fork Around and Find Out BlueSky handle which will be posting
1247
01:03:17,980 --> 01:03:21,870
these episodes, so feel free to leave comments and send us messages on there.
1248
01:03:21,880 --> 01:03:24,210
And yeah, we will talk to you all again soon.
1249
01:03:39,710 --> 01:03:42,720
Thank you for listening to this episode of Fork Around and Find Out.
1250
01:03:43,010 --> 01:03:45,150
If you like this show, please consider sharing it with
1251
01:03:45,150 --> 01:03:48,329
a friend, a coworker, a family member, or even an enemy.
1252
01:03:48,440 --> 01:03:50,539
However we get the word out about this show
1253
01:03:50,750 --> 01:03:52,970
helps it to become sustainable for the long-term.
1254
01:03:53,250 --> 01:03:59,510
If you want to sponsor this show, please go to fafo.fm/sponsor, and reach out
1255
01:03:59,510 --> 01:04:02,700
to us there about what you’re interested in sponsoring, and how we can help.
1256
01:04:03,960 --> 01:04:07,160
We hope your systems stay available and your pagers stay quiet.
1257
01:04:07,670 --> 01:04:08,859
We’ll see you again next time.