Transcript
1
00:00:00,310 --> 00:00:01,980
There's a lot of convenience that comes
2
00:00:01,980 --> 00:00:04,160
with cloud, but you definitely pay for it.
3
00:00:04,340 --> 00:00:06,279
And you don't necessarily pay for it in the
4
00:00:06,300 --> 00:00:07,940
things that you expect to pay for it in.
5
00:00:07,949 --> 00:00:10,150
Like, you don't expect, ah, you're gonna charge a markup
6
00:00:10,190 --> 00:00:13,840
on this EC2 instance based off of how powerful it is.
7
00:00:13,840 --> 00:00:16,949
You end up paying most of it in, like, kind of hidden places.
8
00:00:22,705 --> 00:00:26,115
Welcome to Fork Around and Find Out, the podcast about
9
00:00:26,115 --> 00:00:29,215
building, running, and maintaining software and systems.
10
00:00:41,825 --> 00:00:47,105
Welcome to Fork Around and Find Out, the PLC DID of Is the Website Still Up?
11
00:00:47,425 --> 00:00:50,354
I am Justin Garrison, and with me is Autumn Nash, and
12
00:00:50,355 --> 00:00:53,834
today we have Jazz, a software engineer at BlueSky.
13
00:00:53,835 --> 00:00:54,875
Welcome to the show, Jazz.
14
00:00:55,485 --> 00:00:56,575
Hi, glad to be here.
15
00:00:57,090 --> 00:00:58,550
So excited for you to be here.
16
00:00:58,550 --> 00:01:01,300
I have been looking forward to talk about the infrastructure
17
00:01:01,470 --> 00:01:04,230
around Blue Sky and what you all been doing for a very long time.
18
00:01:04,780 --> 00:01:08,789
Jazz's radio voice just totally kicked Justin's podcast voice.
19
00:01:08,789 --> 00:01:10,229
Voice is absolutely better than mine.
20
00:01:10,240 --> 00:01:10,399
Like,
21
00:01:10,629 --> 00:01:13,070
like, as your friend, I want to have your back.
22
00:01:15,970 --> 00:01:16,889
That's fire.
23
00:01:16,890 --> 00:01:17,735
I'm not sure if I'm going to
24
00:01:17,735 --> 00:01:21,380
be demonstrating that for the podcast yet.
25
00:01:21,410 --> 00:01:22,410
But, you know.
26
00:01:22,619 --> 00:01:24,009
Can we hire you in Skittles?
27
00:01:24,179 --> 00:01:26,359
We could, you could DM me, we could talk about it.
28
00:01:26,520 --> 00:01:29,619
We'll just, we'll send you Ikea plushies as payment.
29
00:01:32,460 --> 00:01:32,739
That's right.
30
00:01:32,740 --> 00:01:33,509
Pay me in gum.
31
00:01:34,389 --> 00:01:36,219
All valid cryptocurrencies for 2025.
32
00:01:39,660 --> 00:01:42,539
I guarantee you, if we took Trump coins and
33
00:01:42,539 --> 00:01:44,919
Ikea plushies, one has a better resale value.
34
00:01:46,535 --> 00:01:47,155
Oh, wow.
35
00:01:47,155 --> 00:01:49,574
We're, we are three minutes into this episode.
36
00:01:49,574 --> 00:01:50,455
Welcome to the show.
37
00:01:50,455 --> 00:01:50,985
Everyone.
38
00:01:51,085 --> 00:01:51,845
It's been a week.
39
00:01:52,384 --> 00:01:53,095
It's definitely been a week.
40
00:01:53,964 --> 00:01:55,304
Just for context, for anyone listening to
41
00:01:55,304 --> 00:01:57,294
this, we are recording this on January 23rd.
42
00:01:57,324 --> 00:01:59,104
It is Thursday, still in January.
43
00:01:59,104 --> 00:02:01,524
This episode is coming out in February, second week of February.
44
00:02:01,524 --> 00:02:03,845
So I don't know what the future holds, but.
45
00:02:04,190 --> 00:02:05,289
Godspeed to you all.
46
00:02:05,300 --> 00:02:06,479
Y'all, uh, it is,
47
00:02:06,779 --> 00:02:09,329
we were like 2025 will get better.
48
00:02:09,329 --> 00:02:11,680
And then halfway through January, we're like, whoa, whoa.
49
00:02:11,690 --> 00:02:12,890
We want a refund.
50
00:02:12,930 --> 00:02:13,209
Like,
51
00:02:14,899 --> 00:02:16,579
I don't know if the store does that anymore.
52
00:02:16,880 --> 00:02:16,989
I think
53
00:02:16,989 --> 00:02:18,209
the bad place, what
54
00:02:21,689 --> 00:02:25,059
we just survived tech like recession.
55
00:02:25,489 --> 00:02:27,889
And now we just, we don't even know.
56
00:02:27,909 --> 00:02:28,309
Okay.
57
00:02:28,309 --> 00:02:28,939
We like,
58
00:02:29,049 --> 00:02:31,069
we definitely served a lot of video this week.
59
00:02:31,069 --> 00:02:31,950
I can tell you that much.
60
00:02:31,950 --> 00:02:32,250
I mean.
61
00:02:32,475 --> 00:02:35,765
Speaking of surviving blue sky is like a male.
62
00:02:36,065 --> 00:02:37,925
It's just, it is, and that is
63
00:02:37,984 --> 00:02:40,535
you are the saving the world right now.
64
00:02:40,555 --> 00:02:42,375
Cause like, I don't even know where to go.
65
00:02:42,405 --> 00:02:44,055
I've deleted my Instagram three times.
66
00:02:44,065 --> 00:02:47,074
The only reason why I have a Facebook is because it's so confusing.
67
00:02:47,074 --> 00:02:48,215
I can't get rid of it.
68
00:02:48,475 --> 00:02:52,255
Like, I swear Meadow was like, I'm going to make this horrible.
69
00:02:52,265 --> 00:02:53,385
So they can't delete it.
70
00:02:53,385 --> 00:02:56,105
And I'm like, I just won't post and I'll just delete it from my phone.
71
00:02:56,144 --> 00:02:59,040
Like, yeah, They had to ask a UI UX like designer,
72
00:02:59,079 --> 00:03:01,090
how to make it as insufferable as possible.
73
00:03:01,090 --> 00:03:04,270
Like they, not to make it better, but how to make it worse.
74
00:03:04,599 --> 00:03:06,040
It's a lot of dark patterns out there.
75
00:03:06,180 --> 00:03:06,380
Yeah.
76
00:03:06,990 --> 00:03:09,290
And then people are like, Oh, tick tock is bad.
77
00:03:09,299 --> 00:03:10,260
Don't rock tick tock.
78
00:03:10,279 --> 00:03:12,689
You know, you shouldn't give your information
79
00:03:12,689 --> 00:03:15,279
to foreign like, okay, but this is like, okay.
80
00:03:15,880 --> 00:03:18,720
This is the Boston Tea Party of data.
81
00:03:18,890 --> 00:03:21,420
They were like, okay, you want to take my data?
82
00:03:21,470 --> 00:03:25,000
And like, you want to take my TikTok and say it's like Chinese government ware?
83
00:03:25,179 --> 00:03:28,260
We will throw it into like Red Note.
84
00:03:28,310 --> 00:03:30,670
Like it is the Tea Party of data.
85
00:03:30,730 --> 00:03:32,429
They were like, F your data rules.
86
00:03:32,459 --> 00:03:35,620
And then they gave it to the, there was a video of this woman
87
00:03:35,620 --> 00:03:39,049
saying that they told her to verify her identity for fraud.
88
00:03:39,314 --> 00:03:41,504
On red note, and she was like, I'm giving the Chinese
89
00:03:41,504 --> 00:03:44,224
government my I. D. What now, U. S. Government?
90
00:03:44,224 --> 00:03:46,934
And I was like, Oh, sweet Lord, what are we doing?
91
00:03:49,424 --> 00:03:50,824
Me and Jazz are going to be besties.
92
00:03:51,254 --> 00:03:51,684
I don't know.
93
00:03:51,684 --> 00:03:53,194
I hope more good places show up.
94
00:03:53,484 --> 00:03:54,784
Blue Sky is all we got.
95
00:03:54,904 --> 00:03:57,354
I think there were four at Proto based TikTok clones
96
00:03:57,354 --> 00:03:59,824
that were like starting up in the past week or two.
97
00:04:00,014 --> 00:04:02,024
So let's go back a little bit first and.
98
00:04:02,660 --> 00:04:05,090
How did you get into software infrastructure?
99
00:04:05,110 --> 00:04:05,940
What's kind of your background?
100
00:04:06,120 --> 00:04:10,210
Where did you go from doing something to like part of blue sky at Proto?
101
00:04:10,640 --> 00:04:11,750
I started in hardware.
102
00:04:11,770 --> 00:04:14,730
I started as like a, as a repair tech, uh, when I was
103
00:04:14,730 --> 00:04:18,390
like 14 at a computer repair shop in my local town.
104
00:04:18,740 --> 00:04:22,000
Support desk life, it is like you are help desk and yeah,
105
00:04:22,159 --> 00:04:24,040
yeah, I was very good at taking stuff apart.
106
00:04:24,060 --> 00:04:25,690
I wasn't very good at putting things back together.
107
00:04:25,700 --> 00:04:28,270
And then as I got older, I got better at putting things back together.
108
00:04:29,360 --> 00:04:30,159
Well, I don't know.
109
00:04:30,179 --> 00:04:33,900
There's, there's, I feel like there's, I feel like there's a disease you get
110
00:04:33,900 --> 00:04:36,750
where you just want to like take everything apart and figure out how it works.
111
00:04:36,780 --> 00:04:37,919
And so I was that kid, you have
112
00:04:37,919 --> 00:04:38,659
to see the parts.
113
00:04:38,679 --> 00:04:39,899
You have to know what's going on.
114
00:04:40,010 --> 00:04:40,359
Yeah.
115
00:04:40,360 --> 00:04:40,729
Yeah.
116
00:04:40,729 --> 00:04:42,159
I didn't, I didn't know how like solder joints
117
00:04:42,159 --> 00:04:42,739
worked.
118
00:04:42,780 --> 00:04:44,419
I learned that the hard way after like
119
00:04:44,419 --> 00:04:46,119
breaking a few too many solder joints and like.
120
00:04:46,119 --> 00:04:47,239
It's not going back together.
121
00:04:47,239 --> 00:04:48,299
What the hell does it work?
122
00:04:48,299 --> 00:04:48,399
A new
123
00:04:48,399 --> 00:04:49,119
skill today.
124
00:04:51,259 --> 00:04:52,689
The oven to try to solder it again.
125
00:04:53,069 --> 00:04:53,530
Yeah.
126
00:04:53,530 --> 00:04:54,099
Yeah.
127
00:04:54,169 --> 00:04:59,529
That's just, and then evolved from that to doing tech support at a local PC
128
00:04:59,529 --> 00:05:03,739
repair shop, and then that paid awfully and they build a lot for my time.
129
00:05:03,739 --> 00:05:04,560
So I was like, okay, cool.
130
00:05:04,570 --> 00:05:06,009
Let me do this independently.
131
00:05:06,010 --> 00:05:07,990
Um, so I went solo for a little bit.
132
00:05:08,000 --> 00:05:11,520
And then when I was in like high school, got in the hackathon scene
133
00:05:11,570 --> 00:05:14,180
in London, in the UK, right after I moved there in high school.
134
00:05:14,525 --> 00:05:15,615
That was really cool.
135
00:05:15,735 --> 00:05:17,085
I was going to these hackathons.
136
00:05:17,115 --> 00:05:19,485
I was like, well, technically not old enough to go
137
00:05:19,485 --> 00:05:21,455
to some of the hackathons so I could win the prizes.
138
00:05:21,465 --> 00:05:22,495
Terrible in London.
139
00:05:22,495 --> 00:05:23,255
Or is it good?
140
00:05:23,425 --> 00:05:24,865
If you get the right food, it's good.
141
00:05:24,874 --> 00:05:25,834
British food is bad.
142
00:05:25,924 --> 00:05:26,174
Okay.
143
00:05:26,214 --> 00:05:28,690
Uh, I probably shouldn't have said that on the podcast, but British food is bad.
144
00:05:28,690 --> 00:05:29,634
They know it.
145
00:05:29,635 --> 00:05:30,074
Um,
146
00:05:30,094 --> 00:05:30,525
they know it.
147
00:05:30,525 --> 00:05:31,664
Just like they know it.
148
00:05:31,664 --> 00:05:32,054
It's cool.
149
00:05:32,734 --> 00:05:33,484
They know, they know it.
150
00:05:33,485 --> 00:05:36,245
They know the good British food is like Nando's, but that's like.
151
00:05:36,544 --> 00:05:39,905
South African slash Portuguese slash British.
152
00:05:39,924 --> 00:05:42,844
And then obviously there's like really good Indian food
153
00:05:42,954 --> 00:05:45,174
and there's really good continental European foods.
154
00:05:45,174 --> 00:05:47,015
If you want like good Italian food or good French
155
00:05:47,015 --> 00:05:50,115
food, those are some really good eats to get in London.
156
00:05:50,484 --> 00:05:50,854
Yeah.
157
00:05:50,875 --> 00:05:53,044
So it was in the hackathon scene, started doing
158
00:05:53,044 --> 00:05:55,824
software engineering as like a part time thing.
159
00:05:55,824 --> 00:05:59,745
I think my junior year of high school into my senior year of high school.
160
00:05:59,745 --> 00:06:00,204
And then.
161
00:06:00,640 --> 00:06:01,319
moved back to the U.
162
00:06:01,319 --> 00:06:04,979
S. for college, uh, worked through college 39 and a half hours
163
00:06:04,979 --> 00:06:08,789
a week doing contracting, and then graduated, uh, early 2020,
164
00:06:09,049 --> 00:06:12,259
was thrust into the tech market in the middle of a pandemic.
165
00:06:12,490 --> 00:06:13,129
So it was interesting.
166
00:06:13,129 --> 00:06:15,449
So I spent some time working at a financial company, um,
167
00:06:15,449 --> 00:06:18,429
doing like infrastructure for their engineering teams,
168
00:06:18,429 --> 00:06:20,599
building a platform as a service on top of Kubernetes.
169
00:06:20,699 --> 00:06:23,209
And then I spent some time working at a social media
170
00:06:23,209 --> 00:06:26,650
company doing infrastructure for their research teams.
171
00:06:26,650 --> 00:06:28,240
So turning research projects Was it like a really evil one or
172
00:06:28,280 --> 00:06:29,610
just like a kind of evil one?
173
00:06:30,039 --> 00:06:31,299
It was, yeah, it was at Facebook.
174
00:06:31,299 --> 00:06:33,610
I was at, I was at Facebook for briefly for about a year.
175
00:06:33,669 --> 00:06:35,350
Um, fourth as a production engineer
176
00:06:35,355 --> 00:06:35,804
you got out.
177
00:06:35,985 --> 00:06:38,084
So I I got out, subscribe, got out.
178
00:06:38,090 --> 00:06:38,470
I'm just trying
179
00:06:38,470 --> 00:06:40,120
to think how hard it was to get out of the company.
180
00:06:40,150 --> 00:06:40,689
Like this is . Yeah.
181
00:06:41,005 --> 00:06:41,204
Yeah.
182
00:06:42,340 --> 00:06:45,220
I was working, um, production engineering at Facebook Reality Labs,
183
00:06:45,220 --> 00:06:48,039
so I spent time working with a bunch of That sounds a cool job.
184
00:06:48,159 --> 00:06:49,059
Researchers.
185
00:06:49,120 --> 00:06:49,390
Yeah.
186
00:06:49,390 --> 00:06:50,500
They were building really cool stuff.
187
00:06:50,500 --> 00:06:51,939
The problem is there were like a few thousand
188
00:06:51,939 --> 00:06:53,980
researchers and there were about 20 production engineers.
189
00:06:54,239 --> 00:06:55,249
So it was just like
190
00:06:55,369 --> 00:06:57,789
MetaQuest, like VR This is like MetaQuest.
191
00:06:57,790 --> 00:06:58,150
This is like
192
00:06:58,249 --> 00:06:59,199
MetaHorizons.
193
00:06:59,219 --> 00:07:01,089
This is all sorts of this is all the hardware project.
194
00:07:01,249 --> 00:07:01,939
They're working on.
195
00:07:02,379 --> 00:07:03,289
Do you know the legs
196
00:07:03,289 --> 00:07:03,890
on the models?
197
00:07:04,619 --> 00:07:06,109
I was, that's what I was going to ask.
198
00:07:06,109 --> 00:07:08,650
I was going to be like, why don't they have hands and legs?
199
00:07:08,659 --> 00:07:09,859
Like, do you have the teeth?
200
00:07:10,390 --> 00:07:12,609
Like this man was really trying to tell us that we
201
00:07:12,619 --> 00:07:14,780
all need to be more masculine and all this stuff.
202
00:07:14,789 --> 00:07:16,249
And I was like, bro, you can't build hands.
203
00:07:16,259 --> 00:07:16,759
Sit down.
204
00:07:16,829 --> 00:07:18,459
I don't know why there are no legs.
205
00:07:18,499 --> 00:07:19,489
I do.
206
00:07:19,520 --> 00:07:20,559
Yeah, I do know.
207
00:07:20,559 --> 00:07:20,789
Like.
208
00:07:20,844 --> 00:07:23,885
There were lots of really cool projects going on around
209
00:07:23,905 --> 00:07:27,864
the AR glasses that debuted at a recent meta event.
210
00:07:28,265 --> 00:07:30,184
So the, the really cool, the like time machine
211
00:07:30,184 --> 00:07:33,314
glasses, the real thick ones, those are awesome.
212
00:07:33,324 --> 00:07:35,875
The amounts of engineering that went into.
213
00:07:36,559 --> 00:07:39,520
Every single component in that pair of glasses is crazy.
214
00:07:39,530 --> 00:07:42,630
Like everything in it is custom Silicon designing that custom Silicon
215
00:07:42,630 --> 00:07:46,449
and designing the optics, those Silicon carbide optics that are like
216
00:07:46,469 --> 00:07:50,370
actually just like a rock that was manufactured specifically to like
217
00:07:50,370 --> 00:07:53,749
do all these crazy wave guides and stuff that requires an insane
218
00:07:53,749 --> 00:07:56,859
amount of simulation and insane amount of like physics and engineering.
219
00:07:56,979 --> 00:07:57,749
And I like.
220
00:07:57,975 --> 00:07:58,284
Sure.
221
00:07:58,294 --> 00:08:00,775
Helped the team with their simulation cluster or something.
222
00:08:00,775 --> 00:08:03,265
I have no idea how the math works, but that was cool stuff to work on.
223
00:08:04,434 --> 00:08:07,044
And now that I think about it, jazz, like I can't see your legs now either.
224
00:08:07,044 --> 00:08:08,164
So I don't even know if you have legs.
225
00:08:08,174 --> 00:08:11,384
So you
226
00:08:11,385 --> 00:08:12,324
can tell us point twice.
227
00:08:13,005 --> 00:08:14,338
No, I, uh, yeah.
228
00:08:14,338 --> 00:08:15,054
So that was.
229
00:08:15,214 --> 00:08:16,375
That was a fun chapter.
230
00:08:16,405 --> 00:08:21,755
After that, I kind of like, I went to, I went to a tiny, I went to a
231
00:08:21,755 --> 00:08:25,934
tiny six person startup that was, that was doing like solar cellular
232
00:08:25,974 --> 00:08:29,314
camera networks around cities for determining parking occupancy.
233
00:08:29,314 --> 00:08:30,104
It was very weird.
234
00:08:30,324 --> 00:08:32,504
What made you want to go from Meta to that?
235
00:08:32,604 --> 00:08:35,275
I wanted a small, like small team startup vibe thing.
236
00:08:35,334 --> 00:08:36,824
And the CEO was a friend of mine from high
237
00:08:36,824 --> 00:08:39,214
school, but they didn't really have any engineers.
238
00:08:39,244 --> 00:08:43,194
And I kind of, I built a product stack and burned out pretty quickly
239
00:08:43,304 --> 00:08:47,034
and then went to work at Planet Labs where I was for about two years.
240
00:08:47,124 --> 00:08:49,574
That was kind of my ethical turning point where I was like,
241
00:08:49,574 --> 00:08:52,014
Hey, I want to go build technology that helps the world.
242
00:08:52,204 --> 00:08:57,175
Uh, Planet builds tiny CubeSat constellation that images the world every day.
243
00:08:57,540 --> 00:09:00,650
They sell that imagery to farmers and agricultural industry
244
00:09:00,680 --> 00:09:04,210
and all sorts of like NGOs and other, other, uh, organizations
245
00:09:04,220 --> 00:09:06,990
so they can get like really fast real time imagery.
246
00:09:07,230 --> 00:09:10,150
My role there was like billing infrastructure when I got
247
00:09:10,150 --> 00:09:12,820
my foot in the door and then it turned into, uh, I wrote a
248
00:09:12,850 --> 00:09:15,329
charter and built out their internal developer experience team.
249
00:09:15,510 --> 00:09:18,450
But come, you know, 18 plus months into my career
250
00:09:18,450 --> 00:09:20,560
at Planet, my friend invites me to Blue Sky.
251
00:09:20,650 --> 00:09:23,310
Uh, it was like usually like 20, 000 or something like that.
252
00:09:23,430 --> 00:09:25,479
I check out this cool protocol that they're working on.
253
00:09:25,755 --> 00:09:28,084
It's very, very interesting because they just have a
254
00:09:28,084 --> 00:09:30,789
public fire hose and I was like Holy crap, that's awesome.
255
00:09:30,789 --> 00:09:33,329
I've never really seen a public fire hose for a social network.
256
00:09:33,339 --> 00:09:35,399
So I figure out how to consume the fire hose.
257
00:09:35,459 --> 00:09:37,149
I noticed like, Hey, there's this Paul guy.
258
00:09:37,159 --> 00:09:40,669
Who's like everywhere all the time, responding to everybody.
259
00:09:40,729 --> 00:09:41,719
Everyone mentions him.
260
00:09:41,819 --> 00:09:41,949
Always the
261
00:09:42,129 --> 00:09:43,769
first thing everybody notices.
262
00:09:44,359 --> 00:09:44,799
Yeah.
263
00:09:44,799 --> 00:09:47,539
So I was like, who's this Paul guy and how, how much has he mentioned?
264
00:09:47,549 --> 00:09:48,329
So I wrote like,
265
00:09:48,329 --> 00:09:49,419
he's the MySpace Tom of LooseGuy.
266
00:09:49,759 --> 00:09:50,209
Yeah.
267
00:09:50,419 --> 00:09:50,909
Yeah.
268
00:09:51,109 --> 00:09:53,639
I wrote some code and I was like, how often is Paul mentioned?
269
00:09:53,639 --> 00:09:56,219
And like, how many different people are talking to Paul?
270
00:09:56,399 --> 00:09:57,889
And so that was the initial idea was like
271
00:09:57,930 --> 00:10:00,859
tracking how popular Paul was on this platform.
272
00:10:00,859 --> 00:10:04,269
And then that evolved into my social graph visualization, which was, Hey,
273
00:10:04,269 --> 00:10:08,069
let's graph all of the interactions between users on blue sky and try to find
274
00:10:08,079 --> 00:10:12,009
like clusters of, of new users popping up that have common features and stuff.
275
00:10:12,159 --> 00:10:12,629
Very cool
276
00:10:12,629 --> 00:10:13,349
hobbies.
277
00:10:13,890 --> 00:10:14,069
Thank
278
00:10:14,069 --> 00:10:14,239
you.
279
00:10:14,240 --> 00:10:16,190
So that was, that was really fun.
280
00:10:16,190 --> 00:10:17,879
And I realized I was spending about 30 hours
281
00:10:17,879 --> 00:10:21,229
a week on miscellaneous at Proto stuff.
282
00:10:21,630 --> 00:10:23,450
And then 40 hours a week at work.
283
00:10:23,470 --> 00:10:26,370
And I was like, I definitely like one of these a lot more than the other.
284
00:10:26,500 --> 00:10:29,340
So I went to a, one of the blue sky user meetups in the Bay area.
285
00:10:29,360 --> 00:10:31,940
And I met, uh, some members of the team at the time.
286
00:10:31,959 --> 00:10:33,999
And they recognized me from the projects I was doing
287
00:10:33,999 --> 00:10:36,109
on the network of sharing all of this as I was building
288
00:10:36,110 --> 00:10:38,299
it open source, like, Hey, check out this cool graph.
289
00:10:38,329 --> 00:10:40,000
Oh, look, all these new users showed up and they're from
290
00:10:40,000 --> 00:10:42,420
this area and they speak this language or whatever it is.
291
00:10:42,560 --> 00:10:43,830
We chatted for a couple hours and like, cool.
292
00:10:43,840 --> 00:10:45,670
Do you want to like come work here?
293
00:10:45,720 --> 00:10:47,070
And I was like, Oh, do I?
294
00:10:47,560 --> 00:10:48,540
I was like, yeah, I think I do.
295
00:10:49,390 --> 00:10:52,290
That's actually really helpful data for a startup though.
296
00:10:52,640 --> 00:10:53,449
Like you were doing meaningful work.
297
00:10:53,450 --> 00:10:54,080
Yeah, I mean,
298
00:10:54,170 --> 00:10:56,550
yeah, at the point I had more dashboards than the
299
00:10:56,560 --> 00:10:58,850
company did of like what was going on on the network.
300
00:10:58,850 --> 00:11:01,820
I had like a better idea of who their users were than they did.
301
00:11:01,959 --> 00:11:05,230
Obviously it's evolved a whole lot since then, but I was, I was basically
302
00:11:05,230 --> 00:11:07,740
working at the company before I started working at the company just because
303
00:11:07,740 --> 00:11:10,430
everything was open source and everything was all the data was open.
304
00:11:10,440 --> 00:11:11,095
You could just do
305
00:11:11,095 --> 00:11:11,650
whatever you want.
306
00:11:11,650 --> 00:11:12,204
I was gonna
307
00:11:12,204 --> 00:11:12,574
say you
308
00:11:12,574 --> 00:11:14,340
were, you just rolled in with insights.
309
00:11:14,854 --> 00:11:16,334
Yeah, it was, it was super cool.
310
00:11:16,334 --> 00:11:19,394
Like I've never, I've never had like a, an experience like that
311
00:11:19,394 --> 00:11:22,594
where you can watch the evolution of a social network, like from
312
00:11:22,604 --> 00:11:25,734
basically first principles, totally in the public and totally
313
00:11:25,734 --> 00:11:28,314
in the open and just build all sorts of stuff on top of it.
314
00:11:28,314 --> 00:11:31,324
And the developer community around it got really psyched.
315
00:11:31,395 --> 00:11:31,834
Yeah.
316
00:11:31,915 --> 00:11:35,224
That, and that was back in, I joined the team back in July of 2023.
317
00:11:35,385 --> 00:11:38,564
So I've been, been around for about like 18 months now.
318
00:11:38,880 --> 00:11:40,780
And it's been an absolutely insane ride.
319
00:11:40,780 --> 00:11:42,310
It's felt like it's been a decade.
320
00:11:42,630 --> 00:11:45,340
You primarily have focused on the infrastructure side of it, right?
321
00:11:45,400 --> 00:11:46,850
Like as far as, yeah,
322
00:11:46,900 --> 00:11:50,169
my roles and responsibilities are mostly around infrastructure and scaling.
323
00:11:50,290 --> 00:11:53,529
So when I joined, we had around a hundred thousand users this
324
00:11:53,540 --> 00:11:57,439
weekend where we'll probably be pushing on 30 million users.
325
00:11:57,614 --> 00:11:59,814
So pretty significant increase in scale.
326
00:12:00,214 --> 00:12:00,795
It's a little bit 18 months.
327
00:12:00,854 --> 00:12:03,624
We need a HugOps meme for jazz.
328
00:12:03,764 --> 00:12:06,364
You are the real MVP because the amount of
329
00:12:06,374 --> 00:12:09,444
people that are social media refugees right now.
330
00:12:09,814 --> 00:12:11,924
It's not just me on the, on the infrasight of things.
331
00:12:11,924 --> 00:12:13,384
We have a, we probably have.
332
00:12:13,774 --> 00:12:16,554
Five or six people who are like kind of core
333
00:12:16,584 --> 00:12:18,474
infrastructure, like on call rotation now.
334
00:12:18,514 --> 00:12:21,514
But back in the day, it was, it was not quite that big.
335
00:12:21,614 --> 00:12:23,000
We were, we were really tiny team.
336
00:12:23,000 --> 00:12:23,310
Five is
337
00:12:23,310 --> 00:12:26,784
still a lot for that many users and then doing mostly on prem.
338
00:12:26,874 --> 00:12:28,194
Yeah, it's, it's a bit crazy.
339
00:12:28,204 --> 00:12:33,064
We, we built out our data center locations in like November of 2024.
340
00:12:33,284 --> 00:12:35,004
And so before that, we were all on cloud.
341
00:12:35,094 --> 00:12:37,284
Things were kind of falling over with a hundred thousand users.
342
00:12:37,544 --> 00:12:38,184
Oh, that's interesting.
343
00:12:38,184 --> 00:12:39,504
So you did start in the cloud.
344
00:12:40,145 --> 00:12:40,475
And then,
345
00:12:40,475 --> 00:12:41,675
yeah, it all started in the cloud.
346
00:12:41,675 --> 00:12:42,805
Like general overview.
347
00:12:42,875 --> 00:12:44,645
What does the infrastructure look like today?
348
00:12:44,725 --> 00:12:46,245
Like where, as I know, like some pieces in
349
00:12:46,245 --> 00:12:48,454
the cloud, some on prem, some different areas.
350
00:12:48,465 --> 00:12:49,155
Like what is that?
351
00:12:49,235 --> 00:12:50,064
How does that break down?
352
00:12:50,504 --> 00:12:52,785
So we have three tiers of infrastructure, I guess.
353
00:12:52,814 --> 00:12:57,074
You'd have like singleton one off services, which are kind of smaller, lower
354
00:12:57,074 --> 00:13:01,334
load services that we want a replicated Postgres database for or something.
355
00:13:01,354 --> 00:13:03,204
And those we stick in a cloud provider.
356
00:13:03,324 --> 00:13:06,265
We have our like core data services, which are really
357
00:13:06,265 --> 00:13:08,435
high compute scale, really high storage requirements.
358
00:13:08,999 --> 00:13:10,389
Uh, and those we run on prem.
359
00:13:10,599 --> 00:13:12,489
And so we have two, two POPs, two physical
360
00:13:12,489 --> 00:13:14,319
POPs that we have our own hardware in.
361
00:13:14,609 --> 00:13:15,439
Uh, that we co locate.
362
00:13:15,439 --> 00:13:16,869
We, you know, get a cage in a data center
363
00:13:16,879 --> 00:13:18,899
somewhere and go throw your servers in it.
364
00:13:19,149 --> 00:13:21,800
And then we have, our third tier is kind of like bare metal
365
00:13:21,800 --> 00:13:25,569
providers, which is different providers, but they give us
366
00:13:25,809 --> 00:13:29,399
basically a full machine in their data center somewhere.
367
00:13:29,489 --> 00:13:31,449
And then we run like the PDSs.
368
00:13:31,459 --> 00:13:35,450
So if you've, you're the personal data servers that have All of our users
369
00:13:35,450 --> 00:13:39,610
canonical data on it is stored on bare metal through bare metal providers.
370
00:13:39,630 --> 00:13:42,030
And that lets us kind of scale those a lot more easily than we
371
00:13:42,030 --> 00:13:45,409
can scale our own physical hardware and then smaller one off
372
00:13:45,410 --> 00:13:48,220
services or things that need to be in the cloud or in the cloud.
373
00:13:48,250 --> 00:13:52,700
And then all of our kind of like really high compute intensive or network
374
00:13:52,700 --> 00:13:57,080
intensive or storage intensive stuff runs on our own hardware because.
375
00:13:57,575 --> 00:14:00,715
bandwidth in a data center is a lot cheaper than bandwidth in a cloud.
376
00:14:00,815 --> 00:14:03,235
Storage in a data center is a lot cheaper than storage in the cloud.
377
00:14:03,375 --> 00:14:05,415
Do you have any extra backup storage in the
378
00:14:05,415 --> 00:14:07,945
cloud just in case things get super crazy?
379
00:14:08,195 --> 00:14:10,375
Yeah, so there are all sorts of like different tiers of
380
00:14:10,394 --> 00:14:13,205
backups based on what kind of data it is and where it is.
381
00:14:13,205 --> 00:14:15,995
So like canonical data, like your PDS data is backed up
382
00:14:16,025 --> 00:14:18,375
a couple different ways in a couple different places.
383
00:14:18,425 --> 00:14:22,535
But like our global index of All of the data in the atmosphere,
384
00:14:22,535 --> 00:14:25,994
we run to fully independent copies, uh, indexing atmosphere.
385
00:14:25,994 --> 00:14:30,774
So each data center, uh, fully indexes, um, the fire hose on its own.
386
00:14:31,074 --> 00:14:34,514
Um, so they both contain two independent sets of the same data.
387
00:14:34,844 --> 00:14:36,544
Um, so if there were some kind of.
388
00:14:36,889 --> 00:14:38,109
outage or anything like that.
389
00:14:38,269 --> 00:14:40,379
Uh, we have at least a copy of that somewhere.
390
00:14:40,409 --> 00:14:42,609
Uh, and we have the ability to shift all of our traffic to one of
391
00:14:42,609 --> 00:14:45,719
the data centers so that it can, it can handle the production load.
392
00:14:45,899 --> 00:14:49,049
It makes my heart so happy when people use cloud and on prem
393
00:14:49,050 --> 00:14:52,800
correctly and don't just think either of them are the end all be all.
394
00:14:52,879 --> 00:14:56,474
And when people are redundant properly, like it just makes me so happy.
395
00:14:56,694 --> 00:14:58,915
We, we really like commoditized cloud products.
396
00:14:58,925 --> 00:15:01,624
So something like block storage is like super commoditized.
397
00:15:01,624 --> 00:15:02,324
It's super cheap.
398
00:15:02,364 --> 00:15:04,305
There's so many different people that provide it and the
399
00:15:04,305 --> 00:15:07,314
like SLAs on it are very industry standard at this point.
400
00:15:07,324 --> 00:15:09,295
So it's much easier to get cheap.
401
00:15:09,614 --> 00:15:10,525
Block storage.
402
00:15:10,574 --> 00:15:13,724
So we don't mind building a petabyte scale, uh, storage
403
00:15:13,724 --> 00:15:17,405
cluster on like metal is, is kind of challenging.
404
00:15:17,405 --> 00:15:18,515
It's, it's expensive.
405
00:15:18,574 --> 00:15:21,354
It's error prone, depending on your latency requirements
406
00:15:21,354 --> 00:15:23,034
and stuff, you might mean to be running flash for that.
407
00:15:23,045 --> 00:15:24,484
In which case it's a lot more expensive.
408
00:15:24,604 --> 00:15:27,044
And if you're running hard drives, you have failure rates, which means you
409
00:15:27,045 --> 00:15:30,045
need somebody to like the bigger scale your cluster is, the more often you have
410
00:15:30,045 --> 00:15:33,265
to send somebody down to go swap hard drives, whereas block storage is like.
411
00:15:33,809 --> 00:15:36,479
Honestly, really economical up to, I think it's somewhere in
412
00:15:36,479 --> 00:15:38,599
the, in the like four to five petabyte range, at which point
413
00:15:38,599 --> 00:15:41,209
it makes sense to start just running your own storage clusters.
414
00:15:41,399 --> 00:15:44,419
But I love that you guys actually did the numbers and you looked at
415
00:15:44,449 --> 00:15:48,229
each, you know, like the, all of your storage is very well placed.
416
00:15:48,379 --> 00:15:50,109
Well, and the, the really funny thing is, I
417
00:15:50,109 --> 00:15:52,009
mean, you, you said you started these in 2023.
418
00:15:53,539 --> 00:15:56,689
Alright, like end of 2024, you were growing a million users a day.
419
00:15:56,689 --> 00:15:59,079
So whatever math you thought you had in 2023
420
00:15:59,089 --> 00:16:01,669
was not the math you were doing in 2024.
421
00:16:01,919 --> 00:16:03,519
You would be surprised.
422
00:16:03,569 --> 00:16:08,050
We've got some spreadsheets that were written in 2023,
423
00:16:08,050 --> 00:16:11,168
very early 2024, and they were wildly early 2024.
424
00:16:11,168 --> 00:16:11,614
early 2024.
425
00:16:11,744 --> 00:16:14,074
They were wildly ambitious when they were written
426
00:16:14,164 --> 00:16:16,384
that go month by month, like user numbers.
427
00:16:16,444 --> 00:16:18,844
We missed a ton of the marks on them.
428
00:16:18,874 --> 00:16:20,064
And then we caught up.
429
00:16:20,305 --> 00:16:22,035
Were y'all just predicting the future?
430
00:16:22,334 --> 00:16:24,844
Like, did, did someone know if Elon was
431
00:16:24,844 --> 00:16:26,354
breaking up or getting back with girlfriends?
432
00:16:28,324 --> 00:16:30,894
Our previous infrastructure lead, Jake, Justin, who I
433
00:16:30,935 --> 00:16:33,464
think you talked to briefly on the website at some point.
434
00:16:33,665 --> 00:16:36,074
He wrote up this spreadsheet that like, I think he, I
435
00:16:36,084 --> 00:16:38,665
think he based it off of Instagram numbers or something.
436
00:16:38,665 --> 00:16:41,145
He got, he got a bunch of different numbers from different social medias
437
00:16:41,175 --> 00:16:45,125
that like had, they gave you a whole like six data points of their
438
00:16:45,125 --> 00:16:48,234
user numbers over the course of their like 10 year history and then
439
00:16:48,235 --> 00:16:52,004
extrapolated between them to try and find what successful growth look like.
440
00:16:52,025 --> 00:16:54,094
And then we built that into the spreadsheet and
441
00:16:54,094 --> 00:16:56,255
then we said, Hey, let's plan for success because.
442
00:16:56,804 --> 00:16:58,484
If you plan for failure, you're not going to succeed.
443
00:16:58,564 --> 00:17:02,934
Which is amazing because just the, the environment in which Instagram
444
00:17:02,974 --> 00:17:08,144
and Facebook and most places and social media grew is not like.
445
00:17:08,659 --> 00:17:09,639
What's happening right now?
446
00:17:09,649 --> 00:17:12,659
Like, this is a very crazy time in social media.
447
00:17:12,949 --> 00:17:16,079
When you, when you have so much market saturation and there's so many
448
00:17:16,079 --> 00:17:19,489
incumbents and everybody is already fully subscribed on the social medias
449
00:17:19,489 --> 00:17:24,659
that they want to be on, it is so hard to pull people away from the platform
450
00:17:24,659 --> 00:17:27,390
that they're on and bring them to something new and show them something new.
451
00:17:27,649 --> 00:17:32,100
And we saw that for six months last year, we had like basically flat growth.
452
00:17:32,120 --> 00:17:36,510
We were like between three and four thousand new users a day for six months.
453
00:17:36,919 --> 00:17:40,710
And then in November of 2024, we were doing over
454
00:17:40,710 --> 00:17:43,010
a million users a day for three days in a row.
455
00:17:43,020 --> 00:17:43,550
We'll never
456
00:17:43,550 --> 00:17:46,209
know why in November specifically that happened.
457
00:17:46,700 --> 00:17:48,000
The whiplash is crazy.
458
00:17:48,010 --> 00:17:49,679
You go from like no growth at all.
459
00:17:49,810 --> 00:17:52,719
Oh, we, I can't believe we spent so much time focusing on scaling.
460
00:17:52,720 --> 00:17:56,429
Why did we waste all that time and money on filling out these data centers?
461
00:17:56,615 --> 00:17:59,905
Oh my gosh, like at, in the last six months, was there like, you don't have
462
00:17:59,905 --> 00:18:02,524
to give us specifics, obviously we don't need numbers, but like, was there
463
00:18:02,534 --> 00:18:06,254
ever a point where you were like, oh my goodness, like what's going on?
464
00:18:06,264 --> 00:18:07,985
Like, or how are we going to sustain this?
465
00:18:08,575 --> 00:18:11,155
Brazil was insane, Brazil was, I was, I thought
466
00:18:11,155 --> 00:18:14,145
it was going to be another thing because I like watched the Brazil blip,
467
00:18:14,155 --> 00:18:18,105
but I guess in the U. S. it didn't make the same, it didn't seem as much.
468
00:18:18,225 --> 00:18:21,154
It didn't catch on as much in the U. S., but it was
469
00:18:21,185 --> 00:18:24,454
like one and a half million users in a weekend, right?
470
00:18:24,485 --> 00:18:27,645
Which for us coming from having like no growth for six months
471
00:18:27,645 --> 00:18:30,085
to suddenly picking up a million and a half users in a weekend
472
00:18:30,085 --> 00:18:33,325
was, it was like 30 percent growth of our network in like a week.
473
00:18:33,635 --> 00:18:34,915
Which was nuts for us.
474
00:18:35,185 --> 00:18:38,935
I was like on a plane to London to go see a friend of mine for his birthday.
475
00:18:39,264 --> 00:18:42,604
And I like bought the in flight wifi and was like trying to get my
476
00:18:42,605 --> 00:18:45,634
VPN to work so that I could like connect to dashboards and everything.
477
00:18:45,645 --> 00:18:46,245
And it was.
478
00:18:46,764 --> 00:18:47,935
I was so terrified.
479
00:18:47,935 --> 00:18:49,965
I ended up like that working the entire weekend.
480
00:18:49,965 --> 00:18:54,294
I was in London because I was just like, we've never seen any load like this.
481
00:18:54,304 --> 00:18:57,015
It was like five or six times higher, like firehose
482
00:18:57,015 --> 00:19:00,144
throughput and request throughput than we've ever seen before.
483
00:19:00,284 --> 00:19:02,874
You've mentioned a couple of components and I don't want to go deep into
484
00:19:02,874 --> 00:19:05,574
the app protocol stuff, but like, could you just give a general overview?
485
00:19:05,574 --> 00:19:09,054
Like the PDS, the firehose, the app view, the indexes.
486
00:19:09,360 --> 00:19:12,960
All of those have different constraints and how do they tie
487
00:19:12,960 --> 00:19:15,550
together or just a general overview of like what BlueSky is
488
00:19:15,550 --> 00:19:18,260
offering as a service is a bunch of things underneath it.
489
00:19:18,870 --> 00:19:21,780
I'll steal the Paulism, which is everybody's a website.
490
00:19:21,839 --> 00:19:25,359
So you as a user on BlueSky, every time you like something,
491
00:19:25,360 --> 00:19:27,500
every time you create a post, every time you follow somebody.
492
00:19:27,804 --> 00:19:31,124
Uh, every time you block somebody, every time you repost something, you
493
00:19:31,134 --> 00:19:35,415
are writing a little document, a JSON document effectively to your website.
494
00:19:35,485 --> 00:19:38,475
You're putting a JSON document in your canonical data
495
00:19:38,475 --> 00:19:42,194
store that lives on your PDS, on your personal data server.
496
00:19:42,274 --> 00:19:45,624
For the vast majority of our users, that means they are writing it to a PDS
497
00:19:45,634 --> 00:19:50,384
that we operate, but there are also thousands of independently operated PDSs.
498
00:19:50,665 --> 00:19:50,955
I'm one
499
00:19:50,955 --> 00:19:51,225
of them.
500
00:19:51,425 --> 00:19:52,685
Yeah, Justin's one of them.
501
00:19:52,685 --> 00:19:52,795
He
502
00:19:52,795 --> 00:19:54,405
broke himself for a while, and I couldn't reply to him.
503
00:19:54,415 --> 00:19:55,065
I was broken for a
504
00:19:55,065 --> 00:19:56,055
little while, but I'm back.
505
00:19:56,274 --> 00:19:57,825
It's been stable for a couple weeks now.
506
00:19:57,825 --> 00:19:58,017
It
507
00:19:58,017 --> 00:19:59,552
was like he picked on me specifically, Jazz.
508
00:19:59,552 --> 00:20:01,095
Like, I couldn't even reply to him.
509
00:20:01,105 --> 00:20:04,044
And he kept tagging me in taco and license plate things.
510
00:20:04,044 --> 00:20:04,925
Like, that's just mean.
511
00:20:04,935 --> 00:20:06,295
First of all, I'm hungry.
512
00:20:07,204 --> 00:20:09,014
And then I couldn't even reply.
513
00:20:09,274 --> 00:20:10,204
That's unfortunate.
514
00:20:10,205 --> 00:20:12,475
The nature of a distributed network is, is you've got
515
00:20:12,475 --> 00:20:14,425
all these documents that you write into your personal
516
00:20:14,425 --> 00:20:16,575
data store, whether it's hosted by us or somebody else.
517
00:20:17,100 --> 00:20:19,299
They get aggregated into one giant fire hose.
518
00:20:19,309 --> 00:20:22,699
So your PDS emits an event stream for all of the repos hosted on it.
519
00:20:22,709 --> 00:20:26,620
So for our users, it's usually right now it's like 500, 000 users per PDS.
520
00:20:26,789 --> 00:20:29,509
And so if you're on like Amanita, that's a,
521
00:20:29,600 --> 00:20:31,749
all of our PDSs are named after mushrooms.
522
00:20:31,870 --> 00:20:33,049
Um, so if you're on Amanita.
523
00:20:33,434 --> 00:20:38,144
You've got 499,999 of your closest friends on AMITA with
524
00:20:38,144 --> 00:20:41,084
you, and every time you post, you are writing to Amita to
525
00:20:41,559 --> 00:20:44,684
a SQL Light database that exists just for you on amita.
526
00:20:44,864 --> 00:20:45,824
Each user gets one.
527
00:20:45,824 --> 00:20:46,159
SSL L. We
528
00:20:46,159 --> 00:20:48,644
like database Justin, or do you think we're the same?
529
00:20:48,884 --> 00:20:49,245
We were.
530
00:20:49,249 --> 00:20:49,399
That was.
531
00:20:50,085 --> 00:20:51,855
Well, so that was the problem, actually, Autumn, because
532
00:20:51,855 --> 00:20:53,875
someone else pointed that out to me where you and I
533
00:20:53,875 --> 00:20:57,415
were on the same PDS originally before I migrated off.
534
00:20:57,875 --> 00:21:00,835
And then when I migrated off, they were the real MVP.
535
00:21:01,725 --> 00:21:03,864
I don't remember which one it was, but when I migrated off to my
536
00:21:03,865 --> 00:21:08,584
own, my account deactivation didn't fully happen on the hosted PDS.
537
00:21:08,584 --> 00:21:12,975
So people on my PDS couldn't reply to me until I went through and did my full.
538
00:21:14,215 --> 00:21:17,324
So it just so happened we were neighbors and, uh, and then when I
539
00:21:17,794 --> 00:21:20,125
moved out of the neighborhood, why'd you do that?
540
00:21:20,135 --> 00:21:21,824
Like just worst friend ever.
541
00:21:21,875 --> 00:21:22,175
Like
542
00:21:22,545 --> 00:21:24,325
you and your neighbors all chat, all you want.
543
00:21:24,325 --> 00:21:26,924
You've write all of these, these documents to your own
544
00:21:26,924 --> 00:21:29,025
little sequel lights living on your mushroom with you.
545
00:21:29,285 --> 00:21:33,974
And then the mushroom itself, it sequences all of the events for its users.
546
00:21:33,975 --> 00:21:36,955
So you and all the other people on there are writing.
547
00:21:37,395 --> 00:21:41,225
Generally, we see somewhere between 5 and 20 events a second per PDS.
548
00:21:41,295 --> 00:21:43,665
So, all those writes get written into one
549
00:21:43,685 --> 00:21:46,135
sequencer database, which is a SQLite as well.
550
00:21:46,465 --> 00:21:48,504
Uh, and then once they get sequenced, they get given like
551
00:21:48,504 --> 00:21:50,744
a sequence number and they get emitted out of the firehose.
552
00:21:50,774 --> 00:21:52,865
So each, each mushroom has its own little firehose.
553
00:21:53,234 --> 00:21:56,165
And then we have something called the relay, which is in the network that
554
00:21:56,175 --> 00:22:00,645
sucks from all of the mushrooms and turns into a gigantic firehose that
555
00:22:00,645 --> 00:22:05,215
does, you know, anywhere from 1, 000 to 2, 000 events per second these days.
556
00:22:05,364 --> 00:22:10,865
And so that giant firehose is running in our right now it runs in our on prem
557
00:22:10,865 --> 00:22:15,064
for us that merges all of the disparate event streams into one giant event
558
00:22:15,064 --> 00:22:19,784
stream, which makes consuming the network a lot easier and a lot less complex.
559
00:22:19,864 --> 00:22:23,144
And so that one big event stream then gets crawled by Firehose
560
00:22:23,174 --> 00:22:26,105
consumers, a couple hundred of those Jetstream is connected
561
00:22:26,105 --> 00:22:27,985
that which is like a lightweight version of the Firehose that
562
00:22:27,995 --> 00:22:29,935
has a couple hundred consumers that connect to it as well.
563
00:22:30,004 --> 00:22:32,024
But everybody, everybody consumes this Firehose and
564
00:22:32,024 --> 00:22:34,855
then the Firehose has like, hey, this person created.
565
00:22:35,179 --> 00:22:39,519
This record with this ID and here's the content of that record and then
566
00:22:39,529 --> 00:22:42,870
here's a proof of this operation so that you can check and make sure that
567
00:22:42,870 --> 00:22:45,889
they actually created this record, like it's signed with their private key.
568
00:22:46,059 --> 00:22:49,229
Can I just work at Blue Sky for like a day and then just re architect
569
00:22:49,239 --> 00:22:51,989
all of your architecture with mushrooms, like just little mushroom
570
00:22:51,989 --> 00:22:55,719
databases and like just magical streams, you know, like it was
571
00:22:55,719 --> 00:22:58,769
just like the fire hose will be like magical and then like it'll
572
00:22:58,779 --> 00:23:01,719
be like each thing will be like a mushroom and it'll be adorable.
573
00:23:02,119 --> 00:23:04,169
There's a legendary drawing that I found in the
574
00:23:04,169 --> 00:23:06,289
developer discord, the third party dev discord.
575
00:23:06,399 --> 00:23:11,509
Somebody drew an architectural diagram of blue sky, but where each node in the
576
00:23:11,509 --> 00:23:15,689
network is like a forest creature and gave them like Very interesting names.
577
00:23:15,720 --> 00:23:18,600
And so there, there is like kind of headcanon of, oh yeah, this,
578
00:23:18,649 --> 00:23:21,520
this is, this component is like an anteater and this component
579
00:23:21,530 --> 00:23:24,370
is like a hedgehog and this component, you know, whatever.
580
00:23:24,510 --> 00:23:26,889
Each database should be a mushroom or like
581
00:23:26,890 --> 00:23:29,850
each, each data, the microsphere, we call it, uh,
582
00:23:29,850 --> 00:23:32,679
all of the, the PDS is make up the microsphere.
583
00:23:32,919 --> 00:23:36,260
So that gets indexed internally and then we have a big.
584
00:23:36,669 --> 00:23:39,360
database in each of the POPs right now that runs
585
00:23:39,360 --> 00:23:42,389
Scilla, which is a kind of a C rewrite of Cassandra.
586
00:23:42,439 --> 00:23:45,159
So it's a big NoSQL key, key value store.
587
00:23:45,249 --> 00:23:49,350
And that's where we actually persist the global index of data on the network.
588
00:23:49,399 --> 00:23:53,460
So your PDS only knows about what your users have, what its users have.
589
00:23:53,889 --> 00:23:58,040
I love watching Scylla and Cassandra fight and then Scylla's like, C is like
590
00:23:58,040 --> 00:24:01,330
faster because we don't like compile and then Cassandra's like, but we're
591
00:24:01,330 --> 00:24:04,970
faster and it just pretend like it doesn't suck to manage us and then they
592
00:24:04,970 --> 00:24:08,929
fight back and forth and it's the best nerd fight you've ever seen in your life.
593
00:24:09,020 --> 00:24:10,710
Is that where the AppVue pulls from?
594
00:24:10,720 --> 00:24:12,429
It's not going directly from the.
595
00:24:12,509 --> 00:24:12,759
Yeah.
596
00:24:12,760 --> 00:24:15,500
So the AppVue pulls from its local, there's
597
00:24:15,500 --> 00:24:17,710
a data service that I wrote called Atlantis.
598
00:24:17,969 --> 00:24:20,929
which is our like data plane or whatever that talks to Scylla that that's
599
00:24:20,929 --> 00:24:24,039
what writes things into Scylla that's what reads things out of Scylla it
600
00:24:24,039 --> 00:24:27,409
also handles some like caching tiers it handles some request coalescing
601
00:24:27,419 --> 00:24:30,939
things like that and so that is where the global index of data is so when
602
00:24:30,939 --> 00:24:34,699
you load your timeline when you load a thread when you look at the number
603
00:24:34,699 --> 00:24:38,249
of likes on a post that's all coming out of Scylla that's coming from our
604
00:24:38,289 --> 00:24:42,359
big data store and then in terms of scale for that like the actual amount
605
00:24:42,359 --> 00:24:45,509
of data that's on the network is like It's like a couple of terabytes.
606
00:24:45,559 --> 00:24:48,209
If you don't include images and you don't include video or anything like
607
00:24:48,209 --> 00:24:52,219
that, the actual like, record data, the JSON is like a couple terabytes.
608
00:24:52,229 --> 00:24:53,569
So it's not huge.
609
00:24:53,699 --> 00:24:55,159
The timelines are really big though.
610
00:24:55,219 --> 00:24:57,449
Timelines are a really weird workload, which is like every time
611
00:24:57,449 --> 00:25:00,349
you post, we send out your post to all the people that follow you.
612
00:25:00,469 --> 00:25:02,709
So if you have 20, 000 followers and you post something,
613
00:25:02,709 --> 00:25:05,269
we're going to go insert 20, 000 references to your
614
00:25:05,269 --> 00:25:08,509
post into the timelines of the people that follow you.
615
00:25:08,679 --> 00:25:09,979
And then we keep a That sounds very complex.
616
00:25:09,979 --> 00:25:14,804
It's It was a big architectural shift from what we did before, but the
617
00:25:14,824 --> 00:25:18,574
timelines themselves, like the timelines table is like over 100 billion rows.
618
00:25:18,684 --> 00:25:21,624
We trim it so like there's a maximum length of your timeline, but when you
619
00:25:21,624 --> 00:25:25,034
have 30 million users and you want to keep like a few thousand timeline
620
00:25:25,034 --> 00:25:27,964
items in there that quickly balloons to like hundreds of billions of rows.
621
00:25:27,965 --> 00:25:31,564
Wasn't the blue sky account had to like you had to post tens of
622
00:25:31,564 --> 00:25:33,945
thousands of people wait five minutes right to let that propagate?
623
00:25:34,125 --> 00:25:36,995
There was a moment where we had, there was only
624
00:25:36,995 --> 00:25:39,615
one work queue or whatever for dealing with stuff.
625
00:25:39,655 --> 00:25:42,245
And the fan out job was also in that same work queue.
626
00:25:42,334 --> 00:25:44,735
And so it, like you get sharded into a work queue
627
00:25:44,735 --> 00:25:46,074
based off of your date and all that kind of stuff.
628
00:25:46,084 --> 00:25:47,654
But he's got an app account would.
629
00:25:47,675 --> 00:25:50,314
It would create a post, and then it would start fanning out
630
00:25:50,314 --> 00:25:53,064
the post, and the creation of the next post in the thread
631
00:25:53,064 --> 00:25:55,324
would get blocked, because it would be waiting for the fanout
632
00:25:55,344 --> 00:25:57,374
to finish before it would create the next post in the thread.
633
00:25:57,374 --> 00:26:00,064
And now, those are two separate queues, so fanout jobs
634
00:26:00,064 --> 00:26:02,014
can happen in the background, and they don't block
635
00:26:02,014 --> 00:26:04,694
the, like, persisting of the actual thread post itself.
636
00:26:04,930 --> 00:26:08,960
Okay, do you have a different flow for users that have a bunch of followers
637
00:26:08,960 --> 00:26:11,860
versus users that don't have a bunch because there was a certain point
638
00:26:11,910 --> 00:26:15,269
in like Twitter where they had to re architect for like Justin Bieber
639
00:26:15,760 --> 00:26:18,770
versus a regular person and that's one of my favorite data stories
640
00:26:18,800 --> 00:26:22,379
because it just shows you how scale can just be completely ridiculous.
641
00:26:22,379 --> 00:26:25,270
Like he would get so many followers a day and then when he would
642
00:26:25,290 --> 00:26:29,490
tweet it would like mess up everything and it's just so interesting.
643
00:26:29,540 --> 00:26:30,650
We haven't done that yet.
644
00:26:30,779 --> 00:26:33,560
But that is absolutely, like a hybrid timeline architecture is
645
00:26:33,560 --> 00:26:36,919
absolutely probably where we'll go as we get bigger and bigger.
646
00:26:36,989 --> 00:26:38,629
Because right now, every time bscott.
647
00:26:38,719 --> 00:26:40,320
app posts a thread, it's getting fanned out
648
00:26:40,320 --> 00:26:42,599
to, I think, 22 million people's timelines.
649
00:26:42,809 --> 00:26:43,599
That's a lot of writes.
650
00:26:43,769 --> 00:26:45,729
And if they post a five post thread, that's like
651
00:26:45,739 --> 00:26:47,570
a hundred mil, over a hundred million writes.
652
00:26:47,659 --> 00:26:49,799
The, the guy who wrote Date
653
00:26:49,799 --> 00:26:50,869
of Intensive Applications
654
00:26:50,870 --> 00:26:52,399
Is on Blue Sky.
655
00:26:53,354 --> 00:26:53,905
Yes.
656
00:26:54,175 --> 00:26:59,614
And he's so rad and nice and that, dude, that's my favorite book.
657
00:26:59,695 --> 00:26:59,754
It is.
658
00:27:00,725 --> 00:27:01,364
The boar book.
659
00:27:01,364 --> 00:27:02,144
But when I found
660
00:27:02,304 --> 00:27:04,395
him, I was like, Oh my God, you're real.
661
00:27:05,455 --> 00:27:05,764
Yeah.
662
00:27:06,455 --> 00:27:10,205
Martin is actually a technical advisor of Blue Sky.
663
00:27:10,464 --> 00:27:10,484
To
664
00:27:10,485 --> 00:27:13,485
be like, so smart, you know, like you'd think that he would
665
00:27:13,495 --> 00:27:15,844
be like, Oh, I'm too smart and I won't talk to people.
666
00:27:15,844 --> 00:27:16,834
And he's so nice.
667
00:27:17,389 --> 00:27:18,139
He's a teacher.
668
00:27:18,139 --> 00:27:19,980
I feel like he gets a lot of human interaction.
669
00:27:19,990 --> 00:27:22,490
He's not like locked in a cave doing like research.
670
00:27:22,540 --> 00:27:24,679
So I think he ends up interacting with humans a
671
00:27:24,679 --> 00:27:27,459
lot more than some, uh, some CS researchers do.
672
00:27:27,629 --> 00:27:30,050
Also, I think the way that that book is written, you can
673
00:27:30,050 --> 00:27:32,820
almost tell that he must have taught something because it's.
674
00:27:33,095 --> 00:27:37,514
Much more digestible than a lot of just dense, horrible data.
675
00:27:38,695 --> 00:27:39,475
Martin is great.
676
00:27:39,514 --> 00:27:41,575
We meet with him fairly regularly to just
677
00:27:41,584 --> 00:27:43,135
talk about like issues that we're having.
678
00:27:43,135 --> 00:27:43,325
I'm a
679
00:27:43,325 --> 00:27:43,794
fan.
680
00:27:43,794 --> 00:27:44,244
Okay.
681
00:27:44,314 --> 00:27:46,685
Tell him I'm a fan girl over all of his data books.
682
00:27:46,685 --> 00:27:47,934
And that's my favorite data book.
683
00:27:47,935 --> 00:27:49,115
And I talk about it way too much.
684
00:27:49,115 --> 00:27:51,854
And people are probably so tired of me bringing up that one book.
685
00:27:51,935 --> 00:27:52,845
One of my favorite.
686
00:27:52,949 --> 00:27:54,550
We have, we do have a lot of like internal memes.
687
00:27:54,570 --> 00:27:56,169
I think we've shared a couple of them on the network.
688
00:27:56,169 --> 00:27:57,479
We do have memes
689
00:27:58,320 --> 00:28:00,719
and we need to like plushy picks jazz.
690
00:28:00,749 --> 00:28:01,639
I just blue skies.
691
00:28:01,639 --> 00:28:01,870
Yeah.
692
00:28:02,009 --> 00:28:02,409
You're right.
693
00:28:03,759 --> 00:28:06,059
She picks
694
00:28:06,060 --> 00:28:09,349
and, and designing data intensive applications, memes, Martin used to come to us
695
00:28:09,349 --> 00:28:11,999
with like, we'd, we'd go to him and we'd ask like all these questions and he'd
696
00:28:11,999 --> 00:28:14,540
give us like, Oh yeah, here's a great, here's a great way to solve that problem.
697
00:28:14,540 --> 00:28:16,300
And nowadays we go to him and every time we
698
00:28:16,310 --> 00:28:17,860
like, we're like, Hey, we have this problem.
699
00:28:18,290 --> 00:28:19,870
And he'd be like, ah, that's a tough one.
700
00:28:20,209 --> 00:28:20,510
And it's like,
701
00:28:21,080 --> 00:28:23,080
we're getting out of the realm of easy, of
702
00:28:23,080 --> 00:28:25,030
easy answers that are like well explored.
703
00:28:25,030 --> 00:28:27,419
And we're into the like, yep, that's, that's a challenge at scale.
704
00:28:27,590 --> 00:28:29,709
Can I come be a free technical consultant
705
00:28:29,709 --> 00:28:31,509
just so I can talk to Martin and get memes?
706
00:28:31,579 --> 00:28:33,389
Like, can I be paid in memes?
707
00:28:33,689 --> 00:28:35,229
You can talk to Martin on the network, and I think
708
00:28:35,229 --> 00:28:37,379
Martin also goes to a good number of conferences as well.
709
00:28:37,800 --> 00:28:39,709
You could, you could end up at a I'm gonna not stalk him in
710
00:28:39,709 --> 00:28:42,879
a creepy way, but in a very nice, professional way.
711
00:28:42,879 --> 00:28:43,389
Yeah.
712
00:28:43,459 --> 00:28:44,986
Get him to come to one
713
00:28:44,986 --> 00:28:46,686
of the concerts we just did.
714
00:28:46,686 --> 00:28:49,919
With that general overview, I remember back when you were scaling
715
00:28:49,920 --> 00:28:52,629
a million users a day, you had to go rack some servers, right?
716
00:28:52,630 --> 00:28:53,920
Like, there was a point where you were like, hey,
717
00:28:53,920 --> 00:28:57,410
we need to scale up, and, and it's not in the PDS.
718
00:28:57,910 --> 00:29:00,850
with, with millions of users coming, even though that's growing, you
719
00:29:00,850 --> 00:29:04,330
can still scale that the, the bare metal, because those are rentals.
720
00:29:04,330 --> 00:29:06,450
That's a, it's a provider that say, give me another one.
721
00:29:06,700 --> 00:29:07,320
We'll provision it.
722
00:29:07,320 --> 00:29:08,160
It'll come into the network.
723
00:29:08,349 --> 00:29:10,549
And then on the other side of like some cloud services running,
724
00:29:10,549 --> 00:29:12,440
like those skip, but like somewhere in there you had to rack.
725
00:29:12,460 --> 00:29:14,270
And that's mostly, you said for the fire hose
726
00:29:14,309 --> 00:29:17,830
for that kind of global index, so that the data service that does
727
00:29:17,830 --> 00:29:20,420
all the querying to the database, the database cluster itself.
728
00:29:20,550 --> 00:29:24,000
And a couple of other, like the discover feed and stuff that run on prem.
729
00:29:24,040 --> 00:29:27,870
So those all require machines to run on and we don't have a
730
00:29:27,870 --> 00:29:31,810
magic, uh, like I can't change a number in a Pulumi deploy and
731
00:29:31,810 --> 00:29:34,990
then magically have more hardware available in the data center.
732
00:29:35,249 --> 00:29:36,269
It's a whole process.
733
00:29:36,270 --> 00:29:37,959
You've got to go through the acquisition process.
734
00:29:37,960 --> 00:29:38,970
You've got to find a vendor.
735
00:29:38,970 --> 00:29:40,010
You've got to talk to a vendor.
736
00:29:40,010 --> 00:29:42,600
You've got to, you know, spend some money on some new machines.
737
00:29:42,600 --> 00:29:43,160
They get shipped.
738
00:29:43,160 --> 00:29:44,210
You have to go to the data center.
739
00:29:44,210 --> 00:29:45,000
You have to receive them.
740
00:29:45,000 --> 00:29:48,340
You have to Unbox everything, rack it, hook it up, network it, burn
741
00:29:48,340 --> 00:29:50,760
it in, provision it, and then you can figure out, all right, how are
742
00:29:50,760 --> 00:29:53,650
we going to like migrate the workload to this, to this new hardware?
743
00:29:53,750 --> 00:29:57,400
That's why I think that like you have to have that happy medium between cloud.
744
00:29:57,750 --> 00:30:02,090
And like on prem like everybody acts like either is some magical solution and
745
00:30:02,090 --> 00:30:05,719
I'm just like we're going to pretend that we forgot the leeway and all the
746
00:30:05,720 --> 00:30:09,440
stuff you have to do to get something on prem like it is cheaper and it does
747
00:30:09,440 --> 00:30:12,580
need to be used a lot more because putting everything in the cloud is just
748
00:30:12,580 --> 00:30:17,080
not cost efficient but I think people forgot how long it takes to get stuff
749
00:30:17,190 --> 00:30:20,750
on prem and then the fact that you have to go fix that when it burns out.
750
00:30:20,910 --> 00:30:22,560
There's a lot of convenience that comes
751
00:30:22,560 --> 00:30:24,690
with cloud, but you definitely pay for it.
752
00:30:24,930 --> 00:30:26,890
And you don't necessarily pay for it in the
753
00:30:26,890 --> 00:30:28,530
things that you expect to pay for it in.
754
00:30:28,540 --> 00:30:30,730
Like, you don't expect, ah, you're gonna charge a markup
755
00:30:30,760 --> 00:30:34,430
on this EC2 instance based off of how powerful it is.
756
00:30:34,430 --> 00:30:38,089
You end up paying most of it in, like, kind of hidden places, like, you
757
00:30:38,089 --> 00:30:42,369
know, in egress fees or in, like, WAF requests or something like that.
758
00:30:42,370 --> 00:30:44,019
You're also kind of beholden to
759
00:30:44,019 --> 00:30:46,380
them and their decision making, you know.
760
00:30:46,660 --> 00:30:49,210
Yeah, a lot of a lot of cloud providers haven't really passed
761
00:30:49,210 --> 00:30:53,410
down cost savings of like more efficient hardware to consumers.
762
00:30:53,480 --> 00:30:56,520
So like the cost of an EC2 instance per like vCore hasn't
763
00:30:56,530 --> 00:31:00,690
really, or vCPU hasn't really gone down much over time.
764
00:31:00,720 --> 00:31:03,659
And the number of vCPUs you can pack into a single machine
765
00:31:03,659 --> 00:31:06,745
and that you can, the amount of compute you get per watt
766
00:31:06,895 --> 00:31:10,195
in a data center has had insane leaps in the past 10 years.
767
00:31:10,255 --> 00:31:12,675
I'm really interested to see where that goes, right?
768
00:31:12,705 --> 00:31:15,285
Like eventually they're going to have to figure out
769
00:31:15,335 --> 00:31:18,154
how to compete with on prem, you know what I mean?
770
00:31:18,585 --> 00:31:22,025
And it's just interesting the way that they've made cuts in certain areas.
771
00:31:22,025 --> 00:31:24,915
And I'm like, bro, you're making cuts for the most expensive stuff that you
772
00:31:24,915 --> 00:31:28,085
run, but not the stuff that you get for the cheapest, which is very interesting.
773
00:31:28,105 --> 00:31:29,635
I mean, almost all of them have doubled
774
00:31:29,635 --> 00:31:31,794
down on their investments in custom silicon.
775
00:31:31,940 --> 00:31:34,210
And so they all say like, Oh, we're going to, the,
776
00:31:34,220 --> 00:31:37,760
the AWS play is Graviton is more efficient per Watts.
777
00:31:37,760 --> 00:31:38,820
And so you should go to Graviton.
778
00:31:38,830 --> 00:31:39,790
You should use our
779
00:31:39,790 --> 00:31:44,540
About like the bad place, but Graviton is kind of fire.
780
00:31:44,650 --> 00:31:50,029
Now, is that a good excuse to ha for what we're talking about?
781
00:31:50,049 --> 00:31:52,850
No, but I think that is going to be one of the best
782
00:31:52,850 --> 00:31:55,650
things that has come out of the bad place in a long time.
783
00:31:55,870 --> 00:31:57,550
So for you looking back on.
784
00:31:58,000 --> 00:32:02,060
These separate places that pieces of infrastructure run and putting
785
00:32:02,060 --> 00:32:03,940
things on prem and having to go through that we have to scale
786
00:32:03,940 --> 00:32:06,390
this thing up was, do you think that was still a good decision?
787
00:32:07,020 --> 00:32:07,660
Absolutely.
788
00:32:07,690 --> 00:32:08,120
Yeah.
789
00:32:08,300 --> 00:32:11,140
I mean, the way that we approached it was, hey, let's build
790
00:32:11,140 --> 00:32:16,190
out, let's way overbuild our on prem solution and then we'll be.
791
00:32:16,720 --> 00:32:20,200
Ready for, you know, insane overheads if something crazy happens.
792
00:32:20,270 --> 00:32:22,500
And then even now, we like, we just, we recently
793
00:32:22,500 --> 00:32:24,960
finished an expansion in our, in our on prem POPs.
794
00:32:25,020 --> 00:32:27,729
And even that was like a, it was a preemptive measure.
795
00:32:27,730 --> 00:32:29,989
It was a cool, we're not near the limits of the hardware
796
00:32:29,989 --> 00:32:32,990
we have right now, but if we want to keep really healthy
797
00:32:33,019 --> 00:32:35,509
overhead in our POPs, we should probably do some expansion.
798
00:32:35,589 --> 00:32:38,380
And so a lot of this comes from like planning for a couple orders
799
00:32:38,380 --> 00:32:41,060
of magnitude, and then making sure that in the time it would
800
00:32:41,060 --> 00:32:43,890
take to grow by a couple orders of magnitude, you, you can.
801
00:32:44,200 --> 00:32:45,960
get hardware where it needs to be in time.
802
00:32:46,100 --> 00:32:49,220
Whatever you're doing, the planning is very well placed.
803
00:32:49,390 --> 00:32:50,230
You're doing a great job.
804
00:32:50,450 --> 00:32:52,820
A lot of it was kind of scarily instinct.
805
00:32:52,850 --> 00:32:54,820
Like, the most recent expansion that we did, I
806
00:32:54,829 --> 00:32:58,250
was like, this was after Brazil, you know, we saw
807
00:32:58,289 --> 00:33:02,699
Which is such a weird, like I almost wonder if y'all should just pay,
808
00:33:02,710 --> 00:33:05,680
like, Elon at this point, or like, send him a gift, because like,
809
00:33:05,710 --> 00:33:08,260
every time like, there for a while, every time he said something
810
00:33:08,260 --> 00:33:11,330
stupid or did something stupid, it would just be like, Spike?
811
00:33:12,700 --> 00:33:16,025
Like, you could Tell what, like, you just be like, what did Elon do today?
812
00:33:16,025 --> 00:33:18,005
Cause there's so many new people, you can
813
00:33:18,065 --> 00:33:18,745
see them on graphs.
814
00:33:19,045 --> 00:33:21,425
They're pretty noticeable and pretty sharp on the graph, make
815
00:33:21,425 --> 00:33:22,584
a meme of the graph.
816
00:33:22,584 --> 00:33:22,884
Right.
817
00:33:22,925 --> 00:33:24,385
And then put his head on each.
818
00:33:27,285 --> 00:33:29,995
You're like, we'd mark our, all of our graphs with our deploys.
819
00:33:30,055 --> 00:33:32,864
And instead you have all these marks of like news articles, a
820
00:33:32,865 --> 00:33:35,055
little blip of like the dumb thing he did that
821
00:33:35,055 --> 00:33:37,595
day, you know, like talked crap to Brazil.
822
00:33:39,375 --> 00:33:43,365
So much of our planning and everything is like, we, we don't have control
823
00:33:43,365 --> 00:33:46,464
over how many people are going to decide to use our website today.
824
00:33:46,464 --> 00:33:46,614
There
825
00:33:46,615 --> 00:33:50,124
was like a rumor at Tesla that when his girlfriend changed
826
00:33:50,124 --> 00:33:52,945
the color of her hair, or that like if they had a fight, like
827
00:33:52,945 --> 00:33:56,395
if they saw them walk out, the handlers of the Elon would
828
00:33:56,395 --> 00:33:59,735
like panic and then figure out how to like make it the least.
829
00:34:00,290 --> 00:34:02,340
Wild outcome of that.
830
00:34:02,370 --> 00:34:06,760
Like, can you imagine this dude is like CEO of a company,
831
00:34:06,810 --> 00:34:10,450
like, and they have handlers because they're worried about,
832
00:34:10,450 --> 00:34:14,220
like, what will result after this argument or hair color?
833
00:34:14,220 --> 00:34:15,809
Like, can you imagine that environment?
834
00:34:15,809 --> 00:34:16,710
And you know what I mean?
835
00:34:16,839 --> 00:34:18,740
And now it's affecting a whole nother company.
836
00:34:18,740 --> 00:34:20,970
And now we're just like, let's try it with the country.
837
00:34:20,970 --> 00:34:22,619
It's going to be great.
838
00:34:22,620 --> 00:34:26,540
I mean, well, when, when all of Brazil loses access to Twitter, like overnight.
839
00:34:26,610 --> 00:34:28,140
That was an insane moment.
840
00:34:28,160 --> 00:34:32,560
That was like, we were, I think some numbers I can talk about, like, which
841
00:34:32,560 --> 00:34:36,349
are fun numbers is like total requests, throughput across the PDS is so like,
842
00:34:36,350 --> 00:34:39,859
that's kind of our, our big, how much load is going on right now number.
843
00:34:40,100 --> 00:34:42,790
And before Brazil, we were doing like three and
844
00:34:42,790 --> 00:34:46,319
a half K 4, 000 requests a second peak a day.
845
00:34:46,510 --> 00:34:48,829
And then Brazil happened and we shot to 25,
846
00:34:48,830 --> 00:34:52,390
000 requests a second across all of the PDS is.
847
00:34:52,460 --> 00:34:53,080
And then.
848
00:34:53,655 --> 00:34:56,015
In November, we hit our new kind of record,
849
00:34:56,015 --> 00:34:58,115
which was like 50, 000 requests a second.
850
00:34:58,195 --> 00:35:02,245
So we're still like way above Brazil's peak on like a daily basis now.
851
00:35:02,305 --> 00:35:04,544
But it is insane to me that that was, that
852
00:35:04,545 --> 00:35:07,484
was like a 10x event for us, which is crazy.
853
00:35:07,684 --> 00:35:10,755
And now that has become normal in like a few months.
854
00:35:10,765 --> 00:35:12,395
It's like, yeah, that's just what we deal with every day.
855
00:35:12,395 --> 00:35:13,565
Now we're running around with the chickens
856
00:35:13,565 --> 00:35:15,185
with our head cut off when Brazil happened.
857
00:35:15,455 --> 00:35:17,695
And then November came along and that was like.
858
00:35:18,140 --> 00:35:19,830
An even worse version of it, it was like
859
00:35:19,880 --> 00:35:21,890
four Brazils or something crazy like that.
860
00:35:21,970 --> 00:35:24,000
After Brazil happened, we were like, alright, how
861
00:35:24,000 --> 00:35:26,260
the heck are we going to plan for a 10x of this?
862
00:35:26,360 --> 00:35:28,250
But we, we did everything we could to like, alright,
863
00:35:28,250 --> 00:35:30,579
can we prepare for a 10x of what that just was?
864
00:35:30,599 --> 00:35:32,419
And now November happened and we're like, alright, how do
865
00:35:32,420 --> 00:35:34,680
we prepare, how do we prepare for a 10x of what that was?
866
00:35:34,869 --> 00:35:36,859
Okay, like low key though, did you think
867
00:35:36,859 --> 00:35:38,650
like it was going to hit the fan in November?
868
00:35:38,880 --> 00:35:40,589
Or like, did you, like, did you anticipate it at all?
869
00:35:40,589 --> 00:35:40,990
Yeah, we were
870
00:35:41,250 --> 00:35:41,510
prepared.
871
00:35:41,550 --> 00:35:43,640
I don't think we, anybody expected that we were
872
00:35:43,640 --> 00:35:47,459
gonna like triple our user base in like three weeks.
873
00:35:47,630 --> 00:35:51,090
We had 10 million users leading up to the election roughly, right?
874
00:35:51,220 --> 00:35:54,049
And we're, a few months after that, like, I think we hit
875
00:35:54,050 --> 00:35:58,020
25 million users within like a month of the election.
876
00:35:58,395 --> 00:36:02,285
For the small issues you had, they were very well handled.
877
00:36:02,285 --> 00:36:03,655
Like y'all were
878
00:36:03,695 --> 00:36:06,665
just, I would, that's actually, so I, we asked questions or I asked
879
00:36:06,665 --> 00:36:09,585
him questions on blue sky, like, Hey, anyone have questions to ask?
880
00:36:09,585 --> 00:36:11,065
And one of them was about incident management.
881
00:36:11,505 --> 00:36:12,495
How does that, how does that work?
882
00:36:12,495 --> 00:36:14,785
How do you learn from some of those incidents you've been having?
883
00:36:14,794 --> 00:36:17,944
Like there's, there's always something going on, um, between
884
00:36:17,945 --> 00:36:21,105
all the planning and, and the other, the normal things
885
00:36:21,105 --> 00:36:23,865
you have to do to like deploy and make software better.
886
00:36:24,055 --> 00:36:25,495
How do you handle those incidents?
887
00:36:30,715 --> 00:36:32,765
A lot of metrics, a lot of dashboards.
888
00:36:32,895 --> 00:36:35,315
That's, that's kind of the most important thing is like, if
889
00:36:35,315 --> 00:36:38,145
you are not measuring something, it is very hard to improve it.
890
00:36:38,235 --> 00:36:39,874
And so we lean really heavily into Can you say that
891
00:36:39,874 --> 00:36:41,544
louder for the people in the back, Jazz?
892
00:36:41,675 --> 00:36:45,265
Observability and monitoring is important, engineers.
893
00:36:46,885 --> 00:36:50,105
If you, if you can't measure it, you, you can't meaningfully improve it.
894
00:36:50,255 --> 00:36:51,565
Or at least you can't prove that you improved it.
895
00:36:51,735 --> 00:36:54,545
So when things were going crazy in November,
896
00:36:54,575 --> 00:36:57,174
we had what I call like the 11 days from hell.
897
00:36:57,495 --> 00:37:03,615
Which was 11 days of 16 hours a day in a situation room from like the moment you
898
00:37:03,615 --> 00:37:07,695
wake up to the moment you go to bed and then like Wake up, check some graphs in
899
00:37:07,695 --> 00:37:11,465
bed, line is still going up, get ready as quickly as you can, get downstairs,
900
00:37:11,465 --> 00:37:13,765
log into the situation room, and figure out what's on fire this morning.
901
00:37:13,875 --> 00:37:14,935
Tell me there was coffee.
902
00:37:15,065 --> 00:37:16,865
I drink Monster, but yeah, there was, there's,
903
00:37:16,905 --> 00:37:18,705
there's a lot of, uh, a lot of Dang, see, that's why
904
00:37:18,705 --> 00:37:20,975
that, that's why their infrastructure never goes down.
905
00:37:21,155 --> 00:37:24,204
That was, there, there were so many, like, so many different components hit.
906
00:37:24,270 --> 00:37:28,610
I guess you would call them like early scaling limits, not that they were at the
907
00:37:28,610 --> 00:37:31,600
maximum of their design, but that they've never been pushed that hard before.
908
00:37:31,600 --> 00:37:34,390
And so we were shaking out bugs all over the place, like scaling,
909
00:37:34,449 --> 00:37:37,210
scaling issues or like some concurrency bug or something like that
910
00:37:37,210 --> 00:37:39,800
that was falling out from so many different systems all at once.
911
00:37:39,819 --> 00:37:40,359
Because when you.
912
00:37:40,795 --> 00:37:43,945
You drive a truck over a bridge, and if the truck is really heavy,
913
00:37:44,165 --> 00:37:46,195
and it's like too heavy, and you have this really old, like,
914
00:37:46,195 --> 00:37:49,965
bolt in the bridge, the bolt could, like, get broken, or shear,
915
00:37:49,965 --> 00:37:52,595
or fall off, and that reduces some of the stress of the bridge.
916
00:37:52,765 --> 00:37:54,954
Like, it starts swaying, and then a bolt
917
00:37:54,954 --> 00:37:56,865
fires off, and then it stops swaying as much.
918
00:37:56,884 --> 00:37:58,504
And that kind of, that's kind of how you, like,
919
00:37:58,505 --> 00:38:01,115
release tension in a bridge when it's under stress.
920
00:38:01,405 --> 00:38:06,215
But if you land, like, an AC 130 on the bridge, and you're, like, taking, like,
921
00:38:06,215 --> 00:38:11,355
a giant jumbo jet, or, like, some kind of massive 747 landed on the bridge all
922
00:38:11,355 --> 00:38:15,475
at once and a bunch of bolts pop loose all at once and you're like Oh, crap.
923
00:38:15,535 --> 00:38:17,315
Which one do we go fix first?
924
00:38:17,335 --> 00:38:20,785
Which one is like structurally important to the success of the bridge?
925
00:38:20,875 --> 00:38:23,624
So when you scale insanely fast in a really short period of
926
00:38:23,625 --> 00:38:26,835
time, you have a lot of systems that hit these early limits
927
00:38:26,845 --> 00:38:30,195
or that, that shoot these bugs out like bolts off of a bridge.
928
00:38:30,214 --> 00:38:33,274
And you have to figure out through your metrics, figure out,
929
00:38:33,275 --> 00:38:36,195
okay, which services are okay, which services are not okay.
930
00:38:36,475 --> 00:38:38,545
And then dig into the services that are not okay and
931
00:38:38,545 --> 00:38:40,455
figure out, all right, where are we running into problems?
932
00:38:40,785 --> 00:38:44,405
One of the craziest issues we had was like everybody's handles started suddenly
933
00:38:44,405 --> 00:38:48,949
started becoming invalid because we ran into the limits of public DNS resolvers.
934
00:38:49,210 --> 00:38:52,670
We were like hitting Google Public DNS Resolver and
935
00:38:52,680 --> 00:38:55,130
Cloudflare's Public DNS Resolver so heavily they started
936
00:38:55,130 --> 00:38:57,699
rate limiting us and we just couldn't do DNS queries anymore.
937
00:38:57,860 --> 00:38:58,929
Okay, can we just talk though?
938
00:38:58,930 --> 00:39:00,050
Like, why is it always DNS?
939
00:39:00,050 --> 00:39:04,979
DNS finds new ways to like, just ruin people's lives.
940
00:39:04,979 --> 00:39:07,159
Like, it wakes up in the morning and it's like, how
941
00:39:07,159 --> 00:39:10,039
can I be difficult in a way that they'll never expect?
942
00:39:10,039 --> 00:39:12,140
Like, it's never something that's easily figured out.
943
00:39:12,140 --> 00:39:14,540
You gotta go down the whole rabbit hole, figure
944
00:39:14,540 --> 00:39:16,710
out some way that you've never heard of before.
945
00:39:17,245 --> 00:39:20,305
Justin's problem also somehow tied to DNS.
946
00:39:20,305 --> 00:39:24,845
Like, it's always every time and it's always, it's never like a normal error
947
00:39:24,845 --> 00:39:28,425
that like makes you think, okay, it's this, it's always something ridiculous.
948
00:39:28,425 --> 00:39:29,934
That's just this rabbit hole.
949
00:39:30,215 --> 00:39:34,294
Every error message in every application for every log
950
00:39:34,294 --> 00:39:36,854
everywhere should probably just end with, it might be DNS.
951
00:39:36,855 --> 00:39:37,465
No, seriously.
952
00:39:37,655 --> 00:39:39,375
It should be like, go hit this line.
953
00:39:39,705 --> 00:39:41,155
This thing's mad at you.
954
00:39:41,185 --> 00:39:44,065
But also if this fails, is it DNS?
955
00:39:44,255 --> 00:39:44,945
Segfault.
956
00:39:44,965 --> 00:39:45,795
Maybe it's DNS.
957
00:39:45,855 --> 00:39:46,175
I don't know.
958
00:39:46,225 --> 00:39:49,185
And then Kubernetes was like, hey, what if we put DNS everywhere?
959
00:39:49,275 --> 00:39:51,855
What if we wove DNS through the entire stack?
960
00:39:51,955 --> 00:39:53,575
Actually, that's a good question because you said
961
00:39:53,575 --> 00:39:55,205
you were doing Kubernetes at previous startups.
962
00:39:55,224 --> 00:39:57,075
You don't have any Kubernetes in the stack now, right?
963
00:39:57,144 --> 00:39:57,455
We have
964
00:39:57,455 --> 00:39:58,575
no Kubernetes it's all
965
00:39:58,905 --> 00:40:00,995
VMs and it's still containerized.
966
00:40:01,310 --> 00:40:02,130
It's containerized.
967
00:40:02,140 --> 00:40:04,890
It is containerized, but it is not a lot of VMs,
968
00:40:04,930 --> 00:40:07,490
even honestly, it's just like SSH into the box.
969
00:40:07,500 --> 00:40:09,720
It's kind of running, you know, Linux right on top
970
00:40:09,720 --> 00:40:11,760
of the bare metal and then it's running Docker.
971
00:40:11,870 --> 00:40:13,490
So no traditional orchestrator.
972
00:40:13,580 --> 00:40:15,749
No, no traditional orchestrator at the moment.
973
00:40:15,910 --> 00:40:17,619
It's like Ansible jobs, Docker run.
974
00:40:18,420 --> 00:40:22,630
Yeah, Ansible jobs, Docker compose and a couple of tweaks to make things faster.
975
00:40:22,640 --> 00:40:25,390
We're not using like the Docker logging because Docker logging
976
00:40:25,390 --> 00:40:28,460
is not very good if you have really really high throughput logs.
977
00:40:28,595 --> 00:40:32,075
So using like, we're using svlogd, which is in runit.
978
00:40:32,175 --> 00:40:35,765
And so svlogd lets you just log to a directory and it kind of
979
00:40:35,765 --> 00:40:38,814
cycles through files and then you can use like Promptail to.
980
00:40:39,090 --> 00:40:39,900
Scrape those directories.
981
00:40:39,900 --> 00:40:44,120
So every container gets its own logging directory and then it just pipes
982
00:40:44,120 --> 00:40:47,349
it to svlogd and svlogd is really lightweight and it handles all the log
983
00:40:47,350 --> 00:40:50,400
management without having to do like standard out piping or anything like that.
984
00:40:50,469 --> 00:40:53,930
Every user is a website, a SQLite database, and a svlogd.
985
00:40:54,600 --> 00:40:55,440
Yeah, exactly.
986
00:40:55,620 --> 00:40:56,010
Exactly.
987
00:40:56,310 --> 00:40:57,280
It's a whole stack right there.
988
00:40:57,910 --> 00:40:59,290
It works surprisingly well.
989
00:40:59,550 --> 00:41:01,920
Uh, you also want to make sure that you're not like doing user
990
00:41:01,920 --> 00:41:04,890
space docker NAT, because user space docker NAT is how you
991
00:41:04,970 --> 00:41:07,590
make your high throughput services be very low throughput.
992
00:41:07,659 --> 00:41:09,789
Well, you're not running everything like network hosts though, right?
993
00:41:10,029 --> 00:41:10,480
Uh, no.
994
00:41:10,489 --> 00:41:13,560
I mean, you can, you can run kernel level NAT, which is,
995
00:41:13,569 --> 00:41:17,320
which is a lot less, uh, messy than user level NAT for docker.
996
00:41:17,470 --> 00:41:19,830
It's not CPU intensive, I guess I would say.
997
00:41:20,170 --> 00:41:22,320
Uh, there's less, less packet copying going on.
998
00:41:22,510 --> 00:41:23,930
But that's one of the reasons we don't, didn't want to run
999
00:41:23,930 --> 00:41:26,480
Kubernetes is we've got these really cool bare metal machines.
1000
00:41:26,580 --> 00:41:29,229
We don't want to add so many layers of virtualization on top of them that.
1001
00:41:29,960 --> 00:41:32,863
We lose a lot of the, like, benefit of being close to the metal.
1002
00:41:32,863 --> 00:41:33,329
You're gonna hide all that
1003
00:41:33,330 --> 00:41:34,930
performance under abstractions.
1004
00:41:34,930 --> 00:41:35,320
Yeah,
1005
00:41:35,710 --> 00:41:37,200
yeah, exactly, exactly.
1006
00:41:37,220 --> 00:41:39,040
Say goodbye to your, your cache locality.
1007
00:41:39,070 --> 00:41:41,530
Say goodbye to, I don't know, whatever it is you're, you're trying to do
1008
00:41:41,610 --> 00:41:44,710
because your, your container is being preempted because the Kubernetes,
1009
00:41:44,770 --> 00:41:47,080
the Kubelet needs to come in and do something or whatever it might be.
1010
00:41:47,179 --> 00:41:48,670
I mean, you can tune Kubernetes for performance
1011
00:41:48,670 --> 00:41:50,410
and you can run it in a high performance way.
1012
00:41:50,450 --> 00:41:51,600
We don't have the expertise to do that.
1013
00:41:51,790 --> 00:41:54,540
But what we, we do know is, yeah, you can just And
1014
00:41:54,890 --> 00:41:57,140
a lot of this, I mean, a lot of the orchestrators are
1015
00:41:57,140 --> 00:41:59,670
typically, you have a dynamic infrastructure, right?
1016
00:41:59,670 --> 00:42:01,610
Like you have machines coming and going frequently.
1017
00:42:01,610 --> 00:42:04,649
You need to reshuffle things or reallocate things.
1018
00:42:04,649 --> 00:42:06,309
And in a lot of your case, at least half
1019
00:42:06,309 --> 00:42:08,070
of your infrastructure is fairly static.
1020
00:42:08,240 --> 00:42:11,229
It's like we have a bunch of machines over here that are running PDSs,
1021
00:42:11,229 --> 00:42:14,720
a bunch of machines over here running all the app view and database.
1022
00:42:15,085 --> 00:42:18,185
Flows and everything and and you can define that that's a spreadsheet.
1023
00:42:18,235 --> 00:42:19,255
That's not an orchestrator
1024
00:42:19,335 --> 00:42:21,675
It's all very static and and you buy the
1025
00:42:21,675 --> 00:42:23,625
capacity when you buy the machines, right?
1026
00:42:23,695 --> 00:42:25,015
You can use as much or as little of it as you
1027
00:42:25,015 --> 00:42:26,704
want to you've already paid for it Basically,
1028
00:42:26,915 --> 00:42:29,385
do you think blue sky will somehow figure
1029
00:42:29,385 --> 00:42:32,054
a way to incorporate video and images more?
1030
00:42:32,375 --> 00:42:34,694
So that way we don't have to go to any of the bad places
1031
00:42:35,225 --> 00:42:35,895
I think so.
1032
00:42:35,895 --> 00:42:36,265
Yeah.
1033
00:42:36,295 --> 00:42:38,515
I mean, I think recently we launched video feeds.
1034
00:42:38,525 --> 00:42:42,095
So feeds can describe themselves as like primarily a video feed and
1035
00:42:42,095 --> 00:42:45,374
then they'll go into that kind of video vertical scrolling mode.
1036
00:42:45,425 --> 00:42:47,605
That was like a six day project by the front end team
1037
00:42:47,605 --> 00:42:50,235
that was actually like kind of insane turnaround on that.
1038
00:42:50,284 --> 00:42:54,395
So we have a couple of things where we do very hackathon mindset and,
1039
00:42:54,405 --> 00:42:56,755
and we're like, cool, how quickly can we get something that is like.
1040
00:42:57,180 --> 00:42:59,330
Of our quality standards shipped to production.
1041
00:42:59,390 --> 00:43:01,370
When you're at a tiny company, you know, you've got
1042
00:43:01,370 --> 00:43:03,910
like 20 something people and you're dealing with tens of
1043
00:43:03,910 --> 00:43:06,259
millions of users, there's a lot of priority juggling.
1044
00:43:06,350 --> 00:43:10,320
And so you've got like stuff that's easy to do and stuff that is important.
1045
00:43:10,490 --> 00:43:12,529
There's stuff that's like fast and easy and stuff that's important.
1046
00:43:12,730 --> 00:43:14,880
And if it's in that quadrant, you've, you kind of just do it.
1047
00:43:15,155 --> 00:43:17,065
immediately drop whatever you're doing, go do that thing.
1048
00:43:17,175 --> 00:43:19,475
And then you have stuff that's like a little bit harder to do
1049
00:43:19,495 --> 00:43:22,015
and it's important, and that's work that you try to schedule.
1050
00:43:22,075 --> 00:43:25,265
And then you have work that is stuff that's like hard to do and on unimportant.
1051
00:43:25,445 --> 00:43:27,865
And that's stuff that falls to the, kind of the bottom of your priority list.
1052
00:43:27,865 --> 00:43:30,414
And then there's stuff that is easy to do, but unimportant, and.
1053
00:43:30,720 --> 00:43:32,320
If you need extra dopamine and there's nothing on the
1054
00:43:32,320 --> 00:43:34,640
easy important list to do, you gotta do that stuff.
1055
00:43:35,110 --> 00:43:37,010
Speaking of, of possibly important, I'm
1056
00:43:37,010 --> 00:43:38,550
going back to some of the questions here.
1057
00:43:39,170 --> 00:43:41,759
Someone's asking about like expansion outside the U. S. What
1058
00:43:41,760 --> 00:43:44,990
does that look like in your network, which is mostly static?
1059
00:43:44,990 --> 00:43:48,489
Are you going to, are you planning on doing some like, Oh, these
1060
00:43:48,490 --> 00:43:50,930
users really care about data locality or this country does.
1061
00:43:50,940 --> 00:43:53,810
So we have to put the PDSs or the whole stack in
1062
00:43:53,810 --> 00:43:56,090
that environment in their country within the borders.
1063
00:43:56,815 --> 00:44:00,865
I'm not up to date on the legal side of any of that or like
1064
00:44:00,865 --> 00:44:04,345
the regulatory side of that from a just a purely architectural
1065
00:44:04,345 --> 00:44:08,504
standpoint, it should be something doable is like run the PDS in
1066
00:44:08,505 --> 00:44:11,314
another country and then your canonical data lives in that country.
1067
00:44:11,465 --> 00:44:14,885
And then the other side, like if we wanted to run a pop in another country or
1068
00:44:14,885 --> 00:44:17,815
something like that, we could we could go set it up and move our hardware there.
1069
00:44:18,040 --> 00:44:19,910
Some countries are easier to do that in than others.
1070
00:44:19,980 --> 00:44:22,000
And then the connectivity of that country is also important.
1071
00:44:22,000 --> 00:44:23,800
It's like, cool, can we get a lot of bandwidth cheap?
1072
00:44:23,860 --> 00:44:25,030
Is it going to reach our customers?
1073
00:44:25,090 --> 00:44:28,360
There are a couple of considerations that go into where we place infrastructure.
1074
00:44:28,660 --> 00:44:29,840
Right now, it's mostly in the U.
1075
00:44:29,840 --> 00:44:31,350
S. just because that's the easiest place to put it.
1076
00:44:31,389 --> 00:44:34,099
When it comes to delivering like images and video, we, we work with
1077
00:44:34,099 --> 00:44:37,630
a CDN partner and the CDN, they've got, you know, a whole distributed
1078
00:44:37,630 --> 00:44:41,650
network of their pops and their local caches and nodes and stuff.
1079
00:44:42,055 --> 00:44:46,025
Going back to the, the hardware, not going into super specific details,
1080
00:44:46,025 --> 00:44:49,125
but as far as like, how did you decide what to pick for hardware?
1081
00:44:49,125 --> 00:44:50,285
Where were you looking at?
1082
00:44:50,285 --> 00:44:51,874
What were the kind of the qualifications?
1083
00:44:52,085 --> 00:44:55,485
I can talk about like the chips and stuff that we're running because
1084
00:44:55,525 --> 00:44:59,814
we, we wanted to run AMD because current generation AMD in, in the
1085
00:44:59,814 --> 00:45:03,575
data center is just at a scale that it is hard to push Intel to.
1086
00:45:03,795 --> 00:45:06,545
It runs higher performance per watts and
1087
00:45:06,845 --> 00:45:08,355
you just get better density out of them.
1088
00:45:08,435 --> 00:45:11,425
That was kind of our decision on AMD versus Intel for that.
1089
00:45:11,745 --> 00:45:15,445
And also we were very interested in, uh, the X, the
1090
00:45:15,595 --> 00:45:18,305
3DV cache, uh, chips that AMD is coming out with.
1091
00:45:18,355 --> 00:45:21,565
And so Genoa X CPUs, we've got, like, some of our
1092
00:45:21,565 --> 00:45:25,155
machines are spec'd with two of the 96 core, 192 thread
1093
00:45:25,165 --> 00:45:29,335
Genoa X series CPUs that each have 768 megs of L3 cache.
1094
00:45:29,385 --> 00:45:29,395
I
1095
00:45:29,685 --> 00:45:31,215
mean, you're over 300 cores.
1096
00:45:31,225 --> 00:45:31,755
Holy crap.
1097
00:45:31,975 --> 00:45:35,515
Yeah, so it's uh, a gig and a half of uh, L3 cache in a
1098
00:45:35,515 --> 00:45:38,754
single box across two chips, which is absolutely absurd.
1099
00:45:38,814 --> 00:45:39,184
Yeah.
1100
00:45:39,605 --> 00:45:40,825
That's more than my first computer.
1101
00:45:41,464 --> 00:45:44,185
It's like all total RAM and that's cache.
1102
00:45:44,285 --> 00:45:46,554
Yeah, so you can get insane amounts of cache.
1103
00:45:46,554 --> 00:45:49,975
You can get these like really, really high core density machines.
1104
00:45:50,155 --> 00:45:51,935
You could, you could pack a ton of RAM into a box.
1105
00:45:51,945 --> 00:45:53,455
Like if you're, if you're just buying.
1106
00:45:53,880 --> 00:45:55,150
Your own box.
1107
00:45:55,150 --> 00:45:57,210
You can stick a couple of terabytes of Ram into it.
1108
00:45:57,360 --> 00:46:00,320
You can't get a couple of terabytes of Ram in a cloud VM.
1109
00:46:00,650 --> 00:46:02,410
You can, but you're going to pay for it.
1110
00:46:02,949 --> 00:46:04,220
I mean, you probably have to like
1111
00:46:04,220 --> 00:46:06,999
break like 16 different pieces of glass and like talk
1112
00:46:07,000 --> 00:46:09,539
to like 30 different account reps before they'll let
1113
00:46:09,540 --> 00:46:11,969
you get like a node with two terabytes of Ram in it.
1114
00:46:12,210 --> 00:46:14,860
Which is where cloud is not fun when like, it's
1115
00:46:14,860 --> 00:46:17,510
cool when you can get an instance in seconds.
1116
00:46:17,520 --> 00:46:20,280
It's not when you have to break glass and ask permission.
1117
00:46:20,510 --> 00:46:23,090
Yeah, we can buy hardware that is very kind
1118
00:46:23,090 --> 00:46:25,009
of tailored to the workloads that we're doing.
1119
00:46:25,019 --> 00:46:28,999
So ScyllaDB is a big distributed horizontally scalable database.
1120
00:46:29,009 --> 00:46:31,900
It's got a shard per core architecture, so you can throw a bunch
1121
00:46:31,900 --> 00:46:34,190
more cores at it and it will just kind of scale horizontally.
1122
00:46:34,280 --> 00:46:36,760
But what it does want is a lot of RAM and a lot of NVMe.
1123
00:46:36,840 --> 00:46:37,520
And so.
1124
00:46:37,930 --> 00:46:39,170
NVMe is cheap these days.
1125
00:46:39,170 --> 00:46:42,930
You can get like a 15 terabyte enterprise NVMe drive for like two grand.
1126
00:46:43,050 --> 00:46:45,680
Is it as hard to manage as Cassandra's?
1127
00:46:45,900 --> 00:46:48,900
It's been, when we've been using it correctly, it's
1128
00:46:48,900 --> 00:46:51,159
been totally quiet and we've had no issues with it.
1129
00:46:51,240 --> 00:46:54,309
We do have the timelines workload that is doing those like.
1130
00:46:54,680 --> 00:46:59,120
Many, many, many writes a second to timelines is not the best
1131
00:46:59,130 --> 00:47:02,710
fit for like an LSM tree with, with size to your compaction.
1132
00:47:02,930 --> 00:47:05,690
So we've running into performance issues there that were really annoying.
1133
00:47:05,810 --> 00:47:08,240
We've got past some of them by kind of
1134
00:47:08,310 --> 00:47:10,400
segmenting that workload into its own cluster.
1135
00:47:10,620 --> 00:47:14,370
And now it no longer has an impact on like P99 latencies
1136
00:47:14,370 --> 00:47:17,460
for every other operation that goes on on the website.
1137
00:47:17,830 --> 00:47:19,450
Uh, but it was all in one big cluster.
1138
00:47:19,570 --> 00:47:19,890
I think
1139
00:47:19,900 --> 00:47:21,480
that's kind of the secret of databases.
1140
00:47:21,490 --> 00:47:23,540
Cause everyone thinks that no SQL or.
1141
00:47:24,070 --> 00:47:27,850
Using one or the other is going to be some sort of magical thing because they
1142
00:47:27,850 --> 00:47:31,110
think it's not, doesn't have to be a structured or it's not, doesn't have
1143
00:47:31,110 --> 00:47:35,130
to be like is relational, but they're all you have to write, use the right
1144
00:47:35,130 --> 00:47:38,170
tool for the job and then the right access patterns and all kinds of stuff.
1145
00:47:38,170 --> 00:47:38,409
So, I
1146
00:47:38,410 --> 00:47:40,090
mean, I think the secret of databases, everyone
1147
00:47:40,099 --> 00:47:42,509
has to use it wrong the first time, right?
1148
00:47:42,509 --> 00:47:45,409
And then, and then you figure out, Oh, this one's different.
1149
00:47:45,729 --> 00:47:50,669
There is, there is no database that will support wildly different workloads.
1150
00:47:51,250 --> 00:47:54,980
on the same instance, on the same cluster, basically, is what we've learned.
1151
00:47:55,040 --> 00:47:58,030
You can design your database as, as heavily as you want to, but
1152
00:47:58,030 --> 00:48:00,869
like, if you have a really noisy neighbor, it's gonna thrash your
1153
00:48:00,870 --> 00:48:03,680
caches, and you're gonna have really bad performance, or it's gonna,
1154
00:48:03,719 --> 00:48:06,330
like, cause a bunch of compactions to kick off, and you're gonna
1155
00:48:06,330 --> 00:48:09,049
be wasting a bunch of CPU time in compactions that could have been
1156
00:48:09,049 --> 00:48:11,900
serving requests, and your latencies are gonna be all over the place.
1157
00:48:11,920 --> 00:48:14,990
So, so when we bought hardware, we were like, okay, cool, let's buy hardware
1158
00:48:15,070 --> 00:48:19,685
to run a Scylla cluster, and let's buy hardware to run A couple of really
1159
00:48:19,695 --> 00:48:24,135
highly concurrent Go processes and then some more generic hardware to run
1160
00:48:24,175 --> 00:48:28,014
more generic things like a bunch of TypeScript containers and stuff like that.
1161
00:48:28,085 --> 00:48:31,635
So the, the core data service I was talking to you about in November was running
1162
00:48:31,635 --> 00:48:35,825
on 16 containers across two physical machines in both of our data centers.
1163
00:48:35,835 --> 00:48:38,645
So two in each, in each DC, eight, eight
1164
00:48:38,645 --> 00:48:41,945
containers, those machines had 384 logical cores.
1165
00:48:41,955 --> 00:48:46,325
So with, with SMT 384 cores, and so each Go process was getting.
1166
00:48:46,700 --> 00:48:48,150
A couple dozen cores and
1167
00:48:48,160 --> 00:48:50,530
still, when I think of that scale and you're literally talking
1168
00:48:50,530 --> 00:48:53,770
about four physical servers, and I think if I wanted to
1169
00:48:53,770 --> 00:48:57,629
replicate that in a cloud architecture, that is at least 30
1170
00:48:57,710 --> 00:49:01,479
VM somewhere with a couple of cues and something else and like
1171
00:49:01,480 --> 00:49:06,439
that complexity for physical servers handling across all four of them
1172
00:49:06,439 --> 00:49:10,210
in the neighborhood of 700, 000 requests a second from the app view
1173
00:49:10,630 --> 00:49:14,410
and querying a database around four and a half million times a second.
1174
00:49:14,735 --> 00:49:18,845
Your experience being a hardware engineer and a software engineer
1175
00:49:18,845 --> 00:49:22,385
really meshes well with you working in infrastructure because if
1176
00:49:22,385 --> 00:49:25,185
you didn't know hardware as well you probably wouldn't be able to
1177
00:49:25,565 --> 00:49:29,295
Go and pick the right, like everything is, seems like you have a
1178
00:49:29,485 --> 00:49:32,845
really good knack for right sizing and picking the right things.
1179
00:49:32,875 --> 00:49:35,345
And I think people struggle with that so much.
1180
00:49:35,655 --> 00:49:36,865
They're all tools, right?
1181
00:49:36,885 --> 00:49:39,875
But how do you go and use that tool efficiently, right?
1182
00:49:39,905 --> 00:49:43,144
And the fact that you worked with bare metal and you worked with hardware
1183
00:49:43,144 --> 00:49:46,725
and let's be real, it's easier to figure out cloud because there's a lot more
1184
00:49:46,825 --> 00:49:49,855
kind of tutorials and information out there to go figure that out, right?
1185
00:49:49,915 --> 00:49:53,515
You came with the hard stuff and then you get to meld that together.
1186
00:49:54,125 --> 00:49:56,145
I feel like a lot of it is instinct at this point,
1187
00:49:56,145 --> 00:49:58,465
or it's like, I feel like I'm guessing really often.
1188
00:49:58,775 --> 00:50:01,635
When you are, like, right sizing for hardware, you're
1189
00:50:01,635 --> 00:50:04,315
never gonna make a decision with as much data as you want.
1190
00:50:04,355 --> 00:50:07,525
You'll never reach a point where every decision that you make is fully
1191
00:50:07,525 --> 00:50:10,394
informed, and you're like, Ah, yes, this is clearly the obvious decision
1192
00:50:10,395 --> 00:50:12,745
because I have all the information I need to make this decision.
1193
00:50:13,055 --> 00:50:15,695
So I will just make the correct decision.
1194
00:50:15,985 --> 00:50:18,760
What you're left with is like What do you know?
1195
00:50:18,790 --> 00:50:20,280
What do you have experience with?
1196
00:50:20,350 --> 00:50:22,960
And then, what does your gut say?
1197
00:50:23,150 --> 00:50:24,940
A lot of times that's almost more important.
1198
00:50:24,960 --> 00:50:28,040
I've learned through working at different companies that sometimes
1199
00:50:28,040 --> 00:50:31,599
it's more like what your engineers know and what they're good at
1200
00:50:31,860 --> 00:50:34,920
and then finding the best tool that they have experience with.
1201
00:50:35,205 --> 00:50:36,915
Rather than just picking the best tool, like they
1202
00:50:36,915 --> 00:50:39,635
all have to be counted in and like accounted for.
1203
00:50:39,745 --> 00:50:43,035
Making the decision is like, and making the correct decision is hard.
1204
00:50:43,055 --> 00:50:45,785
Choosing when to make a decision is another really
1205
00:50:45,825 --> 00:50:49,125
important role that takes a lot of experience to get.
1206
00:50:49,144 --> 00:50:51,575
I don't have a ton of that experience right now.
1207
00:50:51,644 --> 00:50:53,915
Jake, our previous, our previous Inferlead
1208
00:50:53,955 --> 00:50:56,345
made a lot of these decisions that I was like.
1209
00:50:56,775 --> 00:50:57,475
Are you sure?
1210
00:50:57,475 --> 00:50:59,855
Like, I don't know, like, is this going to work?
1211
00:50:59,865 --> 00:51:02,845
And that has a lot of those have like very clearly panned out.
1212
00:51:02,855 --> 00:51:05,645
And I, I've bowed to his wisdom on a lot of that.
1213
00:51:05,705 --> 00:51:07,454
And now I'm in the position where I'm like.
1214
00:51:07,810 --> 00:51:09,480
I hope I know what I'm doing.
1215
00:51:09,670 --> 00:51:12,330
I like, I have no idea what I'm doing, but you know, we're still alive.
1216
00:51:12,330 --> 00:51:13,420
So I must be doing something right.
1217
00:51:13,450 --> 00:51:17,859
And choosing when to make a decision is also very important because delaying
1218
00:51:17,859 --> 00:51:21,580
decisions until you have more information is, is good if you really don't
1219
00:51:21,580 --> 00:51:25,399
have enough information to make a decision, but being indecisive can cause
1220
00:51:25,399 --> 00:51:28,430
you to slow down or it can cause problems or it can make more work for you.
1221
00:51:28,680 --> 00:51:30,620
And so you have to like constantly be.
1222
00:51:30,950 --> 00:51:33,950
doing this trade off between should I just make a decision and
1223
00:51:33,950 --> 00:51:36,680
go with it and commit to it because we'll get more done that way
1224
00:51:36,740 --> 00:51:39,010
If the decision isn't super high stakes or if it's a really high
1225
00:51:39,010 --> 00:51:42,840
stakes decision How do I wait just the right amount of time so that
1226
00:51:42,840 --> 00:51:45,440
we have enough information, but we're also not missing the boat
1227
00:51:45,530 --> 00:51:49,075
Looking back over the last 18 months Were there any decisions you regret
1228
00:51:49,135 --> 00:51:53,485
that either you made at the wrong time or you, you just decided that I'm just
1229
00:51:53,485 --> 00:51:56,034
trying, I'm asking, you know, there's a lot of learning experiences here,
1230
00:51:57,864 --> 00:52:01,035
any decisions that I regret, I don't think I can fault
1231
00:52:01,065 --> 00:52:03,545
any of our major decisions that we've made because
1232
00:52:03,545 --> 00:52:05,094
we
1233
00:52:05,175 --> 00:52:08,765
haven't, well, we, yeah, we haven't fallen over Nobody could
1234
00:52:08,765 --> 00:52:12,475
have possibly predicted the ridiculous trajectory that we're
1235
00:52:12,475 --> 00:52:14,885
on, like, except for Jake when he wrote that spreadsheet.
1236
00:52:14,895 --> 00:52:15,495
But like,
1237
00:52:15,645 --> 00:52:18,664
if you could have predicted all of this, then we should pay you for like
1238
00:52:18,705 --> 00:52:23,315
predicting the election and a bunch of like, some other really unstable world.
1239
00:52:24,335 --> 00:52:26,675
These have all been very heavily outside influence.
1240
00:52:27,695 --> 00:52:29,765
I do kind of firmly believe that, like, from
1241
00:52:29,765 --> 00:52:32,005
an infrastructure standpoint, we have made.
1242
00:52:32,130 --> 00:52:33,890
the best decision that we could with the information
1243
00:52:33,890 --> 00:52:35,500
that we had pretty much across the board.
1244
00:52:35,550 --> 00:52:37,750
And having more information, we wouldn't have believed
1245
00:52:37,750 --> 00:52:40,079
it if I, if I like sent myself back from the future
1246
00:52:40,079 --> 00:52:42,100
and was like, Hey, you have to prepare for this scale.
1247
00:52:42,100 --> 00:52:43,279
I would have been like, you're insane.
1248
00:52:43,360 --> 00:52:43,920
Get out of here.
1249
00:52:43,920 --> 00:52:44,050
I
1250
00:52:44,060 --> 00:52:46,019
saw a post like that on blue sky today.
1251
00:52:46,020 --> 00:52:50,399
It was like, if someone had told me that it was something like random about
1252
00:52:50,399 --> 00:52:53,700
like where we are now, verse 10 years ago, it was like, if I went back in
1253
00:52:53,700 --> 00:52:58,460
2004 and I got put in like a mental asylum for telling people what's going
1254
00:52:58,470 --> 00:53:02,145
on in the future, the future's like, And I was like, they're not wrong.
1255
00:53:02,285 --> 00:53:03,665
Like, they're so not wrong.
1256
00:53:03,675 --> 00:53:08,105
Back in November of 2023, we re architected the entire backend.
1257
00:53:08,114 --> 00:53:11,465
So the entire backend was on one big Postgres instance, uh, or like a bunch
1258
00:53:11,465 --> 00:53:15,585
of Postgres replicas, the PDS and the App Viewer merged into one big thing.
1259
00:53:15,595 --> 00:53:18,424
It was all just one giant Postgres serving a hundred thousand users.
1260
00:53:18,495 --> 00:53:19,875
We broke those roles apart.
1261
00:53:20,105 --> 00:53:24,485
And then we moved to the V2 architecture, which is, Hey, Scylla based.
1262
00:53:24,925 --> 00:53:29,205
Rewrite the entire data schema, build it all from scratch, and design
1263
00:53:29,205 --> 00:53:32,255
it to support up to 100 million users at the time was the goal.
1264
00:53:32,435 --> 00:53:34,714
And we had 100, 000 users, and we were like, cool, we're
1265
00:53:34,714 --> 00:53:37,155
going to build for three orders of magnitude from only
1266
00:53:37,155 --> 00:53:40,155
having information of, you know, operating at 100, 000 users.
1267
00:53:40,345 --> 00:53:42,035
None of us had any idea what the hell we were doing.
1268
00:53:42,154 --> 00:53:46,455
Like, this was all way pie in the sky architect engineering stuff.
1269
00:53:46,545 --> 00:53:49,045
We got some idea of what it was going to look like and then I went
1270
00:53:49,065 --> 00:53:53,365
head down for like six weeks from like Christmas to the end of January.
1271
00:53:53,545 --> 00:53:57,365
And just wrote out our entire new data architecture and
1272
00:53:57,365 --> 00:54:00,105
then implemented it and got it running and on our hardware.
1273
00:54:00,165 --> 00:54:02,085
I hope you guys are going to a beach in Mexico
1274
00:54:02,085 --> 00:54:04,235
at some point because you'll be working some
1275
00:54:04,565 --> 00:54:04,655
hours.
1276
00:54:05,324 --> 00:54:08,795
Right before the public launch back in February of last year, five days
1277
00:54:08,805 --> 00:54:15,465
before that, we silently shifted the entire backend from the in cloud.
1278
00:54:15,900 --> 00:54:20,290
On top of a big Postgres to the running on our own hardwire and nobody
1279
00:54:20,290 --> 00:54:23,500
noticed and so we had we'd like we backfilled all the data we had it
1280
00:54:23,500 --> 00:54:26,540
all running for a while we for a couple days before everything switched
1281
00:54:26,540 --> 00:54:29,569
over and then we just slowly moved one PDS at a time and pointed it
1282
00:54:29,570 --> 00:54:32,239
out at the new architecture and so over the course of like an hour we
1283
00:54:32,239 --> 00:54:35,220
shifted 100 percent of traffic onto the on prem loadout and that was
1284
00:54:35,220 --> 00:54:37,890
like that was the moment where I was like I can't believe we just did
1285
00:54:37,890 --> 00:54:41,480
that you I was like, we went to a cave and wrote this whole thing.
1286
00:54:41,480 --> 00:54:43,350
And then like, all right, I hope it works.
1287
00:54:43,450 --> 00:54:45,950
We'll see what happens when it like actually gets users on it.
1288
00:54:45,950 --> 00:54:47,320
And then it just frigging worked.
1289
00:54:47,330 --> 00:54:48,569
And it was like, you're kidding me.
1290
00:54:49,020 --> 00:54:50,159
Like we had like two bugs.
1291
00:54:50,570 --> 00:54:52,890
And like, tiny, tiny, tiny percentage of people
1292
00:54:52,890 --> 00:54:54,720
noticed it, and we fixed those within a day or two.
1293
00:54:54,940 --> 00:54:56,000
And I was like, alright, what's next?
1294
00:54:56,260 --> 00:54:59,650
I feel like someone tried to explain what an SRE was the
1295
00:54:59,650 --> 00:55:02,629
other day on Blue Sky to like, people that were not technical.
1296
00:55:02,629 --> 00:55:06,640
And it's wild because like, nobody knows what you're doing until you mess it up.
1297
00:55:06,985 --> 00:55:08,915
And then they know what you're doing, you know what I mean?
1298
00:55:08,915 --> 00:55:11,455
So like, it's what, like, that's such a huge
1299
00:55:11,455 --> 00:55:13,855
achievement for you to do that much of a data switch.
1300
00:55:13,885 --> 00:55:18,285
And like, to know you did it right is because nobody noticed, you know?
1301
00:55:18,515 --> 00:55:21,785
Yeah, that was one of the very high stakes moments.
1302
00:55:21,815 --> 00:55:24,404
We've had a couple of those since then, like turning on video.
1303
00:55:24,830 --> 00:55:26,540
Was like, I have no idea.
1304
00:55:26,550 --> 00:55:29,320
Video, the like backend for video is all custom.
1305
00:55:29,350 --> 00:55:33,390
It's all like I w I wrote up our entire kind of video processing pipeline.
1306
00:55:33,500 --> 00:55:36,450
I architected it and, and set up the, it just runs
1307
00:55:36,450 --> 00:55:39,130
on a bunch of machines that, that we don't operate.
1308
00:55:39,230 --> 00:55:41,729
And I was like, I think this should be horizontally scalable.
1309
00:55:41,759 --> 00:55:42,660
Like I've done.
1310
00:55:43,135 --> 00:55:47,235
I've run it in Docker Compose on my like work machine and I've scaled
1311
00:55:47,235 --> 00:55:50,435
it to like, however many, you know, hits a second and it worked fine.
1312
00:55:50,455 --> 00:55:53,534
It should probably be okay, but our only way of like
1313
00:55:53,534 --> 00:55:55,454
figuring it out was like, all right, turn the dial and
1314
00:55:55,454 --> 00:55:58,114
actually let users use it and see if it's going to happen.
1315
00:55:58,114 --> 00:55:59,034
And this was right after Brazil.
1316
00:55:59,034 --> 00:55:59,875
So Brazil happened.
1317
00:55:59,885 --> 00:56:02,285
We had 10 X, the, the number of users we
1318
00:56:02,295 --> 00:56:04,695
expected to have, I had been building video.
1319
00:56:05,140 --> 00:56:06,750
For the previous number of users.
1320
00:56:06,750 --> 00:56:09,940
But I was like, I want it to be able to scale to a billion horizontally.
1321
00:56:10,100 --> 00:56:13,739
And then Brazil came on and, and Paul was like, can we still do video?
1322
00:56:14,250 --> 00:56:15,570
And I was like, give me a week.
1323
00:56:15,580 --> 00:56:18,149
Like, yeah, give me, give me, give me a week.
1324
00:56:18,150 --> 00:56:19,860
Let me, let me, let me update some spreadsheets to
1325
00:56:19,860 --> 00:56:21,050
figure out what the costs are going to look like.
1326
00:56:21,050 --> 00:56:22,810
And then give me a week and then yeah, let's do video.
1327
00:56:22,910 --> 00:56:25,150
We had a last minute architectural change with video as well.
1328
00:56:25,150 --> 00:56:25,830
That was insane.
1329
00:56:25,839 --> 00:56:29,080
We were, it was the morning of the video launch.
1330
00:56:29,150 --> 00:56:31,149
Uh, we had, we had a transcoding partner that was
1331
00:56:31,150 --> 00:56:33,720
going to do like half of our video encoding for us.
1332
00:56:33,740 --> 00:56:35,870
And, and a big chunk of the, the workflow.
1333
00:56:35,970 --> 00:56:38,840
We submitted some jobs to their, their queues that morning.
1334
00:56:38,850 --> 00:56:38,960
Like.
1335
00:56:39,460 --> 00:56:42,730
Through their API and it took like an hour to process the video and I
1336
00:56:42,730 --> 00:56:45,780
was like what this was like working just fine Like last night it was
1337
00:56:45,790 --> 00:56:48,260
happening in seconds and they said oh, you know There's there's a really
1338
00:56:48,260 --> 00:56:52,790
big backlog right now and I was like, I can't ship that to like millions
1339
00:56:52,790 --> 00:56:56,289
of users That's not that's not accept it Like people can't upload
1340
00:56:56,299 --> 00:56:58,850
videos if it's gonna take an hour to process a 60 second video that
1341
00:56:58,850 --> 00:57:03,620
makes no sense So in about 14 hours of insanity, I like rewrote their
1342
00:57:03,620 --> 00:57:07,660
entire part of that stack into the existing job system that I built.
1343
00:57:07,880 --> 00:57:09,020
And I was like, cool, I'm just going to replace your
1344
00:57:09,020 --> 00:57:12,087
product and I'm just going to shove these into an S3 bucket.
1345
00:57:12,087 --> 00:57:14,669
What kind of monster do you drink?
1346
00:57:14,860 --> 00:57:16,280
Goodness, Paul
1347
00:57:16,280 --> 00:57:17,260
drinks Red Bull, doesn't he?
1348
00:57:17,410 --> 00:57:19,260
It's like between Red Bull and Monster.
1349
00:57:19,780 --> 00:57:21,130
Paul needs a fridge of Red Bull.
1350
00:57:21,130 --> 00:57:22,020
I think I ate that
1351
00:57:22,020 --> 00:57:23,180
night, briefly.
1352
00:57:23,309 --> 00:57:23,829
Yeah.
1353
00:57:23,940 --> 00:57:25,090
Me and, me and Divey.
1354
00:57:25,130 --> 00:57:27,689
Divey was like, I was like, Hey, I think this is how this can work.
1355
00:57:27,700 --> 00:57:30,769
Can you figure out how to get the CDN to front this like S3
1356
00:57:30,770 --> 00:57:33,580
bucket or like S3 compatible bucket, this block store bucket?
1357
00:57:33,830 --> 00:57:37,315
And then I will do everything I can to get us to encode these
1358
00:57:37,315 --> 00:57:39,590
HLS streams and get them into that block storage bucket.
1359
00:57:39,955 --> 00:57:42,605
And then hopefully it should just work, maybe.
1360
00:57:42,865 --> 00:57:45,085
Um, and we literally launched the next day.
1361
00:57:45,265 --> 00:57:48,804
You're like, oh, Jake did this and like, oh, I didn't do anything big.
1362
00:57:48,805 --> 00:57:50,645
And I'm like, are you listening to the
1363
00:57:50,655 --> 00:57:52,485
words that are coming out of your own mouth?
1364
00:57:52,825 --> 00:57:53,805
It was a lot.
1365
00:57:53,885 --> 00:57:54,045
It was a
1366
00:57:54,655 --> 00:57:57,365
bajillion times.
1367
00:57:57,365 --> 00:57:58,975
And like, it was no big deal though.
1368
00:57:58,975 --> 00:58:00,495
I just did it with a monster.
1369
00:58:01,165 --> 00:58:04,055
The secret to video encoding is everybody's just calling FFmpeg.
1370
00:58:04,265 --> 00:58:04,845
It doesn't matter.
1371
00:58:04,845 --> 00:58:05,915
It doesn't matter how big of a company.
1372
00:58:05,955 --> 00:58:07,444
I mean, maybe if you're like Google scale or
1373
00:58:07,444 --> 00:58:09,205
something, you're not doing it anymore at that point.
1374
00:58:09,215 --> 00:58:09,515
But.
1375
00:58:10,085 --> 00:58:12,724
It's so much just like, yeah, you're calling FFmpeg.
1376
00:58:12,725 --> 00:58:14,095
Disney FFmpeg.
1377
00:58:14,185 --> 00:58:15,635
It's just legit.
1378
00:58:15,835 --> 00:58:18,345
Like, yeah, there's some hardware that's specialized to It's
1379
00:58:18,345 --> 00:58:20,295
so phenomenal that Disney didn't fall over in itself.
1380
00:58:20,434 --> 00:58:22,615
Also, like, can we talk, like, with the amount of times
1381
00:58:22,615 --> 00:58:26,295
that we saw the Twitter whale in early Twitter scale days?
1382
00:58:26,585 --> 00:58:27,655
Y'all are killing it.
1383
00:58:27,735 --> 00:58:31,725
The secret is we're a distributed system, so we're never fully down.
1384
00:58:31,875 --> 00:58:33,595
We only ever have partial outages.
1385
00:58:34,595 --> 00:58:36,355
We only ever have service degradations.
1386
00:58:36,595 --> 00:58:38,874
So occasionally the website goes into read only mode,
1387
00:58:38,895 --> 00:58:40,904
and you can't like things or anything, and they all get
1388
00:58:40,914 --> 00:58:43,374
backed up in a queue somewhere, but you can still scroll.
1389
00:58:43,385 --> 00:58:43,649
You can still scroll.
1390
00:58:43,660 --> 00:58:45,340
Scrolla nine, that nine CAS.
1391
00:58:45,340 --> 00:58:48,779
If your
1392
00:58:48,779 --> 00:58:53,270
system is distributed enough, you're never fully down.
1393
00:58:53,320 --> 00:58:57,039
Your bugs will always be 10 times worse because you have to figure out where
1394
00:58:57,039 --> 00:59:01,009
you went wrong, but it'll be up and it looks like it's great for customers.
1395
00:59:01,889 --> 00:59:02,399
Exactly.
1396
00:59:02,419 --> 00:59:05,602
All of your, all of your bugs
1397
00:59:05,602 --> 00:59:06,779
are Heisenbugs.
1398
00:59:06,780 --> 00:59:07,000
What's next?
1399
00:59:07,810 --> 00:59:09,480
What's next for BlueSky for infrastructure?
1400
00:59:09,480 --> 00:59:10,510
What are you, what are you looking at?
1401
00:59:12,200 --> 00:59:14,290
We just did some hardware scaling, which was exciting.
1402
00:59:14,665 --> 00:59:17,815
Um, we're probably going to do some more of that in the future, depending
1403
00:59:17,815 --> 00:59:20,925
on how growth goes this year, you know, like we were at 100, 000 users
1404
00:59:20,925 --> 00:59:25,245
18 months ago, we're sitting at 30, just shy of 30 million users today,
1405
00:59:25,375 --> 00:59:28,485
there's a lot of maturing our data architecture that we have to do,
1406
00:59:28,555 --> 00:59:32,264
there's a lot of like low hanging fruit in, in like how to do caches
1407
00:59:32,265 --> 00:59:35,325
better, how to coalesce requests better, how to, you know, hybrid
1408
00:59:35,325 --> 00:59:39,255
timeline fan out stuff for, uh, for celebrities, there's so many different
1409
00:59:39,255 --> 00:59:43,190
things that If we stretch this, you know, this past six month period
1410
00:59:43,410 --> 00:59:47,310
over the course of two years It would have gone totally differently.
1411
00:59:47,320 --> 00:59:48,880
Everything would have been perfectly smooth.
1412
00:59:48,920 --> 00:59:50,560
Like, we would have no tech debt.
1413
00:59:50,580 --> 00:59:53,000
It would have been great because we would have scaled at a rate
1414
00:59:53,010 --> 00:59:56,569
that like, you can see what's going to be a problem slightly ahead
1415
00:59:56,569 --> 00:59:59,350
of time and you can anticipate it and go do something about it.
1416
00:59:59,649 --> 01:00:01,380
But where we're at now is like, problems are
1417
01:00:01,400 --> 01:00:03,850
either on fire or they're not high enough priority.
1418
01:00:03,930 --> 01:00:05,820
And so that, that was in November.
1419
01:00:05,830 --> 01:00:09,020
And now, now we've got, we've bought ourselves some more breathing room.
1420
01:00:09,040 --> 01:00:11,350
And so I'm starting to look at how do we do service discovery?
1421
01:00:11,520 --> 01:00:13,974
We have a bunch of services that are like, Here's like a
1422
01:00:14,225 --> 01:00:16,915
Here's a static list of instances to go try to talk to.
1423
01:00:17,055 --> 01:00:19,275
And if one of those instances goes down and I can't bring it back
1424
01:00:19,295 --> 01:00:21,445
up because it had some load bearing bloom filters or something
1425
01:00:21,445 --> 01:00:23,875
like that and we're in peak traffic, everything gets mad.
1426
01:00:23,875 --> 01:00:25,915
I have to go redeploy all of the services that talk
1427
01:00:25,915 --> 01:00:27,785
to it to tell it, hey, don't try to talk to this one.
1428
01:00:27,904 --> 01:00:29,555
So there's some kind of like dynamic configuration
1429
01:00:29,555 --> 01:00:31,245
and service discovery that we want to get rolling.
1430
01:00:31,945 --> 01:00:33,695
Lots of caching infrastructure changes.
1431
01:00:33,815 --> 01:00:36,115
Maybe writing a custom database for timelines.
1432
01:00:36,415 --> 01:00:40,215
That's, that's one thing that's been on my mind is uh, LSM tree is not a
1433
01:00:40,215 --> 01:00:44,900
great fit for this like, Circular buffer style timeline where like, you've
1434
01:00:44,900 --> 01:00:48,110
got a fixed length of, of references you want to put in everybody's timelines.
1435
01:00:48,120 --> 01:00:49,840
Then you want to kind of overwrite the oldest one.
1436
01:00:49,850 --> 01:00:52,060
When a new one comes in, it feels a lot like a circular buffer.
1437
01:00:52,060 --> 01:00:52,839
And I'm like, okay, cool.
1438
01:00:52,870 --> 01:00:53,670
Can we do something with that?
1439
01:00:53,670 --> 01:00:55,619
Can I go write a database for timelines?
1440
01:00:55,630 --> 01:00:57,969
That is just going to be a super, especially built
1441
01:00:57,980 --> 01:01:00,769
for this workload and just really efficient and scale.
1442
01:01:01,050 --> 01:01:02,880
way farther than I needed to right now.
1443
01:01:02,980 --> 01:01:04,580
So, yeah, writing some databases.
1444
01:01:04,660 --> 01:01:06,140
I did that with a graph database last year.
1445
01:01:06,140 --> 01:01:09,539
Yeah, like that's totally no big deal.
1446
01:01:09,540 --> 01:01:10,930
Because everybody does that.
1447
01:01:10,980 --> 01:01:13,080
I'm just going to change the way that, like, you
1448
01:01:13,080 --> 01:01:15,370
know, app protocol and social media does data.
1449
01:01:15,460 --> 01:01:18,609
Hey, if you limit the scope of your problem, any problem is, any
1450
01:01:18,610 --> 01:01:20,370
problem can be tackleable if you limit the scope hard enough.
1451
01:01:20,390 --> 01:01:23,540
The next time you go for a job interview or write a bio, call us.
1452
01:01:23,900 --> 01:01:24,790
This is your new resume.
1453
01:01:24,810 --> 01:01:28,640
Yeah, you just, you're not doing, like, what you do justice, okay?
1454
01:01:28,790 --> 01:01:30,260
It's, yeah, I don't know.
1455
01:01:30,690 --> 01:01:33,390
There's so many, you wear so many hats at a, uh, like on a
1456
01:01:33,390 --> 01:01:37,130
tiny team that like, I forget what I do a month afterwards
1457
01:01:37,130 --> 01:01:40,160
because the, the, the past month is like, because you left, or
1458
01:01:40,160 --> 01:01:40,725
eighth of that month?
1459
01:01:43,040 --> 01:01:45,530
The past month is like a whole, a whole, like
1460
01:01:45,530 --> 01:01:47,240
every month is like we're in a whole new league.
1461
01:01:47,300 --> 01:01:47,900
Oh crap.
1462
01:01:47,930 --> 01:01:48,920
Now we're in a whole new league.
1463
01:01:48,950 --> 01:01:49,370
Oh crap.
1464
01:01:49,370 --> 01:01:50,390
Now we're in a whole new league.
1465
01:01:50,390 --> 01:01:50,900
And it's like your poor
1466
01:01:50,900 --> 01:01:55,040
brain hasn't had the time to turn off and like register the memory.
1467
01:01:56,350 --> 01:01:59,890
I, I took some time off over, over the holiday, over the winter holidays.
1468
01:01:59,890 --> 01:02:02,190
I got, I got like a week or two off there, which was, uh,
1469
01:02:02,900 --> 01:02:03,810
gave me some breathing room.
1470
01:02:03,850 --> 01:02:05,010
I slept for eight hours.
1471
01:02:05,110 --> 01:02:05,810
It was okay.
1472
01:02:06,480 --> 01:02:09,910
Jazz, thank you so much for coming on the podcast, explaining all of this.
1473
01:02:09,920 --> 01:02:14,490
The rollercoaster of blue sky over the last year and a half has been phenomenal.
1474
01:02:14,490 --> 01:02:15,840
I've been enjoying it thoroughly.
1475
01:02:15,930 --> 01:02:18,475
I've been trying to Play with the new things you've been
1476
01:02:18,475 --> 01:02:21,735
putting out with PDSs and whoever I want to, you know,
1477
01:02:21,735 --> 01:02:23,895
poke at a fire hose and whatnot and see what's going on.
1478
01:02:23,895 --> 01:02:23,965
We are sorry
1479
01:02:23,965 --> 01:02:26,265
that Justin does hoodrat stuff with your infrastructure.
1480
01:02:26,265 --> 01:02:26,845
We apologize.
1481
01:02:26,905 --> 01:02:27,225
I
1482
01:02:27,245 --> 01:02:28,525
definitely am one of those abusers.
1483
01:02:28,995 --> 01:02:31,894
Just like, look, just we're, we're going to send, just make a
1484
01:02:31,894 --> 01:02:34,795
like little like page where we can send you coffee every time
1485
01:02:34,795 --> 01:02:37,565
Justin gets a bright idea and then post about it to encourage
1486
01:02:37,585 --> 01:02:40,265
other people to get said bright idea and do hoodrat stuff.
1487
01:02:40,835 --> 01:02:42,905
If a well intended dev can cause issues,
1488
01:02:42,905 --> 01:02:44,685
then we've, we've got work to do, right?
1489
01:02:44,685 --> 01:02:46,355
Justin's your chaos engineering.
1490
01:02:46,395 --> 01:02:47,915
He's your, like, chaos goblin.
1491
01:02:48,015 --> 01:02:50,985
Retroid is definitely another one of our chaos engineers in the community.
1492
01:02:50,985 --> 01:02:53,834
If you, if you follow Retroid, he's, since the early days,
1493
01:02:53,845 --> 01:02:57,085
has been helping us find, uh, bugs in unlikely places.
1494
01:02:57,524 --> 01:02:58,314
That's a way to describe.
1495
01:02:58,870 --> 01:02:59,700
that relationship.
1496
01:02:59,860 --> 01:03:01,710
That was such a nice way of doing it.
1497
01:03:02,530 --> 01:03:04,110
So everyone, thank you for listening.
1498
01:03:04,110 --> 01:03:06,410
If you're on blue sky, go look up jazz.
1499
01:03:06,680 --> 01:03:09,570
They're on the network, obviously very active
1500
01:03:09,599 --> 01:03:11,829
posting and sharing your knowledge and everything.
1501
01:03:11,830 --> 01:03:13,309
And so that's, that's been fantastic just to
1502
01:03:13,309 --> 01:03:15,429
follow along and everyone that's listening.
1503
01:03:15,460 --> 01:03:16,210
Thank you so much.
1504
01:03:16,210 --> 01:03:17,279
We will talk to you again next week.
1505
01:03:17,340 --> 01:03:18,150
Thank you for having me.
1506
01:03:33,460 --> 01:03:36,430
Thank you for listening to this episode of Fork Around and find out.
1507
01:03:36,760 --> 01:03:38,920
If you like this show, please consider sharing it with
1508
01:03:38,920 --> 01:03:42,100
a friend, a coworker, a family member, or even an enemy.
1509
01:03:42,160 --> 01:03:44,290
However we get the word out about this show
1510
01:03:44,500 --> 01:03:46,750
helps it to become sustainable for the long term.
1511
01:03:46,990 --> 01:03:53,110
If you wanna sponsor this show, please go to fa fo fm slash sponsor and reach
1512
01:03:53,110 --> 01:03:56,410
out to us there about what you're interested in sponsoring and how we can help.
1513
01:03:57,725 --> 01:04:00,895
We hope your systems stay available and your pagers stay quiet.
1514
01:04:01,425 --> 01:04:02,605
We'll see you again next time.