Feb. 14, 2025

Predicting Bluesky’s Scale with Jaz

Predicting Bluesky’s Scale with Jaz

Bluesky has been on a roller coaster of growth for over a year. From the early days of figuring out a new distributed social protocol—AT protocol—to actually building it and inviting 30 million of their closest friends. Not only has the site gone through tremendous growth, the team has been optimizing, re-architecting, and adding features the entire time.

Jaz is a software engineer focused on the infrastructure at Bluesky, and they share how they achieved exponential growth without exponential costs. We cover some of the key components of the protocol and how that affects the architecture.

There’s some amazing advice from the trenches we know you’ll enjoy.

Show Highlights
(0:00) Intro
(5:00) Jaz’s background
(12:30) Bluesky Infrastructure
(17:00) Predicting the future
(20:00) What is a PDS?
(22:30) Relay and firehose
(26:00) Work queues
(30:00) Scaling physical servers
(37:00) How do you handle incidents?
(41:00) Where’s Kubernetes?
(43:30) How video changes
(45:00) Data locality
(46:30) Hardware decisions
(53:00) What bad decisions?
(57:00) Launching video
(1:00:00) What’s next?

About Jaz

Jaz is a software engineer who learned from on-the-job experience. They have a background with hardware which makes them better with software. If they’re not drinking Monster they’re building a single purpose database, or maybe they’re doing both. Jaz went from building with AT protocol to building AT protocol in a matter of months. They also have an impressive collection of plushies and power tools.

Sponsor the FAFO Podcast!

http://fafo.fm/sponsor

Transcript
1
00:00:00,310 --> 00:00:01,980
There's a lot of convenience that comes

2
00:00:01,980 --> 00:00:04,160
with cloud, but you definitely pay for it.

3
00:00:04,340 --> 00:00:06,279
And you don't necessarily pay for it in the

4
00:00:06,300 --> 00:00:07,940
things that you expect to pay for it in.

5
00:00:07,949 --> 00:00:10,150
Like, you don't expect, ah, you're gonna charge a markup

6
00:00:10,190 --> 00:00:13,840
on this EC2 instance based off of how powerful it is.

7
00:00:13,840 --> 00:00:16,949
You end up paying most of it in, like, kind of hidden places.

8
00:00:22,705 --> 00:00:26,115
Welcome to Fork Around and Find Out, the podcast about

9
00:00:26,115 --> 00:00:29,215
building, running, and maintaining software and systems.

10
00:00:41,825 --> 00:00:47,105
Welcome to Fork Around and Find Out, the PLC DID of Is the Website Still Up?

11
00:00:47,425 --> 00:00:50,354
I am Justin Garrison, and with me is Autumn Nash, and

12
00:00:50,355 --> 00:00:53,834
today we have Jazz, a software engineer at BlueSky.

13
00:00:53,835 --> 00:00:54,875
Welcome to the show, Jazz.

14
00:00:55,485 --> 00:00:56,575
Hi, glad to be here.

15
00:00:57,090 --> 00:00:58,550
So excited for you to be here.

16
00:00:58,550 --> 00:01:01,300
I have been looking forward to talk about the infrastructure

17
00:01:01,470 --> 00:01:04,230
around Blue Sky and what you all been doing for a very long time.

18
00:01:04,780 --> 00:01:08,789
Jazz's radio voice just totally kicked Justin's podcast voice.

19
00:01:08,789 --> 00:01:10,229
Voice is absolutely better than mine.

20
00:01:10,240 --> 00:01:10,399
Like,

21
00:01:10,629 --> 00:01:13,070
like, as your friend, I want to have your back.

22
00:01:15,970 --> 00:01:16,889
That's fire.

23
00:01:16,890 --> 00:01:17,735
I'm not sure if I'm going to

24
00:01:17,735 --> 00:01:21,380
be demonstrating that for the podcast yet.

25
00:01:21,410 --> 00:01:22,410
But, you know.

26
00:01:22,619 --> 00:01:24,009
Can we hire you in Skittles?

27
00:01:24,179 --> 00:01:26,359
We could, you could DM me, we could talk about it.

28
00:01:26,520 --> 00:01:29,619
We'll just, we'll send you Ikea plushies as payment.

29
00:01:32,460 --> 00:01:32,739
That's right.

30
00:01:32,740 --> 00:01:33,509
Pay me in gum.

31
00:01:34,389 --> 00:01:36,219
All valid cryptocurrencies for 2025.

32
00:01:39,660 --> 00:01:42,539
I guarantee you, if we took Trump coins and

33
00:01:42,539 --> 00:01:44,919
Ikea plushies, one has a better resale value.

34
00:01:46,535 --> 00:01:47,155
Oh, wow.

35
00:01:47,155 --> 00:01:49,574
We're, we are three minutes into this episode.

36
00:01:49,574 --> 00:01:50,455
Welcome to the show.

37
00:01:50,455 --> 00:01:50,985
Everyone.

38
00:01:51,085 --> 00:01:51,845
It's been a week.

39
00:01:52,384 --> 00:01:53,095
It's definitely been a week.

40
00:01:53,964 --> 00:01:55,304
Just for context, for anyone listening to

41
00:01:55,304 --> 00:01:57,294
this, we are recording this on January 23rd.

42
00:01:57,324 --> 00:01:59,104
It is Thursday, still in January.

43
00:01:59,104 --> 00:02:01,524
This episode is coming out in February, second week of February.

44
00:02:01,524 --> 00:02:03,845
So I don't know what the future holds, but.

45
00:02:04,190 --> 00:02:05,289
Godspeed to you all.

46
00:02:05,300 --> 00:02:06,479
Y'all, uh, it is,

47
00:02:06,779 --> 00:02:09,329
we were like 2025 will get better.

48
00:02:09,329 --> 00:02:11,680
And then halfway through January, we're like, whoa, whoa.

49
00:02:11,690 --> 00:02:12,890
We want a refund.

50
00:02:12,930 --> 00:02:13,209
Like,

51
00:02:14,899 --> 00:02:16,579
I don't know if the store does that anymore.

52
00:02:16,880 --> 00:02:16,989
I think

53
00:02:16,989 --> 00:02:18,209
the bad place, what

54
00:02:21,689 --> 00:02:25,059
we just survived tech like recession.

55
00:02:25,489 --> 00:02:27,889
And now we just, we don't even know.

56
00:02:27,909 --> 00:02:28,309
Okay.

57
00:02:28,309 --> 00:02:28,939
We like,

58
00:02:29,049 --> 00:02:31,069
we definitely served a lot of video this week.

59
00:02:31,069 --> 00:02:31,950
I can tell you that much.

60
00:02:31,950 --> 00:02:32,250
I mean.

61
00:02:32,475 --> 00:02:35,765
Speaking of surviving blue sky is like a male.

62
00:02:36,065 --> 00:02:37,925
It's just, it is, and that is

63
00:02:37,984 --> 00:02:40,535
you are the saving the world right now.

64
00:02:40,555 --> 00:02:42,375
Cause like, I don't even know where to go.

65
00:02:42,405 --> 00:02:44,055
I've deleted my Instagram three times.

66
00:02:44,065 --> 00:02:47,074
The only reason why I have a Facebook is because it's so confusing.

67
00:02:47,074 --> 00:02:48,215
I can't get rid of it.

68
00:02:48,475 --> 00:02:52,255
Like, I swear Meadow was like, I'm going to make this horrible.

69
00:02:52,265 --> 00:02:53,385
So they can't delete it.

70
00:02:53,385 --> 00:02:56,105
And I'm like, I just won't post and I'll just delete it from my phone.

71
00:02:56,144 --> 00:02:59,040
Like, yeah, They had to ask a UI UX like designer,

72
00:02:59,079 --> 00:03:01,090
how to make it as insufferable as possible.

73
00:03:01,090 --> 00:03:04,270
Like they, not to make it better, but how to make it worse.

74
00:03:04,599 --> 00:03:06,040
It's a lot of dark patterns out there.

75
00:03:06,180 --> 00:03:06,380
Yeah.

76
00:03:06,990 --> 00:03:09,290
And then people are like, Oh, tick tock is bad.

77
00:03:09,299 --> 00:03:10,260
Don't rock tick tock.

78
00:03:10,279 --> 00:03:12,689
You know, you shouldn't give your information

79
00:03:12,689 --> 00:03:15,279
to foreign like, okay, but this is like, okay.

80
00:03:15,880 --> 00:03:18,720
This is the Boston Tea Party of data.

81
00:03:18,890 --> 00:03:21,420
They were like, okay, you want to take my data?

82
00:03:21,470 --> 00:03:25,000
And like, you want to take my TikTok and say it's like Chinese government ware?

83
00:03:25,179 --> 00:03:28,260
We will throw it into like Red Note.

84
00:03:28,310 --> 00:03:30,670
Like it is the Tea Party of data.

85
00:03:30,730 --> 00:03:32,429
They were like, F your data rules.

86
00:03:32,459 --> 00:03:35,620
And then they gave it to the, there was a video of this woman

87
00:03:35,620 --> 00:03:39,049
saying that they told her to verify her identity for fraud.

88
00:03:39,314 --> 00:03:41,504
On red note, and she was like, I'm giving the Chinese

89
00:03:41,504 --> 00:03:44,224
government my I. D. What now, U. S. Government?

90
00:03:44,224 --> 00:03:46,934
And I was like, Oh, sweet Lord, what are we doing?

91
00:03:49,424 --> 00:03:50,824
Me and Jazz are going to be besties.

92
00:03:51,254 --> 00:03:51,684
I don't know.

93
00:03:51,684 --> 00:03:53,194
I hope more good places show up.

94
00:03:53,484 --> 00:03:54,784
Blue Sky is all we got.

95
00:03:54,904 --> 00:03:57,354
I think there were four at Proto based TikTok clones

96
00:03:57,354 --> 00:03:59,824
that were like starting up in the past week or two.

97
00:04:00,014 --> 00:04:02,024
So let's go back a little bit first and.

98
00:04:02,660 --> 00:04:05,090
How did you get into software infrastructure?

99
00:04:05,110 --> 00:04:05,940
What's kind of your background?

100
00:04:06,120 --> 00:04:10,210
Where did you go from doing something to like part of blue sky at Proto?

101
00:04:10,640 --> 00:04:11,750
I started in hardware.

102
00:04:11,770 --> 00:04:14,730
I started as like a, as a repair tech, uh, when I was

103
00:04:14,730 --> 00:04:18,390
like 14 at a computer repair shop in my local town.

104
00:04:18,740 --> 00:04:22,000
Support desk life, it is like you are help desk and yeah,

105
00:04:22,159 --> 00:04:24,040
yeah, I was very good at taking stuff apart.

106
00:04:24,060 --> 00:04:25,690
I wasn't very good at putting things back together.

107
00:04:25,700 --> 00:04:28,270
And then as I got older, I got better at putting things back together.

108
00:04:29,360 --> 00:04:30,159
Well, I don't know.

109
00:04:30,179 --> 00:04:33,900
There's, there's, I feel like there's, I feel like there's a disease you get

110
00:04:33,900 --> 00:04:36,750
where you just want to like take everything apart and figure out how it works.

111
00:04:36,780 --> 00:04:37,919
And so I was that kid, you have

112
00:04:37,919 --> 00:04:38,659
to see the parts.

113
00:04:38,679 --> 00:04:39,899
You have to know what's going on.

114
00:04:40,010 --> 00:04:40,359
Yeah.

115
00:04:40,360 --> 00:04:40,729
Yeah.

116
00:04:40,729 --> 00:04:42,159
I didn't, I didn't know how like solder joints

117
00:04:42,159 --> 00:04:42,739
worked.

118
00:04:42,780 --> 00:04:44,419
I learned that the hard way after like

119
00:04:44,419 --> 00:04:46,119
breaking a few too many solder joints and like.

120
00:04:46,119 --> 00:04:47,239
It's not going back together.

121
00:04:47,239 --> 00:04:48,299
What the hell does it work?

122
00:04:48,299 --> 00:04:48,399
A new

123
00:04:48,399 --> 00:04:49,119
skill today.

124
00:04:51,259 --> 00:04:52,689
The oven to try to solder it again.

125
00:04:53,069 --> 00:04:53,530
Yeah.

126
00:04:53,530 --> 00:04:54,099
Yeah.

127
00:04:54,169 --> 00:04:59,529
That's just, and then evolved from that to doing tech support at a local PC

128
00:04:59,529 --> 00:05:03,739
repair shop, and then that paid awfully and they build a lot for my time.

129
00:05:03,739 --> 00:05:04,560
So I was like, okay, cool.

130
00:05:04,570 --> 00:05:06,009
Let me do this independently.

131
00:05:06,010 --> 00:05:07,990
Um, so I went solo for a little bit.

132
00:05:08,000 --> 00:05:11,520
And then when I was in like high school, got in the hackathon scene

133
00:05:11,570 --> 00:05:14,180
in London, in the UK, right after I moved there in high school.

134
00:05:14,525 --> 00:05:15,615
That was really cool.

135
00:05:15,735 --> 00:05:17,085
I was going to these hackathons.

136
00:05:17,115 --> 00:05:19,485
I was like, well, technically not old enough to go

137
00:05:19,485 --> 00:05:21,455
to some of the hackathons so I could win the prizes.

138
00:05:21,465 --> 00:05:22,495
Terrible in London.

139
00:05:22,495 --> 00:05:23,255
Or is it good?

140
00:05:23,425 --> 00:05:24,865
If you get the right food, it's good.

141
00:05:24,874 --> 00:05:25,834
British food is bad.

142
00:05:25,924 --> 00:05:26,174
Okay.

143
00:05:26,214 --> 00:05:28,690
Uh, I probably shouldn't have said that on the podcast, but British food is bad.

144
00:05:28,690 --> 00:05:29,634
They know it.

145
00:05:29,635 --> 00:05:30,074
Um,

146
00:05:30,094 --> 00:05:30,525
they know it.

147
00:05:30,525 --> 00:05:31,664
Just like they know it.

148
00:05:31,664 --> 00:05:32,054
It's cool.

149
00:05:32,734 --> 00:05:33,484
They know, they know it.

150
00:05:33,485 --> 00:05:36,245
They know the good British food is like Nando's, but that's like.

151
00:05:36,544 --> 00:05:39,905
South African slash Portuguese slash British.

152
00:05:39,924 --> 00:05:42,844
And then obviously there's like really good Indian food

153
00:05:42,954 --> 00:05:45,174
and there's really good continental European foods.

154
00:05:45,174 --> 00:05:47,015
If you want like good Italian food or good French

155
00:05:47,015 --> 00:05:50,115
food, those are some really good eats to get in London.

156
00:05:50,484 --> 00:05:50,854
Yeah.

157
00:05:50,875 --> 00:05:53,044
So it was in the hackathon scene, started doing

158
00:05:53,044 --> 00:05:55,824
software engineering as like a part time thing.

159
00:05:55,824 --> 00:05:59,745
I think my junior year of high school into my senior year of high school.

160
00:05:59,745 --> 00:06:00,204
And then.

161
00:06:00,640 --> 00:06:01,319
moved back to the U.

162
00:06:01,319 --> 00:06:04,979
S. for college, uh, worked through college 39 and a half hours

163
00:06:04,979 --> 00:06:08,789
a week doing contracting, and then graduated, uh, early 2020,

164
00:06:09,049 --> 00:06:12,259
was thrust into the tech market in the middle of a pandemic.

165
00:06:12,490 --> 00:06:13,129
So it was interesting.

166
00:06:13,129 --> 00:06:15,449
So I spent some time working at a financial company, um,

167
00:06:15,449 --> 00:06:18,429
doing like infrastructure for their engineering teams,

168
00:06:18,429 --> 00:06:20,599
building a platform as a service on top of Kubernetes.

169
00:06:20,699 --> 00:06:23,209
And then I spent some time working at a social media

170
00:06:23,209 --> 00:06:26,650
company doing infrastructure for their research teams.

171
00:06:26,650 --> 00:06:28,240
So turning research projects Was it like a really evil one or

172
00:06:28,280 --> 00:06:29,610
just like a kind of evil one?

173
00:06:30,039 --> 00:06:31,299
It was, yeah, it was at Facebook.

174
00:06:31,299 --> 00:06:33,610
I was at, I was at Facebook for briefly for about a year.

175
00:06:33,669 --> 00:06:35,350
Um, fourth as a production engineer

176
00:06:35,355 --> 00:06:35,804
you got out.

177
00:06:35,985 --> 00:06:38,084
So I I got out, subscribe, got out.

178
00:06:38,090 --> 00:06:38,470
I'm just trying

179
00:06:38,470 --> 00:06:40,120
to think how hard it was to get out of the company.

180
00:06:40,150 --> 00:06:40,689
Like this is . Yeah.

181
00:06:41,005 --> 00:06:41,204
Yeah.

182
00:06:42,340 --> 00:06:45,220
I was working, um, production engineering at Facebook Reality Labs,

183
00:06:45,220 --> 00:06:48,039
so I spent time working with a bunch of That sounds a cool job.

184
00:06:48,159 --> 00:06:49,059
Researchers.

185
00:06:49,120 --> 00:06:49,390
Yeah.

186
00:06:49,390 --> 00:06:50,500
They were building really cool stuff.

187
00:06:50,500 --> 00:06:51,939
The problem is there were like a few thousand

188
00:06:51,939 --> 00:06:53,980
researchers and there were about 20 production engineers.

189
00:06:54,239 --> 00:06:55,249
So it was just like

190
00:06:55,369 --> 00:06:57,789
MetaQuest, like VR This is like MetaQuest.

191
00:06:57,790 --> 00:06:58,150
This is like

192
00:06:58,249 --> 00:06:59,199
MetaHorizons.

193
00:06:59,219 --> 00:07:01,089
This is all sorts of this is all the hardware project.

194
00:07:01,249 --> 00:07:01,939
They're working on.

195
00:07:02,379 --> 00:07:03,289
Do you know the legs

196
00:07:03,289 --> 00:07:03,890
on the models?

197
00:07:04,619 --> 00:07:06,109
I was, that's what I was going to ask.

198
00:07:06,109 --> 00:07:08,650
I was going to be like, why don't they have hands and legs?

199
00:07:08,659 --> 00:07:09,859
Like, do you have the teeth?

200
00:07:10,390 --> 00:07:12,609
Like this man was really trying to tell us that we

201
00:07:12,619 --> 00:07:14,780
all need to be more masculine and all this stuff.

202
00:07:14,789 --> 00:07:16,249
And I was like, bro, you can't build hands.

203
00:07:16,259 --> 00:07:16,759
Sit down.

204
00:07:16,829 --> 00:07:18,459
I don't know why there are no legs.

205
00:07:18,499 --> 00:07:19,489
I do.

206
00:07:19,520 --> 00:07:20,559
Yeah, I do know.

207
00:07:20,559 --> 00:07:20,789
Like.

208
00:07:20,844 --> 00:07:23,885
There were lots of really cool projects going on around

209
00:07:23,905 --> 00:07:27,864
the AR glasses that debuted at a recent meta event.

210
00:07:28,265 --> 00:07:30,184
So the, the really cool, the like time machine

211
00:07:30,184 --> 00:07:33,314
glasses, the real thick ones, those are awesome.

212
00:07:33,324 --> 00:07:35,875
The amounts of engineering that went into.

213
00:07:36,559 --> 00:07:39,520
Every single component in that pair of glasses is crazy.

214
00:07:39,530 --> 00:07:42,630
Like everything in it is custom Silicon designing that custom Silicon

215
00:07:42,630 --> 00:07:46,449
and designing the optics, those Silicon carbide optics that are like

216
00:07:46,469 --> 00:07:50,370
actually just like a rock that was manufactured specifically to like

217
00:07:50,370 --> 00:07:53,749
do all these crazy wave guides and stuff that requires an insane

218
00:07:53,749 --> 00:07:56,859
amount of simulation and insane amount of like physics and engineering.

219
00:07:56,979 --> 00:07:57,749
And I like.

220
00:07:57,975 --> 00:07:58,284
Sure.

221
00:07:58,294 --> 00:08:00,775
Helped the team with their simulation cluster or something.

222
00:08:00,775 --> 00:08:03,265
I have no idea how the math works, but that was cool stuff to work on.

223
00:08:04,434 --> 00:08:07,044
And now that I think about it, jazz, like I can't see your legs now either.

224
00:08:07,044 --> 00:08:08,164
So I don't even know if you have legs.

225
00:08:08,174 --> 00:08:11,384
So you

226
00:08:11,385 --> 00:08:12,324
can tell us point twice.

227
00:08:13,005 --> 00:08:14,338
No, I, uh, yeah.

228
00:08:14,338 --> 00:08:15,054
So that was.

229
00:08:15,214 --> 00:08:16,375
That was a fun chapter.

230
00:08:16,405 --> 00:08:21,755
After that, I kind of like, I went to, I went to a tiny, I went to a

231
00:08:21,755 --> 00:08:25,934
tiny six person startup that was, that was doing like solar cellular

232
00:08:25,974 --> 00:08:29,314
camera networks around cities for determining parking occupancy.

233
00:08:29,314 --> 00:08:30,104
It was very weird.

234
00:08:30,324 --> 00:08:32,504
What made you want to go from Meta to that?

235
00:08:32,604 --> 00:08:35,275
I wanted a small, like small team startup vibe thing.

236
00:08:35,334 --> 00:08:36,824
And the CEO was a friend of mine from high

237
00:08:36,824 --> 00:08:39,214
school, but they didn't really have any engineers.

238
00:08:39,244 --> 00:08:43,194
And I kind of, I built a product stack and burned out pretty quickly

239
00:08:43,304 --> 00:08:47,034
and then went to work at Planet Labs where I was for about two years.

240
00:08:47,124 --> 00:08:49,574
That was kind of my ethical turning point where I was like,

241
00:08:49,574 --> 00:08:52,014
Hey, I want to go build technology that helps the world.

242
00:08:52,204 --> 00:08:57,175
Uh, Planet builds tiny CubeSat constellation that images the world every day.

243
00:08:57,540 --> 00:09:00,650
They sell that imagery to farmers and agricultural industry

244
00:09:00,680 --> 00:09:04,210
and all sorts of like NGOs and other, other, uh, organizations

245
00:09:04,220 --> 00:09:06,990
so they can get like really fast real time imagery.

246
00:09:07,230 --> 00:09:10,150
My role there was like billing infrastructure when I got

247
00:09:10,150 --> 00:09:12,820
my foot in the door and then it turned into, uh, I wrote a

248
00:09:12,850 --> 00:09:15,329
charter and built out their internal developer experience team.

249
00:09:15,510 --> 00:09:18,450
But come, you know, 18 plus months into my career

250
00:09:18,450 --> 00:09:20,560
at Planet, my friend invites me to Blue Sky.

251
00:09:20,650 --> 00:09:23,310
Uh, it was like usually like 20, 000 or something like that.

252
00:09:23,430 --> 00:09:25,479
I check out this cool protocol that they're working on.

253
00:09:25,755 --> 00:09:28,084
It's very, very interesting because they just have a

254
00:09:28,084 --> 00:09:30,789
public fire hose and I was like Holy crap, that's awesome.

255
00:09:30,789 --> 00:09:33,329
I've never really seen a public fire hose for a social network.

256
00:09:33,339 --> 00:09:35,399
So I figure out how to consume the fire hose.

257
00:09:35,459 --> 00:09:37,149
I noticed like, Hey, there's this Paul guy.

258
00:09:37,159 --> 00:09:40,669
Who's like everywhere all the time, responding to everybody.

259
00:09:40,729 --> 00:09:41,719
Everyone mentions him.

260
00:09:41,819 --> 00:09:41,949
Always the

261
00:09:42,129 --> 00:09:43,769
first thing everybody notices.

262
00:09:44,359 --> 00:09:44,799
Yeah.

263
00:09:44,799 --> 00:09:47,539
So I was like, who's this Paul guy and how, how much has he mentioned?

264
00:09:47,549 --> 00:09:48,329
So I wrote like,

265
00:09:48,329 --> 00:09:49,419
he's the MySpace Tom of LooseGuy.

266
00:09:49,759 --> 00:09:50,209
Yeah.

267
00:09:50,419 --> 00:09:50,909
Yeah.

268
00:09:51,109 --> 00:09:53,639
I wrote some code and I was like, how often is Paul mentioned?

269
00:09:53,639 --> 00:09:56,219
And like, how many different people are talking to Paul?

270
00:09:56,399 --> 00:09:57,889
And so that was the initial idea was like

271
00:09:57,930 --> 00:10:00,859
tracking how popular Paul was on this platform.

272
00:10:00,859 --> 00:10:04,269
And then that evolved into my social graph visualization, which was, Hey,

273
00:10:04,269 --> 00:10:08,069
let's graph all of the interactions between users on blue sky and try to find

274
00:10:08,079 --> 00:10:12,009
like clusters of, of new users popping up that have common features and stuff.

275
00:10:12,159 --> 00:10:12,629
Very cool

276
00:10:12,629 --> 00:10:13,349
hobbies.

277
00:10:13,890 --> 00:10:14,069
Thank

278
00:10:14,069 --> 00:10:14,239
you.

279
00:10:14,240 --> 00:10:16,190
So that was, that was really fun.

280
00:10:16,190 --> 00:10:17,879
And I realized I was spending about 30 hours

281
00:10:17,879 --> 00:10:21,229
a week on miscellaneous at Proto stuff.

282
00:10:21,630 --> 00:10:23,450
And then 40 hours a week at work.

283
00:10:23,470 --> 00:10:26,370
And I was like, I definitely like one of these a lot more than the other.

284
00:10:26,500 --> 00:10:29,340
So I went to a, one of the blue sky user meetups in the Bay area.

285
00:10:29,360 --> 00:10:31,940
And I met, uh, some members of the team at the time.

286
00:10:31,959 --> 00:10:33,999
And they recognized me from the projects I was doing

287
00:10:33,999 --> 00:10:36,109
on the network of sharing all of this as I was building

288
00:10:36,110 --> 00:10:38,299
it open source, like, Hey, check out this cool graph.

289
00:10:38,329 --> 00:10:40,000
Oh, look, all these new users showed up and they're from

290
00:10:40,000 --> 00:10:42,420
this area and they speak this language or whatever it is.

291
00:10:42,560 --> 00:10:43,830
We chatted for a couple hours and like, cool.

292
00:10:43,840 --> 00:10:45,670
Do you want to like come work here?

293
00:10:45,720 --> 00:10:47,070
And I was like, Oh, do I?

294
00:10:47,560 --> 00:10:48,540
I was like, yeah, I think I do.

295
00:10:49,390 --> 00:10:52,290
That's actually really helpful data for a startup though.

296
00:10:52,640 --> 00:10:53,449
Like you were doing meaningful work.

297
00:10:53,450 --> 00:10:54,080
Yeah, I mean,

298
00:10:54,170 --> 00:10:56,550
yeah, at the point I had more dashboards than the

299
00:10:56,560 --> 00:10:58,850
company did of like what was going on on the network.

300
00:10:58,850 --> 00:11:01,820
I had like a better idea of who their users were than they did.

301
00:11:01,959 --> 00:11:05,230
Obviously it's evolved a whole lot since then, but I was, I was basically

302
00:11:05,230 --> 00:11:07,740
working at the company before I started working at the company just because

303
00:11:07,740 --> 00:11:10,430
everything was open source and everything was all the data was open.

304
00:11:10,440 --> 00:11:11,095
You could just do

305
00:11:11,095 --> 00:11:11,650
whatever you want.

306
00:11:11,650 --> 00:11:12,204
I was gonna

307
00:11:12,204 --> 00:11:12,574
say you

308
00:11:12,574 --> 00:11:14,340
were, you just rolled in with insights.

309
00:11:14,854 --> 00:11:16,334
Yeah, it was, it was super cool.

310
00:11:16,334 --> 00:11:19,394
Like I've never, I've never had like a, an experience like that

311
00:11:19,394 --> 00:11:22,594
where you can watch the evolution of a social network, like from

312
00:11:22,604 --> 00:11:25,734
basically first principles, totally in the public and totally

313
00:11:25,734 --> 00:11:28,314
in the open and just build all sorts of stuff on top of it.

314
00:11:28,314 --> 00:11:31,324
And the developer community around it got really psyched.

315
00:11:31,395 --> 00:11:31,834
Yeah.

316
00:11:31,915 --> 00:11:35,224
That, and that was back in, I joined the team back in July of 2023.

317
00:11:35,385 --> 00:11:38,564
So I've been, been around for about like 18 months now.

318
00:11:38,880 --> 00:11:40,780
And it's been an absolutely insane ride.

319
00:11:40,780 --> 00:11:42,310
It's felt like it's been a decade.

320
00:11:42,630 --> 00:11:45,340
You primarily have focused on the infrastructure side of it, right?

321
00:11:45,400 --> 00:11:46,850
Like as far as, yeah,

322
00:11:46,900 --> 00:11:50,169
my roles and responsibilities are mostly around infrastructure and scaling.

323
00:11:50,290 --> 00:11:53,529
So when I joined, we had around a hundred thousand users this

324
00:11:53,540 --> 00:11:57,439
weekend where we'll probably be pushing on 30 million users.

325
00:11:57,614 --> 00:11:59,814
So pretty significant increase in scale.

326
00:12:00,214 --> 00:12:00,795
It's a little bit 18 months.

327
00:12:00,854 --> 00:12:03,624
We need a HugOps meme for jazz.

328
00:12:03,764 --> 00:12:06,364
You are the real MVP because the amount of

329
00:12:06,374 --> 00:12:09,444
people that are social media refugees right now.

330
00:12:09,814 --> 00:12:11,924
It's not just me on the, on the infrasight of things.

331
00:12:11,924 --> 00:12:13,384
We have a, we probably have.

332
00:12:13,774 --> 00:12:16,554
Five or six people who are like kind of core

333
00:12:16,584 --> 00:12:18,474
infrastructure, like on call rotation now.

334
00:12:18,514 --> 00:12:21,514
But back in the day, it was, it was not quite that big.

335
00:12:21,614 --> 00:12:23,000
We were, we were really tiny team.

336
00:12:23,000 --> 00:12:23,310
Five is

337
00:12:23,310 --> 00:12:26,784
still a lot for that many users and then doing mostly on prem.

338
00:12:26,874 --> 00:12:28,194
Yeah, it's, it's a bit crazy.

339
00:12:28,204 --> 00:12:33,064
We, we built out our data center locations in like November of 2024.

340
00:12:33,284 --> 00:12:35,004
And so before that, we were all on cloud.

341
00:12:35,094 --> 00:12:37,284
Things were kind of falling over with a hundred thousand users.

342
00:12:37,544 --> 00:12:38,184
Oh, that's interesting.

343
00:12:38,184 --> 00:12:39,504
So you did start in the cloud.

344
00:12:40,145 --> 00:12:40,475
And then,

345
00:12:40,475 --> 00:12:41,675
yeah, it all started in the cloud.

346
00:12:41,675 --> 00:12:42,805
Like general overview.

347
00:12:42,875 --> 00:12:44,645
What does the infrastructure look like today?

348
00:12:44,725 --> 00:12:46,245
Like where, as I know, like some pieces in

349
00:12:46,245 --> 00:12:48,454
the cloud, some on prem, some different areas.

350
00:12:48,465 --> 00:12:49,155
Like what is that?

351
00:12:49,235 --> 00:12:50,064
How does that break down?

352
00:12:50,504 --> 00:12:52,785
So we have three tiers of infrastructure, I guess.

353
00:12:52,814 --> 00:12:57,074
You'd have like singleton one off services, which are kind of smaller, lower

354
00:12:57,074 --> 00:13:01,334
load services that we want a replicated Postgres database for or something.

355
00:13:01,354 --> 00:13:03,204
And those we stick in a cloud provider.

356
00:13:03,324 --> 00:13:06,265
We have our like core data services, which are really

357
00:13:06,265 --> 00:13:08,435
high compute scale, really high storage requirements.

358
00:13:08,999 --> 00:13:10,389
Uh, and those we run on prem.

359
00:13:10,599 --> 00:13:12,489
And so we have two, two POPs, two physical

360
00:13:12,489 --> 00:13:14,319
POPs that we have our own hardware in.

361
00:13:14,609 --> 00:13:15,439
Uh, that we co locate.

362
00:13:15,439 --> 00:13:16,869
We, you know, get a cage in a data center

363
00:13:16,879 --> 00:13:18,899
somewhere and go throw your servers in it.

364
00:13:19,149 --> 00:13:21,800
And then we have, our third tier is kind of like bare metal

365
00:13:21,800 --> 00:13:25,569
providers, which is different providers, but they give us

366
00:13:25,809 --> 00:13:29,399
basically a full machine in their data center somewhere.

367
00:13:29,489 --> 00:13:31,449
And then we run like the PDSs.

368
00:13:31,459 --> 00:13:35,450
So if you've, you're the personal data servers that have All of our users

369
00:13:35,450 --> 00:13:39,610
canonical data on it is stored on bare metal through bare metal providers.

370
00:13:39,630 --> 00:13:42,030
And that lets us kind of scale those a lot more easily than we

371
00:13:42,030 --> 00:13:45,409
can scale our own physical hardware and then smaller one off

372
00:13:45,410 --> 00:13:48,220
services or things that need to be in the cloud or in the cloud.

373
00:13:48,250 --> 00:13:52,700
And then all of our kind of like really high compute intensive or network

374
00:13:52,700 --> 00:13:57,080
intensive or storage intensive stuff runs on our own hardware because.

375
00:13:57,575 --> 00:14:00,715
bandwidth in a data center is a lot cheaper than bandwidth in a cloud.

376
00:14:00,815 --> 00:14:03,235
Storage in a data center is a lot cheaper than storage in the cloud.

377
00:14:03,375 --> 00:14:05,415
Do you have any extra backup storage in the

378
00:14:05,415 --> 00:14:07,945
cloud just in case things get super crazy?

379
00:14:08,195 --> 00:14:10,375
Yeah, so there are all sorts of like different tiers of

380
00:14:10,394 --> 00:14:13,205
backups based on what kind of data it is and where it is.

381
00:14:13,205 --> 00:14:15,995
So like canonical data, like your PDS data is backed up

382
00:14:16,025 --> 00:14:18,375
a couple different ways in a couple different places.

383
00:14:18,425 --> 00:14:22,535
But like our global index of All of the data in the atmosphere,

384
00:14:22,535 --> 00:14:25,994
we run to fully independent copies, uh, indexing atmosphere.

385
00:14:25,994 --> 00:14:30,774
So each data center, uh, fully indexes, um, the fire hose on its own.

386
00:14:31,074 --> 00:14:34,514
Um, so they both contain two independent sets of the same data.

387
00:14:34,844 --> 00:14:36,544
Um, so if there were some kind of.

388
00:14:36,889 --> 00:14:38,109
outage or anything like that.

389
00:14:38,269 --> 00:14:40,379
Uh, we have at least a copy of that somewhere.

390
00:14:40,409 --> 00:14:42,609
Uh, and we have the ability to shift all of our traffic to one of

391
00:14:42,609 --> 00:14:45,719
the data centers so that it can, it can handle the production load.

392
00:14:45,899 --> 00:14:49,049
It makes my heart so happy when people use cloud and on prem

393
00:14:49,050 --> 00:14:52,800
correctly and don't just think either of them are the end all be all.

394
00:14:52,879 --> 00:14:56,474
And when people are redundant properly, like it just makes me so happy.

395
00:14:56,694 --> 00:14:58,915
We, we really like commoditized cloud products.

396
00:14:58,925 --> 00:15:01,624
So something like block storage is like super commoditized.

397
00:15:01,624 --> 00:15:02,324
It's super cheap.

398
00:15:02,364 --> 00:15:04,305
There's so many different people that provide it and the

399
00:15:04,305 --> 00:15:07,314
like SLAs on it are very industry standard at this point.

400
00:15:07,324 --> 00:15:09,295
So it's much easier to get cheap.

401
00:15:09,614 --> 00:15:10,525
Block storage.

402
00:15:10,574 --> 00:15:13,724
So we don't mind building a petabyte scale, uh, storage

403
00:15:13,724 --> 00:15:17,405
cluster on like metal is, is kind of challenging.

404
00:15:17,405 --> 00:15:18,515
It's, it's expensive.

405
00:15:18,574 --> 00:15:21,354
It's error prone, depending on your latency requirements

406
00:15:21,354 --> 00:15:23,034
and stuff, you might mean to be running flash for that.

407
00:15:23,045 --> 00:15:24,484
In which case it's a lot more expensive.

408
00:15:24,604 --> 00:15:27,044
And if you're running hard drives, you have failure rates, which means you

409
00:15:27,045 --> 00:15:30,045
need somebody to like the bigger scale your cluster is, the more often you have

410
00:15:30,045 --> 00:15:33,265
to send somebody down to go swap hard drives, whereas block storage is like.

411
00:15:33,809 --> 00:15:36,479
Honestly, really economical up to, I think it's somewhere in

412
00:15:36,479 --> 00:15:38,599
the, in the like four to five petabyte range, at which point

413
00:15:38,599 --> 00:15:41,209
it makes sense to start just running your own storage clusters.

414
00:15:41,399 --> 00:15:44,419
But I love that you guys actually did the numbers and you looked at

415
00:15:44,449 --> 00:15:48,229
each, you know, like the, all of your storage is very well placed.

416
00:15:48,379 --> 00:15:50,109
Well, and the, the really funny thing is, I

417
00:15:50,109 --> 00:15:52,009
mean, you, you said you started these in 2023.

418
00:15:53,539 --> 00:15:56,689
Alright, like end of 2024, you were growing a million users a day.

419
00:15:56,689 --> 00:15:59,079
So whatever math you thought you had in 2023

420
00:15:59,089 --> 00:16:01,669
was not the math you were doing in 2024.

421
00:16:01,919 --> 00:16:03,519
You would be surprised.

422
00:16:03,569 --> 00:16:08,050
We've got some spreadsheets that were written in 2023,

423
00:16:08,050 --> 00:16:11,168
very early 2024, and they were wildly early 2024.

424
00:16:11,168 --> 00:16:11,614
early 2024.

425
00:16:11,744 --> 00:16:14,074
They were wildly ambitious when they were written

426
00:16:14,164 --> 00:16:16,384
that go month by month, like user numbers.

427
00:16:16,444 --> 00:16:18,844
We missed a ton of the marks on them.

428
00:16:18,874 --> 00:16:20,064
And then we caught up.

429
00:16:20,305 --> 00:16:22,035
Were y'all just predicting the future?

430
00:16:22,334 --> 00:16:24,844
Like, did, did someone know if Elon was

431
00:16:24,844 --> 00:16:26,354
breaking up or getting back with girlfriends?

432
00:16:28,324 --> 00:16:30,894
Our previous infrastructure lead, Jake, Justin, who I

433
00:16:30,935 --> 00:16:33,464
think you talked to briefly on the website at some point.

434
00:16:33,665 --> 00:16:36,074
He wrote up this spreadsheet that like, I think he, I

435
00:16:36,084 --> 00:16:38,665
think he based it off of Instagram numbers or something.

436
00:16:38,665 --> 00:16:41,145
He got, he got a bunch of different numbers from different social medias

437
00:16:41,175 --> 00:16:45,125
that like had, they gave you a whole like six data points of their

438
00:16:45,125 --> 00:16:48,234
user numbers over the course of their like 10 year history and then

439
00:16:48,235 --> 00:16:52,004
extrapolated between them to try and find what successful growth look like.

440
00:16:52,025 --> 00:16:54,094
And then we built that into the spreadsheet and

441
00:16:54,094 --> 00:16:56,255
then we said, Hey, let's plan for success because.

442
00:16:56,804 --> 00:16:58,484
If you plan for failure, you're not going to succeed.

443
00:16:58,564 --> 00:17:02,934
Which is amazing because just the, the environment in which Instagram

444
00:17:02,974 --> 00:17:08,144
and Facebook and most places and social media grew is not like.

445
00:17:08,659 --> 00:17:09,639
What's happening right now?

446
00:17:09,649 --> 00:17:12,659
Like, this is a very crazy time in social media.

447
00:17:12,949 --> 00:17:16,079
When you, when you have so much market saturation and there's so many

448
00:17:16,079 --> 00:17:19,489
incumbents and everybody is already fully subscribed on the social medias

449
00:17:19,489 --> 00:17:24,659
that they want to be on, it is so hard to pull people away from the platform

450
00:17:24,659 --> 00:17:27,390
that they're on and bring them to something new and show them something new.

451
00:17:27,649 --> 00:17:32,100
And we saw that for six months last year, we had like basically flat growth.

452
00:17:32,120 --> 00:17:36,510
We were like between three and four thousand new users a day for six months.

453
00:17:36,919 --> 00:17:40,710
And then in November of 2024, we were doing over

454
00:17:40,710 --> 00:17:43,010
a million users a day for three days in a row.

455
00:17:43,020 --> 00:17:43,550
We'll never

456
00:17:43,550 --> 00:17:46,209
know why in November specifically that happened.

457
00:17:46,700 --> 00:17:48,000
The whiplash is crazy.

458
00:17:48,010 --> 00:17:49,679
You go from like no growth at all.

459
00:17:49,810 --> 00:17:52,719
Oh, we, I can't believe we spent so much time focusing on scaling.

460
00:17:52,720 --> 00:17:56,429
Why did we waste all that time and money on filling out these data centers?

461
00:17:56,615 --> 00:17:59,905
Oh my gosh, like at, in the last six months, was there like, you don't have

462
00:17:59,905 --> 00:18:02,524
to give us specifics, obviously we don't need numbers, but like, was there

463
00:18:02,534 --> 00:18:06,254
ever a point where you were like, oh my goodness, like what's going on?

464
00:18:06,264 --> 00:18:07,985
Like, or how are we going to sustain this?

465
00:18:08,575 --> 00:18:11,155
Brazil was insane, Brazil was, I was, I thought

466
00:18:11,155 --> 00:18:14,145
it was going to be another thing because I like watched the Brazil blip,

467
00:18:14,155 --> 00:18:18,105
but I guess in the U. S. it didn't make the same, it didn't seem as much.

468
00:18:18,225 --> 00:18:21,154
It didn't catch on as much in the U. S., but it was

469
00:18:21,185 --> 00:18:24,454
like one and a half million users in a weekend, right?

470
00:18:24,485 --> 00:18:27,645
Which for us coming from having like no growth for six months

471
00:18:27,645 --> 00:18:30,085
to suddenly picking up a million and a half users in a weekend

472
00:18:30,085 --> 00:18:33,325
was, it was like 30 percent growth of our network in like a week.

473
00:18:33,635 --> 00:18:34,915
Which was nuts for us.

474
00:18:35,185 --> 00:18:38,935
I was like on a plane to London to go see a friend of mine for his birthday.

475
00:18:39,264 --> 00:18:42,604
And I like bought the in flight wifi and was like trying to get my

476
00:18:42,605 --> 00:18:45,634
VPN to work so that I could like connect to dashboards and everything.

477
00:18:45,645 --> 00:18:46,245
And it was.

478
00:18:46,764 --> 00:18:47,935
I was so terrified.

479
00:18:47,935 --> 00:18:49,965
I ended up like that working the entire weekend.

480
00:18:49,965 --> 00:18:54,294
I was in London because I was just like, we've never seen any load like this.

481
00:18:54,304 --> 00:18:57,015
It was like five or six times higher, like firehose

482
00:18:57,015 --> 00:19:00,144
throughput and request throughput than we've ever seen before.

483
00:19:00,284 --> 00:19:02,874
You've mentioned a couple of components and I don't want to go deep into

484
00:19:02,874 --> 00:19:05,574
the app protocol stuff, but like, could you just give a general overview?

485
00:19:05,574 --> 00:19:09,054
Like the PDS, the firehose, the app view, the indexes.

486
00:19:09,360 --> 00:19:12,960
All of those have different constraints and how do they tie

487
00:19:12,960 --> 00:19:15,550
together or just a general overview of like what BlueSky is

488
00:19:15,550 --> 00:19:18,260
offering as a service is a bunch of things underneath it.

489
00:19:18,870 --> 00:19:21,780
I'll steal the Paulism, which is everybody's a website.

490
00:19:21,839 --> 00:19:25,359
So you as a user on BlueSky, every time you like something,

491
00:19:25,360 --> 00:19:27,500
every time you create a post, every time you follow somebody.

492
00:19:27,804 --> 00:19:31,124
Uh, every time you block somebody, every time you repost something, you

493
00:19:31,134 --> 00:19:35,415
are writing a little document, a JSON document effectively to your website.

494
00:19:35,485 --> 00:19:38,475
You're putting a JSON document in your canonical data

495
00:19:38,475 --> 00:19:42,194
store that lives on your PDS, on your personal data server.

496
00:19:42,274 --> 00:19:45,624
For the vast majority of our users, that means they are writing it to a PDS

497
00:19:45,634 --> 00:19:50,384
that we operate, but there are also thousands of independently operated PDSs.

498
00:19:50,665 --> 00:19:50,955
I'm one

499
00:19:50,955 --> 00:19:51,225
of them.

500
00:19:51,425 --> 00:19:52,685
Yeah, Justin's one of them.

501
00:19:52,685 --> 00:19:52,795
He

502
00:19:52,795 --> 00:19:54,405
broke himself for a while, and I couldn't reply to him.

503
00:19:54,415 --> 00:19:55,065
I was broken for a

504
00:19:55,065 --> 00:19:56,055
little while, but I'm back.

505
00:19:56,274 --> 00:19:57,825
It's been stable for a couple weeks now.

506
00:19:57,825 --> 00:19:58,017
It

507
00:19:58,017 --> 00:19:59,552
was like he picked on me specifically, Jazz.

508
00:19:59,552 --> 00:20:01,095
Like, I couldn't even reply to him.

509
00:20:01,105 --> 00:20:04,044
And he kept tagging me in taco and license plate things.

510
00:20:04,044 --> 00:20:04,925
Like, that's just mean.

511
00:20:04,935 --> 00:20:06,295
First of all, I'm hungry.

512
00:20:07,204 --> 00:20:09,014
And then I couldn't even reply.

513
00:20:09,274 --> 00:20:10,204
That's unfortunate.

514
00:20:10,205 --> 00:20:12,475
The nature of a distributed network is, is you've got

515
00:20:12,475 --> 00:20:14,425
all these documents that you write into your personal

516
00:20:14,425 --> 00:20:16,575
data store, whether it's hosted by us or somebody else.

517
00:20:17,100 --> 00:20:19,299
They get aggregated into one giant fire hose.

518
00:20:19,309 --> 00:20:22,699
So your PDS emits an event stream for all of the repos hosted on it.

519
00:20:22,709 --> 00:20:26,620
So for our users, it's usually right now it's like 500, 000 users per PDS.

520
00:20:26,789 --> 00:20:29,509
And so if you're on like Amanita, that's a,

521
00:20:29,600 --> 00:20:31,749
all of our PDSs are named after mushrooms.

522
00:20:31,870 --> 00:20:33,049
Um, so if you're on Amanita.

523
00:20:33,434 --> 00:20:38,144
You've got 499,999 of your closest friends on AMITA with

524
00:20:38,144 --> 00:20:41,084
you, and every time you post, you are writing to Amita to

525
00:20:41,559 --> 00:20:44,684
a SQL Light database that exists just for you on amita.

526
00:20:44,864 --> 00:20:45,824
Each user gets one.

527
00:20:45,824 --> 00:20:46,159
SSL L. We

528
00:20:46,159 --> 00:20:48,644
like database Justin, or do you think we're the same?

529
00:20:48,884 --> 00:20:49,245
We were.

530
00:20:49,249 --> 00:20:49,399
That was.

531
00:20:50,085 --> 00:20:51,855
Well, so that was the problem, actually, Autumn, because

532
00:20:51,855 --> 00:20:53,875
someone else pointed that out to me where you and I

533
00:20:53,875 --> 00:20:57,415
were on the same PDS originally before I migrated off.

534
00:20:57,875 --> 00:21:00,835
And then when I migrated off, they were the real MVP.

535
00:21:01,725 --> 00:21:03,864
I don't remember which one it was, but when I migrated off to my

536
00:21:03,865 --> 00:21:08,584
own, my account deactivation didn't fully happen on the hosted PDS.

537
00:21:08,584 --> 00:21:12,975
So people on my PDS couldn't reply to me until I went through and did my full.

538
00:21:14,215 --> 00:21:17,324
So it just so happened we were neighbors and, uh, and then when I

539
00:21:17,794 --> 00:21:20,125
moved out of the neighborhood, why'd you do that?

540
00:21:20,135 --> 00:21:21,824
Like just worst friend ever.

541
00:21:21,875 --> 00:21:22,175
Like

542
00:21:22,545 --> 00:21:24,325
you and your neighbors all chat, all you want.

543
00:21:24,325 --> 00:21:26,924
You've write all of these, these documents to your own

544
00:21:26,924 --> 00:21:29,025
little sequel lights living on your mushroom with you.

545
00:21:29,285 --> 00:21:33,974
And then the mushroom itself, it sequences all of the events for its users.

546
00:21:33,975 --> 00:21:36,955
So you and all the other people on there are writing.

547
00:21:37,395 --> 00:21:41,225
Generally, we see somewhere between 5 and 20 events a second per PDS.

548
00:21:41,295 --> 00:21:43,665
So, all those writes get written into one

549
00:21:43,685 --> 00:21:46,135
sequencer database, which is a SQLite as well.

550
00:21:46,465 --> 00:21:48,504
Uh, and then once they get sequenced, they get given like

551
00:21:48,504 --> 00:21:50,744
a sequence number and they get emitted out of the firehose.

552
00:21:50,774 --> 00:21:52,865
So each, each mushroom has its own little firehose.

553
00:21:53,234 --> 00:21:56,165
And then we have something called the relay, which is in the network that

554
00:21:56,175 --> 00:22:00,645
sucks from all of the mushrooms and turns into a gigantic firehose that

555
00:22:00,645 --> 00:22:05,215
does, you know, anywhere from 1, 000 to 2, 000 events per second these days.

556
00:22:05,364 --> 00:22:10,865
And so that giant firehose is running in our right now it runs in our on prem

557
00:22:10,865 --> 00:22:15,064
for us that merges all of the disparate event streams into one giant event

558
00:22:15,064 --> 00:22:19,784
stream, which makes consuming the network a lot easier and a lot less complex.

559
00:22:19,864 --> 00:22:23,144
And so that one big event stream then gets crawled by Firehose

560
00:22:23,174 --> 00:22:26,105
consumers, a couple hundred of those Jetstream is connected

561
00:22:26,105 --> 00:22:27,985
that which is like a lightweight version of the Firehose that

562
00:22:27,995 --> 00:22:29,935
has a couple hundred consumers that connect to it as well.

563
00:22:30,004 --> 00:22:32,024
But everybody, everybody consumes this Firehose and

564
00:22:32,024 --> 00:22:34,855
then the Firehose has like, hey, this person created.

565
00:22:35,179 --> 00:22:39,519
This record with this ID and here's the content of that record and then

566
00:22:39,529 --> 00:22:42,870
here's a proof of this operation so that you can check and make sure that

567
00:22:42,870 --> 00:22:45,889
they actually created this record, like it's signed with their private key.

568
00:22:46,059 --> 00:22:49,229
Can I just work at Blue Sky for like a day and then just re architect

569
00:22:49,239 --> 00:22:51,989
all of your architecture with mushrooms, like just little mushroom

570
00:22:51,989 --> 00:22:55,719
databases and like just magical streams, you know, like it was

571
00:22:55,719 --> 00:22:58,769
just like the fire hose will be like magical and then like it'll

572
00:22:58,779 --> 00:23:01,719
be like each thing will be like a mushroom and it'll be adorable.

573
00:23:02,119 --> 00:23:04,169
There's a legendary drawing that I found in the

574
00:23:04,169 --> 00:23:06,289
developer discord, the third party dev discord.

575
00:23:06,399 --> 00:23:11,509
Somebody drew an architectural diagram of blue sky, but where each node in the

576
00:23:11,509 --> 00:23:15,689
network is like a forest creature and gave them like Very interesting names.

577
00:23:15,720 --> 00:23:18,600
And so there, there is like kind of headcanon of, oh yeah, this,

578
00:23:18,649 --> 00:23:21,520
this is, this component is like an anteater and this component

579
00:23:21,530 --> 00:23:24,370
is like a hedgehog and this component, you know, whatever.

580
00:23:24,510 --> 00:23:26,889
Each database should be a mushroom or like

581
00:23:26,890 --> 00:23:29,850
each, each data, the microsphere, we call it, uh,

582
00:23:29,850 --> 00:23:32,679
all of the, the PDS is make up the microsphere.

583
00:23:32,919 --> 00:23:36,260
So that gets indexed internally and then we have a big.

584
00:23:36,669 --> 00:23:39,360
database in each of the POPs right now that runs

585
00:23:39,360 --> 00:23:42,389
Scilla, which is a kind of a C rewrite of Cassandra.

586
00:23:42,439 --> 00:23:45,159
So it's a big NoSQL key, key value store.

587
00:23:45,249 --> 00:23:49,350
And that's where we actually persist the global index of data on the network.

588
00:23:49,399 --> 00:23:53,460
So your PDS only knows about what your users have, what its users have.

589
00:23:53,889 --> 00:23:58,040
I love watching Scylla and Cassandra fight and then Scylla's like, C is like

590
00:23:58,040 --> 00:24:01,330
faster because we don't like compile and then Cassandra's like, but we're

591
00:24:01,330 --> 00:24:04,970
faster and it just pretend like it doesn't suck to manage us and then they

592
00:24:04,970 --> 00:24:08,929
fight back and forth and it's the best nerd fight you've ever seen in your life.

593
00:24:09,020 --> 00:24:10,710
Is that where the AppVue pulls from?

594
00:24:10,720 --> 00:24:12,429
It's not going directly from the.

595
00:24:12,509 --> 00:24:12,759
Yeah.

596
00:24:12,760 --> 00:24:15,500
So the AppVue pulls from its local, there's

597
00:24:15,500 --> 00:24:17,710
a data service that I wrote called Atlantis.

598
00:24:17,969 --> 00:24:20,929
which is our like data plane or whatever that talks to Scylla that that's

599
00:24:20,929 --> 00:24:24,039
what writes things into Scylla that's what reads things out of Scylla it

600
00:24:24,039 --> 00:24:27,409
also handles some like caching tiers it handles some request coalescing

601
00:24:27,419 --> 00:24:30,939
things like that and so that is where the global index of data is so when

602
00:24:30,939 --> 00:24:34,699
you load your timeline when you load a thread when you look at the number

603
00:24:34,699 --> 00:24:38,249
of likes on a post that's all coming out of Scylla that's coming from our

604
00:24:38,289 --> 00:24:42,359
big data store and then in terms of scale for that like the actual amount

605
00:24:42,359 --> 00:24:45,509
of data that's on the network is like It's like a couple of terabytes.

606
00:24:45,559 --> 00:24:48,209
If you don't include images and you don't include video or anything like

607
00:24:48,209 --> 00:24:52,219
that, the actual like, record data, the JSON is like a couple terabytes.

608
00:24:52,229 --> 00:24:53,569
So it's not huge.

609
00:24:53,699 --> 00:24:55,159
The timelines are really big though.

610
00:24:55,219 --> 00:24:57,449
Timelines are a really weird workload, which is like every time

611
00:24:57,449 --> 00:25:00,349
you post, we send out your post to all the people that follow you.

612
00:25:00,469 --> 00:25:02,709
So if you have 20, 000 followers and you post something,

613
00:25:02,709 --> 00:25:05,269
we're going to go insert 20, 000 references to your

614
00:25:05,269 --> 00:25:08,509
post into the timelines of the people that follow you.

615
00:25:08,679 --> 00:25:09,979
And then we keep a That sounds very complex.

616
00:25:09,979 --> 00:25:14,804
It's It was a big architectural shift from what we did before, but the

617
00:25:14,824 --> 00:25:18,574
timelines themselves, like the timelines table is like over 100 billion rows.

618
00:25:18,684 --> 00:25:21,624
We trim it so like there's a maximum length of your timeline, but when you

619
00:25:21,624 --> 00:25:25,034
have 30 million users and you want to keep like a few thousand timeline

620
00:25:25,034 --> 00:25:27,964
items in there that quickly balloons to like hundreds of billions of rows.

621
00:25:27,965 --> 00:25:31,564
Wasn't the blue sky account had to like you had to post tens of

622
00:25:31,564 --> 00:25:33,945
thousands of people wait five minutes right to let that propagate?

623
00:25:34,125 --> 00:25:36,995
There was a moment where we had, there was only

624
00:25:36,995 --> 00:25:39,615
one work queue or whatever for dealing with stuff.

625
00:25:39,655 --> 00:25:42,245
And the fan out job was also in that same work queue.

626
00:25:42,334 --> 00:25:44,735
And so it, like you get sharded into a work queue

627
00:25:44,735 --> 00:25:46,074
based off of your date and all that kind of stuff.

628
00:25:46,084 --> 00:25:47,654
But he's got an app account would.

629
00:25:47,675 --> 00:25:50,314
It would create a post, and then it would start fanning out

630
00:25:50,314 --> 00:25:53,064
the post, and the creation of the next post in the thread

631
00:25:53,064 --> 00:25:55,324
would get blocked, because it would be waiting for the fanout

632
00:25:55,344 --> 00:25:57,374
to finish before it would create the next post in the thread.

633
00:25:57,374 --> 00:26:00,064
And now, those are two separate queues, so fanout jobs

634
00:26:00,064 --> 00:26:02,014
can happen in the background, and they don't block

635
00:26:02,014 --> 00:26:04,694
the, like, persisting of the actual thread post itself.

636
00:26:04,930 --> 00:26:08,960
Okay, do you have a different flow for users that have a bunch of followers

637
00:26:08,960 --> 00:26:11,860
versus users that don't have a bunch because there was a certain point

638
00:26:11,910 --> 00:26:15,269
in like Twitter where they had to re architect for like Justin Bieber

639
00:26:15,760 --> 00:26:18,770
versus a regular person and that's one of my favorite data stories

640
00:26:18,800 --> 00:26:22,379
because it just shows you how scale can just be completely ridiculous.

641
00:26:22,379 --> 00:26:25,270
Like he would get so many followers a day and then when he would

642
00:26:25,290 --> 00:26:29,490
tweet it would like mess up everything and it's just so interesting.

643
00:26:29,540 --> 00:26:30,650
We haven't done that yet.

644
00:26:30,779 --> 00:26:33,560
But that is absolutely, like a hybrid timeline architecture is

645
00:26:33,560 --> 00:26:36,919
absolutely probably where we'll go as we get bigger and bigger.

646
00:26:36,989 --> 00:26:38,629
Because right now, every time bscott.

647
00:26:38,719 --> 00:26:40,320
app posts a thread, it's getting fanned out

648
00:26:40,320 --> 00:26:42,599
to, I think, 22 million people's timelines.

649
00:26:42,809 --> 00:26:43,599
That's a lot of writes.

650
00:26:43,769 --> 00:26:45,729
And if they post a five post thread, that's like

651
00:26:45,739 --> 00:26:47,570
a hundred mil, over a hundred million writes.

652
00:26:47,659 --> 00:26:49,799
The, the guy who wrote Date

653
00:26:49,799 --> 00:26:50,869
of Intensive Applications

654
00:26:50,870 --> 00:26:52,399
Is on Blue Sky.

655
00:26:53,354 --> 00:26:53,905
Yes.

656
00:26:54,175 --> 00:26:59,614
And he's so rad and nice and that, dude, that's my favorite book.

657
00:26:59,695 --> 00:26:59,754
It is.

658
00:27:00,725 --> 00:27:01,364
The boar book.

659
00:27:01,364 --> 00:27:02,144
But when I found

660
00:27:02,304 --> 00:27:04,395
him, I was like, Oh my God, you're real.

661
00:27:05,455 --> 00:27:05,764
Yeah.

662
00:27:06,455 --> 00:27:10,205
Martin is actually a technical advisor of Blue Sky.

663
00:27:10,464 --> 00:27:10,484
To

664
00:27:10,485 --> 00:27:13,485
be like, so smart, you know, like you'd think that he would

665
00:27:13,495 --> 00:27:15,844
be like, Oh, I'm too smart and I won't talk to people.

666
00:27:15,844 --> 00:27:16,834
And he's so nice.

667
00:27:17,389 --> 00:27:18,139
He's a teacher.

668
00:27:18,139 --> 00:27:19,980
I feel like he gets a lot of human interaction.

669
00:27:19,990 --> 00:27:22,490
He's not like locked in a cave doing like research.

670
00:27:22,540 --> 00:27:24,679
So I think he ends up interacting with humans a

671
00:27:24,679 --> 00:27:27,459
lot more than some, uh, some CS researchers do.

672
00:27:27,629 --> 00:27:30,050
Also, I think the way that that book is written, you can

673
00:27:30,050 --> 00:27:32,820
almost tell that he must have taught something because it's.

674
00:27:33,095 --> 00:27:37,514
Much more digestible than a lot of just dense, horrible data.

675
00:27:38,695 --> 00:27:39,475
Martin is great.

676
00:27:39,514 --> 00:27:41,575
We meet with him fairly regularly to just

677
00:27:41,584 --> 00:27:43,135
talk about like issues that we're having.

678
00:27:43,135 --> 00:27:43,325
I'm a

679
00:27:43,325 --> 00:27:43,794
fan.

680
00:27:43,794 --> 00:27:44,244
Okay.

681
00:27:44,314 --> 00:27:46,685
Tell him I'm a fan girl over all of his data books.

682
00:27:46,685 --> 00:27:47,934
And that's my favorite data book.

683
00:27:47,935 --> 00:27:49,115
And I talk about it way too much.

684
00:27:49,115 --> 00:27:51,854
And people are probably so tired of me bringing up that one book.

685
00:27:51,935 --> 00:27:52,845
One of my favorite.

686
00:27:52,949 --> 00:27:54,550
We have, we do have a lot of like internal memes.

687
00:27:54,570 --> 00:27:56,169
I think we've shared a couple of them on the network.

688
00:27:56,169 --> 00:27:57,479
We do have memes

689
00:27:58,320 --> 00:28:00,719
and we need to like plushy picks jazz.

690
00:28:00,749 --> 00:28:01,639
I just blue skies.

691
00:28:01,639 --> 00:28:01,870
Yeah.

692
00:28:02,009 --> 00:28:02,409
You're right.

693
00:28:03,759 --> 00:28:06,059
She picks

694
00:28:06,060 --> 00:28:09,349
and, and designing data intensive applications, memes, Martin used to come to us

695
00:28:09,349 --> 00:28:11,999
with like, we'd, we'd go to him and we'd ask like all these questions and he'd

696
00:28:11,999 --> 00:28:14,540
give us like, Oh yeah, here's a great, here's a great way to solve that problem.

697
00:28:14,540 --> 00:28:16,300
And nowadays we go to him and every time we

698
00:28:16,310 --> 00:28:17,860
like, we're like, Hey, we have this problem.

699
00:28:18,290 --> 00:28:19,870
And he'd be like, ah, that's a tough one.

700
00:28:20,209 --> 00:28:20,510
And it's like,

701
00:28:21,080 --> 00:28:23,080
we're getting out of the realm of easy, of

702
00:28:23,080 --> 00:28:25,030
easy answers that are like well explored.

703
00:28:25,030 --> 00:28:27,419
And we're into the like, yep, that's, that's a challenge at scale.

704
00:28:27,590 --> 00:28:29,709
Can I come be a free technical consultant

705
00:28:29,709 --> 00:28:31,509
just so I can talk to Martin and get memes?

706
00:28:31,579 --> 00:28:33,389
Like, can I be paid in memes?

707
00:28:33,689 --> 00:28:35,229
You can talk to Martin on the network, and I think

708
00:28:35,229 --> 00:28:37,379
Martin also goes to a good number of conferences as well.

709
00:28:37,800 --> 00:28:39,709
You could, you could end up at a I'm gonna not stalk him in

710
00:28:39,709 --> 00:28:42,879
a creepy way, but in a very nice, professional way.

711
00:28:42,879 --> 00:28:43,389
Yeah.

712
00:28:43,459 --> 00:28:44,986
Get him to come to one

713
00:28:44,986 --> 00:28:46,686
of the concerts we just did.

714
00:28:46,686 --> 00:28:49,919
With that general overview, I remember back when you were scaling

715
00:28:49,920 --> 00:28:52,629
a million users a day, you had to go rack some servers, right?

716
00:28:52,630 --> 00:28:53,920
Like, there was a point where you were like, hey,

717
00:28:53,920 --> 00:28:57,410
we need to scale up, and, and it's not in the PDS.

718
00:28:57,910 --> 00:29:00,850
with, with millions of users coming, even though that's growing, you

719
00:29:00,850 --> 00:29:04,330
can still scale that the, the bare metal, because those are rentals.

720
00:29:04,330 --> 00:29:06,450
That's a, it's a provider that say, give me another one.

721
00:29:06,700 --> 00:29:07,320
We'll provision it.

722
00:29:07,320 --> 00:29:08,160
It'll come into the network.

723
00:29:08,349 --> 00:29:10,549
And then on the other side of like some cloud services running,

724
00:29:10,549 --> 00:29:12,440
like those skip, but like somewhere in there you had to rack.

725
00:29:12,460 --> 00:29:14,270
And that's mostly, you said for the fire hose

726
00:29:14,309 --> 00:29:17,830
for that kind of global index, so that the data service that does

727
00:29:17,830 --> 00:29:20,420
all the querying to the database, the database cluster itself.

728
00:29:20,550 --> 00:29:24,000
And a couple of other, like the discover feed and stuff that run on prem.

729
00:29:24,040 --> 00:29:27,870
So those all require machines to run on and we don't have a

730
00:29:27,870 --> 00:29:31,810
magic, uh, like I can't change a number in a Pulumi deploy and

731
00:29:31,810 --> 00:29:34,990
then magically have more hardware available in the data center.

732
00:29:35,249 --> 00:29:36,269
It's a whole process.

733
00:29:36,270 --> 00:29:37,959
You've got to go through the acquisition process.

734
00:29:37,960 --> 00:29:38,970
You've got to find a vendor.

735
00:29:38,970 --> 00:29:40,010
You've got to talk to a vendor.

736
00:29:40,010 --> 00:29:42,600
You've got to, you know, spend some money on some new machines.

737
00:29:42,600 --> 00:29:43,160
They get shipped.

738
00:29:43,160 --> 00:29:44,210
You have to go to the data center.

739
00:29:44,210 --> 00:29:45,000
You have to receive them.

740
00:29:45,000 --> 00:29:48,340
You have to Unbox everything, rack it, hook it up, network it, burn

741
00:29:48,340 --> 00:29:50,760
it in, provision it, and then you can figure out, all right, how are

742
00:29:50,760 --> 00:29:53,650
we going to like migrate the workload to this, to this new hardware?

743
00:29:53,750 --> 00:29:57,400
That's why I think that like you have to have that happy medium between cloud.

744
00:29:57,750 --> 00:30:02,090
And like on prem like everybody acts like either is some magical solution and

745
00:30:02,090 --> 00:30:05,719
I'm just like we're going to pretend that we forgot the leeway and all the

746
00:30:05,720 --> 00:30:09,440
stuff you have to do to get something on prem like it is cheaper and it does

747
00:30:09,440 --> 00:30:12,580
need to be used a lot more because putting everything in the cloud is just

748
00:30:12,580 --> 00:30:17,080
not cost efficient but I think people forgot how long it takes to get stuff

749
00:30:17,190 --> 00:30:20,750
on prem and then the fact that you have to go fix that when it burns out.

750
00:30:20,910 --> 00:30:22,560
There's a lot of convenience that comes

751
00:30:22,560 --> 00:30:24,690
with cloud, but you definitely pay for it.

752
00:30:24,930 --> 00:30:26,890
And you don't necessarily pay for it in the

753
00:30:26,890 --> 00:30:28,530
things that you expect to pay for it in.

754
00:30:28,540 --> 00:30:30,730
Like, you don't expect, ah, you're gonna charge a markup

755
00:30:30,760 --> 00:30:34,430
on this EC2 instance based off of how powerful it is.

756
00:30:34,430 --> 00:30:38,089
You end up paying most of it in, like, kind of hidden places, like, you

757
00:30:38,089 --> 00:30:42,369
know, in egress fees or in, like, WAF requests or something like that.

758
00:30:42,370 --> 00:30:44,019
You're also kind of beholden to

759
00:30:44,019 --> 00:30:46,380
them and their decision making, you know.

760
00:30:46,660 --> 00:30:49,210
Yeah, a lot of a lot of cloud providers haven't really passed

761
00:30:49,210 --> 00:30:53,410
down cost savings of like more efficient hardware to consumers.

762
00:30:53,480 --> 00:30:56,520
So like the cost of an EC2 instance per like vCore hasn't

763
00:30:56,530 --> 00:31:00,690
really, or vCPU hasn't really gone down much over time.

764
00:31:00,720 --> 00:31:03,659
And the number of vCPUs you can pack into a single machine

765
00:31:03,659 --> 00:31:06,745
and that you can, the amount of compute you get per watt

766
00:31:06,895 --> 00:31:10,195
in a data center has had insane leaps in the past 10 years.

767
00:31:10,255 --> 00:31:12,675
I'm really interested to see where that goes, right?

768
00:31:12,705 --> 00:31:15,285
Like eventually they're going to have to figure out

769
00:31:15,335 --> 00:31:18,154
how to compete with on prem, you know what I mean?

770
00:31:18,585 --> 00:31:22,025
And it's just interesting the way that they've made cuts in certain areas.

771
00:31:22,025 --> 00:31:24,915
And I'm like, bro, you're making cuts for the most expensive stuff that you

772
00:31:24,915 --> 00:31:28,085
run, but not the stuff that you get for the cheapest, which is very interesting.

773
00:31:28,105 --> 00:31:29,635
I mean, almost all of them have doubled

774
00:31:29,635 --> 00:31:31,794
down on their investments in custom silicon.

775
00:31:31,940 --> 00:31:34,210
And so they all say like, Oh, we're going to, the,

776
00:31:34,220 --> 00:31:37,760
the AWS play is Graviton is more efficient per Watts.

777
00:31:37,760 --> 00:31:38,820
And so you should go to Graviton.

778
00:31:38,830 --> 00:31:39,790
You should use our

779
00:31:39,790 --> 00:31:44,540
About like the bad place, but Graviton is kind of fire.

780
00:31:44,650 --> 00:31:50,029
Now, is that a good excuse to ha for what we're talking about?

781
00:31:50,049 --> 00:31:52,850
No, but I think that is going to be one of the best

782
00:31:52,850 --> 00:31:55,650
things that has come out of the bad place in a long time.

783
00:31:55,870 --> 00:31:57,550
So for you looking back on.

784
00:31:58,000 --> 00:32:02,060
These separate places that pieces of infrastructure run and putting

785
00:32:02,060 --> 00:32:03,940
things on prem and having to go through that we have to scale

786
00:32:03,940 --> 00:32:06,390
this thing up was, do you think that was still a good decision?

787
00:32:07,020 --> 00:32:07,660
Absolutely.

788
00:32:07,690 --> 00:32:08,120
Yeah.

789
00:32:08,300 --> 00:32:11,140
I mean, the way that we approached it was, hey, let's build

790
00:32:11,140 --> 00:32:16,190
out, let's way overbuild our on prem solution and then we'll be.

791
00:32:16,720 --> 00:32:20,200
Ready for, you know, insane overheads if something crazy happens.

792
00:32:20,270 --> 00:32:22,500
And then even now, we like, we just, we recently

793
00:32:22,500 --> 00:32:24,960
finished an expansion in our, in our on prem POPs.

794
00:32:25,020 --> 00:32:27,729
And even that was like a, it was a preemptive measure.

795
00:32:27,730 --> 00:32:29,989
It was a cool, we're not near the limits of the hardware

796
00:32:29,989 --> 00:32:32,990
we have right now, but if we want to keep really healthy

797
00:32:33,019 --> 00:32:35,509
overhead in our POPs, we should probably do some expansion.

798
00:32:35,589 --> 00:32:38,380
And so a lot of this comes from like planning for a couple orders

799
00:32:38,380 --> 00:32:41,060
of magnitude, and then making sure that in the time it would

800
00:32:41,060 --> 00:32:43,890
take to grow by a couple orders of magnitude, you, you can.

801
00:32:44,200 --> 00:32:45,960
get hardware where it needs to be in time.

802
00:32:46,100 --> 00:32:49,220
Whatever you're doing, the planning is very well placed.

803
00:32:49,390 --> 00:32:50,230
You're doing a great job.

804
00:32:50,450 --> 00:32:52,820
A lot of it was kind of scarily instinct.

805
00:32:52,850 --> 00:32:54,820
Like, the most recent expansion that we did, I

806
00:32:54,829 --> 00:32:58,250
was like, this was after Brazil, you know, we saw

807
00:32:58,289 --> 00:33:02,699
Which is such a weird, like I almost wonder if y'all should just pay,

808
00:33:02,710 --> 00:33:05,680
like, Elon at this point, or like, send him a gift, because like,

809
00:33:05,710 --> 00:33:08,260
every time like, there for a while, every time he said something

810
00:33:08,260 --> 00:33:11,330
stupid or did something stupid, it would just be like, Spike?

811
00:33:12,700 --> 00:33:16,025
Like, you could Tell what, like, you just be like, what did Elon do today?

812
00:33:16,025 --> 00:33:18,005
Cause there's so many new people, you can

813
00:33:18,065 --> 00:33:18,745
see them on graphs.

814
00:33:19,045 --> 00:33:21,425
They're pretty noticeable and pretty sharp on the graph, make

815
00:33:21,425 --> 00:33:22,584
a meme of the graph.

816
00:33:22,584 --> 00:33:22,884
Right.

817
00:33:22,925 --> 00:33:24,385
And then put his head on each.

818
00:33:27,285 --> 00:33:29,995
You're like, we'd mark our, all of our graphs with our deploys.

819
00:33:30,055 --> 00:33:32,864
And instead you have all these marks of like news articles, a

820
00:33:32,865 --> 00:33:35,055
little blip of like the dumb thing he did that

821
00:33:35,055 --> 00:33:37,595
day, you know, like talked crap to Brazil.

822
00:33:39,375 --> 00:33:43,365
So much of our planning and everything is like, we, we don't have control

823
00:33:43,365 --> 00:33:46,464
over how many people are going to decide to use our website today.

824
00:33:46,464 --> 00:33:46,614
There

825
00:33:46,615 --> 00:33:50,124
was like a rumor at Tesla that when his girlfriend changed

826
00:33:50,124 --> 00:33:52,945
the color of her hair, or that like if they had a fight, like

827
00:33:52,945 --> 00:33:56,395
if they saw them walk out, the handlers of the Elon would

828
00:33:56,395 --> 00:33:59,735
like panic and then figure out how to like make it the least.

829
00:34:00,290 --> 00:34:02,340
Wild outcome of that.

830
00:34:02,370 --> 00:34:06,760
Like, can you imagine this dude is like CEO of a company,

831
00:34:06,810 --> 00:34:10,450
like, and they have handlers because they're worried about,

832
00:34:10,450 --> 00:34:14,220
like, what will result after this argument or hair color?

833
00:34:14,220 --> 00:34:15,809
Like, can you imagine that environment?

834
00:34:15,809 --> 00:34:16,710
And you know what I mean?

835
00:34:16,839 --> 00:34:18,740
And now it's affecting a whole nother company.

836
00:34:18,740 --> 00:34:20,970
And now we're just like, let's try it with the country.

837
00:34:20,970 --> 00:34:22,619
It's going to be great.

838
00:34:22,620 --> 00:34:26,540
I mean, well, when, when all of Brazil loses access to Twitter, like overnight.

839
00:34:26,610 --> 00:34:28,140
That was an insane moment.

840
00:34:28,160 --> 00:34:32,560
That was like, we were, I think some numbers I can talk about, like, which

841
00:34:32,560 --> 00:34:36,349
are fun numbers is like total requests, throughput across the PDS is so like,

842
00:34:36,350 --> 00:34:39,859
that's kind of our, our big, how much load is going on right now number.

843
00:34:40,100 --> 00:34:42,790
And before Brazil, we were doing like three and

844
00:34:42,790 --> 00:34:46,319
a half K 4, 000 requests a second peak a day.

845
00:34:46,510 --> 00:34:48,829
And then Brazil happened and we shot to 25,

846
00:34:48,830 --> 00:34:52,390
000 requests a second across all of the PDS is.

847
00:34:52,460 --> 00:34:53,080
And then.

848
00:34:53,655 --> 00:34:56,015
In November, we hit our new kind of record,

849
00:34:56,015 --> 00:34:58,115
which was like 50, 000 requests a second.

850
00:34:58,195 --> 00:35:02,245
So we're still like way above Brazil's peak on like a daily basis now.

851
00:35:02,305 --> 00:35:04,544
But it is insane to me that that was, that

852
00:35:04,545 --> 00:35:07,484
was like a 10x event for us, which is crazy.

853
00:35:07,684 --> 00:35:10,755
And now that has become normal in like a few months.

854
00:35:10,765 --> 00:35:12,395
It's like, yeah, that's just what we deal with every day.

855
00:35:12,395 --> 00:35:13,565
Now we're running around with the chickens

856
00:35:13,565 --> 00:35:15,185
with our head cut off when Brazil happened.

857
00:35:15,455 --> 00:35:17,695
And then November came along and that was like.

858
00:35:18,140 --> 00:35:19,830
An even worse version of it, it was like

859
00:35:19,880 --> 00:35:21,890
four Brazils or something crazy like that.

860
00:35:21,970 --> 00:35:24,000
After Brazil happened, we were like, alright, how

861
00:35:24,000 --> 00:35:26,260
the heck are we going to plan for a 10x of this?

862
00:35:26,360 --> 00:35:28,250
But we, we did everything we could to like, alright,

863
00:35:28,250 --> 00:35:30,579
can we prepare for a 10x of what that just was?

864
00:35:30,599 --> 00:35:32,419
And now November happened and we're like, alright, how do

865
00:35:32,420 --> 00:35:34,680
we prepare, how do we prepare for a 10x of what that was?

866
00:35:34,869 --> 00:35:36,859
Okay, like low key though, did you think

867
00:35:36,859 --> 00:35:38,650
like it was going to hit the fan in November?

868
00:35:38,880 --> 00:35:40,589
Or like, did you, like, did you anticipate it at all?

869
00:35:40,589 --> 00:35:40,990
Yeah, we were

870
00:35:41,250 --> 00:35:41,510
prepared.

871
00:35:41,550 --> 00:35:43,640
I don't think we, anybody expected that we were

872
00:35:43,640 --> 00:35:47,459
gonna like triple our user base in like three weeks.

873
00:35:47,630 --> 00:35:51,090
We had 10 million users leading up to the election roughly, right?

874
00:35:51,220 --> 00:35:54,049
And we're, a few months after that, like, I think we hit

875
00:35:54,050 --> 00:35:58,020
25 million users within like a month of the election.

876
00:35:58,395 --> 00:36:02,285
For the small issues you had, they were very well handled.

877
00:36:02,285 --> 00:36:03,655
Like y'all were

878
00:36:03,695 --> 00:36:06,665
just, I would, that's actually, so I, we asked questions or I asked

879
00:36:06,665 --> 00:36:09,585
him questions on blue sky, like, Hey, anyone have questions to ask?

880
00:36:09,585 --> 00:36:11,065
And one of them was about incident management.

881
00:36:11,505 --> 00:36:12,495
How does that, how does that work?

882
00:36:12,495 --> 00:36:14,785
How do you learn from some of those incidents you've been having?

883
00:36:14,794 --> 00:36:17,944
Like there's, there's always something going on, um, between

884
00:36:17,945 --> 00:36:21,105
all the planning and, and the other, the normal things

885
00:36:21,105 --> 00:36:23,865
you have to do to like deploy and make software better.

886
00:36:24,055 --> 00:36:25,495
How do you handle those incidents?

887
00:36:30,715 --> 00:36:32,765
A lot of metrics, a lot of dashboards.

888
00:36:32,895 --> 00:36:35,315
That's, that's kind of the most important thing is like, if

889
00:36:35,315 --> 00:36:38,145
you are not measuring something, it is very hard to improve it.

890
00:36:38,235 --> 00:36:39,874
And so we lean really heavily into Can you say that

891
00:36:39,874 --> 00:36:41,544
louder for the people in the back, Jazz?

892
00:36:41,675 --> 00:36:45,265
Observability and monitoring is important, engineers.

893
00:36:46,885 --> 00:36:50,105
If you, if you can't measure it, you, you can't meaningfully improve it.

894
00:36:50,255 --> 00:36:51,565
Or at least you can't prove that you improved it.

895
00:36:51,735 --> 00:36:54,545
So when things were going crazy in November,

896
00:36:54,575 --> 00:36:57,174
we had what I call like the 11 days from hell.

897
00:36:57,495 --> 00:37:03,615
Which was 11 days of 16 hours a day in a situation room from like the moment you

898
00:37:03,615 --> 00:37:07,695
wake up to the moment you go to bed and then like Wake up, check some graphs in

899
00:37:07,695 --> 00:37:11,465
bed, line is still going up, get ready as quickly as you can, get downstairs,

900
00:37:11,465 --> 00:37:13,765
log into the situation room, and figure out what's on fire this morning.

901
00:37:13,875 --> 00:37:14,935
Tell me there was coffee.

902
00:37:15,065 --> 00:37:16,865
I drink Monster, but yeah, there was, there's,

903
00:37:16,905 --> 00:37:18,705
there's a lot of, uh, a lot of Dang, see, that's why

904
00:37:18,705 --> 00:37:20,975
that, that's why their infrastructure never goes down.

905
00:37:21,155 --> 00:37:24,204
That was, there, there were so many, like, so many different components hit.

906
00:37:24,270 --> 00:37:28,610
I guess you would call them like early scaling limits, not that they were at the

907
00:37:28,610 --> 00:37:31,600
maximum of their design, but that they've never been pushed that hard before.

908
00:37:31,600 --> 00:37:34,390
And so we were shaking out bugs all over the place, like scaling,

909
00:37:34,449 --> 00:37:37,210
scaling issues or like some concurrency bug or something like that

910
00:37:37,210 --> 00:37:39,800
that was falling out from so many different systems all at once.

911
00:37:39,819 --> 00:37:40,359
Because when you.

912
00:37:40,795 --> 00:37:43,945
You drive a truck over a bridge, and if the truck is really heavy,

913
00:37:44,165 --> 00:37:46,195
and it's like too heavy, and you have this really old, like,

914
00:37:46,195 --> 00:37:49,965
bolt in the bridge, the bolt could, like, get broken, or shear,

915
00:37:49,965 --> 00:37:52,595
or fall off, and that reduces some of the stress of the bridge.

916
00:37:52,765 --> 00:37:54,954
Like, it starts swaying, and then a bolt

917
00:37:54,954 --> 00:37:56,865
fires off, and then it stops swaying as much.

918
00:37:56,884 --> 00:37:58,504
And that kind of, that's kind of how you, like,

919
00:37:58,505 --> 00:38:01,115
release tension in a bridge when it's under stress.

920
00:38:01,405 --> 00:38:06,215
But if you land, like, an AC 130 on the bridge, and you're, like, taking, like,

921
00:38:06,215 --> 00:38:11,355
a giant jumbo jet, or, like, some kind of massive 747 landed on the bridge all

922
00:38:11,355 --> 00:38:15,475
at once and a bunch of bolts pop loose all at once and you're like Oh, crap.

923
00:38:15,535 --> 00:38:17,315
Which one do we go fix first?

924
00:38:17,335 --> 00:38:20,785
Which one is like structurally important to the success of the bridge?

925
00:38:20,875 --> 00:38:23,624
So when you scale insanely fast in a really short period of

926
00:38:23,625 --> 00:38:26,835
time, you have a lot of systems that hit these early limits

927
00:38:26,845 --> 00:38:30,195
or that, that shoot these bugs out like bolts off of a bridge.

928
00:38:30,214 --> 00:38:33,274
And you have to figure out through your metrics, figure out,

929
00:38:33,275 --> 00:38:36,195
okay, which services are okay, which services are not okay.

930
00:38:36,475 --> 00:38:38,545
And then dig into the services that are not okay and

931
00:38:38,545 --> 00:38:40,455
figure out, all right, where are we running into problems?

932
00:38:40,785 --> 00:38:44,405
One of the craziest issues we had was like everybody's handles started suddenly

933
00:38:44,405 --> 00:38:48,949
started becoming invalid because we ran into the limits of public DNS resolvers.

934
00:38:49,210 --> 00:38:52,670
We were like hitting Google Public DNS Resolver and

935
00:38:52,680 --> 00:38:55,130
Cloudflare's Public DNS Resolver so heavily they started

936
00:38:55,130 --> 00:38:57,699
rate limiting us and we just couldn't do DNS queries anymore.

937
00:38:57,860 --> 00:38:58,929
Okay, can we just talk though?

938
00:38:58,930 --> 00:39:00,050
Like, why is it always DNS?

939
00:39:00,050 --> 00:39:04,979
DNS finds new ways to like, just ruin people's lives.

940
00:39:04,979 --> 00:39:07,159
Like, it wakes up in the morning and it's like, how

941
00:39:07,159 --> 00:39:10,039
can I be difficult in a way that they'll never expect?

942
00:39:10,039 --> 00:39:12,140
Like, it's never something that's easily figured out.

943
00:39:12,140 --> 00:39:14,540
You gotta go down the whole rabbit hole, figure

944
00:39:14,540 --> 00:39:16,710
out some way that you've never heard of before.

945
00:39:17,245 --> 00:39:20,305
Justin's problem also somehow tied to DNS.

946
00:39:20,305 --> 00:39:24,845
Like, it's always every time and it's always, it's never like a normal error

947
00:39:24,845 --> 00:39:28,425
that like makes you think, okay, it's this, it's always something ridiculous.

948
00:39:28,425 --> 00:39:29,934
That's just this rabbit hole.

949
00:39:30,215 --> 00:39:34,294
Every error message in every application for every log

950
00:39:34,294 --> 00:39:36,854
everywhere should probably just end with, it might be DNS.

951
00:39:36,855 --> 00:39:37,465
No, seriously.

952
00:39:37,655 --> 00:39:39,375
It should be like, go hit this line.

953
00:39:39,705 --> 00:39:41,155
This thing's mad at you.

954
00:39:41,185 --> 00:39:44,065
But also if this fails, is it DNS?

955
00:39:44,255 --> 00:39:44,945
Segfault.

956
00:39:44,965 --> 00:39:45,795
Maybe it's DNS.

957
00:39:45,855 --> 00:39:46,175
I don't know.

958
00:39:46,225 --> 00:39:49,185
And then Kubernetes was like, hey, what if we put DNS everywhere?

959
00:39:49,275 --> 00:39:51,855
What if we wove DNS through the entire stack?

960
00:39:51,955 --> 00:39:53,575
Actually, that's a good question because you said

961
00:39:53,575 --> 00:39:55,205
you were doing Kubernetes at previous startups.

962
00:39:55,224 --> 00:39:57,075
You don't have any Kubernetes in the stack now, right?

963
00:39:57,144 --> 00:39:57,455
We have

964
00:39:57,455 --> 00:39:58,575
no Kubernetes it's all

965
00:39:58,905 --> 00:40:00,995
VMs and it's still containerized.

966
00:40:01,310 --> 00:40:02,130
It's containerized.

967
00:40:02,140 --> 00:40:04,890
It is containerized, but it is not a lot of VMs,

968
00:40:04,930 --> 00:40:07,490
even honestly, it's just like SSH into the box.

969
00:40:07,500 --> 00:40:09,720
It's kind of running, you know, Linux right on top

970
00:40:09,720 --> 00:40:11,760
of the bare metal and then it's running Docker.

971
00:40:11,870 --> 00:40:13,490
So no traditional orchestrator.

972
00:40:13,580 --> 00:40:15,749
No, no traditional orchestrator at the moment.

973
00:40:15,910 --> 00:40:17,619
It's like Ansible jobs, Docker run.

974
00:40:18,420 --> 00:40:22,630
Yeah, Ansible jobs, Docker compose and a couple of tweaks to make things faster.

975
00:40:22,640 --> 00:40:25,390
We're not using like the Docker logging because Docker logging

976
00:40:25,390 --> 00:40:28,460
is not very good if you have really really high throughput logs.

977
00:40:28,595 --> 00:40:32,075
So using like, we're using svlogd, which is in runit.

978
00:40:32,175 --> 00:40:35,765
And so svlogd lets you just log to a directory and it kind of

979
00:40:35,765 --> 00:40:38,814
cycles through files and then you can use like Promptail to.

980
00:40:39,090 --> 00:40:39,900
Scrape those directories.

981
00:40:39,900 --> 00:40:44,120
So every container gets its own logging directory and then it just pipes

982
00:40:44,120 --> 00:40:47,349
it to svlogd and svlogd is really lightweight and it handles all the log

983
00:40:47,350 --> 00:40:50,400
management without having to do like standard out piping or anything like that.

984
00:40:50,469 --> 00:40:53,930
Every user is a website, a SQLite database, and a svlogd.

985
00:40:54,600 --> 00:40:55,440
Yeah, exactly.

986
00:40:55,620 --> 00:40:56,010
Exactly.

987
00:40:56,310 --> 00:40:57,280
It's a whole stack right there.

988
00:40:57,910 --> 00:40:59,290
It works surprisingly well.

989
00:40:59,550 --> 00:41:01,920
Uh, you also want to make sure that you're not like doing user

990
00:41:01,920 --> 00:41:04,890
space docker NAT, because user space docker NAT is how you

991
00:41:04,970 --> 00:41:07,590
make your high throughput services be very low throughput.

992
00:41:07,659 --> 00:41:09,789
Well, you're not running everything like network hosts though, right?

993
00:41:10,029 --> 00:41:10,480
Uh, no.

994
00:41:10,489 --> 00:41:13,560
I mean, you can, you can run kernel level NAT, which is,

995
00:41:13,569 --> 00:41:17,320
which is a lot less, uh, messy than user level NAT for docker.

996
00:41:17,470 --> 00:41:19,830
It's not CPU intensive, I guess I would say.

997
00:41:20,170 --> 00:41:22,320
Uh, there's less, less packet copying going on.

998
00:41:22,510 --> 00:41:23,930
But that's one of the reasons we don't, didn't want to run

999
00:41:23,930 --> 00:41:26,480
Kubernetes is we've got these really cool bare metal machines.

1000
00:41:26,580 --> 00:41:29,229
We don't want to add so many layers of virtualization on top of them that.

1001
00:41:29,960 --> 00:41:32,863
We lose a lot of the, like, benefit of being close to the metal.

1002
00:41:32,863 --> 00:41:33,329
You're gonna hide all that

1003
00:41:33,330 --> 00:41:34,930
performance under abstractions.

1004
00:41:34,930 --> 00:41:35,320
Yeah,

1005
00:41:35,710 --> 00:41:37,200
yeah, exactly, exactly.

1006
00:41:37,220 --> 00:41:39,040
Say goodbye to your, your cache locality.

1007
00:41:39,070 --> 00:41:41,530
Say goodbye to, I don't know, whatever it is you're, you're trying to do

1008
00:41:41,610 --> 00:41:44,710
because your, your container is being preempted because the Kubernetes,

1009
00:41:44,770 --> 00:41:47,080
the Kubelet needs to come in and do something or whatever it might be.

1010
00:41:47,179 --> 00:41:48,670
I mean, you can tune Kubernetes for performance

1011
00:41:48,670 --> 00:41:50,410
and you can run it in a high performance way.

1012
00:41:50,450 --> 00:41:51,600
We don't have the expertise to do that.

1013
00:41:51,790 --> 00:41:54,540
But what we, we do know is, yeah, you can just And

1014
00:41:54,890 --> 00:41:57,140
a lot of this, I mean, a lot of the orchestrators are

1015
00:41:57,140 --> 00:41:59,670
typically, you have a dynamic infrastructure, right?

1016
00:41:59,670 --> 00:42:01,610
Like you have machines coming and going frequently.

1017
00:42:01,610 --> 00:42:04,649
You need to reshuffle things or reallocate things.

1018
00:42:04,649 --> 00:42:06,309
And in a lot of your case, at least half

1019
00:42:06,309 --> 00:42:08,070
of your infrastructure is fairly static.

1020
00:42:08,240 --> 00:42:11,229
It's like we have a bunch of machines over here that are running PDSs,

1021
00:42:11,229 --> 00:42:14,720
a bunch of machines over here running all the app view and database.

1022
00:42:15,085 --> 00:42:18,185
Flows and everything and and you can define that that's a spreadsheet.

1023
00:42:18,235 --> 00:42:19,255
That's not an orchestrator

1024
00:42:19,335 --> 00:42:21,675
It's all very static and and you buy the

1025
00:42:21,675 --> 00:42:23,625
capacity when you buy the machines, right?

1026
00:42:23,695 --> 00:42:25,015
You can use as much or as little of it as you

1027
00:42:25,015 --> 00:42:26,704
want to you've already paid for it Basically,

1028
00:42:26,915 --> 00:42:29,385
do you think blue sky will somehow figure

1029
00:42:29,385 --> 00:42:32,054
a way to incorporate video and images more?

1030
00:42:32,375 --> 00:42:34,694
So that way we don't have to go to any of the bad places

1031
00:42:35,225 --> 00:42:35,895
I think so.

1032
00:42:35,895 --> 00:42:36,265
Yeah.

1033
00:42:36,295 --> 00:42:38,515
I mean, I think recently we launched video feeds.

1034
00:42:38,525 --> 00:42:42,095
So feeds can describe themselves as like primarily a video feed and

1035
00:42:42,095 --> 00:42:45,374
then they'll go into that kind of video vertical scrolling mode.

1036
00:42:45,425 --> 00:42:47,605
That was like a six day project by the front end team

1037
00:42:47,605 --> 00:42:50,235
that was actually like kind of insane turnaround on that.

1038
00:42:50,284 --> 00:42:54,395
So we have a couple of things where we do very hackathon mindset and,

1039
00:42:54,405 --> 00:42:56,755
and we're like, cool, how quickly can we get something that is like.

1040
00:42:57,180 --> 00:42:59,330
Of our quality standards shipped to production.

1041
00:42:59,390 --> 00:43:01,370
When you're at a tiny company, you know, you've got

1042
00:43:01,370 --> 00:43:03,910
like 20 something people and you're dealing with tens of

1043
00:43:03,910 --> 00:43:06,259
millions of users, there's a lot of priority juggling.

1044
00:43:06,350 --> 00:43:10,320
And so you've got like stuff that's easy to do and stuff that is important.

1045
00:43:10,490 --> 00:43:12,529
There's stuff that's like fast and easy and stuff that's important.

1046
00:43:12,730 --> 00:43:14,880
And if it's in that quadrant, you've, you kind of just do it.

1047
00:43:15,155 --> 00:43:17,065
immediately drop whatever you're doing, go do that thing.

1048
00:43:17,175 --> 00:43:19,475
And then you have stuff that's like a little bit harder to do

1049
00:43:19,495 --> 00:43:22,015
and it's important, and that's work that you try to schedule.

1050
00:43:22,075 --> 00:43:25,265
And then you have work that is stuff that's like hard to do and on unimportant.

1051
00:43:25,445 --> 00:43:27,865
And that's stuff that falls to the, kind of the bottom of your priority list.

1052
00:43:27,865 --> 00:43:30,414
And then there's stuff that is easy to do, but unimportant, and.

1053
00:43:30,720 --> 00:43:32,320
If you need extra dopamine and there's nothing on the

1054
00:43:32,320 --> 00:43:34,640
easy important list to do, you gotta do that stuff.

1055
00:43:35,110 --> 00:43:37,010
Speaking of, of possibly important, I'm

1056
00:43:37,010 --> 00:43:38,550
going back to some of the questions here.

1057
00:43:39,170 --> 00:43:41,759
Someone's asking about like expansion outside the U. S. What

1058
00:43:41,760 --> 00:43:44,990
does that look like in your network, which is mostly static?

1059
00:43:44,990 --> 00:43:48,489
Are you going to, are you planning on doing some like, Oh, these

1060
00:43:48,490 --> 00:43:50,930
users really care about data locality or this country does.

1061
00:43:50,940 --> 00:43:53,810
So we have to put the PDSs or the whole stack in

1062
00:43:53,810 --> 00:43:56,090
that environment in their country within the borders.

1063
00:43:56,815 --> 00:44:00,865
I'm not up to date on the legal side of any of that or like

1064
00:44:00,865 --> 00:44:04,345
the regulatory side of that from a just a purely architectural

1065
00:44:04,345 --> 00:44:08,504
standpoint, it should be something doable is like run the PDS in

1066
00:44:08,505 --> 00:44:11,314
another country and then your canonical data lives in that country.

1067
00:44:11,465 --> 00:44:14,885
And then the other side, like if we wanted to run a pop in another country or

1068
00:44:14,885 --> 00:44:17,815
something like that, we could we could go set it up and move our hardware there.

1069
00:44:18,040 --> 00:44:19,910
Some countries are easier to do that in than others.

1070
00:44:19,980 --> 00:44:22,000
And then the connectivity of that country is also important.

1071
00:44:22,000 --> 00:44:23,800
It's like, cool, can we get a lot of bandwidth cheap?

1072
00:44:23,860 --> 00:44:25,030
Is it going to reach our customers?

1073
00:44:25,090 --> 00:44:28,360
There are a couple of considerations that go into where we place infrastructure.

1074
00:44:28,660 --> 00:44:29,840
Right now, it's mostly in the U.

1075
00:44:29,840 --> 00:44:31,350
S. just because that's the easiest place to put it.

1076
00:44:31,389 --> 00:44:34,099
When it comes to delivering like images and video, we, we work with

1077
00:44:34,099 --> 00:44:37,630
a CDN partner and the CDN, they've got, you know, a whole distributed

1078
00:44:37,630 --> 00:44:41,650
network of their pops and their local caches and nodes and stuff.

1079
00:44:42,055 --> 00:44:46,025
Going back to the, the hardware, not going into super specific details,

1080
00:44:46,025 --> 00:44:49,125
but as far as like, how did you decide what to pick for hardware?

1081
00:44:49,125 --> 00:44:50,285
Where were you looking at?

1082
00:44:50,285 --> 00:44:51,874
What were the kind of the qualifications?

1083
00:44:52,085 --> 00:44:55,485
I can talk about like the chips and stuff that we're running because

1084
00:44:55,525 --> 00:44:59,814
we, we wanted to run AMD because current generation AMD in, in the

1085
00:44:59,814 --> 00:45:03,575
data center is just at a scale that it is hard to push Intel to.

1086
00:45:03,795 --> 00:45:06,545
It runs higher performance per watts and

1087
00:45:06,845 --> 00:45:08,355
you just get better density out of them.

1088
00:45:08,435 --> 00:45:11,425
That was kind of our decision on AMD versus Intel for that.

1089
00:45:11,745 --> 00:45:15,445
And also we were very interested in, uh, the X, the

1090
00:45:15,595 --> 00:45:18,305
3DV cache, uh, chips that AMD is coming out with.

1091
00:45:18,355 --> 00:45:21,565
And so Genoa X CPUs, we've got, like, some of our

1092
00:45:21,565 --> 00:45:25,155
machines are spec'd with two of the 96 core, 192 thread

1093
00:45:25,165 --> 00:45:29,335
Genoa X series CPUs that each have 768 megs of L3 cache.

1094
00:45:29,385 --> 00:45:29,395
I

1095
00:45:29,685 --> 00:45:31,215
mean, you're over 300 cores.

1096
00:45:31,225 --> 00:45:31,755
Holy crap.

1097
00:45:31,975 --> 00:45:35,515
Yeah, so it's uh, a gig and a half of uh, L3 cache in a

1098
00:45:35,515 --> 00:45:38,754
single box across two chips, which is absolutely absurd.

1099
00:45:38,814 --> 00:45:39,184
Yeah.

1100
00:45:39,605 --> 00:45:40,825
That's more than my first computer.

1101
00:45:41,464 --> 00:45:44,185
It's like all total RAM and that's cache.

1102
00:45:44,285 --> 00:45:46,554
Yeah, so you can get insane amounts of cache.

1103
00:45:46,554 --> 00:45:49,975
You can get these like really, really high core density machines.

1104
00:45:50,155 --> 00:45:51,935
You could, you could pack a ton of RAM into a box.

1105
00:45:51,945 --> 00:45:53,455
Like if you're, if you're just buying.

1106
00:45:53,880 --> 00:45:55,150
Your own box.

1107
00:45:55,150 --> 00:45:57,210
You can stick a couple of terabytes of Ram into it.

1108
00:45:57,360 --> 00:46:00,320
You can't get a couple of terabytes of Ram in a cloud VM.

1109
00:46:00,650 --> 00:46:02,410
You can, but you're going to pay for it.

1110
00:46:02,949 --> 00:46:04,220
I mean, you probably have to like

1111
00:46:04,220 --> 00:46:06,999
break like 16 different pieces of glass and like talk

1112
00:46:07,000 --> 00:46:09,539
to like 30 different account reps before they'll let

1113
00:46:09,540 --> 00:46:11,969
you get like a node with two terabytes of Ram in it.

1114
00:46:12,210 --> 00:46:14,860
Which is where cloud is not fun when like, it's

1115
00:46:14,860 --> 00:46:17,510
cool when you can get an instance in seconds.

1116
00:46:17,520 --> 00:46:20,280
It's not when you have to break glass and ask permission.

1117
00:46:20,510 --> 00:46:23,090
Yeah, we can buy hardware that is very kind

1118
00:46:23,090 --> 00:46:25,009
of tailored to the workloads that we're doing.

1119
00:46:25,019 --> 00:46:28,999
So ScyllaDB is a big distributed horizontally scalable database.

1120
00:46:29,009 --> 00:46:31,900
It's got a shard per core architecture, so you can throw a bunch

1121
00:46:31,900 --> 00:46:34,190
more cores at it and it will just kind of scale horizontally.

1122
00:46:34,280 --> 00:46:36,760
But what it does want is a lot of RAM and a lot of NVMe.

1123
00:46:36,840 --> 00:46:37,520
And so.

1124
00:46:37,930 --> 00:46:39,170
NVMe is cheap these days.

1125
00:46:39,170 --> 00:46:42,930
You can get like a 15 terabyte enterprise NVMe drive for like two grand.

1126
00:46:43,050 --> 00:46:45,680
Is it as hard to manage as Cassandra's?

1127
00:46:45,900 --> 00:46:48,900
It's been, when we've been using it correctly, it's

1128
00:46:48,900 --> 00:46:51,159
been totally quiet and we've had no issues with it.

1129
00:46:51,240 --> 00:46:54,309
We do have the timelines workload that is doing those like.

1130
00:46:54,680 --> 00:46:59,120
Many, many, many writes a second to timelines is not the best

1131
00:46:59,130 --> 00:47:02,710
fit for like an LSM tree with, with size to your compaction.

1132
00:47:02,930 --> 00:47:05,690
So we've running into performance issues there that were really annoying.

1133
00:47:05,810 --> 00:47:08,240
We've got past some of them by kind of

1134
00:47:08,310 --> 00:47:10,400
segmenting that workload into its own cluster.

1135
00:47:10,620 --> 00:47:14,370
And now it no longer has an impact on like P99 latencies

1136
00:47:14,370 --> 00:47:17,460
for every other operation that goes on on the website.

1137
00:47:17,830 --> 00:47:19,450
Uh, but it was all in one big cluster.

1138
00:47:19,570 --> 00:47:19,890
I think

1139
00:47:19,900 --> 00:47:21,480
that's kind of the secret of databases.

1140
00:47:21,490 --> 00:47:23,540
Cause everyone thinks that no SQL or.

1141
00:47:24,070 --> 00:47:27,850
Using one or the other is going to be some sort of magical thing because they

1142
00:47:27,850 --> 00:47:31,110
think it's not, doesn't have to be a structured or it's not, doesn't have

1143
00:47:31,110 --> 00:47:35,130
to be like is relational, but they're all you have to write, use the right

1144
00:47:35,130 --> 00:47:38,170
tool for the job and then the right access patterns and all kinds of stuff.

1145
00:47:38,170 --> 00:47:38,409
So, I

1146
00:47:38,410 --> 00:47:40,090
mean, I think the secret of databases, everyone

1147
00:47:40,099 --> 00:47:42,509
has to use it wrong the first time, right?

1148
00:47:42,509 --> 00:47:45,409
And then, and then you figure out, Oh, this one's different.

1149
00:47:45,729 --> 00:47:50,669
There is, there is no database that will support wildly different workloads.

1150
00:47:51,250 --> 00:47:54,980
on the same instance, on the same cluster, basically, is what we've learned.

1151
00:47:55,040 --> 00:47:58,030
You can design your database as, as heavily as you want to, but

1152
00:47:58,030 --> 00:48:00,869
like, if you have a really noisy neighbor, it's gonna thrash your

1153
00:48:00,870 --> 00:48:03,680
caches, and you're gonna have really bad performance, or it's gonna,

1154
00:48:03,719 --> 00:48:06,330
like, cause a bunch of compactions to kick off, and you're gonna

1155
00:48:06,330 --> 00:48:09,049
be wasting a bunch of CPU time in compactions that could have been

1156
00:48:09,049 --> 00:48:11,900
serving requests, and your latencies are gonna be all over the place.

1157
00:48:11,920 --> 00:48:14,990
So, so when we bought hardware, we were like, okay, cool, let's buy hardware

1158
00:48:15,070 --> 00:48:19,685
to run a Scylla cluster, and let's buy hardware to run A couple of really

1159
00:48:19,695 --> 00:48:24,135
highly concurrent Go processes and then some more generic hardware to run

1160
00:48:24,175 --> 00:48:28,014
more generic things like a bunch of TypeScript containers and stuff like that.

1161
00:48:28,085 --> 00:48:31,635
So the, the core data service I was talking to you about in November was running

1162
00:48:31,635 --> 00:48:35,825
on 16 containers across two physical machines in both of our data centers.

1163
00:48:35,835 --> 00:48:38,645
So two in each, in each DC, eight, eight

1164
00:48:38,645 --> 00:48:41,945
containers, those machines had 384 logical cores.

1165
00:48:41,955 --> 00:48:46,325
So with, with SMT 384 cores, and so each Go process was getting.

1166
00:48:46,700 --> 00:48:48,150
A couple dozen cores and

1167
00:48:48,160 --> 00:48:50,530
still, when I think of that scale and you're literally talking

1168
00:48:50,530 --> 00:48:53,770
about four physical servers, and I think if I wanted to

1169
00:48:53,770 --> 00:48:57,629
replicate that in a cloud architecture, that is at least 30

1170
00:48:57,710 --> 00:49:01,479
VM somewhere with a couple of cues and something else and like

1171
00:49:01,480 --> 00:49:06,439
that complexity for physical servers handling across all four of them

1172
00:49:06,439 --> 00:49:10,210
in the neighborhood of 700, 000 requests a second from the app view

1173
00:49:10,630 --> 00:49:14,410
and querying a database around four and a half million times a second.

1174
00:49:14,735 --> 00:49:18,845
Your experience being a hardware engineer and a software engineer

1175
00:49:18,845 --> 00:49:22,385
really meshes well with you working in infrastructure because if

1176
00:49:22,385 --> 00:49:25,185
you didn't know hardware as well you probably wouldn't be able to

1177
00:49:25,565 --> 00:49:29,295
Go and pick the right, like everything is, seems like you have a

1178
00:49:29,485 --> 00:49:32,845
really good knack for right sizing and picking the right things.

1179
00:49:32,875 --> 00:49:35,345
And I think people struggle with that so much.

1180
00:49:35,655 --> 00:49:36,865
They're all tools, right?

1181
00:49:36,885 --> 00:49:39,875
But how do you go and use that tool efficiently, right?

1182
00:49:39,905 --> 00:49:43,144
And the fact that you worked with bare metal and you worked with hardware

1183
00:49:43,144 --> 00:49:46,725
and let's be real, it's easier to figure out cloud because there's a lot more

1184
00:49:46,825 --> 00:49:49,855
kind of tutorials and information out there to go figure that out, right?

1185
00:49:49,915 --> 00:49:53,515
You came with the hard stuff and then you get to meld that together.

1186
00:49:54,125 --> 00:49:56,145
I feel like a lot of it is instinct at this point,

1187
00:49:56,145 --> 00:49:58,465
or it's like, I feel like I'm guessing really often.

1188
00:49:58,775 --> 00:50:01,635
When you are, like, right sizing for hardware, you're

1189
00:50:01,635 --> 00:50:04,315
never gonna make a decision with as much data as you want.

1190
00:50:04,355 --> 00:50:07,525
You'll never reach a point where every decision that you make is fully

1191
00:50:07,525 --> 00:50:10,394
informed, and you're like, Ah, yes, this is clearly the obvious decision

1192
00:50:10,395 --> 00:50:12,745
because I have all the information I need to make this decision.

1193
00:50:13,055 --> 00:50:15,695
So I will just make the correct decision.

1194
00:50:15,985 --> 00:50:18,760
What you're left with is like What do you know?

1195
00:50:18,790 --> 00:50:20,280
What do you have experience with?

1196
00:50:20,350 --> 00:50:22,960
And then, what does your gut say?

1197
00:50:23,150 --> 00:50:24,940
A lot of times that's almost more important.

1198
00:50:24,960 --> 00:50:28,040
I've learned through working at different companies that sometimes

1199
00:50:28,040 --> 00:50:31,599
it's more like what your engineers know and what they're good at

1200
00:50:31,860 --> 00:50:34,920
and then finding the best tool that they have experience with.

1201
00:50:35,205 --> 00:50:36,915
Rather than just picking the best tool, like they

1202
00:50:36,915 --> 00:50:39,635
all have to be counted in and like accounted for.

1203
00:50:39,745 --> 00:50:43,035
Making the decision is like, and making the correct decision is hard.

1204
00:50:43,055 --> 00:50:45,785
Choosing when to make a decision is another really

1205
00:50:45,825 --> 00:50:49,125
important role that takes a lot of experience to get.

1206
00:50:49,144 --> 00:50:51,575
I don't have a ton of that experience right now.

1207
00:50:51,644 --> 00:50:53,915
Jake, our previous, our previous Inferlead

1208
00:50:53,955 --> 00:50:56,345
made a lot of these decisions that I was like.

1209
00:50:56,775 --> 00:50:57,475
Are you sure?

1210
00:50:57,475 --> 00:50:59,855
Like, I don't know, like, is this going to work?

1211
00:50:59,865 --> 00:51:02,845
And that has a lot of those have like very clearly panned out.

1212
00:51:02,855 --> 00:51:05,645
And I, I've bowed to his wisdom on a lot of that.

1213
00:51:05,705 --> 00:51:07,454
And now I'm in the position where I'm like.

1214
00:51:07,810 --> 00:51:09,480
I hope I know what I'm doing.

1215
00:51:09,670 --> 00:51:12,330
I like, I have no idea what I'm doing, but you know, we're still alive.

1216
00:51:12,330 --> 00:51:13,420
So I must be doing something right.

1217
00:51:13,450 --> 00:51:17,859
And choosing when to make a decision is also very important because delaying

1218
00:51:17,859 --> 00:51:21,580
decisions until you have more information is, is good if you really don't

1219
00:51:21,580 --> 00:51:25,399
have enough information to make a decision, but being indecisive can cause

1220
00:51:25,399 --> 00:51:28,430
you to slow down or it can cause problems or it can make more work for you.

1221
00:51:28,680 --> 00:51:30,620
And so you have to like constantly be.

1222
00:51:30,950 --> 00:51:33,950
doing this trade off between should I just make a decision and

1223
00:51:33,950 --> 00:51:36,680
go with it and commit to it because we'll get more done that way

1224
00:51:36,740 --> 00:51:39,010
If the decision isn't super high stakes or if it's a really high

1225
00:51:39,010 --> 00:51:42,840
stakes decision How do I wait just the right amount of time so that

1226
00:51:42,840 --> 00:51:45,440
we have enough information, but we're also not missing the boat

1227
00:51:45,530 --> 00:51:49,075
Looking back over the last 18 months Were there any decisions you regret

1228
00:51:49,135 --> 00:51:53,485
that either you made at the wrong time or you, you just decided that I'm just

1229
00:51:53,485 --> 00:51:56,034
trying, I'm asking, you know, there's a lot of learning experiences here,

1230
00:51:57,864 --> 00:52:01,035
any decisions that I regret, I don't think I can fault

1231
00:52:01,065 --> 00:52:03,545
any of our major decisions that we've made because

1232
00:52:03,545 --> 00:52:05,094
we

1233
00:52:05,175 --> 00:52:08,765
haven't, well, we, yeah, we haven't fallen over Nobody could

1234
00:52:08,765 --> 00:52:12,475
have possibly predicted the ridiculous trajectory that we're

1235
00:52:12,475 --> 00:52:14,885
on, like, except for Jake when he wrote that spreadsheet.

1236
00:52:14,895 --> 00:52:15,495
But like,

1237
00:52:15,645 --> 00:52:18,664
if you could have predicted all of this, then we should pay you for like

1238
00:52:18,705 --> 00:52:23,315
predicting the election and a bunch of like, some other really unstable world.

1239
00:52:24,335 --> 00:52:26,675
These have all been very heavily outside influence.

1240
00:52:27,695 --> 00:52:29,765
I do kind of firmly believe that, like, from

1241
00:52:29,765 --> 00:52:32,005
an infrastructure standpoint, we have made.

1242
00:52:32,130 --> 00:52:33,890
the best decision that we could with the information

1243
00:52:33,890 --> 00:52:35,500
that we had pretty much across the board.

1244
00:52:35,550 --> 00:52:37,750
And having more information, we wouldn't have believed

1245
00:52:37,750 --> 00:52:40,079
it if I, if I like sent myself back from the future

1246
00:52:40,079 --> 00:52:42,100
and was like, Hey, you have to prepare for this scale.

1247
00:52:42,100 --> 00:52:43,279
I would have been like, you're insane.

1248
00:52:43,360 --> 00:52:43,920
Get out of here.

1249
00:52:43,920 --> 00:52:44,050
I

1250
00:52:44,060 --> 00:52:46,019
saw a post like that on blue sky today.

1251
00:52:46,020 --> 00:52:50,399
It was like, if someone had told me that it was something like random about

1252
00:52:50,399 --> 00:52:53,700
like where we are now, verse 10 years ago, it was like, if I went back in

1253
00:52:53,700 --> 00:52:58,460
2004 and I got put in like a mental asylum for telling people what's going

1254
00:52:58,470 --> 00:53:02,145
on in the future, the future's like, And I was like, they're not wrong.

1255
00:53:02,285 --> 00:53:03,665
Like, they're so not wrong.

1256
00:53:03,675 --> 00:53:08,105
Back in November of 2023, we re architected the entire backend.

1257
00:53:08,114 --> 00:53:11,465
So the entire backend was on one big Postgres instance, uh, or like a bunch

1258
00:53:11,465 --> 00:53:15,585
of Postgres replicas, the PDS and the App Viewer merged into one big thing.

1259
00:53:15,595 --> 00:53:18,424
It was all just one giant Postgres serving a hundred thousand users.

1260
00:53:18,495 --> 00:53:19,875
We broke those roles apart.

1261
00:53:20,105 --> 00:53:24,485
And then we moved to the V2 architecture, which is, Hey, Scylla based.

1262
00:53:24,925 --> 00:53:29,205
Rewrite the entire data schema, build it all from scratch, and design

1263
00:53:29,205 --> 00:53:32,255
it to support up to 100 million users at the time was the goal.

1264
00:53:32,435 --> 00:53:34,714
And we had 100, 000 users, and we were like, cool, we're

1265
00:53:34,714 --> 00:53:37,155
going to build for three orders of magnitude from only

1266
00:53:37,155 --> 00:53:40,155
having information of, you know, operating at 100, 000 users.

1267
00:53:40,345 --> 00:53:42,035
None of us had any idea what the hell we were doing.

1268
00:53:42,154 --> 00:53:46,455
Like, this was all way pie in the sky architect engineering stuff.

1269
00:53:46,545 --> 00:53:49,045
We got some idea of what it was going to look like and then I went

1270
00:53:49,065 --> 00:53:53,365
head down for like six weeks from like Christmas to the end of January.

1271
00:53:53,545 --> 00:53:57,365
And just wrote out our entire new data architecture and

1272
00:53:57,365 --> 00:54:00,105
then implemented it and got it running and on our hardware.

1273
00:54:00,165 --> 00:54:02,085
I hope you guys are going to a beach in Mexico

1274
00:54:02,085 --> 00:54:04,235
at some point because you'll be working some

1275
00:54:04,565 --> 00:54:04,655
hours.

1276
00:54:05,324 --> 00:54:08,795
Right before the public launch back in February of last year, five days

1277
00:54:08,805 --> 00:54:15,465
before that, we silently shifted the entire backend from the in cloud.

1278
00:54:15,900 --> 00:54:20,290
On top of a big Postgres to the running on our own hardwire and nobody

1279
00:54:20,290 --> 00:54:23,500
noticed and so we had we'd like we backfilled all the data we had it

1280
00:54:23,500 --> 00:54:26,540
all running for a while we for a couple days before everything switched

1281
00:54:26,540 --> 00:54:29,569
over and then we just slowly moved one PDS at a time and pointed it

1282
00:54:29,570 --> 00:54:32,239
out at the new architecture and so over the course of like an hour we

1283
00:54:32,239 --> 00:54:35,220
shifted 100 percent of traffic onto the on prem loadout and that was

1284
00:54:35,220 --> 00:54:37,890
like that was the moment where I was like I can't believe we just did

1285
00:54:37,890 --> 00:54:41,480
that you I was like, we went to a cave and wrote this whole thing.

1286
00:54:41,480 --> 00:54:43,350
And then like, all right, I hope it works.

1287
00:54:43,450 --> 00:54:45,950
We'll see what happens when it like actually gets users on it.

1288
00:54:45,950 --> 00:54:47,320
And then it just frigging worked.

1289
00:54:47,330 --> 00:54:48,569
And it was like, you're kidding me.

1290
00:54:49,020 --> 00:54:50,159
Like we had like two bugs.

1291
00:54:50,570 --> 00:54:52,890
And like, tiny, tiny, tiny percentage of people

1292
00:54:52,890 --> 00:54:54,720
noticed it, and we fixed those within a day or two.

1293
00:54:54,940 --> 00:54:56,000
And I was like, alright, what's next?

1294
00:54:56,260 --> 00:54:59,650
I feel like someone tried to explain what an SRE was the

1295
00:54:59,650 --> 00:55:02,629
other day on Blue Sky to like, people that were not technical.

1296
00:55:02,629 --> 00:55:06,640
And it's wild because like, nobody knows what you're doing until you mess it up.

1297
00:55:06,985 --> 00:55:08,915
And then they know what you're doing, you know what I mean?

1298
00:55:08,915 --> 00:55:11,455
So like, it's what, like, that's such a huge

1299
00:55:11,455 --> 00:55:13,855
achievement for you to do that much of a data switch.

1300
00:55:13,885 --> 00:55:18,285
And like, to know you did it right is because nobody noticed, you know?

1301
00:55:18,515 --> 00:55:21,785
Yeah, that was one of the very high stakes moments.

1302
00:55:21,815 --> 00:55:24,404
We've had a couple of those since then, like turning on video.

1303
00:55:24,830 --> 00:55:26,540
Was like, I have no idea.

1304
00:55:26,550 --> 00:55:29,320
Video, the like backend for video is all custom.

1305
00:55:29,350 --> 00:55:33,390
It's all like I w I wrote up our entire kind of video processing pipeline.

1306
00:55:33,500 --> 00:55:36,450
I architected it and, and set up the, it just runs

1307
00:55:36,450 --> 00:55:39,130
on a bunch of machines that, that we don't operate.

1308
00:55:39,230 --> 00:55:41,729
And I was like, I think this should be horizontally scalable.

1309
00:55:41,759 --> 00:55:42,660
Like I've done.

1310
00:55:43,135 --> 00:55:47,235
I've run it in Docker Compose on my like work machine and I've scaled

1311
00:55:47,235 --> 00:55:50,435
it to like, however many, you know, hits a second and it worked fine.

1312
00:55:50,455 --> 00:55:53,534
It should probably be okay, but our only way of like

1313
00:55:53,534 --> 00:55:55,454
figuring it out was like, all right, turn the dial and

1314
00:55:55,454 --> 00:55:58,114
actually let users use it and see if it's going to happen.

1315
00:55:58,114 --> 00:55:59,034
And this was right after Brazil.

1316
00:55:59,034 --> 00:55:59,875
So Brazil happened.

1317
00:55:59,885 --> 00:56:02,285
We had 10 X, the, the number of users we

1318
00:56:02,295 --> 00:56:04,695
expected to have, I had been building video.

1319
00:56:05,140 --> 00:56:06,750
For the previous number of users.

1320
00:56:06,750 --> 00:56:09,940
But I was like, I want it to be able to scale to a billion horizontally.

1321
00:56:10,100 --> 00:56:13,739
And then Brazil came on and, and Paul was like, can we still do video?

1322
00:56:14,250 --> 00:56:15,570
And I was like, give me a week.

1323
00:56:15,580 --> 00:56:18,149
Like, yeah, give me, give me, give me a week.

1324
00:56:18,150 --> 00:56:19,860
Let me, let me, let me update some spreadsheets to

1325
00:56:19,860 --> 00:56:21,050
figure out what the costs are going to look like.

1326
00:56:21,050 --> 00:56:22,810
And then give me a week and then yeah, let's do video.

1327
00:56:22,910 --> 00:56:25,150
We had a last minute architectural change with video as well.

1328
00:56:25,150 --> 00:56:25,830
That was insane.

1329
00:56:25,839 --> 00:56:29,080
We were, it was the morning of the video launch.

1330
00:56:29,150 --> 00:56:31,149
Uh, we had, we had a transcoding partner that was

1331
00:56:31,150 --> 00:56:33,720
going to do like half of our video encoding for us.

1332
00:56:33,740 --> 00:56:35,870
And, and a big chunk of the, the workflow.

1333
00:56:35,970 --> 00:56:38,840
We submitted some jobs to their, their queues that morning.

1334
00:56:38,850 --> 00:56:38,960
Like.

1335
00:56:39,460 --> 00:56:42,730
Through their API and it took like an hour to process the video and I

1336
00:56:42,730 --> 00:56:45,780
was like what this was like working just fine Like last night it was

1337
00:56:45,790 --> 00:56:48,260
happening in seconds and they said oh, you know There's there's a really

1338
00:56:48,260 --> 00:56:52,790
big backlog right now and I was like, I can't ship that to like millions

1339
00:56:52,790 --> 00:56:56,289
of users That's not that's not accept it Like people can't upload

1340
00:56:56,299 --> 00:56:58,850
videos if it's gonna take an hour to process a 60 second video that

1341
00:56:58,850 --> 00:57:03,620
makes no sense So in about 14 hours of insanity, I like rewrote their

1342
00:57:03,620 --> 00:57:07,660
entire part of that stack into the existing job system that I built.

1343
00:57:07,880 --> 00:57:09,020
And I was like, cool, I'm just going to replace your

1344
00:57:09,020 --> 00:57:12,087
product and I'm just going to shove these into an S3 bucket.

1345
00:57:12,087 --> 00:57:14,669
What kind of monster do you drink?

1346
00:57:14,860 --> 00:57:16,280
Goodness, Paul

1347
00:57:16,280 --> 00:57:17,260
drinks Red Bull, doesn't he?

1348
00:57:17,410 --> 00:57:19,260
It's like between Red Bull and Monster.

1349
00:57:19,780 --> 00:57:21,130
Paul needs a fridge of Red Bull.

1350
00:57:21,130 --> 00:57:22,020
I think I ate that

1351
00:57:22,020 --> 00:57:23,180
night, briefly.

1352
00:57:23,309 --> 00:57:23,829
Yeah.

1353
00:57:23,940 --> 00:57:25,090
Me and, me and Divey.

1354
00:57:25,130 --> 00:57:27,689
Divey was like, I was like, Hey, I think this is how this can work.

1355
00:57:27,700 --> 00:57:30,769
Can you figure out how to get the CDN to front this like S3

1356
00:57:30,770 --> 00:57:33,580
bucket or like S3 compatible bucket, this block store bucket?

1357
00:57:33,830 --> 00:57:37,315
And then I will do everything I can to get us to encode these

1358
00:57:37,315 --> 00:57:39,590
HLS streams and get them into that block storage bucket.

1359
00:57:39,955 --> 00:57:42,605
And then hopefully it should just work, maybe.

1360
00:57:42,865 --> 00:57:45,085
Um, and we literally launched the next day.

1361
00:57:45,265 --> 00:57:48,804
You're like, oh, Jake did this and like, oh, I didn't do anything big.

1362
00:57:48,805 --> 00:57:50,645
And I'm like, are you listening to the

1363
00:57:50,655 --> 00:57:52,485
words that are coming out of your own mouth?

1364
00:57:52,825 --> 00:57:53,805
It was a lot.

1365
00:57:53,885 --> 00:57:54,045
It was a

1366
00:57:54,655 --> 00:57:57,365
bajillion times.

1367
00:57:57,365 --> 00:57:58,975
And like, it was no big deal though.

1368
00:57:58,975 --> 00:58:00,495
I just did it with a monster.

1369
00:58:01,165 --> 00:58:04,055
The secret to video encoding is everybody's just calling FFmpeg.

1370
00:58:04,265 --> 00:58:04,845
It doesn't matter.

1371
00:58:04,845 --> 00:58:05,915
It doesn't matter how big of a company.

1372
00:58:05,955 --> 00:58:07,444
I mean, maybe if you're like Google scale or

1373
00:58:07,444 --> 00:58:09,205
something, you're not doing it anymore at that point.

1374
00:58:09,215 --> 00:58:09,515
But.

1375
00:58:10,085 --> 00:58:12,724
It's so much just like, yeah, you're calling FFmpeg.

1376
00:58:12,725 --> 00:58:14,095
Disney FFmpeg.

1377
00:58:14,185 --> 00:58:15,635
It's just legit.

1378
00:58:15,835 --> 00:58:18,345
Like, yeah, there's some hardware that's specialized to It's

1379
00:58:18,345 --> 00:58:20,295
so phenomenal that Disney didn't fall over in itself.

1380
00:58:20,434 --> 00:58:22,615
Also, like, can we talk, like, with the amount of times

1381
00:58:22,615 --> 00:58:26,295
that we saw the Twitter whale in early Twitter scale days?

1382
00:58:26,585 --> 00:58:27,655
Y'all are killing it.

1383
00:58:27,735 --> 00:58:31,725
The secret is we're a distributed system, so we're never fully down.

1384
00:58:31,875 --> 00:58:33,595
We only ever have partial outages.

1385
00:58:34,595 --> 00:58:36,355
We only ever have service degradations.

1386
00:58:36,595 --> 00:58:38,874
So occasionally the website goes into read only mode,

1387
00:58:38,895 --> 00:58:40,904
and you can't like things or anything, and they all get

1388
00:58:40,914 --> 00:58:43,374
backed up in a queue somewhere, but you can still scroll.

1389
00:58:43,385 --> 00:58:43,649
You can still scroll.

1390
00:58:43,660 --> 00:58:45,340
Scrolla nine, that nine CAS.

1391
00:58:45,340 --> 00:58:48,779
If your

1392
00:58:48,779 --> 00:58:53,270
system is distributed enough, you're never fully down.

1393
00:58:53,320 --> 00:58:57,039
Your bugs will always be 10 times worse because you have to figure out where

1394
00:58:57,039 --> 00:59:01,009
you went wrong, but it'll be up and it looks like it's great for customers.

1395
00:59:01,889 --> 00:59:02,399
Exactly.

1396
00:59:02,419 --> 00:59:05,602
All of your, all of your bugs

1397
00:59:05,602 --> 00:59:06,779
are Heisenbugs.

1398
00:59:06,780 --> 00:59:07,000
What's next?

1399
00:59:07,810 --> 00:59:09,480
What's next for BlueSky for infrastructure?

1400
00:59:09,480 --> 00:59:10,510
What are you, what are you looking at?

1401
00:59:12,200 --> 00:59:14,290
We just did some hardware scaling, which was exciting.

1402
00:59:14,665 --> 00:59:17,815
Um, we're probably going to do some more of that in the future, depending

1403
00:59:17,815 --> 00:59:20,925
on how growth goes this year, you know, like we were at 100, 000 users

1404
00:59:20,925 --> 00:59:25,245
18 months ago, we're sitting at 30, just shy of 30 million users today,

1405
00:59:25,375 --> 00:59:28,485
there's a lot of maturing our data architecture that we have to do,

1406
00:59:28,555 --> 00:59:32,264
there's a lot of like low hanging fruit in, in like how to do caches

1407
00:59:32,265 --> 00:59:35,325
better, how to coalesce requests better, how to, you know, hybrid

1408
00:59:35,325 --> 00:59:39,255
timeline fan out stuff for, uh, for celebrities, there's so many different

1409
00:59:39,255 --> 00:59:43,190
things that If we stretch this, you know, this past six month period

1410
00:59:43,410 --> 00:59:47,310
over the course of two years It would have gone totally differently.

1411
00:59:47,320 --> 00:59:48,880
Everything would have been perfectly smooth.

1412
00:59:48,920 --> 00:59:50,560
Like, we would have no tech debt.

1413
00:59:50,580 --> 00:59:53,000
It would have been great because we would have scaled at a rate

1414
00:59:53,010 --> 00:59:56,569
that like, you can see what's going to be a problem slightly ahead

1415
00:59:56,569 --> 00:59:59,350
of time and you can anticipate it and go do something about it.

1416
00:59:59,649 --> 01:00:01,380
But where we're at now is like, problems are

1417
01:00:01,400 --> 01:00:03,850
either on fire or they're not high enough priority.

1418
01:00:03,930 --> 01:00:05,820
And so that, that was in November.

1419
01:00:05,830 --> 01:00:09,020
And now, now we've got, we've bought ourselves some more breathing room.

1420
01:00:09,040 --> 01:00:11,350
And so I'm starting to look at how do we do service discovery?

1421
01:00:11,520 --> 01:00:13,974
We have a bunch of services that are like, Here's like a

1422
01:00:14,225 --> 01:00:16,915
Here's a static list of instances to go try to talk to.

1423
01:00:17,055 --> 01:00:19,275
And if one of those instances goes down and I can't bring it back

1424
01:00:19,295 --> 01:00:21,445
up because it had some load bearing bloom filters or something

1425
01:00:21,445 --> 01:00:23,875
like that and we're in peak traffic, everything gets mad.

1426
01:00:23,875 --> 01:00:25,915
I have to go redeploy all of the services that talk

1427
01:00:25,915 --> 01:00:27,785
to it to tell it, hey, don't try to talk to this one.

1428
01:00:27,904 --> 01:00:29,555
So there's some kind of like dynamic configuration

1429
01:00:29,555 --> 01:00:31,245
and service discovery that we want to get rolling.

1430
01:00:31,945 --> 01:00:33,695
Lots of caching infrastructure changes.

1431
01:00:33,815 --> 01:00:36,115
Maybe writing a custom database for timelines.

1432
01:00:36,415 --> 01:00:40,215
That's, that's one thing that's been on my mind is uh, LSM tree is not a

1433
01:00:40,215 --> 01:00:44,900
great fit for this like, Circular buffer style timeline where like, you've

1434
01:00:44,900 --> 01:00:48,110
got a fixed length of, of references you want to put in everybody's timelines.

1435
01:00:48,120 --> 01:00:49,840
Then you want to kind of overwrite the oldest one.

1436
01:00:49,850 --> 01:00:52,060
When a new one comes in, it feels a lot like a circular buffer.

1437
01:00:52,060 --> 01:00:52,839
And I'm like, okay, cool.

1438
01:00:52,870 --> 01:00:53,670
Can we do something with that?

1439
01:00:53,670 --> 01:00:55,619
Can I go write a database for timelines?

1440
01:00:55,630 --> 01:00:57,969
That is just going to be a super, especially built

1441
01:00:57,980 --> 01:01:00,769
for this workload and just really efficient and scale.

1442
01:01:01,050 --> 01:01:02,880
way farther than I needed to right now.

1443
01:01:02,980 --> 01:01:04,580
So, yeah, writing some databases.

1444
01:01:04,660 --> 01:01:06,140
I did that with a graph database last year.

1445
01:01:06,140 --> 01:01:09,539
Yeah, like that's totally no big deal.

1446
01:01:09,540 --> 01:01:10,930
Because everybody does that.

1447
01:01:10,980 --> 01:01:13,080
I'm just going to change the way that, like, you

1448
01:01:13,080 --> 01:01:15,370
know, app protocol and social media does data.

1449
01:01:15,460 --> 01:01:18,609
Hey, if you limit the scope of your problem, any problem is, any

1450
01:01:18,610 --> 01:01:20,370
problem can be tackleable if you limit the scope hard enough.

1451
01:01:20,390 --> 01:01:23,540
The next time you go for a job interview or write a bio, call us.

1452
01:01:23,900 --> 01:01:24,790
This is your new resume.

1453
01:01:24,810 --> 01:01:28,640
Yeah, you just, you're not doing, like, what you do justice, okay?

1454
01:01:28,790 --> 01:01:30,260
It's, yeah, I don't know.

1455
01:01:30,690 --> 01:01:33,390
There's so many, you wear so many hats at a, uh, like on a

1456
01:01:33,390 --> 01:01:37,130
tiny team that like, I forget what I do a month afterwards

1457
01:01:37,130 --> 01:01:40,160
because the, the, the past month is like, because you left, or

1458
01:01:40,160 --> 01:01:40,725
eighth of that month?

1459
01:01:43,040 --> 01:01:45,530
The past month is like a whole, a whole, like

1460
01:01:45,530 --> 01:01:47,240
every month is like we're in a whole new league.

1461
01:01:47,300 --> 01:01:47,900
Oh crap.

1462
01:01:47,930 --> 01:01:48,920
Now we're in a whole new league.

1463
01:01:48,950 --> 01:01:49,370
Oh crap.

1464
01:01:49,370 --> 01:01:50,390
Now we're in a whole new league.

1465
01:01:50,390 --> 01:01:50,900
And it's like your poor

1466
01:01:50,900 --> 01:01:55,040
brain hasn't had the time to turn off and like register the memory.

1467
01:01:56,350 --> 01:01:59,890
I, I took some time off over, over the holiday, over the winter holidays.

1468
01:01:59,890 --> 01:02:02,190
I got, I got like a week or two off there, which was, uh,

1469
01:02:02,900 --> 01:02:03,810
gave me some breathing room.

1470
01:02:03,850 --> 01:02:05,010
I slept for eight hours.

1471
01:02:05,110 --> 01:02:05,810
It was okay.

1472
01:02:06,480 --> 01:02:09,910
Jazz, thank you so much for coming on the podcast, explaining all of this.

1473
01:02:09,920 --> 01:02:14,490
The rollercoaster of blue sky over the last year and a half has been phenomenal.

1474
01:02:14,490 --> 01:02:15,840
I've been enjoying it thoroughly.

1475
01:02:15,930 --> 01:02:18,475
I've been trying to Play with the new things you've been

1476
01:02:18,475 --> 01:02:21,735
putting out with PDSs and whoever I want to, you know,

1477
01:02:21,735 --> 01:02:23,895
poke at a fire hose and whatnot and see what's going on.

1478
01:02:23,895 --> 01:02:23,965
We are sorry

1479
01:02:23,965 --> 01:02:26,265
that Justin does hoodrat stuff with your infrastructure.

1480
01:02:26,265 --> 01:02:26,845
We apologize.

1481
01:02:26,905 --> 01:02:27,225
I

1482
01:02:27,245 --> 01:02:28,525
definitely am one of those abusers.

1483
01:02:28,995 --> 01:02:31,894
Just like, look, just we're, we're going to send, just make a

1484
01:02:31,894 --> 01:02:34,795
like little like page where we can send you coffee every time

1485
01:02:34,795 --> 01:02:37,565
Justin gets a bright idea and then post about it to encourage

1486
01:02:37,585 --> 01:02:40,265
other people to get said bright idea and do hoodrat stuff.

1487
01:02:40,835 --> 01:02:42,905
If a well intended dev can cause issues,

1488
01:02:42,905 --> 01:02:44,685
then we've, we've got work to do, right?

1489
01:02:44,685 --> 01:02:46,355
Justin's your chaos engineering.

1490
01:02:46,395 --> 01:02:47,915
He's your, like, chaos goblin.

1491
01:02:48,015 --> 01:02:50,985
Retroid is definitely another one of our chaos engineers in the community.

1492
01:02:50,985 --> 01:02:53,834
If you, if you follow Retroid, he's, since the early days,

1493
01:02:53,845 --> 01:02:57,085
has been helping us find, uh, bugs in unlikely places.

1494
01:02:57,524 --> 01:02:58,314
That's a way to describe.

1495
01:02:58,870 --> 01:02:59,700
that relationship.

1496
01:02:59,860 --> 01:03:01,710
That was such a nice way of doing it.

1497
01:03:02,530 --> 01:03:04,110
So everyone, thank you for listening.

1498
01:03:04,110 --> 01:03:06,410
If you're on blue sky, go look up jazz.

1499
01:03:06,680 --> 01:03:09,570
They're on the network, obviously very active

1500
01:03:09,599 --> 01:03:11,829
posting and sharing your knowledge and everything.

1501
01:03:11,830 --> 01:03:13,309
And so that's, that's been fantastic just to

1502
01:03:13,309 --> 01:03:15,429
follow along and everyone that's listening.

1503
01:03:15,460 --> 01:03:16,210
Thank you so much.

1504
01:03:16,210 --> 01:03:17,279
We will talk to you again next week.

1505
01:03:17,340 --> 01:03:18,150
Thank you for having me.

1506
01:03:33,460 --> 01:03:36,430
Thank you for listening to this episode of Fork Around and find out.

1507
01:03:36,760 --> 01:03:38,920
If you like this show, please consider sharing it with

1508
01:03:38,920 --> 01:03:42,100
a friend, a coworker, a family member, or even an enemy.

1509
01:03:42,160 --> 01:03:44,290
However we get the word out about this show

1510
01:03:44,500 --> 01:03:46,750
helps it to become sustainable for the long term.

1511
01:03:46,990 --> 01:03:53,110
If you wanna sponsor this show, please go to fa fo fm slash sponsor and reach

1512
01:03:53,110 --> 01:03:56,410
out to us there about what you're interested in sponsoring and how we can help.

1513
01:03:57,725 --> 01:04:00,895
We hope your systems stay available and your pagers stay quiet.

1514
01:04:01,425 --> 01:04:02,605
We'll see you again next time.