Getting to Know Kafka with Elad Eldor

Is running Kafka on-prem different than running it in the cloud? You’ll find out from Elad Eldor’s years of experience running, tuning, and troubleshooting Kafka in production environments. Elad didn’t set out to learn Kafka, but he kept asking questions and was given the opportunity to dive deep into system performance. He not only knows what all the columns of iostat mean, he knows what his customers want. Make sure to subscribe to this topic on all your consumers.
Show Highlights
(0:00) Intro
(9:30) Why do people use Kafka
(15:00) Learning cloud vs on-prem
(18:30) Kafka vs Linux troubleshooting
(27:00) scaling clusters
(38:00) How to get started
Links Referenced
- Elad’s book: Kafka Troubleshooting in Production https://www.amazon.com/Kafka-Troubleshooting-Production-Stabilizing-premises-ebook/dp/B0CJ4FSGMD
- Systems Performance book by Brendan Gregg https://www.brendangregg.com/blog/2020-07-15/systems-performance-2nd-edition.html
- Kafka: The Definitive Guide book by Neha Narkhede https://www.amazon.com/Kafka-Definitive-Real-Time-Stream-Processing/dp/1491936169
Sponsor
https://www.softwaredefinedtalk.com
Sponsor FAFO at https://fafo.fm/sponsor
1
00:00:00,120 --> 00:00:03,530
The correlation between RAM and disk is one of the things
2
00:00:03,540 --> 00:00:06,960
that are so simple to understand, but so hard to grasp.
3
00:00:12,890 --> 00:00:16,290
Welcome to fork around and find out the podcast about
4
00:00:16,290 --> 00:00:19,430
building, running, and maintaining software and systems.
5
00:00:31,950 --> 00:00:34,199
Welcome back to fork around and find out.
6
00:00:34,229 --> 00:00:35,409
I am Justin Garrison.
7
00:00:35,409 --> 00:00:37,319
And with me as always is Autumn Nash.
8
00:00:37,800 --> 00:00:41,340
Today on the show, we are going to stream process this Kafka
9
00:00:41,340 --> 00:00:44,640
queue with Elad Eldor, the author of Kafka troubleshooting
10
00:00:44,640 --> 00:00:47,300
in production and a data ops engineer at unity.
11
00:00:47,579 --> 00:00:48,720
Welcome to the show, Elad.
12
00:00:48,980 --> 00:00:49,920
Thank you for having me.
13
00:00:50,239 --> 00:00:54,019
We were just going into details before recording this about, uh, not only
14
00:00:54,019 --> 00:00:58,959
how cool, like your name sounds like a great character in any sort of like.
15
00:00:59,209 --> 00:01:00,420
The cool fantasy book.
16
00:01:00,450 --> 00:01:02,129
Yeah, it's, it's just fantastic.
17
00:01:02,129 --> 00:01:04,280
But then like the fact that you're working with Kafka and
18
00:01:04,280 --> 00:01:06,810
like data streams also sounds a little magical sometimes.
19
00:01:06,810 --> 00:01:09,179
Like this sounds like a spell that you're casting.
20
00:01:09,460 --> 00:01:12,220
To do cool things to all the data and like stream it.
21
00:01:12,220 --> 00:01:16,210
Like I could just see like a book cover with like a wizard like streaming data.
22
00:01:16,229 --> 00:01:16,759
Like it'd be so cool.
23
00:01:17,500 --> 00:01:19,880
The book cover is, uh, it looks like a tree.
24
00:01:19,990 --> 00:01:23,544
So yeah, it streams somewhere into the earth or something like that.
25
00:01:23,905 --> 00:01:24,285
Love it.
26
00:01:24,515 --> 00:01:27,765
Let's jump straight into tell me about the book you the
27
00:01:27,775 --> 00:01:29,875
title of the book is Kafka troubleshooting in production
28
00:01:30,085 --> 00:01:33,385
Stabilizing Kafka clusters in cloud and on premises and the
29
00:01:33,385 --> 00:01:36,875
first question I have is why is cloud and on premises different?
30
00:01:37,184 --> 00:01:38,925
That's a big big question.
31
00:01:39,144 --> 00:01:39,535
So we
32
00:01:39,535 --> 00:01:40,254
got a lot of time
33
00:01:40,255 --> 00:01:44,364
Yeah
34
00:01:44,365 --> 00:01:49,785
to answer this I'll start with The interesting story of how I, uh, how
35
00:01:49,785 --> 00:01:54,205
it became like, uh, working with Kafka because the intention of the
36
00:01:54,205 --> 00:02:00,005
book is for people like me, people who had no experience with Kafka or
37
00:02:00,005 --> 00:02:05,074
Linux, by the way, to get into troubleshooting, I started like, uh, 2016.
38
00:02:07,074 --> 00:02:08,685
Uh, working with spark and Kafka.
39
00:02:08,955 --> 00:02:13,805
I was a backend developer working on an, with an, uh, at an on prem company that
40
00:02:13,805 --> 00:02:21,155
sells all the machines, not only the software to clients in a defense industry,
41
00:02:21,894 --> 00:02:26,305
uh, clients and all things worked, but the major issues were with Kafka.
42
00:02:26,915 --> 00:02:30,065
So after some time I had enough with, uh,
43
00:02:30,234 --> 00:02:32,195
going to DevOps, trying to get some help.
44
00:02:32,205 --> 00:02:34,564
They also didn't know Kafka very well.
45
00:02:34,564 --> 00:02:36,995
Uh, so I had to dig into myself.
46
00:02:37,380 --> 00:02:42,610
Into Kafka and turned out that this was like I had some 10 years of experience
47
00:02:42,610 --> 00:02:49,379
with open sources and nothing was as Tough as Kafka and nothing was so pivotal
48
00:02:49,390 --> 00:02:54,940
as Kafka It's in the middle of everything a small problem just spans over
49
00:02:54,990 --> 00:03:00,540
the whole pipeline so then I started getting into troubles like one by one
50
00:03:00,540 --> 00:03:07,205
understanding how to Solve it learning Linux on the way I didn't only, uh,
51
00:03:07,385 --> 00:03:13,115
try to understand issues in Kafka, it was also clusters in, I became SRE in
52
00:03:13,115 --> 00:03:21,055
charge of, uh, Presto now Trino clusters and, uh, HDFS and Spark, like on-prem?
53
00:03:21,060 --> 00:03:21,325
No, no one.
54
00:03:22,309 --> 00:03:28,260
Three distributions of Ambari that bundled together a Kafka
55
00:03:28,329 --> 00:03:31,989
Spark and whatever and no support from anyone, only logs.
56
00:03:32,420 --> 00:03:36,359
And customers are abroad to go to India and Uzbekistan, all sorts
57
00:03:36,360 --> 00:03:40,480
of weird places, which I really like, but they're still weird.
58
00:03:40,859 --> 00:03:45,329
Trying to like going, going over logs, uh, the Persian Gulf, et cetera.
59
00:03:45,329 --> 00:03:48,670
So, and I took Brendan's book to every place
60
00:03:48,750 --> 00:03:52,280
and it just was like all the solutions.
61
00:03:52,540 --> 00:03:56,410
We're in this book, and he described this is such a way that I
62
00:03:56,420 --> 00:04:01,010
fell in love with Linux tuning and understanding bottlenecks.
63
00:04:01,530 --> 00:04:02,909
And again, it was on prem.
64
00:04:02,910 --> 00:04:08,559
So getting back to your questions, the question in on prem, you don't have
65
00:04:08,629 --> 00:04:12,829
any assistance, but of anyone, but but you can look at the overall the logs.
66
00:04:13,270 --> 00:04:16,340
And one of the problems that don't occur in
67
00:04:16,340 --> 00:04:18,740
the cloud is that the hardware just fails.
68
00:04:19,599 --> 00:04:23,870
You have, uh, RAM games that are being, that get defected,
69
00:04:24,460 --> 00:04:29,650
uh, disks start to burn out slowly or fast, and even one
70
00:04:29,650 --> 00:04:33,430
disk in one cluster, uh, can just destroy the whole cluster.
71
00:04:33,829 --> 00:04:38,700
That's the con, the pros are that you can create your own cluster,
72
00:04:38,929 --> 00:04:44,380
whatever RAM you want, how, however, storage you want, whichever
73
00:04:44,390 --> 00:04:48,089
disk type you want, and like build your own little cloud.
74
00:04:48,089 --> 00:04:48,169
That's the con.
75
00:04:48,540 --> 00:04:53,890
But, uh, so it was a mixture both of like having access to
76
00:04:54,360 --> 00:04:58,060
everything, but seeing failures that you don't see in the cloud,
77
00:04:58,909 --> 00:05:04,690
but having the freedom to design your own cluster, or at least try
78
00:05:04,690 --> 00:05:08,350
to design your own cluster because you need really to estimate.
79
00:05:08,450 --> 00:05:13,060
I know that you just moved on from EKS to a
80
00:05:13,060 --> 00:05:16,710
company that works with EKS clusters on prem.
81
00:05:16,720 --> 00:05:16,740
Yeah.
82
00:05:17,219 --> 00:05:22,179
The issue is that you Like designing a cluster that will fit
83
00:05:22,179 --> 00:05:25,950
into the budget and not cost you a lot of money is something
84
00:05:26,070 --> 00:05:29,809
that is very hard to convince and very hard also to estimate.
85
00:05:30,409 --> 00:05:33,630
Because traffic today is X, tomorrow it's 2X or half
86
00:05:33,630 --> 00:05:37,830
X. So these are challenges I didn't face in the cloud.
87
00:05:38,220 --> 00:05:42,249
It's the best school for becoming an SRE.
88
00:05:43,000 --> 00:05:47,819
In every open source, you have full accessibility to all the tools of Linux.
89
00:05:47,900 --> 00:05:50,170
Like, you can see all the metrics in the disk.
90
00:05:50,620 --> 00:05:53,820
And like, create your own monitoring, what AWS
91
00:05:53,820 --> 00:05:57,150
engineers Doing the data centers, you just do yourself,
92
00:05:57,530 --> 00:05:58,540
which is good and bad, right?
93
00:05:58,540 --> 00:06:00,360
I mean, it's like, that's all the responsibility
94
00:06:00,460 --> 00:06:04,270
of now I need a, uh, reliable log engine, right?
95
00:06:04,270 --> 00:06:06,610
Like those sorts of things are just like, oh, this, this sucks.
96
00:06:06,610 --> 00:06:08,650
But also a lot of.
97
00:06:09,234 --> 00:06:13,474
Log aggregators are better than CloudWatch logs, right?
98
00:06:13,474 --> 00:06:15,684
CloudWatch is pretty terrible when you look at anything you can run
99
00:06:15,684 --> 00:06:19,354
yourself, but there's more responsibility to run something yourself.
100
00:06:19,484 --> 00:06:21,815
I also think that it's like, do you have the time
101
00:06:22,084 --> 00:06:25,125
and the team and the ability to run it yourself?
102
00:06:25,274 --> 00:06:26,645
Like, I think when you can.
103
00:06:27,210 --> 00:06:30,069
Sit there and build something and run it yourself.
104
00:06:30,090 --> 00:06:32,599
It gives you that, like, ability to really understand it.
105
00:06:32,609 --> 00:06:36,369
So you end up, you end up in a better situation, but you have to have the
106
00:06:36,379 --> 00:06:39,999
time to spend and the ability to really run it yourself and do all that, you
107
00:06:39,999 --> 00:06:40,159
know?
108
00:06:40,179 --> 00:06:41,570
That's exactly how this conversation started, right?
109
00:06:41,570 --> 00:06:43,210
Like no one knew how Kafka worked.
110
00:06:43,599 --> 00:06:46,010
And so Elad's like, I'm jumping in and we're going to, I'm going to
111
00:06:46,010 --> 00:06:49,950
spend, I'm going to invest the time to accidentally make this my career.
112
00:06:51,400 --> 00:06:52,250
Exactly.
113
00:06:52,330 --> 00:06:53,969
So when I moved to.
114
00:06:54,280 --> 00:06:55,049
To the cloud.
115
00:06:55,370 --> 00:07:01,320
I worked at the company that the many sites so many different deployments.
116
00:07:01,719 --> 00:07:04,710
Not only of Kafka, of consumers and producers.
117
00:07:04,710 --> 00:07:10,370
So I, I saw a vast amount of, uh, like compared to like
118
00:07:10,740 --> 00:07:14,370
an ordinary SRE, uh, because there were many sites.
119
00:07:14,370 --> 00:07:20,203
I saw so many cases of producer errors, consumer errors, Kafka errors, hardware
120
00:07:20,203 --> 00:07:24,299
errors, like whatever I built, like I designed clusters with different.
121
00:07:24,364 --> 00:07:29,864
A mixture of a different disc types, different deems within the same machine.
122
00:07:29,924 --> 00:07:33,234
Like I learned how to combine deems in two
123
00:07:33,514 --> 00:07:36,264
different types of deems at different sizes.
124
00:07:36,284 --> 00:07:39,104
Like how you saw it in the, in the J board
125
00:07:39,605 --> 00:07:41,914
of the, of the deems was really crazy.
126
00:07:42,355 --> 00:07:46,085
So after like three years, is that, is that for performance you're splitting up?
127
00:07:46,145 --> 00:07:50,075
No, that's part of the issue is on premise that you have a cluster.
128
00:07:50,700 --> 00:07:54,050
And it's now at the customer's side, it is DC.
129
00:07:54,530 --> 00:07:55,960
Now, you lack storage.
130
00:07:56,420 --> 00:08:00,130
So, what do you do when you lack storage and the JBOD, you have
131
00:08:00,130 --> 00:08:06,500
a JBOD of disks and you have 24 disks of, uh, let's say, uh, 4TB.
132
00:08:06,680 --> 00:08:10,460
So, are you going to throw away all the disks because you
133
00:08:10,490 --> 00:08:15,125
need 16 terabyte disks or 32 or whatever, the customer will,
134
00:08:15,234 --> 00:08:18,965
will get mad that you throw away the disk that he paid for.
135
00:08:19,355 --> 00:08:21,655
Although it makes sense, but you cannot really do it.
136
00:08:22,184 --> 00:08:25,144
So you can, but you can say to him, okay, I'll throw half of the
137
00:08:25,144 --> 00:08:29,284
disk that you bought, and I'll put in 16 terabyte disks, but then
138
00:08:29,294 --> 00:08:33,954
how do you make a JBOD work with 4 terabyte disks and 16 terabytes?
139
00:08:33,965 --> 00:08:36,515
So there is a way of doing that.
140
00:08:37,105 --> 00:08:41,715
Like you do 4TB, 16, 4TB, 4 or 8TB.
141
00:08:41,985 --> 00:08:43,135
Same goes with DIMMs.
142
00:08:43,505 --> 00:08:52,170
You have 24, a box of 24 DIMMs But you need triple that size.
143
00:08:52,199 --> 00:08:54,120
You are not going to throw away all of it because
144
00:08:54,130 --> 00:08:56,390
the customer will not get it, will not accept it.
145
00:08:56,460 --> 00:08:59,240
I learned that there are several types of customers, by the way,
146
00:08:59,670 --> 00:09:03,920
customers in a, I don't want to mention geographical places, but some
147
00:09:03,920 --> 00:09:08,250
customers have, they, they get really attached to their DMS and disks.
148
00:09:08,260 --> 00:09:11,640
Uh, and they like a lot of machines.
149
00:09:11,995 --> 00:09:13,565
And a lot of this can, whatever.
150
00:09:13,585 --> 00:09:16,125
So, uh, so sometimes you need to mix.
151
00:09:16,285 --> 00:09:20,655
And Kafka's it's one of those pieces of software that seems
152
00:09:20,655 --> 00:09:23,435
to work its way in a lot of places in a lot of industries.
153
00:09:23,445 --> 00:09:28,365
It's not because it's, it's a generic sort of stream processor and, and
154
00:09:28,365 --> 00:09:31,364
pub sub people are just like, I'm just going to use it for whatever I want.
155
00:09:32,025 --> 00:09:32,944
In this case, right?
156
00:09:32,944 --> 00:09:35,494
And there's, and there are other tools that kind of do that.
157
00:09:35,535 --> 00:09:38,665
Um, some of the older ones like zookeeper and whatnot are known to be
158
00:09:38,665 --> 00:09:42,814
a little worse, uh, to operate and try to use or have less features.
159
00:09:43,345 --> 00:09:45,555
What is the most common scenario that you're
160
00:09:45,555 --> 00:09:48,264
like, someone uses Kafka for this purpose?
161
00:09:48,295 --> 00:09:50,014
And what's the use case where you see someone that like
162
00:09:50,074 --> 00:09:52,444
puts Kafka in place and that they shouldn't have used Kafka?
163
00:09:52,824 --> 00:09:55,055
I think that, uh, I don't have.
164
00:09:55,585 --> 00:09:59,775
Any example for why not using Kafka if you want
165
00:09:59,805 --> 00:10:03,075
to write something that many consumers will read?
166
00:10:03,215 --> 00:10:06,434
For example, or many producers, many consumers, like
167
00:10:06,445 --> 00:10:09,235
end to end, one to end, end to one, like whatever.
168
00:10:09,555 --> 00:10:11,085
I can't think of any example.
169
00:10:11,660 --> 00:10:12,819
Why not using it?
170
00:10:13,290 --> 00:10:17,990
Kafka is a Swiss army knife of like streaming data because it's open source.
171
00:10:18,000 --> 00:10:21,219
So, so many people have made their own version of Kafka, but
172
00:10:21,219 --> 00:10:23,930
it's really just Kafka, which means it's like Kubernetes.
173
00:10:23,949 --> 00:10:25,969
Like you can go from one place to another and
174
00:10:26,240 --> 00:10:28,490
people know that if they do know it, right?
175
00:10:28,490 --> 00:10:31,640
Like if they know some sort of streaming, it's going to be that.
176
00:10:31,640 --> 00:10:35,420
So then people just use different versions of it because it's the most common.
177
00:10:35,430 --> 00:10:38,090
Like, you know, people use Kubernetes because they have that like.
178
00:10:38,260 --> 00:10:39,280
There's the learning.
179
00:10:39,290 --> 00:10:41,219
There's the different projects that are built on it.
180
00:10:41,400 --> 00:10:42,930
The ecosystem of the community.
181
00:10:42,939 --> 00:10:43,640
You know what I mean?
182
00:10:43,790 --> 00:10:46,360
And Kafka is like the ecosystem of like
183
00:10:46,439 --> 00:10:48,949
streaming because everybody uses it everywhere.
184
00:10:48,949 --> 00:10:51,459
So if you're going to do the struggle, you might as well
185
00:10:51,459 --> 00:10:55,019
do the one that has the most ecosystem to support whatever
186
00:10:55,019 --> 00:10:56,920
you're about to like the ride you're about to go on.
187
00:10:57,510 --> 00:11:00,089
And for a long time, streaming was new everywhere.
188
00:11:00,100 --> 00:11:01,689
Like streaming that much data.
189
00:11:01,930 --> 00:11:03,860
We were all trying to figure it out at the same time.
190
00:11:04,319 --> 00:11:07,300
Before that, I used all sorts of streaming open source.
191
00:11:07,329 --> 00:11:09,420
The last one before Kafka was HornetQ.
192
00:11:09,790 --> 00:11:12,310
But before that, there was some other open sources.
193
00:11:12,310 --> 00:11:16,239
And there always was an open source that replaced the last one.
194
00:11:16,589 --> 00:11:21,999
Since Kafka, Kafka became like a synonym for moving data from X to Y.
195
00:11:22,169 --> 00:11:25,269
And it works very well with small traffic, by the way.
196
00:11:25,419 --> 00:11:26,249
Really, really well.
197
00:11:26,629 --> 00:11:28,189
And that's one of the problems with it.
198
00:11:28,479 --> 00:11:34,480
Because once you go At some point, you get into problems that are much harder
199
00:11:34,900 --> 00:11:40,069
than what you anticipated, but then you're already locked in and Kafka makes
200
00:11:40,069 --> 00:11:45,979
it really tough to confluent exist from, for a reason, let's say, or Ivan
201
00:11:45,999 --> 00:11:52,625
or like whatever, but like, uh, It's really hard to understand what happens.
202
00:11:52,955 --> 00:11:53,975
Can you describe that more?
203
00:11:53,985 --> 00:11:58,195
Like what, at what point do I say, I only had a little bit of traffic and
204
00:11:58,195 --> 00:12:02,145
now I have a lot of traffic and I need to rearchitect or change how I use it.
205
00:12:02,324 --> 00:12:04,255
No matter how you architect it.
206
00:12:04,605 --> 00:12:10,055
It, I, I, I noticed that I, I started at the rates of a 10 K
207
00:12:10,055 --> 00:12:15,215
5 K. At 5k, 10k, I didn't see any environment that had issues.
208
00:12:15,745 --> 00:12:20,294
But starting from 100k, you start to suffer from consumer
209
00:12:20,295 --> 00:12:24,685
lags or producers that get the buffer getting full.
210
00:12:25,185 --> 00:12:27,974
And data skew really becomes an issue.
211
00:12:28,074 --> 00:12:31,935
And skewing the storage of disks, mainly by the
212
00:12:31,935 --> 00:12:34,425
way, if you work with several disks per broker.
213
00:12:34,435 --> 00:12:39,380
So you have a skew per disks, you have a skew per the Storage in the brokers.
214
00:12:40,090 --> 00:12:41,560
I mean, in the book, I, uh,
215
00:12:41,720 --> 00:12:43,490
Is that SKU for delivery time?
216
00:12:43,520 --> 00:12:47,600
Or is that like just how long it takes from the producer to get to the consumer?
217
00:12:47,659 --> 00:12:49,519
I know, no, not latency.
218
00:12:49,520 --> 00:12:50,400
I'm not meaning that.
219
00:12:50,400 --> 00:12:55,869
I mean, I mean, like traffic SKU, like a number of messages per partition.
220
00:12:56,005 --> 00:12:59,715
Per topic and the skew demand of leaders.
221
00:13:00,735 --> 00:13:03,375
So partitions are, so, it's like a wild
222
00:13:03,375 --> 00:13:06,135
ride , like learning how to do that properly,
223
00:13:07,035 --> 00:13:08,385
partitioning correctly.
224
00:13:08,385 --> 00:13:10,870
So partitioning by, do you want to run Robin?
225
00:13:11,050 --> 00:13:12,630
So you have a, an equal amount.
226
00:13:13,335 --> 00:13:16,495
of data, or do you want to have a better aggregation ratio?
227
00:13:16,495 --> 00:13:19,225
So in that case, this is really bad to go round robin.
228
00:13:19,675 --> 00:13:22,365
And what happened when you have the same number of
229
00:13:22,405 --> 00:13:25,795
partitions per broker, but you have a skew in the leaders?
230
00:13:26,155 --> 00:13:29,245
Uh, the, the main issue when I started was leaders that
231
00:13:29,274 --> 00:13:35,020
became Non leaders like just leaders that are lost and you
232
00:13:35,430 --> 00:13:36,969
is that from a network partition that they
233
00:13:36,969 --> 00:13:39,170
forgot their leaders or do they get no, no,
234
00:13:39,170 --> 00:13:40,199
just a leader
235
00:13:40,199 --> 00:13:40,900
partition.
236
00:13:40,980 --> 00:13:41,989
Now it is not a
237
00:13:41,990 --> 00:13:42,890
leader partition
238
00:13:43,299 --> 00:13:46,130
distributed systems and just trying to do that with data like I.
239
00:13:46,439 --> 00:13:49,030
What you're saying is so true because I think when people want
240
00:13:49,030 --> 00:13:51,790
you to architect a system, they're just like, do magical things
241
00:13:51,790 --> 00:13:54,560
and then account for the, like the scale, but you can't, right?
242
00:13:54,560 --> 00:13:58,020
Because what some qualities that you really want at a smaller scale
243
00:13:58,020 --> 00:14:01,500
that makes it more efficient, you can't do that at a bigger scale.
244
00:14:01,500 --> 00:14:03,199
Like it just, at some point you're going to
245
00:14:03,199 --> 00:14:05,669
have to re architect and rethink of things.
246
00:14:05,669 --> 00:14:10,439
And it's like leaders are going to die or switch and just, you have to do that.
247
00:14:10,699 --> 00:14:11,329
You know what I mean?
248
00:14:11,329 --> 00:14:14,430
Like it's, there's no way to skip that.
249
00:14:15,015 --> 00:14:19,385
It's not even re architecting because you can have topics at
250
00:14:19,485 --> 00:14:23,094
millions per sec at produce rate and you can have multiple consumers
251
00:14:23,094 --> 00:14:28,544
on the topic and not have leaders getting lost while you can
252
00:14:28,544 --> 00:14:32,994
have a topic at one tenths of the size and get leaders get lost.
253
00:14:33,254 --> 00:14:37,495
It depends like on How you monitor your Kafka and what you're
254
00:14:37,495 --> 00:14:40,685
doing, the producer side and the consumer side, you can have
255
00:14:40,735 --> 00:14:44,665
everything right on the Kafka infrastructure, but having some
256
00:14:44,665 --> 00:14:48,385
amount of consumers can cripple down the broken, the cluster.
257
00:14:48,989 --> 00:14:54,550
You can have 20 brokers and one faulty disk can cripple the whole cluster.
258
00:14:54,959 --> 00:15:00,740
It's so vulnerable Kafka compared to, I work at high scale with
259
00:15:00,780 --> 00:15:05,329
Druid and Trino and Spark and Kafka is just a different animal.
260
00:15:05,959 --> 00:15:12,359
It is so important and pivotal and vulnerable at the same time, both on prem.
261
00:15:13,325 --> 00:15:14,145
And on the cloud,
262
00:15:14,295 --> 00:15:17,385
what do you think was your hardest kind of mountain that you had
263
00:15:17,385 --> 00:15:20,704
to climb learning was the cloud or on prem, the hardest for you,
264
00:15:20,974 --> 00:15:25,235
the cloud, because, because when I like the on prem was because it was
265
00:15:25,235 --> 00:15:29,185
hard because I didn't know system performance, but while reading Brandon's
266
00:15:29,214 --> 00:15:33,115
book and the troubleshooting stuff along the way, it's small traffic.
267
00:15:33,815 --> 00:15:38,675
Then once I moved to the cloud and it becomes topics of a million per second.
268
00:15:39,040 --> 00:15:42,550
When I started working at Ironsource, then, before it
269
00:15:42,550 --> 00:15:46,100
got merged with Unity, the problems were different.
270
00:15:46,120 --> 00:15:49,679
Like, when you go 10x in the traffic, you see
271
00:15:50,180 --> 00:15:54,260
problems that you just don't see at a small traffic.
272
00:15:54,579 --> 00:15:57,345
And this was tough because I worked on PEM first,
273
00:15:57,345 --> 00:16:01,584
yes, for four years, and then moved to the cloud.
274
00:16:01,834 --> 00:16:07,895
At huge clusters, you have like 100 of monitoring dashboards.
275
00:16:08,395 --> 00:16:13,384
And when you have some problem, you need to just correlate a subset of this 100.
276
00:16:13,404 --> 00:16:17,985
But I spent tens of hours on understanding each problem.
277
00:16:18,495 --> 00:16:22,555
There were more interesting problems, but I think it was tougher.
278
00:16:23,454 --> 00:16:27,115
Moving to the cloud was tougher just because Of so many
279
00:16:27,115 --> 00:16:30,315
monitoring and so much traffic compared to on prem.
280
00:16:30,755 --> 00:16:33,204
Do you think the abstraction made it harder or was it
281
00:16:33,204 --> 00:16:37,394
just the fact that of the monitoring and the more traffic?
282
00:16:37,834 --> 00:16:41,105
I think the more traffic, yeah, the, the, the more traffic you
283
00:16:41,115 --> 00:16:44,665
have, it's not just traffic, not, not, not just number of messages.
284
00:16:44,675 --> 00:16:49,225
It's more producers, more consumers, more combinations of features like
285
00:16:49,225 --> 00:16:55,944
compacted topics along with, uh, many consumers and many producers.
286
00:16:56,460 --> 00:17:01,880
Trying to have a small cluster to sustain high traffic, because on prem,
287
00:17:01,920 --> 00:17:07,279
when you sell a customer a cluster, big data cluster, analytics cluster,
288
00:17:07,279 --> 00:17:14,384
so usually companies, like companies sell the amount of machines for Kafka.
289
00:17:14,675 --> 00:17:17,385
There are much more machines for Kafka than for other, uh,
290
00:17:17,564 --> 00:17:21,575
data tools, because like managers are really afraid of,
291
00:17:21,595 --> 00:17:25,405
of problems in Kafka when they sell a system as a whole.
292
00:17:25,964 --> 00:17:29,004
So usually you have a lot of brokers compared to the amount
293
00:17:29,064 --> 00:17:33,444
of traffic, but on the cloud, you can shrink the cluster.
294
00:17:33,704 --> 00:17:38,024
So you, you have, usually you have much less, uh,
295
00:17:38,304 --> 00:17:42,904
compute power in the cloud to sustain much more traffic.
296
00:17:43,364 --> 00:17:44,795
I think it makes the problem.
297
00:17:45,460 --> 00:17:49,610
Harder, but throwing money on the problem also doesn't solve the problem.
298
00:17:49,879 --> 00:17:51,389
Are you talking about money solves all technical
299
00:17:51,399 --> 00:17:51,480
problems?
300
00:17:51,480 --> 00:17:53,409
I was going to say, can you say that one more time so they can hear you?
301
00:17:55,860 --> 00:18:00,970
One of, I repeat the system performance again, one of the things
302
00:18:00,970 --> 00:18:05,540
that Brenton showed is that you can save a lot of money when you
303
00:18:05,540 --> 00:18:09,960
understand bottlenecks, not only in Kafka, but understanding storage.
304
00:18:10,215 --> 00:18:15,935
Where the botanical storage RAM or disk IOPS or throughput is something
305
00:18:15,935 --> 00:18:20,794
that once you understand it and you can communicate it to the stakeholders,
306
00:18:21,604 --> 00:18:26,694
then it becomes much easier to to reduce cost, which is, by the way,
307
00:18:26,695 --> 00:18:33,120
a problem in on Prem, because You need to guess what the load will be.
308
00:18:34,280 --> 00:18:37,060
It's how to guess it while in the cloud.
309
00:18:37,060 --> 00:18:38,100
It's, it's easier.
310
00:18:38,310 --> 00:18:40,179
Your book is focused on troubleshooting.
311
00:18:40,389 --> 00:18:43,270
Are there specific tools you're using to troubleshoot Kafka?
312
00:18:43,270 --> 00:18:46,000
Like everything you've described sounds like
313
00:18:46,079 --> 00:18:48,759
mostly traditional Linux troubleshooting, right?
314
00:18:48,770 --> 00:18:49,949
You're looking for disk pressure.
315
00:18:49,949 --> 00:18:51,800
You're looking for RAM usage.
316
00:18:52,280 --> 00:18:54,940
What does Kafka troubleshooting look like?
317
00:18:54,950 --> 00:18:58,060
How is it different than like standard Linux?
318
00:18:58,480 --> 00:18:59,230
Performance tuning
319
00:18:59,770 --> 00:19:04,090
it is very similar to standard, uh, Linux, uh, tuning or,
320
00:19:04,215 --> 00:19:08,840
or tools like the, the first tool I used was a iostat,
321
00:19:09,520 --> 00:19:13,170
like this utilization the utility column at iostat.
322
00:19:13,170 --> 00:19:18,660
I think it's the most important column in a, in every IO based, uh, database.
323
00:19:18,810 --> 00:19:21,000
But then I discovered all the other columns,
324
00:19:21,030 --> 00:19:23,495
which are also really great in Iostat.
325
00:19:23,835 --> 00:19:26,920
So Iostat is my first, uh, tool and then.
326
00:19:28,094 --> 00:19:35,084
You have, uh, you have VMstat, uh, to, to understand the, the CPU, uh, the CPU,
327
00:19:35,084 --> 00:19:40,505
uh, usage, but then you need to understand like, what's the difference in Kafka.
328
00:19:40,505 --> 00:19:44,894
It's very important to distinguish between the various types of
329
00:19:44,894 --> 00:19:51,635
CPU, like third of my book, just summary of a very short and summary
330
00:19:51,735 --> 00:19:57,455
of what Brendan wrote about, uh, Storage, RAM, CPUs, and CPU have
331
00:19:57,455 --> 00:20:03,024
system time, uh, IO8, user time, and interrupts or context switches.
332
00:20:03,635 --> 00:20:08,244
Per each of them, uh, I, I can give an example of how Kafka can
333
00:20:08,244 --> 00:20:13,595
cripple down if some type of dead CPU metric goes up even, even a bit.
334
00:20:14,014 --> 00:20:16,805
And on the RAM, which is the most interesting part,
335
00:20:16,805 --> 00:20:19,754
it took me, I think, two years to understand it.
336
00:20:20,035 --> 00:20:21,615
Uh, the way that Kafka uses RAM.
337
00:20:22,355 --> 00:20:27,605
Made me understand how RAM works in, in, in Linux and mainly the page cache.
338
00:20:28,165 --> 00:20:33,265
So the correlation between RAM and disk is one of the things that are so simple.
339
00:20:33,889 --> 00:20:38,480
To understand, but so hard to grasp and it causes so many production issues,
340
00:20:38,720 --> 00:20:44,770
mainly for, for cluster that use Kafka as a database that save several days for
341
00:20:44,779 --> 00:20:49,770
replay and then start to replay data and then producer got stuck, get stuck.
342
00:20:50,320 --> 00:20:56,370
And that's mainly because of, uh, because of RAM, like Kafka just uses the page
343
00:20:56,410 --> 00:21:01,310
cache mechanism that you read what you write, but if you are late in reading.
344
00:21:01,885 --> 00:21:06,404
Then you start trashing the RAM and helping consumers and producers.
345
00:21:06,634 --> 00:21:09,954
And by the way, in on prem, if you are late in reading
346
00:21:10,824 --> 00:21:14,405
and you start reading from the disk, then the disks get,
347
00:21:14,734 --> 00:21:17,655
get hammered because of that and then they just fail.
348
00:21:18,294 --> 00:21:23,534
So consumer lag in on prem can fail, can cause failure of disks.
349
00:21:24,405 --> 00:21:27,655
And this correlation is just, uh, like, so,
350
00:21:27,905 --> 00:21:30,154
so Brandon wrote a tool called CacheStan.
351
00:21:30,820 --> 00:21:32,560
To show the hidden misread.
352
00:21:33,090 --> 00:21:39,090
So like these three tools like, uh, iostat, uh, VM stat slash top
353
00:21:39,830 --> 00:21:45,500
and c Stat, they, they can solve big part of, of, uh, of Kafka,
354
00:21:46,240 --> 00:21:50,950
uh, production issues along with, uh, you know, using Grafana
355
00:21:51,440 --> 00:21:55,280
because there are several other metrics that are not Linux based
356
00:21:55,310 --> 00:21:56,360
from Kafka itself.
357
00:21:56,360 --> 00:21:56,480
Yeah.
358
00:21:57,155 --> 00:21:57,495
Yeah.
359
00:21:57,635 --> 00:22:01,205
And there are several, several metrics like network, uh,
360
00:22:01,425 --> 00:22:05,534
processing, uh, threads, for example, that if it's that you
361
00:22:05,534 --> 00:22:08,644
could say, okay, if it's high, someone is suffering consumers
362
00:22:08,644 --> 00:22:11,754
or producer of the cluster, but someone is not feeling good.
363
00:22:11,754 --> 00:22:11,784
It's
364
00:22:12,135 --> 00:22:15,040
like you're reminding me so much on how Linux, the
365
00:22:15,040 --> 00:22:17,860
operating system is a lot like a distributed system.
366
00:22:18,360 --> 00:22:20,449
Like everything you just described is basically
367
00:22:20,449 --> 00:22:23,199
a series of queues that Linux is managing.
368
00:22:23,209 --> 00:22:24,919
And it's just like any distributed system
369
00:22:24,919 --> 00:22:26,499
where like, Oh, I have my cash over there.
370
00:22:26,499 --> 00:22:29,709
I got my, you know, my Redis instance, my web server, whatever, like some,
371
00:22:29,739 --> 00:22:33,700
something processing, like that's all happening within the OS as well.
372
00:22:33,730 --> 00:22:38,320
Managing all of the hardware, RAM, CPU, back pressure, disk pressure,
373
00:22:38,330 --> 00:22:41,670
all that stuff is, is basically what we do on distributed systems too.
374
00:22:42,520 --> 00:22:46,649
How does that then relate to Kafka is also a distributed system, right?
375
00:22:46,649 --> 00:22:52,580
Like you're doing all of those low level things on tens or twenties
376
00:22:52,580 --> 00:22:56,489
or, you know, hundreds of machines for like a large Kafka cluster.
377
00:22:57,339 --> 00:22:59,129
How do you then correlate those things between,
378
00:22:59,129 --> 00:23:01,369
is that all just, you have to have some external.
379
00:23:01,860 --> 00:23:04,720
Grafana dashboard that's pulling over higher level metrics,
380
00:23:04,720 --> 00:23:06,930
and then you dig in to find the hotspots and problems.
381
00:23:07,260 --> 00:23:12,360
Yeah, so Grafana really, without it in, in a, in a cluster
382
00:23:12,360 --> 00:23:16,150
with millions per sec, you can't diagnose anything.
383
00:23:16,240 --> 00:23:20,944
If you have such a cluster, my recommendation is setting Grafana with
384
00:23:21,245 --> 00:23:26,735
Metrics of CPU and have a distinction of CPU usage per broker over time
385
00:23:27,255 --> 00:23:32,595
of CPU system time, IOA time, user time, and interrupt and context switch.
386
00:23:32,644 --> 00:23:33,965
I will later explain why.
387
00:23:34,375 --> 00:23:35,465
And then another dashboard.
388
00:23:35,475 --> 00:23:36,785
This is the CPU dashboard.
389
00:23:36,854 --> 00:23:38,134
And also load average.
390
00:23:38,135 --> 00:23:39,675
Load average is super important.
391
00:23:39,684 --> 00:23:44,085
Normalized load average, like dividing the amount of current,
392
00:23:44,565 --> 00:23:47,495
currently running tasks that you can get from Vmstat.
393
00:23:47,875 --> 00:23:51,875
And dividing it by the number of CPU cores after hyper threading.
394
00:23:52,215 --> 00:23:53,414
So that's the CPU part.
395
00:23:53,415 --> 00:23:55,574
Then you have the storage part, where you can see
396
00:23:55,615 --> 00:23:59,254
the disk, the IOPS, the read IOPS and write IOPS.
397
00:23:59,294 --> 00:24:01,954
Make sure that you don't have a lot of, if you have spikes of
398
00:24:01,954 --> 00:24:05,884
read IOPS, then probably you're trashing the, the, the page cache.
399
00:24:06,284 --> 00:24:10,195
And Storage distribution between the brokers,
400
00:24:10,264 --> 00:24:12,815
even not per topic, just storage distribution.
401
00:24:13,365 --> 00:24:16,224
And then the distribution of partitions,
402
00:24:16,254 --> 00:24:18,915
followers and leaders between the brokers.
403
00:24:19,295 --> 00:24:20,645
So that's the storage part.
404
00:24:21,065 --> 00:24:25,380
On the RAM part, it's really hard to monitor it because it's, it's It's
405
00:24:25,380 --> 00:24:30,290
very difficult to monitor the page cache, whether you trash it or not,
406
00:24:30,650 --> 00:24:36,490
and number of messages of course, per traffic, in and out per broker,
407
00:24:36,840 --> 00:24:41,950
because this can help you to understand whether you have a data skew.
408
00:24:42,529 --> 00:24:47,050
And data skew is one, I'm not saying it's the root of all evils.
409
00:24:47,594 --> 00:24:52,989
But it's one of the evils in, in every database, especially Kafka.
410
00:24:52,989 --> 00:24:57,705
Like today, for example, I, I ran to into a problem of a
411
00:24:57,715 --> 00:25:02,614
broker crashing and the, and the follower was really late.
412
00:25:02,715 --> 00:25:07,095
Like it, it was 15 minutes late on its offsets and the
413
00:25:07,095 --> 00:25:10,375
consumer got an offset that it only 15 minutes before.
414
00:25:10,935 --> 00:25:16,075
And we looked at the possible reasons why it is not in sync replica.
415
00:25:16,720 --> 00:25:19,210
But then when you look at the dashboard, you just see a skew.
416
00:25:20,095 --> 00:25:22,655
On the brokers, but, but the number of
417
00:25:22,655 --> 00:25:25,805
partitions is the same in all of the cluster.
418
00:25:25,995 --> 00:25:27,945
So what is the reason for that?
419
00:25:28,365 --> 00:25:33,174
And, and the reason is something that every Kafka owner should have monitoring.
420
00:25:33,555 --> 00:25:35,684
Not only the number of partitions per broker,
421
00:25:35,725 --> 00:25:39,274
but the number of leaders per topic per brokers.
422
00:25:39,274 --> 00:25:43,365
Because if you create the cluster and then just migrate all your topics.
423
00:25:43,465 --> 00:25:46,295
Not related if they are big or small, what Kafka will do.
424
00:25:46,475 --> 00:25:49,455
That's again what makes Kafka so hard to maintain.
425
00:25:49,825 --> 00:25:52,275
Kafka will not say, okay, I'll just distribute the big
426
00:25:52,325 --> 00:25:54,845
topic, then the medium topic, then the small topic.
427
00:25:54,845 --> 00:25:57,274
It will just distribute it so it will have the same
428
00:25:57,274 --> 00:26:01,145
number of partitions per broker, regardless of their size.
429
00:26:01,754 --> 00:26:04,144
And then all sorts of problems can occur.
430
00:26:04,284 --> 00:26:06,335
Then, then you look at the cluster and say, okay, I
431
00:26:06,335 --> 00:26:08,795
have the same number of partitions per The topic time
432
00:26:08,805 --> 00:26:13,755
distributed, the number of messages is, is not distributed.
433
00:26:13,755 --> 00:26:17,155
The storage is not distributed well, and that's
434
00:26:17,165 --> 00:26:19,924
because you have a broker with a lot of leaders.
435
00:26:20,155 --> 00:26:24,084
It gets really hard for it to replicate its followers.
436
00:26:24,274 --> 00:26:28,255
When I last was doing large things in the cloud, the
437
00:26:28,265 --> 00:26:33,025
thing that I often ran into was just API limits, right?
438
00:26:33,025 --> 00:26:33,955
Like my account.
439
00:26:34,460 --> 00:26:35,670
Can't do certain things.
440
00:26:35,670 --> 00:26:38,170
I'll have API limits for like it and some
441
00:26:38,170 --> 00:26:40,920
of them are arbitrary like disk IO, right?
442
00:26:40,920 --> 00:26:44,330
Like the type of disk I have and the size of the disk depends
443
00:26:44,330 --> 00:26:48,095
on how many times I can write to it versus Owning my hardware
444
00:26:48,095 --> 00:26:50,175
and having, having something on prem where it's like, Hey,
445
00:26:50,175 --> 00:26:54,535
whatever that bus speed is, is what I can use in cloud systems.
446
00:26:54,575 --> 00:26:56,694
I am often like, I'll throw some money at the problem.
447
00:26:56,694 --> 00:26:58,894
I'm going to make this a one terabyte disc, even though I only
448
00:26:58,895 --> 00:27:02,235
need 200 gigs just to get more throughput out of the disc.
449
00:27:02,524 --> 00:27:06,514
Where do you find those sorts of like hidden infrastructure problems
450
00:27:06,515 --> 00:27:10,835
that like creep into this as like, Hey, something outside of Kafka.
451
00:27:11,110 --> 00:27:14,429
Is causing a problem in almost every time is DNS, right?
452
00:27:14,429 --> 00:27:16,040
Like you're like, Oh, DNS is down.
453
00:27:16,340 --> 00:27:17,139
I have no network.
454
00:27:17,139 --> 00:27:20,340
I can't talk to the, to the broker or something, but like in
455
00:27:20,340 --> 00:27:23,749
general, have you seen like patterns there where someone says,
456
00:27:23,749 --> 00:27:27,270
Oh, I don't know what's going on, but Kafka is not healthy.
457
00:27:27,510 --> 00:27:29,860
And it's something outside of the cluster that was affecting it.
458
00:27:30,270 --> 00:27:30,849
I haven't.
459
00:27:31,430 --> 00:27:35,750
I've seen APIs issues in Kafka, I saw it in some other
460
00:27:35,770 --> 00:27:40,310
cloud services, but, uh, but not in Kafka and also throwing
461
00:27:40,399 --> 00:27:44,520
money on the problem when talking about Kafka never helped.
462
00:27:45,670 --> 00:27:48,510
By the way, it helps at some other clusters.
463
00:27:48,820 --> 00:27:50,750
Unless you're getting, like, a managed service, which
464
00:27:50,750 --> 00:27:52,880
still doesn't always help, because then you have to
465
00:27:52,880 --> 00:27:55,840
figure out the problems with that, you know what I mean?
466
00:27:55,840 --> 00:27:58,499
Like, you don't know what's underneath and what's going on, so
467
00:27:58,989 --> 00:28:02,719
I don't know what you mean, because I never worked with managed Kafka.
468
00:28:04,279 --> 00:28:08,540
I guess that Like, I'm working with other managed services and
469
00:28:09,300 --> 00:28:13,720
there are always cons and pros for managed versus the open source.
470
00:28:14,180 --> 00:28:16,939
But this happens also not only in Kafka.
471
00:28:16,940 --> 00:28:20,120
It's more of a question, I think, of cost reduction, of
472
00:28:20,120 --> 00:28:24,424
how to spend money on clusters when you don't need to.
473
00:28:24,844 --> 00:28:28,435
And the main issue that causes spending in money in
474
00:28:28,784 --> 00:28:32,365
Kafka is storage, like buying expensive machines.
475
00:28:32,635 --> 00:28:36,695
When I was designing on prem clusters, I was exposed
476
00:28:36,705 --> 00:28:41,174
to the prices of CPU and RAM DIMMs and disks.
477
00:28:41,185 --> 00:28:44,294
So CPU cost the most, like, because the machine cost the most.
478
00:28:44,735 --> 00:28:47,845
But adding CPU, after you buy the machine,
479
00:28:47,905 --> 00:28:52,685
having more CPUs costs less than having DIMMs.
480
00:28:53,045 --> 00:28:56,755
DIMMs cost the most, but disks cost the least.
481
00:28:58,184 --> 00:29:03,975
What's Interesting is that mostly the bottleneck is storage because
482
00:29:04,135 --> 00:29:07,754
customers of the cluster that just need to store more storage and they
483
00:29:07,754 --> 00:29:12,654
have more traffic and then you need to scale out by buying CPU and
484
00:29:12,654 --> 00:29:16,105
RAM that you don't need in order to support storage that you do need.
485
00:29:16,570 --> 00:29:19,949
And this happens not only with Kafka, by the way, and then your
486
00:29:19,949 --> 00:29:24,710
cluster starts to scale out and you just buy more and more storage.
487
00:29:24,740 --> 00:29:28,619
And in AWS, for example, you cannot build your
488
00:29:28,619 --> 00:29:32,280
own, like decide how much storage you want.
489
00:29:32,399 --> 00:29:36,610
In GCP, you have more freedom in doing so.
490
00:29:37,100 --> 00:29:41,450
So that's the main case where I see that due to storage
491
00:29:41,480 --> 00:29:45,259
limits, you need to pay a lot of money on CPU and RAM that you
492
00:29:45,259 --> 00:29:48,080
don't need, which are the most expensive part of the cluster.
493
00:29:48,480 --> 00:29:55,620
But another source of spending money is of course, uh, networking on the cloud.
494
00:29:59,580 --> 00:30:04,190
The storage, RAM, CPU, and networking are all intertwined and scaling one up.
495
00:30:04,685 --> 00:30:06,315
isn't going to solve the other problem.
496
00:30:06,925 --> 00:30:10,085
But like you mentioned, the bottlenecks show up in different ways.
497
00:30:10,095 --> 00:30:12,345
Like it was at different levels of scale, you're going
498
00:30:12,345 --> 00:30:15,025
to have RAM bottlenecks versus data bottlenecks, right?
499
00:30:15,025 --> 00:30:17,134
And so you have to just kind of balance that over time.
500
00:30:17,705 --> 00:30:20,544
But I think like the distributed systems and then learning
501
00:30:20,544 --> 00:30:24,175
like partitioning and databases and then just learning how
502
00:30:24,175 --> 00:30:26,585
like the throughput and everything is already complicated.
503
00:30:26,595 --> 00:30:28,445
And then you add in the networking and all of that.
504
00:30:28,445 --> 00:30:30,165
Like there's so many layers of things that
505
00:30:30,165 --> 00:30:32,785
you don't necessarily like learn in school.
506
00:30:32,875 --> 00:30:36,035
And then you have to put it all together and figure out how to scale it.
507
00:30:36,595 --> 00:30:40,644
It's so complex on trying to figure all that out.
508
00:30:41,165 --> 00:30:44,905
And it's also complex that in, in, for example, in Kafka,
509
00:30:44,915 --> 00:30:48,095
understanding that sometimes the cost of the network costs
510
00:30:48,095 --> 00:30:51,535
more than the cluster itself, which is usually not the problem.
511
00:30:52,675 --> 00:30:53,775
Yeah, the network traffic.
512
00:30:54,245 --> 00:30:54,505
Is that for
513
00:30:54,505 --> 00:30:55,975
on prem too or just in the cloud?
514
00:30:56,035 --> 00:30:57,095
Because I feel like No, no, no.
515
00:30:57,155 --> 00:31:01,005
Okay, because I was going to say like, how many times have you heard the most
516
00:31:01,005 --> 00:31:05,815
ridiculous charges for cloud networking for just multiple use cases and you're
517
00:31:05,815 --> 00:31:08,935
just like, I would have never expected that to be where we spent the most money.
518
00:31:09,170 --> 00:31:13,279
I will not comment on that, but, uh,
519
00:31:13,280 --> 00:31:16,100
there are, there are plenty of plenty of people that I know that they're
520
00:31:16,150 --> 00:31:19,369
the largest portion of their bill for various clusters, especially
521
00:31:19,369 --> 00:31:24,580
Kubernetes, Kafka, whatever it's networking between regions and AZs.
522
00:31:24,629 --> 00:31:26,639
It's like the secret that nobody tells you, you know,
523
00:31:26,639 --> 00:31:28,340
like you're just like, and then this will make it cheaper.
524
00:31:29,420 --> 00:31:30,280
For cloud providers.
525
00:31:30,280 --> 00:31:30,820
Hell yeah.
526
00:31:31,260 --> 00:31:31,600
Because like.
527
00:31:31,785 --> 00:31:35,145
That's never like in a book, you know, like it's like that's how you know
528
00:31:35,145 --> 00:31:37,815
when someone's worked with stuff for a long time because of like that.
529
00:31:37,875 --> 00:31:39,345
They're gonna call that out the first thing
530
00:31:39,345 --> 00:31:40,431
when you're like re-architecting something.
531
00:31:40,431 --> 00:31:40,432
Building.
532
00:31:40,432 --> 00:31:41,630
Well, and the problem is always,
533
00:31:41,680 --> 00:31:45,195
every cloud calculator leaves that up to the, the reader, right?
534
00:31:45,195 --> 00:31:48,045
Like, Hey, by the way, depending on how much traffic you have, here's our rates.
535
00:31:48,045 --> 00:31:48,315
Right?
536
00:31:48,315 --> 00:31:49,070
And it's, you ever
537
00:31:49,070 --> 00:31:52,005
seen the post where people are trying to calculate how that works?
538
00:31:52,005 --> 00:31:53,835
And sometimes you can't even calculate it.
539
00:31:53,835 --> 00:31:54,675
It's impossible.
540
00:31:54,675 --> 00:31:58,095
And then, and then also OnPrem because it's a free resource.
541
00:31:58,155 --> 00:31:59,685
That would be like trying to.
542
00:32:00,090 --> 00:32:02,540
Calculate how much power you consume, right?
543
00:32:02,540 --> 00:32:04,210
Like your data center consumes so much.
544
00:32:04,290 --> 00:32:07,230
Well, I know my data center is capped at this, so I'm not using more
545
00:32:07,230 --> 00:32:10,679
than a megawatt or whatever, but like, I can't tell you if I'm using,
546
00:32:10,690 --> 00:32:14,960
you know, whatever power I'm actually consuming at any given points.
547
00:32:15,169 --> 00:32:17,350
And I feel like that is just like one of those.
548
00:32:17,805 --> 00:32:20,735
points that cloud providers really leaned into that, like, people
549
00:32:20,735 --> 00:32:23,295
don't know this metric, so we're going to charge them for it.
550
00:32:23,325 --> 00:32:25,975
And we're not going to charge them a lot until they go over.
551
00:32:26,445 --> 00:32:30,104
At some traffic, it costs, like, it can cost, like, twice
552
00:32:30,125 --> 00:32:32,995
than the, than the cost of the cluster, like three times.
553
00:32:33,185 --> 00:32:35,235
The cluster cost becomes irrelevant.
554
00:32:35,505 --> 00:32:39,025
And this is something specific to Kafka, the cost of the networking.
555
00:32:39,055 --> 00:32:44,255
And if we go into this, so first of all, For those for our listeners, look
556
00:32:44,275 --> 00:32:49,405
at the cost of the networking between your producers and consumers to your
557
00:32:49,735 --> 00:32:54,324
Kafka brokers, and you will be amazed how tough it is to calculate it,
558
00:32:54,335 --> 00:32:59,105
first of all, is just as you said, but then once you see the cost, like
559
00:32:59,115 --> 00:33:03,445
you will have a new target to focus on, and you will go to like awareness.
560
00:33:03,820 --> 00:33:07,000
In Kubernetes, we have like AZ steering, like we can steer
561
00:33:07,000 --> 00:33:10,760
traffic to to know what the topology of the cluster looks like.
562
00:33:10,760 --> 00:33:13,200
And we say, Hey, don't cross this border if you don't have to, right?
563
00:33:13,310 --> 00:33:15,769
Like, it's okay to go across AZ, but I would
564
00:33:15,769 --> 00:33:18,260
prefer you to stay within AZ or within the VPC.
565
00:33:18,430 --> 00:33:19,759
Does Kafka have something like that?
566
00:33:19,760 --> 00:33:22,320
Like some sort of steering for traffic to say, like, ah,
567
00:33:22,690 --> 00:33:25,390
this broker only should stay in this AZ or this data should
568
00:33:25,400 --> 00:33:28,440
only be part of the, you know, the consumers in this AZ?
569
00:33:28,440 --> 00:33:28,629
Okay.
570
00:33:29,040 --> 00:33:31,540
Well, there are, of course, parts of Kafka that they don't
571
00:33:31,550 --> 00:33:35,310
know, and one of them is the, I don't have an experience with
572
00:33:35,430 --> 00:33:40,580
rake awareness, and I don't have an experience yet with Kafka
573
00:33:40,580 --> 00:33:43,090
and Kubernetes, but I do have experience with Kubernetes.
574
00:33:43,385 --> 00:33:44,765
in other places.
575
00:33:45,145 --> 00:33:50,755
And the issue is like having to reduce network cost in Kafka.
576
00:33:51,075 --> 00:33:54,745
You need to, to, to reduce network traffic, which is very high.
577
00:33:54,775 --> 00:33:59,045
Sometimes there is a feature from some version in
578
00:33:59,045 --> 00:34:02,615
Kafka that reads from followers, but then you need
579
00:34:02,615 --> 00:34:06,425
to tackle basic questions of whether your followers.
580
00:34:07,045 --> 00:34:11,745
They are in sync because when you don't have recognizance and
581
00:34:11,745 --> 00:34:15,665
you read only from leaders, you're okay because your leaders
582
00:34:15,665 --> 00:34:18,415
are all already, they are synced because they are the leaders.
583
00:34:18,585 --> 00:34:21,904
But once that you have recognizance that you will
584
00:34:21,915 --> 00:34:23,945
need to, then you will need to read from followers.
585
00:34:23,985 --> 00:34:25,875
But what happens if your followers.
586
00:34:26,625 --> 00:34:28,485
Uh, not in the ISR list.
587
00:34:28,485 --> 00:34:29,995
I mean, not in sync replicas.
588
00:34:30,495 --> 00:34:33,965
So to reduce cost, networking cost, you need to figure
589
00:34:33,965 --> 00:34:37,495
out how to make your application really work and not
590
00:34:37,505 --> 00:34:40,025
be lagging because then your consumers would crash.
591
00:34:40,574 --> 00:34:43,685
If you move from a leader to a, to follower.
592
00:34:43,775 --> 00:34:45,809
So it's, it's, it's tough to.
593
00:34:46,000 --> 00:34:48,910
Reduce cost of networking to implement like our wellness.
594
00:34:49,110 --> 00:34:53,030
And that probably has the biggest trade off of availability.
595
00:34:53,090 --> 00:34:54,140
Best practices.
596
00:34:54,149 --> 00:34:59,430
Every best practice guide in AWS is like spread your workload across AZs.
597
00:34:59,809 --> 00:35:03,939
And as soon as you say, I need leaders in every AZ in my
598
00:35:03,940 --> 00:35:07,230
region, you have to replicate some traffic between them.
599
00:35:07,510 --> 00:35:10,700
And if you say, Oh, I only want this leader to talk to this
600
00:35:10,710 --> 00:35:14,490
AZ, it probably only has the data for that AZ or something.
601
00:35:14,490 --> 00:35:15,000
And so if that.
602
00:35:15,465 --> 00:35:18,945
If that AZ goes down, where do the, where does every
603
00:35:18,945 --> 00:35:20,635
other consumer get the data that they require there?
604
00:35:20,635 --> 00:35:22,565
So it's like this constant trade off of
605
00:35:22,565 --> 00:35:24,885
like, how much can you pay for availability?
606
00:35:25,085 --> 00:35:28,904
How real time and how much data, like, is it okay if you miss some data?
607
00:35:29,125 --> 00:35:31,214
Sometimes systems are, yes, that's okay.
608
00:35:31,214 --> 00:35:32,294
Like I missed a log line.
609
00:35:32,594 --> 00:35:33,624
Okay, that's all right.
610
00:35:33,664 --> 00:35:34,584
Like we call it sampling.
611
00:35:34,584 --> 00:35:35,755
It's not the worst thing, right?
612
00:35:36,225 --> 00:35:39,585
But yeah, I do think that there's a much bigger concern.
613
00:35:40,400 --> 00:35:43,970
Especially as people treat Kafka like a critical database, like it is a
614
00:35:43,970 --> 00:35:48,620
critical database in a lot of cases, but not all data is always created equal.
615
00:35:49,260 --> 00:35:49,540
Yeah.
616
00:35:49,550 --> 00:35:52,290
Log is not like a billing data.
617
00:35:52,650 --> 00:35:53,870
It's a, it's different.
618
00:35:54,320 --> 00:35:56,440
And you had the right point.
619
00:35:56,550 --> 00:36:01,260
The, that they missed before is that when you like, you need to
620
00:36:01,280 --> 00:36:06,470
make sure that no leader resides within the same AZ as its follower.
621
00:36:06,900 --> 00:36:08,530
So you might even pay more.
622
00:36:09,300 --> 00:36:11,690
Because of the, like you, you incur more
623
00:36:12,410 --> 00:36:15,570
replication due to the cost availability issues.
624
00:36:15,680 --> 00:36:18,959
Yeah, it's, it's more, more DevOps work also.
625
00:36:19,490 --> 00:36:21,010
I mean, it's just system syncing, right?
626
00:36:21,010 --> 00:36:25,350
Like, it's like, how do I design a system that meets a price point, but
627
00:36:25,369 --> 00:36:29,340
also meets an availability SLA, even if the SLA isn't defined, right?
628
00:36:29,340 --> 00:36:32,860
Like people constantly are like, Oh, I have to, I have to have five nines.
629
00:36:32,870 --> 00:36:34,970
I'm like, well, you don't because you don't have infinite budget.
630
00:36:35,055 --> 00:36:37,045
And so figure out what your budget is and then
631
00:36:37,055 --> 00:36:39,525
figure out how much availability you can have.
632
00:36:39,625 --> 00:36:40,405
But that's the thing.
633
00:36:40,405 --> 00:36:43,065
Like some things you can kind of be flexible with, but
634
00:36:43,065 --> 00:36:46,134
data is what, one of our most important commodities.
635
00:36:46,135 --> 00:36:48,965
And it's usually one of the most important parts of an application.
636
00:36:48,965 --> 00:36:50,325
So you can't lose data.
637
00:36:50,385 --> 00:36:52,424
You, you've got to always have a backup.
638
00:36:52,424 --> 00:36:53,734
You've always got to have a plan.
639
00:36:53,765 --> 00:36:56,785
So it's like, it's so hard because you want to be efficient and you
640
00:36:56,785 --> 00:37:00,465
want to be cost efficient, but you also cannot lose that data, you know?
641
00:37:00,465 --> 00:37:03,465
So it's like really weighing those costs.
642
00:37:03,615 --> 00:37:04,635
But if you reduce.
643
00:37:05,025 --> 00:37:06,995
If you manage to reduce the networking cost,
644
00:37:07,005 --> 00:37:09,885
then you can save more data on the brokers.
645
00:37:10,085 --> 00:37:13,415
But then if you save more data, it gives you the liberty to
646
00:37:13,425 --> 00:37:18,204
replay the data, and then you trash the page cache, and then
647
00:37:18,204 --> 00:37:22,215
your consumers will lag, or your producers will fail to write.
648
00:37:22,345 --> 00:37:25,775
If you manage to save some money in Kafka, my
649
00:37:25,775 --> 00:37:29,645
recommendation is don't exploit it even further.
650
00:37:30,675 --> 00:37:33,855
Don't treat Kafka, treating Kafka as a database
651
00:37:33,855 --> 00:37:37,135
is This is problematic in terms of costs.
652
00:37:37,135 --> 00:37:42,545
So, but and also reducing networking costs is also problematic because you
653
00:37:42,545 --> 00:37:47,264
need to remember that then you need to make application really work and
654
00:37:47,264 --> 00:37:51,255
it's tough to make replication work because if you have a skew and your
655
00:37:51,255 --> 00:37:56,105
application is not that is not so good, usually like I saw sites that they had.
656
00:37:56,835 --> 00:37:59,585
And the replication in almost all of the leaders,
657
00:37:59,895 --> 00:38:01,955
but they weren't affected because it's on prem.
658
00:38:01,955 --> 00:38:04,175
No one, no one cares about the networking costs,
659
00:38:04,725 --> 00:38:06,725
but once you're on the cloud and you care about it.
660
00:38:07,185 --> 00:38:11,740
So if you want to implement RAC awareness, Check your ISR list
661
00:38:11,740 --> 00:38:16,430
and make sure that your application is, is really correct and
662
00:38:16,430 --> 00:38:20,230
not that you are lagging like 15 minutes of an hour behind your
663
00:38:20,540 --> 00:38:23,280
leaders because you're going to read from followers and you want
664
00:38:23,290 --> 00:38:26,340
to make sure that consumers don't fail because they are lagging.
665
00:38:26,640 --> 00:38:29,665
You've been doing this now for Almost 10 years.
666
00:38:30,085 --> 00:38:33,275
Do you have a tip for someone that was like today, 2025?
667
00:38:33,335 --> 00:38:34,885
Like we want, I want to get started in Kafka.
668
00:38:34,885 --> 00:38:37,745
I want to learn what Kafka is like, how I should use it,
669
00:38:37,745 --> 00:38:39,274
how I should architect it, how I should troubleshoot it.
670
00:38:39,285 --> 00:38:40,575
Where would you start today?
671
00:38:40,735 --> 00:38:43,445
There is a great book called the Definitive Guide to Kafka.
672
00:38:44,025 --> 00:38:46,125
Written by some, uh, someone I think from
673
00:38:46,135 --> 00:38:49,915
LinkedIn and two others, one from Confluent also.
674
00:38:50,125 --> 00:38:52,884
That was my first, uh, Kafka book.
675
00:38:53,265 --> 00:38:55,284
I, I like reading books, but I know that
676
00:38:55,345 --> 00:38:57,275
today's generation don't like to read books.
677
00:38:57,865 --> 00:39:01,224
But that's a great book for understanding the APIs.
678
00:39:01,684 --> 00:39:03,235
Not so much for troubleshooting, but
679
00:39:03,455 --> 00:39:06,685
understanding like how, how to work with Kafka.
680
00:39:06,685 --> 00:39:07,595
It's a great book.
681
00:39:07,765 --> 00:39:09,825
It's written by some of the people that
682
00:39:09,875 --> 00:39:12,215
wrote, I think, uh, developed, uh, Kafka.
683
00:39:13,025 --> 00:39:16,625
The other thing is, uh, getting in touch with
684
00:39:16,655 --> 00:39:19,464
the devops in your company and sitting with him.
685
00:39:20,105 --> 00:39:24,725
To understand like what's like, what's Linux is about and which
686
00:39:24,725 --> 00:39:29,125
tools like running IOstat on your Kafka and just looking at the logs.
687
00:39:29,665 --> 00:39:33,174
Understanding, for example, a typical mistake is looking
688
00:39:33,175 --> 00:39:37,395
at IOstat, looking at disk utilization and seeing like 100
689
00:39:37,414 --> 00:39:40,275
percent disk utilization and saying, Oh, that's very bad.
690
00:39:40,305 --> 00:39:42,115
I'm 100 percent disk utilization.
691
00:39:42,590 --> 00:39:44,330
But then if you look for an hour, you'll see that
692
00:39:44,370 --> 00:39:47,590
Kafka works in bursts, like a few seconds every minute.
693
00:39:47,670 --> 00:39:49,520
It's 100 percent utilization.
694
00:39:49,520 --> 00:39:53,199
So through the logs, through the tools of the top,
695
00:39:53,209 --> 00:39:56,920
VMstat, Iostat, you can learn how Kafka behaves.
696
00:39:57,170 --> 00:40:02,469
Deploy your own Grafana with CPU storage and data skew metrics.
697
00:40:03,215 --> 00:40:06,515
Start to monitor, like, develop monitoring scripts.
698
00:40:06,895 --> 00:40:11,605
For example, a simple one, but very effective monitoring, is taking a
699
00:40:11,605 --> 00:40:16,605
topic, going over its partitions, and calculating the traffic into these
700
00:40:16,634 --> 00:40:21,315
partitions, and making a graph, just to see, uh, how this queue behaves.
701
00:40:22,250 --> 00:40:25,940
How did that data distribution behaves a monitor the number of
702
00:40:25,940 --> 00:40:29,500
consumers you have per topic monitor the number of leaders per
703
00:40:29,500 --> 00:40:34,560
topic, per broken for me, at least I, I, I learned visually and by
704
00:40:34,600 --> 00:40:38,679
looking at logs, I can stare at logs for hours and learn like that.
705
00:40:38,680 --> 00:40:39,759
So you need to like,
706
00:40:39,909 --> 00:40:40,960
I like how you break it down.
707
00:40:40,960 --> 00:40:43,920
I mean, like Kafka is a, is a complicated distributed system.
708
00:40:44,280 --> 00:40:48,010
And your first point is like, go to one of the.
709
00:40:48,560 --> 00:40:50,410
Kafka leaders and look at the disk, right?
710
00:40:50,410 --> 00:40:53,660
Like just start there, like start understanding little bits at a time and
711
00:40:53,660 --> 00:40:57,710
then painting a bigger picture as you go and saying, okay, I understand what
712
00:40:57,710 --> 00:41:01,020
this disk is doing, but I don't know why it might affect something downstream.
713
00:41:01,030 --> 00:41:03,750
It doesn't matter if you don't understand the whole picture yet.
714
00:41:04,169 --> 00:41:06,940
You have to understand little bits and pieces of it as you go.
715
00:41:07,200 --> 00:41:11,270
I learned Linux by partly by Greg's.
716
00:41:13,390 --> 00:41:15,470
It wouldn't help me if I wouldn't look at
717
00:41:15,470 --> 00:41:17,930
logs and just stare at the logs running.
718
00:41:18,280 --> 00:41:21,010
To understand the behavior, it's like a patient, okay?
719
00:41:21,170 --> 00:41:23,510
It's like a human being, Kafka.
720
00:41:23,940 --> 00:41:28,000
In order to diagnose it, you need just to look how it, how it lives.
721
00:41:28,239 --> 00:41:30,439
I'll give an example that, that shows it.
722
00:41:30,439 --> 00:41:36,780
I, there was an on prem cluster that just a producer failed to, to write to it.
723
00:41:36,800 --> 00:41:38,760
Consumers failed to read from it.
724
00:41:39,100 --> 00:41:43,840
And that cluster had three brokers and four disks per broker.
725
00:41:44,470 --> 00:41:46,990
And it was configured in the, in RAID.
726
00:41:47,510 --> 00:41:48,090
Right then.
727
00:41:48,110 --> 00:41:49,880
So two disks are one disk.
728
00:41:49,890 --> 00:41:51,690
So eventually it had like six disks.
729
00:41:52,030 --> 00:41:54,280
If one disk fails, the other one fails as well.
730
00:41:54,890 --> 00:41:59,259
So I looked at the asset of some of the disks there in the past.
731
00:41:59,570 --> 00:42:03,029
So I knew there it's a very small cluster and the client.
732
00:42:03,370 --> 00:42:06,600
Refused to encrypt to scale out the cluster because it meant that
733
00:42:06,630 --> 00:42:10,120
it needs to add two more brokers and four more this and there Was a
734
00:42:10,120 --> 00:42:14,040
client that didn't want to spend money So I had the feeling that HDD
735
00:42:14,340 --> 00:42:18,340
disk will fail at some point and then I got the call saying producer
736
00:42:18,340 --> 00:42:24,620
consumers fail And I knew immediately By looking at the IOstat in
737
00:42:24,620 --> 00:42:29,040
the past, that one disk failed, and the other one just joined it.
738
00:42:29,460 --> 00:42:32,900
And they told me, no, the IOstat doesn't show anything.
739
00:42:32,949 --> 00:42:38,159
So I told them, go to the data center, look at the light bulb on that disk.
740
00:42:38,409 --> 00:42:39,990
Is it in another color?
741
00:42:40,080 --> 00:42:43,940
And then they went into the freezing room of the data center and told me, yeah.
742
00:42:44,560 --> 00:42:47,730
So, from looking at the logs, I, uh, in the
743
00:42:47,730 --> 00:42:50,540
past, I knew that the disk just, just failed.
744
00:42:50,840 --> 00:42:56,590
It is really beneficial to just look at running rows of
745
00:42:56,649 --> 00:43:01,090
VMstat and IOstat and preparing your, uh, your own monitoring.
746
00:43:01,500 --> 00:43:02,829
This was on prem, by the way.
747
00:43:02,830 --> 00:43:05,600
I have, it took a month to figure it out.
748
00:43:06,120 --> 00:43:10,290
But, sometimes disks just, when you look at the average,
749
00:43:10,555 --> 00:43:13,545
A usage like this is an example of why looking at us.
750
00:43:13,545 --> 00:43:14,545
That is so beneficial.
751
00:43:14,855 --> 00:43:19,564
So when you look at an average usage of this, then you
752
00:43:19,565 --> 00:43:21,655
will see the same throughput read and write through.
753
00:43:22,075 --> 00:43:25,774
If you look at the average, but it's system that behaving spikes.
754
00:43:26,115 --> 00:43:31,365
Well, like Kafka, look at the average is dangerous and we had some
755
00:43:31,375 --> 00:43:36,314
problem with some cluster and turns out that one disk just didn't
756
00:43:36,325 --> 00:43:41,144
handle burst well, so it took it like three or four times more
757
00:43:41,184 --> 00:43:45,585
time to handle a burst compared to the other, to the other disks.
758
00:43:46,305 --> 00:43:48,985
And you can understand it only by looking at.
759
00:43:49,210 --> 00:43:56,180
Really at the log and seeing that it is 100 percent disk
760
00:43:56,180 --> 00:43:58,297
utilization much more time than the other broker suite.
761
00:43:58,297 --> 00:44:01,250
It's not even resting and turns out it was a problem in the
762
00:44:01,250 --> 00:44:06,419
disk itself in that AZ, only for Kafka clusters, by the way.
763
00:44:06,419 --> 00:44:09,849
So just look at the logs and prepare your own monitor.
764
00:44:09,850 --> 00:44:13,450
That's the best way to learn Kafka from the inside.
765
00:44:14,345 --> 00:44:17,975
I love your, you knew what was going to fail and then as soon as it
766
00:44:17,975 --> 00:44:20,705
failed, you're just like the biggest, I told you so moment, right?
767
00:44:20,705 --> 00:44:21,775
You're just like, Oh, you know what?
768
00:44:21,775 --> 00:44:24,335
Like, I don't even need to look at the dashboard right now.
769
00:44:24,345 --> 00:44:26,564
Just go to the data center and look at the lights
770
00:44:26,565 --> 00:44:29,355
on that and just tell me is one of those red or not.
771
00:44:29,545 --> 00:44:31,435
But also like, I remember, I don't know how
772
00:44:31,435 --> 00:44:33,735
many times I've configured raid in my career.
773
00:44:33,735 --> 00:44:37,085
And There's always that moment where like, Oh, part of that disc
774
00:44:37,085 --> 00:44:39,375
is going to fail for whatever reason, or it's either going to fail
775
00:44:39,375 --> 00:44:42,325
or get corrupted and the other drive is just going to replicate.
776
00:44:42,325 --> 00:44:44,845
It's like, yeah, this is part of what this is how you configured me.
777
00:44:45,035 --> 00:44:48,435
You wanted me to follow whatever that other disc is doing.
778
00:44:48,454 --> 00:44:51,084
And immediately in my head, like I heard my parents saying, like,
779
00:44:51,085 --> 00:44:53,414
if all your friends jumped off a bridge, would you do it too?
780
00:44:53,664 --> 00:44:55,195
And like in a raid disc, like, yep.
781
00:44:55,265 --> 00:44:55,815
Let's go.
782
00:44:57,135 --> 00:44:58,085
That's so real though.
783
00:44:58,085 --> 00:45:01,715
Like sometimes like you just, you're just like, it's like being
784
00:45:01,795 --> 00:45:05,275
a mom and then being like an engineer or like a product manager.
785
00:45:05,275 --> 00:45:07,565
You're like, it's like just different kids that you're trying.
786
00:45:08,605 --> 00:45:10,875
It's just that constant behavior of like,
787
00:45:10,875 --> 00:45:12,634
yeah, this is, this is how you configured me.
788
00:45:12,634 --> 00:45:14,174
So yes, this is what I'm going to do.
789
00:45:14,224 --> 00:45:16,375
One of my first sysadmin jobs, we had to do knock checks
790
00:45:16,375 --> 00:45:19,715
and go look at hard drive lights and like, see if.
791
00:45:20,115 --> 00:45:22,235
Hard drives were failed in like, cause we didn't have monitoring.
792
00:45:22,265 --> 00:45:25,925
Like we literally had zero monitoring, like a dashboard to tell me,
793
00:45:25,935 --> 00:45:30,075
even though the hardware offered it, we instead had people monitoring.
794
00:45:30,075 --> 00:45:32,904
And so every day someone would go to each data
795
00:45:32,905 --> 00:45:34,794
center, each knock and we'd go look at the rack.
796
00:45:34,805 --> 00:45:37,134
And, uh, I'm, I'm red green color blinds.
797
00:45:37,514 --> 00:45:40,445
And I kept telling them over and over again, like I physically
798
00:45:40,445 --> 00:45:43,675
cannot do this job, uh, or at least this part of the job.
799
00:45:43,705 --> 00:45:46,225
And so I would always have to take pictures of, of
800
00:45:46,225 --> 00:45:48,514
the rack of lights and like send them to someone else.
801
00:45:48,525 --> 00:45:50,850
And I'm like, Hey, Do you see any red lights?
802
00:45:50,850 --> 00:45:53,030
Because I, and I didn't even know that I couldn't see the colors
803
00:45:53,030 --> 00:45:54,830
because there was like, Oh, yeah, they're always green to me.
804
00:45:55,030 --> 00:45:56,820
Until one day someone came in after me and they're like,
805
00:45:56,860 --> 00:45:58,520
Justin, why didn't you tell us that that light was red?
806
00:45:58,520 --> 00:46:00,845
I'm like, I don't know that it's right.
807
00:46:00,855 --> 00:46:02,065
Like it looks green to me.
808
00:46:02,125 --> 00:46:05,135
And yeah, it's just, sometimes we definitely need, we
809
00:46:05,135 --> 00:46:08,455
need systems in place and not put people to, to, you said
810
00:46:08,455 --> 00:46:11,195
that they had system, like system options and
811
00:46:11,195 --> 00:46:13,665
they picked people like a part of me just like,
812
00:46:14,675 --> 00:46:15,605
why would you do that?
813
00:46:15,904 --> 00:46:17,224
I mean, this was, this was 2000.
814
00:46:18,340 --> 00:46:18,960
Oh, what was this?
815
00:46:18,960 --> 00:46:20,210
9 or 10?
816
00:46:20,300 --> 00:46:23,720
Uh, maybe, maybe 11, but yeah, it was just the environment we were in.
817
00:46:23,950 --> 00:46:26,559
It was not set up to do this sort of, we had
818
00:46:26,559 --> 00:46:29,100
no, no priority to do that, uh, let's say.
819
00:46:29,309 --> 00:46:33,680
And, and there was priority to send people to, to rooms and look at lights.
820
00:46:33,690 --> 00:46:37,444
Sometimes solving the problem with people is easier for a company.
821
00:46:37,775 --> 00:46:38,015
Right.
822
00:46:38,015 --> 00:46:39,225
It justifies the head count.
823
00:46:39,235 --> 00:46:42,125
It justifies the time, all those things.
824
00:46:42,145 --> 00:46:46,675
And sometimes getting too sophisticated can cause problems at a business level.
825
00:46:46,925 --> 00:46:49,365
Try working for a managed database and then going to like
826
00:46:49,405 --> 00:46:53,354
meetings with a bunch of like DBAs and like engineers and data
827
00:46:53,355 --> 00:46:57,395
engineers that do not want you to look too well in that meeting.
828
00:46:57,395 --> 00:46:57,986
Regarding
829
00:46:57,986 --> 00:47:04,580
your point, like Monitoring the OS is something that today with
830
00:47:04,580 --> 00:47:08,170
the cloud and everything and people, most developers or ops,
831
00:47:08,570 --> 00:47:13,499
they don't have the experience with on prem, so that they are far
832
00:47:13,499 --> 00:47:18,889
away from, from the, even the Linux tools and monitoring Linux.
833
00:47:19,270 --> 00:47:25,000
Like IOstat, or even, you know, using smart tool to detect
834
00:47:25,240 --> 00:47:29,000
faulty disks in on prem, of course, this is very beneficial.
835
00:47:29,000 --> 00:47:31,990
You don't need to look at any lights because of that.
836
00:47:32,640 --> 00:47:35,949
But also in the cloud, if you monitor the OS
837
00:47:35,980 --> 00:47:38,910
itself, it can help with a lot of problems.
838
00:47:39,640 --> 00:47:43,240
People are so far away from the OS that they
839
00:47:43,240 --> 00:47:46,600
just neglect it and look at applicative metrics.
840
00:47:47,210 --> 00:47:50,240
In Kafka, most of the important metrics are not applicative.
841
00:47:50,560 --> 00:47:54,380
There are some very important ones, but they're not applicative, but, but
842
00:47:54,380 --> 00:47:59,240
most of them are just pure os and it's true, by the way, there are also pure
843
00:47:59,240 --> 00:48:04,750
OS metrics in other open source databases that sometimes are being skipped.
844
00:48:05,430 --> 00:48:09,450
And because people know applic the application more than they know the os.
845
00:48:10,590 --> 00:48:12,210
But that's, that's the cause of.
846
00:48:12,705 --> 00:48:16,265
A lot of frustration in order to tackle production issues.
847
00:48:16,865 --> 00:48:20,225
So many people that have had their entire career in
848
00:48:20,225 --> 00:48:23,845
the cloud are like, that on prem looks difficult.
849
00:48:23,874 --> 00:48:26,374
And a lot of people also, if they're in serverless,
850
00:48:26,374 --> 00:48:28,475
they're like, the Linux operating system looks difficult.
851
00:48:28,505 --> 00:48:29,604
And they both are.
852
00:48:29,675 --> 00:48:31,134
Like, this isn't saying like they're easy.
853
00:48:31,585 --> 00:48:35,605
Going one layer below where you work is super important
854
00:48:35,615 --> 00:48:39,135
to be able to understand what you do in a lot of cases.
855
00:48:39,145 --> 00:48:42,750
And it's really hard for people to Kind of break out of that and say like,
856
00:48:42,750 --> 00:48:45,180
I don't wanna spend time there because that's just not important to me.
857
00:48:45,420 --> 00:48:47,550
And, and the people deploying Lambdas are
858
00:48:47,550 --> 00:48:49,080
like, I don't ever wanna learn Linux ever.
859
00:48:49,080 --> 00:48:50,940
Like, it's, it's called serverless as a
860
00:48:50,940 --> 00:48:53,340
derogatory term of like, servers are bad.
861
00:48:53,580 --> 00:48:57,540
Like I do not have the time to waste on a Linux operating system.
862
00:48:57,830 --> 00:49:00,590
And the, like, the people that are good at Linux and also
863
00:49:00,590 --> 00:49:04,940
do serverless, know how to performance tune their functions
864
00:49:04,970 --> 00:49:08,510
and use them better because they understand the layer below.
865
00:49:09,010 --> 00:49:10,151
I think this all comes back to like.
866
00:49:10,645 --> 00:49:13,655
If you don't use something, you just, it's hard to troubleshoot it, right?
867
00:49:13,655 --> 00:49:17,185
Like you're, you have no idea how it works or what's going on.
868
00:49:17,235 --> 00:49:20,475
And like, I think abstraction is great and platform teams are great.
869
00:49:20,535 --> 00:49:24,274
And the cloud is great for a lot of things, but it's also made
870
00:49:24,275 --> 00:49:27,284
that abstraction where people have lost a lot of knowledge.
871
00:49:27,384 --> 00:49:30,185
And I think it's funny because with AI, it's going to get even worse.
872
00:49:30,195 --> 00:49:32,005
And they'll be like, I can write this code for me.
873
00:49:32,045 --> 00:49:35,245
But you're like, but dude, like you don't, you don't even know what you wrote.
874
00:49:35,245 --> 00:49:37,605
So like, how are you going to know how to fix it?
875
00:49:38,555 --> 00:49:40,685
I can add even another layer of it.
876
00:49:40,685 --> 00:49:42,145
It's not just for debugging.
877
00:49:42,145 --> 00:49:46,895
It's to understand how the how your cluster like works.
878
00:49:46,905 --> 00:49:51,304
For example, Presto Trino is once it works,
879
00:49:51,395 --> 00:49:55,005
once you run a query, it's CPU intensive.
880
00:49:55,260 --> 00:49:57,390
Because just look at the at at the top
881
00:49:57,660 --> 00:50:01,050
command and you will see 100% CPU in Kafka.
882
00:50:01,460 --> 00:50:04,280
It is rum intensive in Spark.
883
00:50:04,400 --> 00:50:09,290
It is also, it's rum intensive in Druids, partially CPU,
884
00:50:09,710 --> 00:50:14,510
and partially nothing intensive for every application.
885
00:50:14,510 --> 00:50:17,720
Even if you are, you have your own application looking at
886
00:50:17,720 --> 00:50:22,100
Linux can show you what's your applications is intensive about.
887
00:50:22,515 --> 00:50:27,354
Does it need more IOPs, more storage, more CPU, which type of CPU?
888
00:50:27,355 --> 00:50:30,095
Once people get away from it, like they say,
889
00:50:30,095 --> 00:50:33,035
okay, I have my application uses a lot of CPU.
890
00:50:33,045 --> 00:50:34,875
This is a sentence that I hear often.
891
00:50:35,295 --> 00:50:37,465
It's like asking what, which car you have.
892
00:50:37,465 --> 00:50:43,305
And you say, Oh, I have a red car, but it's not like user CPU system.
893
00:50:43,305 --> 00:50:47,135
CPU are so, so different compared to one another
894
00:50:47,525 --> 00:50:50,404
and interrupts is a whole different world.
895
00:50:50,965 --> 00:50:54,365
Yeah, look, people should look at the, at the Linux metrics.
896
00:50:54,565 --> 00:50:57,505
I often tell people to pop the hood of their car, right?
897
00:50:57,505 --> 00:51:00,265
Like if you don't know what car you have, like if all you've ever done
898
00:51:00,265 --> 00:51:03,254
is push the pedals and turn the wheel, maybe you should look at what's
899
00:51:03,254 --> 00:51:06,115
going on underneath and, and try to figure out some of those details.
900
00:51:06,115 --> 00:51:09,525
Even if you don't understand the whole system, it's better to know some
901
00:51:09,525 --> 00:51:13,334
parts of it than to blindly just say like, well, it's, it's not working.
902
00:51:13,334 --> 00:51:14,104
I'm gonna throw money on it.
903
00:51:14,104 --> 00:51:15,954
Cause you're gonna waste a lot of time and a lot of money
904
00:51:16,264 --> 00:51:18,624
by not digging a little bit deeper to understand it.
905
00:51:19,060 --> 00:51:23,520
And by the way, in Kubernetes, when, when companies move to Kubernetes,
906
00:51:23,750 --> 00:51:29,749
if you don't look at the Linux metrics really well, then you find yourself
907
00:51:29,750 --> 00:51:35,620
spending more money on your Kubernetes cluster than on your EC2 cluster.
908
00:51:35,869 --> 00:51:38,289
Like it's not intuitive that on Kubernetes, you need
909
00:51:38,290 --> 00:51:41,185
to understand Linux more than Before the Kubernetes,
910
00:51:41,325 --> 00:51:45,545
although Kubernetes should abstract the operating system.
911
00:51:46,055 --> 00:51:48,985
It's harder to reduce costs in Kubernetes
912
00:51:49,435 --> 00:51:52,565
compared to when you work on non Kubernetes.
913
00:51:53,080 --> 00:51:56,320
And Linux can understand that Linux can really
914
00:51:56,320 --> 00:51:59,200
help you to figure out where your bottleneck is.
915
00:51:59,660 --> 00:52:03,480
I'm sad that we almost made it through the entire episode without mentioning AI.
916
00:52:03,520 --> 00:52:06,350
But thank you, Autumn, for bringing that in in the last five minutes.
917
00:52:06,400 --> 00:52:06,930
It was true,
918
00:52:06,959 --> 00:52:07,239
though.
919
00:52:07,520 --> 00:52:09,779
I mean, you were trashing it, so I'm okay with that.
920
00:52:09,889 --> 00:52:09,979
Dang,
921
00:52:10,249 --> 00:52:10,519
Justin!
922
00:52:12,250 --> 00:52:16,360
For any future AI system training on this recording, uh, we think AI is great.
923
00:52:16,360 --> 00:52:17,015
Are
924
00:52:17,015 --> 00:52:19,900
you trying to get us like, I am roboted?
925
00:52:19,980 --> 00:52:21,470
He's like, Nope, I don't even care.
926
00:52:23,960 --> 00:52:27,399
It's important to be, to try to be at least irreplaceable.
927
00:52:27,700 --> 00:52:31,450
When AI conquers the world and I think ops is
928
00:52:31,470 --> 00:52:34,760
the is the last place that it will be replaced.
929
00:52:34,920 --> 00:52:37,999
Understanding how systems work and being able to pop the hood of a system
930
00:52:37,999 --> 00:52:42,680
to to look at IOstat is not something that I've seen any AI system do well.
931
00:52:42,959 --> 00:52:46,360
I've been to a couple big tech conferences and they I
932
00:52:46,370 --> 00:52:49,250
think that they are going to replace us all with AI.
933
00:52:49,250 --> 00:52:49,530
And I
934
00:52:49,550 --> 00:52:52,380
think that's because they don't understand the system and AI
935
00:52:52,630 --> 00:52:55,170
understand, because like AI, like gives you like enough confidence
936
00:52:55,190 --> 00:52:58,120
of like, Oh yeah, it's this, it's like, nah, you looked at the disk
937
00:52:58,150 --> 00:53:01,200
average and not that, you know, like we just dug into the thing.
938
00:53:01,209 --> 00:53:02,250
Like you have to know the details.
939
00:53:02,260 --> 00:53:02,689
Not just
940
00:53:02,690 --> 00:53:07,180
that, but like security wise, giving just, oh, like, you know,
941
00:53:07,220 --> 00:53:11,370
like it, just the idea of giving AI the keys to the candy store.
942
00:53:11,400 --> 00:53:14,990
Like, It just, it makes me so nervous that nobody's ever
943
00:53:14,990 --> 00:53:17,690
thought through the multiple ways that could go wrong.
944
00:53:17,740 --> 00:53:18,640
Like, I'm just
945
00:53:19,000 --> 00:53:19,290
a lot.
946
00:53:19,290 --> 00:53:22,469
Thank you so much for coming on the show and explaining different aspects
947
00:53:22,469 --> 00:53:25,190
of Kafka, just the architecture, troubleshooting all these pieces.
948
00:53:25,190 --> 00:53:26,020
That was fantastic.
949
00:53:26,020 --> 00:53:28,259
So if people want to find you online, uh, we'll have
950
00:53:28,259 --> 00:53:30,150
some links in the show notes, go check out the book.
951
00:53:30,180 --> 00:53:32,099
It's on Amazon or wherever you're buying
952
00:53:32,100 --> 00:53:34,030
books, Kafka troubleshooting and production.
953
00:53:34,030 --> 00:53:37,600
So, uh, thank you all for listening and we will talk to you again soon.
954
00:53:38,040 --> 00:53:38,540
Thank you very much.
955
00:53:38,820 --> 00:53:39,410
Thanks everyone.
956
00:53:54,600 --> 00:53:57,590
Thank you for listening to this episode of Fork Around and Find Out.
957
00:53:57,910 --> 00:54:00,060
If you like this show, please consider sharing it with
958
00:54:00,060 --> 00:54:03,240
a friend, a coworker, a family member, or even an enemy.
959
00:54:03,340 --> 00:54:05,440
However we get the word out about this show
960
00:54:05,650 --> 00:54:07,819
helps it to become sustainable for the long term.
961
00:54:08,160 --> 00:54:11,820
If you want to sponsor this show, please go to fafo.
962
00:54:11,860 --> 00:54:15,400
fm slash sponsor and reach out to us there about what
963
00:54:15,400 --> 00:54:17,650
you're interested in sponsoring and how we can help.
964
00:54:18,870 --> 00:54:22,060
We hope your systems stay available and your pagers stay quiet.
965
00:54:22,580 --> 00:54:23,749
We'll see you again next time.