Tuning &os;
Introduction
Overview
This set of documents covers a variety of performance tuning topics.
It attempts to span tuning from the perspective of the system
administrator to systems programmer. The art of performance tuning
itself is very old. To tune something means to make it operate
more efficiently, whether one is referring to a &os; based
technical server or a vacuum cleaner, the goal is to improve something,
whether that be the way something is done, how it works or how it is
put together.
What is Performance Tuning?
A view from 10,000 feet pretty much dictates that everything
we do is task oriented, this pertains to a &os; system as well.
When the system boots, it automatically begins to perform a
variety of tasks. When a user logs in, they usually have a wide
variety of tasks they have to accomplish. In the scope of these
documents, however, performance tuning strictly means to
improve how efficient a &os; system performs.
The most common thought that crops into an someone's mind when
they think "tuning" is some sort of speed increase or decreasing the
size of the kernel - while those are ways to improve performance,
they are not the only ends an administrator may have to take
for increasing efficiency.
For our purposes, performance tuning means this;
To make a &os; system operate in an optimum and stable
state.
Which could mean a variety things, not necessarily
speed enhancements. A good example of this is filesystem
formatting parameters, on a system that has a lot of small
files (say like a source repository) an administrator may
need to increase the number of inodes by making their
size smaller (say down to 1024k) and then increasing the
amount of inodes. In this case the number of inodes was increased,
however, it keeps the administrator from getting those nasty
out of inodes messages, which ultimately makes the system more efficient.
Tuning normally revolves around finding and eliminating bottlenecks.
Most of the time, such bottlenecks are spurious, for example, a
release of Mozilla that does not quite handle java applets too well can
cause Mozilla to start crunching the CPU, especially applets that are not
done well. Occasions when processes seem to spin off into nowhere and
eat CPU are almost always resolved with a kill. There are instances,
however, when resolving bottlenecks takes a lot longer, for example,
say an rsynced server is just getting larger and larger. Slowly,
performance begins to fade and the administrator may have to take
some sort of action to speed things up, however, the situation is
relative to say an emergency like an instantly spiked CPU.
When does one tune?
Many &os; users rarely have to tune a system. The GENERIC
kernel may run just fine and the layout/configuration of the system
may do the job as well. By the same token, as a pragma it is always
good to know how to tune a system. Most often tuning comes as a
result of a sudden bottleneck issue (which may occur randomly) or a
gradual loss of performance. It does happen in a sense to everyone
at some point, one process that is eating the CPU is a bottleneck
as much as a gradual increase in paging. So, the question should
not be when to tune so much as when to learn to tune.
One last time to tune is if you can tune in a preventive manner
(and you think you might need to) then do it. One example of this
was on a system that needed to be able to reboot quickly. Instead of
waiting, I did everything I could to trim the kernel and make sure
there was absolutely nothing running that was not needed, I even
removed drivers that did have devices, but were never used (lp).
The result was reducing reboot time by nearly two-thirds. In the
long run, it was a smart move to tune it before it became an issue.
What these Documents Will Not Cover
Before I wrap up the introduction, I think it is important to
note what these documents will not cover. This guide will pertain
only to the core &os; system. In other words, it will not cover
say tuning a web servers configuration to make it run better, however,
it might mention how to tune &os; to run better as a web server.
The logic behind this is simple, web servers, database software etc.
are third party and almost limitless. I could easily get mired down
in details that do not apply to the &os; system. Almost all third
party software have their own documentation about tuning anyhow.
How Examples are Laid Out
Since there is ample man page documentation, only used options and
arguments with examples are discussed. In some cases, material is
truncated for brevity and not thoroughly discussed because, quite
simply, there is too much. For example, every single device driver
entry in the kernel will not be discussed, however, an example of
determining whether or not a given system needs one will be.
Nothing in this Guide is concrete, tuning and performance are very
subjective, instead, it is a guide for the reader to learn what
some of the tools available to them can do.
Tuning Considerations
Tuning a system is not really too difficult when pro-active tuning
is the approach. This document approaches tuning from a
before it comes up
approach. While tuning in spare time is
considerably easier versus say, a server that is almost completely
bogged down to 0.1% idle time, there are still a few things that
should be mulled over about tuning before actually doing it,
hopefully, before a system is even installed.
General System Configuration
Of course, how the system is setup makes a big difference.
Sometimes small items can be overlooked which may in fact
cause some sort of long term performance problem.
Filesystems and Disks
How the filesystem is laid out relative to disk drives is very
important. On hardware RAID systems, it is not such a big deal,
but, many &os; users specifically use &os; on older hardware
where hardware RAID simply is not an option. The idea of
/ being
close to the first drive is a good one, but for example if there
are several drives to choose from that will be the first one, is
the best performing the one that / will be on?
On a related note,
is it wise to split off /usr? Will the system
see heavy usage
say in /usr/pkgsrc? It might make sense to
slap a fast drive
in and mount it under /usr/pkgsrc, or it might
not. Like all things in performance tuning, this is subjective.
Swap Configuration
There are three schools of thought on swap size and about fifty
on using split swap files with prioritizing and how that should
be done. In the swap size arena, the vendor school (at least most
commercial ones) usually have their own formulas per OS. As an
example, on a particular version of HP-UX with a particular
version of Oracle the formula was:
2.5 GB * Number_of_processor
Well, that all really depends on what type of usage the
database is having and how large it is, for instance if it is
so large that it must be distributed, that formula does not fit well.
The next school of thought about swap sizing is sort of strange
but makes some sense, it says, if possible, get a reference amount
of memory used by the system. It goes something like this:
Startup a machine and estimate total memory needs by
running everything that may ever be needed at once.
Databases, web servers .... whatever. Total up the amount.
Add a few MB for padding.
Subtract the amount of physical RAM from this total.
If the amount leftover is 3 times the size of physical RAM,
consider getting more RAM. The problem, of course, is
figuring out what is needed and how much space it will take.
There is also another flaw in this method, some programs do not
behave well. A glaring example of misbehaved software is web browsers.
On certain versions of Netscape, when something goes wrong Netscape
had a tendency to runaway and eat swap space. So, the more spare
space available, the more time to kill it.
Last and not least is the tried and true PHYSICAL_RAM * 2 method.
On modern machines and even older ones (with limited purpose of course)
this seems to work best.
All in all, it is hard to tell when swap will start thrashing.
Even on small 16MB RAM machines (and less) &os; has always worked
well for most people until misbehaving software is running.
System Services
On servers, system services have a large impact. Getting them to
run at their best almost always requires some sort of network level
change or a fundamental speed increase in the underlying system
(which of course is what this is all about). There are instances
when some simple solutions can improve services. One example,
an ftp server is becoming slower and a new release of the ftp
server that is shipped with the system comes out that, just
happens to run faster. By upgrading the ftp software, a
performance boost is accomplished.
Another good example where services are concerned is the age
old question To use inetd or not to use inetd?
. A great
service example is pop3. Pop3 connections can conceivably
clog up inetd. While the pop3 service itself starts to
degrade slowly, other services that are multiplexed
through inetd will also degrade (in some case more than pop3).
Setting up pop3 to run outside of inetd and on its own may help.
The &os; Kernel
The &os; kernel obviously plays a key role in how well a
system performs, while rebuilding and tuning the kernel is covered
later in the text, it is worth discussing in the local
context from a high level.
Tuning the &os; kernel really involves three main areas:
removing unrequired drivers
configuring options
system settings
Removing Unrequired Drivers
Taking drivers out of the kernel that are not needed achieves
several results; first, the system boots faster since the
kernel is smaller, second again since the kernel is smaller,
more memory is free to users and processes and third, the
kernel tends to respond quicker.
Configuring Options
Configuring options such as enabling/disabling certain subsystems,
specific hardware and filesystems can also improve performance
pretty much the same way removing unrequired drivers does. A very
simple example of this is a FTP server that only hosts ftp files
- nothing else. On this particular server there is no need to
have anything but native filesystem support and perhaps a few
options to help speed things along. Why would it ever need
NTFS support for example? Besides, if it did, support for
NTFS could be added at some later time. In an opposite case,
a workstation may need to support a lot of different filesystem
types to share and access files.
System Settings
System wide settings are controlled by the kernel, a few
examples are filesystem settings, network settings and core
kernel settings such as the maximum number of processes.
Almost all system settings can be at least looked at or
modified via the sysctl facility. Examples using the
sysctl facility are given later on.
Visual Monitoring Tools
&os; ships a variety of performance monitoring tools with the
system. Most of these tools are common on all UNIX systems. In
this section some example usage of the tools is given with
interpretation of the output.
The top Process Monitor
The top monitor does exactly what it says, it displays the CPU
hogs on the system. To run the monitor, simply type top at the prompt.
Without any arguments, it should look like:
load averages: 0.09, 0.12, 0.08 20:23:41
21 processes: 20 sleeping, 1 on processor
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 15M Act, 1104K Inact, 208K Wired, 22M Free, 129M Swap free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
13663 root 2 0 1552K 1836K sleep 0:08 0.00% 0.00% httpd
127 root 10 0 129M 4464K sleep 0:01 0.00% 0.00% mount_mfs
22591 root 2 0 388K 1156K sleep 0:01 0.00% 0.00% sshd
108 root 2 0 132K 472K sleep 0:01 0.00% 0.00% syslogd
22597 jrf 28 0 156K 616K onproc 0:00 0.00% 0.00% top
22592 jrf 18 0 828K 1128K sleep 0:00 0.00% 0.00% tcsh
203 root 10 0 220K 424K sleep 0:00 0.00% 0.00% cron
1 root 10 0 312K 192K sleep 0:00 0.00% 0.00% init
205 root 3 0 48K 432K sleep 0:00 0.00% 0.00% getty
206 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
208 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
207 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
13667 nobody 2 0 1660K 1508K sleep 0:00 0.00% 0.00% httpd
9926 root 2 0 336K 588K sleep 0:00 0.00% 0.00% sshd
200 root 2 0 76K 456K sleep 0:00 0.00% 0.00% inetd
182 root 2 0 92K 436K sleep 0:00 0.00% 0.00% portsentry
180 root 2 0 92K 436K sleep 0:00 0.00% 0.00% portsentry
13666 nobody -4 0 1600K 1260K sleep 0:00 0.00% 0.00% httpd
The top utility is great for finding CPU hogs, runaway processes or
groups of processes that may be causing problems. The output shown
above indicates that this particular system is in good health. Now,
the next display should show some very different results:
load averages: 0.34, 0.16, 0.13 21:13:47
25 processes: 24 sleeping, 1 on processor
CPU states: 0.5% user, 0.0% nice, 9.0% system, 1.0% interrupt, 89.6% idle
Memory: 20M Act, 1712K Inact, 240K Wired, 30M Free, 129M Swap free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
5304 jrf -5 0 56K 336K sleep 0:04 66.07% 19.53% bonnie
5294 root 2 0 412K 1176K sleep 0:02 1.01% 0.93% sshd
108 root 2 0 132K 472K sleep 1:23 0.00% 0.00% syslogd
187 root 2 0 1552K 1824K sleep 0:07 0.00% 0.00% httpd
5288 root 2 0 412K 1176K sleep 0:02 0.00% 0.00% sshd
5302 jrf 28 0 160K 620K onproc 0:00 0.00% 0.00% top
5295 jrf 18 0 828K 1116K sleep 0:00 0.00% 0.00% tcsh
5289 jrf 18 0 828K 1112K sleep 0:00 0.00% 0.00% tcsh
127 root 10 0 129M 8388K sleep 0:00 0.00% 0.00% mount_mfs
204 root 10 0 220K 424K sleep 0:00 0.00% 0.00% cron
1 root 10 0 312K 192K sleep 0:00 0.00% 0.00% init
208 root 3 0 48K 432K sleep 0:00 0.00% 0.00% getty
210 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
209 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
211 root 3 0 48K 424K sleep 0:00 0.00% 0.00% getty
217 nobody 2 0 1616K 1272K sleep 0:00 0.00% 0.00% httpd
184 root 2 0 336K 580K sleep 0:00 0.00% 0.00% sshd
201 root 2 0 76K 456K sleep 0:00 0.00% 0.00% inetd
At first, it should seem rather obvious which process is hogging
the system, however, what is interesting in this case is why. The
bonnie program is a disk benchmark tool which can write large files
in a variety of sizes and ways. What the previous output indicates
is only that the bonnie program is a CPU hog, but not why.
Other Neat Things About Top
A careful examination of the manual page &man.top.1; shows that
there is a lot more that can be done with top, for
example, processes
can have their priority changed and killed. Additionally, filters
can be set for looking at processes.
The sysstat utility
As the man page &man.sysstat.1; indicates, the sysstat utility shows a
variety of system statistics using the curses library. While it is
running the screen is shown in two parts, the upper window shows
the current load average while the lower screen depends on user
commands. The exception to the split window view is when vmstat
display is on which takes up the whole screen. Following is what
sysstat looks like on a fairly idle system with no arguments
given when it was invoked:
/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average |
/0 /10 /20 /30 /40 /50 /60 /70 /80 /90 /100
<idle> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Basically a lot of dead time there, so now have a look with some
arguments provided, in this case, sysstat inet.tcp
which looks like this:
/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average |
0 connections initiated 19 total TCP packets sent
0 connections accepted 11 data
0 connections established 0 data (retransmit)
8 ack-only
0 connections dropped 0 window probes
0 in embryonic state 0 window updates
0 on retransmit timeout 0 urgent data only
0 by keepalive 0 control
0 by persist
29 total TCP packets received
11 potential rtt updates 17 in sequence
11 successful rtt updates 0 completely duplicate
9 delayed acks sent 0 with some duplicate data
0 retransmit timeouts 4 out of order
0 persist timeouts 0 duplicate acks
0 keepalive probes 11 acks
0 keepalive timeouts 0 window probes
0 window updates
Now that is informative. The first poll is accumulative, so it is
possible to see quite a lot of information in the output when
sysstat is invoked. Now, while that may be interesting, how about
a look at the buffer cache with sysstat bufcache:
/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average
There are 1642 buffers using 6568 kBytes of memory.
File System Bufs used % kB in use % Bufsize kB % Util %
/ 877 53 6171 93 6516 99 94
/var/tmp 5 0 17 0 28 0 60
Total: 882 53 6188 94 6544 99
Again, a pretty boring system, but great information to have
available. While this is all nice to look at, it is time to put a
false load on the system to see how sysstat can be used as a
performance monitoring tool. As with top, bonnie++ will be used
to put a high load on the I/O subsystems and a little on the CPU.
The bufcache will be looked at again to see of there are any
noticeable differences:
/0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10
Load Average |||
There are 1642 buffers using 6568 kBytes of memory.
File System Bufs used % kB in use % Bufsize kB % Util %
/ 811 49 6422 97 6444 98 99
Total: 811 49 6422 97 6444 98
First, notice that the load average shot up, this is to be
expected of course, then, while most of the numbers are close,
notice that utilization is at 99%. Throughout the time that bonnie++
was running the utilization percentage remained at 99, this of
course makes sense, however, in a real troubleshooting situation,
it could be indicative of a process doing heavy I/O on one particular
file or filesystem.
Monitoring Tools
In addition to screen oriented monitors and tools, the &os;
system also ships with a set of command line oriented tools.
Many of the tools that ship with a &os; system can be found
on other UNIX and UNIX-like systems.
fstat
The &man.fstat.1; utility reports the status of open files on the system,
while it is not what many administrators consider a performance monitor,
it can help find out if a particular user or process is using an
inordinate amount of files, generating large files and similar information.
Following is a sample of some fstat output:
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
jrf tcsh 21607 wd / 29772 drwxr-xr-x 512 r
jrf tcsh 21607 3* unix stream c057acc0<-> c0553280
jrf tcsh 21607 4* unix stream c0553280 <-> c057acc0
root sshd 21597 wd / 2 drwxr-xr-x 512 r
root sshd 21597 0 / 11921 crw-rw-rw- null rw
nobody httpd 5032 wd / 2 drwxr-xr-x 512 r
nobody httpd 5032 0 / 11921 crw-rw-rw- null r
nobody httpd 5032 1 / 11921 crw-rw-rw- null w
nobody httpd 5032 2 / 15890 -rw-r--r-- 353533 rw
...
The fields are pretty self explanatory, again, this tool while
not as performance oriented as others, can come in handy when
trying to find out information about file usage.
iostat
The &man.iostat.8 command does exactly what it sounds like, it
reports the status of the I/O subsystems on the system. When
iostat is employed, the user typically runs it with a certain
number of counts and an interval between them like so:
&uprompt iostat 5 5
tty wd0 cd0 fd0 md0 cpu
tin tout KB/t t/s MB/s KB/t t/s MB/s KB/t t/s MB/s KB/t t/s MB/s us ni sy in id
0 1 5.13 1 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 54 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 18 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 18 8.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 28 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
The above output is from a very quiet ftp server. The fields
represent the various I/O devices, the tty (which, ironically,
is the most active because iostat is running), wd0 which is the
primary IDE disk, cd0, the cdrom drive, fd0, the floppy and the
memory filesystem.
Now, lets see if we can pummel the system with some heavy usage.
First, a large ftp transaction consisting of a tarball of netbsd-current
source along with the bonnie++ disk benchmark program running
at the same time.
&uprompt; iostat 5 5
tty wd0 cd0 fd0 md0 cpu
tin tout KB/t t/s MB/s KB/t t/s MB/s KB/t t/s MB/s KB/t t/s MB/s us ni sy in id
0 1 5.68 1 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 0 0 100
0 54 61.03 150 8.92 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 18 4 78
0 26 63.14 157 9.71 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 20 4 75
0 20 43.58 26 1.12 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 0 9 2 88
0 28 19.49 82 1.55 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 1 0 7 3 89
As can be expected, notice that wd0 is very active, what is
interesting about this output is how the processor's I/O seems
to rise in proportion to wd0. This makes perfect sense, however,
it is worth noting that only because this ftp server is hardly
being used can that be observed. If, for example, the cpu I/O
subsystem was already under a moderate load and the disk subsystem
was under the same load as it is now, it could appear that the
cpu is bottlenecked when in fact it would have been the disk.
In such a case, we can observe that "one tool" is rarely
enough to completely analyze a problem. A quick glance at
processes probably would tell us (after watching iostat) which
processes were causing problems.
ps
Using the &man.ps.1; command or process status, a great deal of
information about the system can be discovered. Most of the time,
the ps command is used to isolate a particular process by name,
group, owner etc. Invoked with no options or arguments, ps simply
prints out information about the user executing it.
&uprompt; ps
PID TT STAT TIME COMMAND
21560 p0 Is 0:00.04 -tcsh
21564 p0 I+ 0:00.37 ssh jrf.odpn.net
21598 p1 Ss 0:00.12 -tcsh
21673 p1 R+ 0:00.00 ps
21638 p2 Is+ 0:00.06 -tcsh
Not very exciting. The fields are self explanatory with the exception
of STAT which is actually the state a process is in. The flags are
all documented in the man page, however, in the above example,
I is idle, S is sleeping, R is runnable, the + means the process
is in a foreground state, and the s means the process is a session
leader. This all makes perfect sense when looking at the flags,
for example, PID 21560 is a shell, it is idle and (as would be
expected) the shell is the process leader.
In most cases, someone is looking for something very specific
in the process listing. As an example, looking at all processes
is specified with -a, to see all processes plus those without
controlling terminals is -ax. and to get a much more verbose
listing (basically everything plus information about the
impact processes are having) aux:
&rprompt; ps aux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 0 0.0 9.6 0 6260 ?? DLs 16Jul02 0:01.00 (swapper)
root 23362 0.0 0.8 144 488 ?? S 12:38PM 0:00.01 ftpd -l
root 23328 0.0 0.4 428 280 p1 S 12:34PM 0:00.04 -csh
jrf 23312 0.0 1.8 828 1132 p1 Is 12:32PM 0:00.06 -tcsh
root 23311 0.0 1.8 388 1156 ?? S 12:32PM 0:01.60 sshd: jrf@ttyp1
jrf 21951 0.0 1.7 244 1124 p0 S+ 4:22PM 0:02.90 ssh jrf.odpn.net
jrf 21947 0.0 1.7 828 1128 p0 Is 4:21PM 0:00.04 -tcsh
root 21946 0.0 1.8 388 1156 ?? S 4:21PM 0:04.94 sshd: jrf@ttyp0
nobody 5032 0.0 2.0 1616 1300 ?? I 19Jul02 0:00.02 /usr/pkg/sbin/httpd
...
Again, most of the fields are self explanatory with the exception of
VSZ and RSS which can be a little confusing. RSS is the real size of
a process in 1024 byte units while VSZ is the virtual size. This is
all great, but again, how can ps help? Well, for one, take a
look at this modified version of the same output:
&rprompt; ps aux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 0 0.0 9.6 0 6260 ?? DLs 16Jul02 0:01.00 (swapper)
root 23362 0.0 0.8 144 488 ?? S 12:38PM 0:00.01 ftpd -l
root 23328 0.0 0.4 428 280 p1 S 12:34PM 0:00.04 -csh
jrf 23312 0.0 1.8 828 1132 p1 Is 12:32PM 0:00.06 -tcsh
root 23311 0.0 1.8 388 1156 ?? S 12:32PM 0:01.60 sshd: jrf@ttyp1
jrf 21951 0.0 1.7 244 1124 p0 S+ 4:22PM 0:02.90 ssh jrf.odpn.net
jrf 21947 0.0 1.7 828 1128 p0 Is 4:21PM 0:00.04 -tcsh
root 21946 0.0 1.8 388 1156 ?? S 4:21PM 0:04.94 sshd: jrf@ttyp0
nobody 5032 9.0 2.0 1616 1300 ?? I 19Jul02 0:00.02 /usr/pkg/sbin/httpd
...
Given that on this server, our baseline indicates a relatively
quiet system, the PID 5032 has an unusually large amount of %CPU.
Sometimes this can also cause high TIME numbers. The ps command
can be grepped on for PIDs, username and process name and hence
help track down processes that may be experiencing problems.
vmstat
Using &man.vmstat.1;, information pertaining to virtual memory can be
monitored and measured. Not unlike iostat, vmstat can be invoked
with a count and interval. Following is some sample output
using 5 5 like the iostat example:
&uprompt; vmstat 5 5
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr w0 c0 f0 m0 in sy cs us sy id
0 7 0 17716 33160 2 0 0 0 0 0 1 0 0 0 105 15 4 0 0 100
0 7 0 17724 33156 2 0 0 0 0 0 1 0 0 0 109 6 3 0 0 100
0 7 0 17724 33156 1 0 0 0 0 0 1 0 0 0 105 6 3 0 0 100
0 7 0 17724 33156 1 0 0 0 0 0 0 0 0 0 107 6 3 0 0 100
0 7 0 17724 33156 1 0 0 0 0 0 0 0 0 0 105 6 3 0 0 100
Yet again, relatively quiet, for posterity, the exact same load
that was put on this server in the iostat example will be used.
The load is a large file transfer and the bonnie benchmark program.
&uprompt; vmstat 5 5
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr w0 c0 f0 m0 in sy cs us sy id
1 8 0 18880 31968 2 0 0 0 0 0 1 0 0 0 105 15 4 0 0 100
0 8 0 18888 31964 2 0 0 0 0 0 130 0 0 0 1804 5539 1094 31 22 47
1 7 0 18888 31964 1 0 0 0 0 0 130 0 0 0 1802 5500 1060 36 16 49
1 8 0 18888 31964 1 0 0 0 0 0 160 0 0 0 1849 5905 1107 21 22 57
1 7 0 18888 31964 1 0 0 0 0 0 175 0 0 0 1893 6167 1082 1 25 75
Just a little different. Notice, since most of the work was I/O based,
the actual memory used was not very much. Since this system uses mfs
for /tmp, however, it can certainly get beat up.
Have a look at this:
&rprompt; vmstat 5 5
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr w0 c0 f0 m0 in sy cs us sy id
0 2 0 99188 500 2 0 0 0 0 0 1 0 0 0 105 16 4 0 0 100
0 2 0111596 436 592 0 587 624 586 1210 624 0 0 0 741 883 1088 0 11 89
0 3 0123976 784 666 0 662 643 683 1326 702 0 0 0 828 993 1237 0 12 88
0 2 0134692 1236 581 0 571 563 595 1158 599 0 0 0 722 863 1066 0 9 90
2 0 0142860 912 433 0 406 403 405 808 429 0 0 0 552 602 768 0 7 93
Pretty scary stuff. That was created by running bonnie in
/tmp
on a memory based filesystem. If it continued for too long,
it is possible the system could have started thrashing.
Notice that even though the VM subsystem was taking a beating,
the processors still were not getting too battered.
Network Tools
Sometimes a performance problem is not a particular machine,
it is the network or some sort of device on the network such as
another host, a router etc. What other machines that provide a
service or some sort of connectivity to a particular &os; system
do and how they act can have a very large impact on performance
of the &os; system itself, or the perception of performance by users.
A really great example of this is when a DNS server that a &os;
machine is using suddenly disappears. Lookups take long and they
eventually fail. Someone logged into the &os; machine who is not
experienced would undoubtedly (provided they had no other evidence)
blame the &os; system. One of my personal favorites,
the Internet is broke
usually means either DNS service or a
router/gateway has dropped offline. Whatever the case may be,
a &os; system comes adequately armed to deal with finding out
what network issues may be cropping up whether the fault of the
local system or some other issue.
ping
The classic &man.ping.8; utility can tell us if there is just plain
connectivity, it can also tell if host resolution (depending
on how nsswitch.conf dictates) is working. Following
is some typical ping output on a local network with a count of 3 specified:
&rprompt; ping -c 3 marie
PING marie (172.16.14.12): 56 data bytes
64 bytes from 172.16.14.12: icmp_seq=0 ttl=255 time=0.571 ms
64 bytes from 172.16.14.12: icmp_seq=1 ttl=255 time=0.361 ms
64 bytes from 172.16.14.12: icmp_seq=2 ttl=255 time=0.371 ms
----marie PING Statistics----
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.361/0.434/0.571/0.118 ms
Not only does ping tell us if a host is alive, it tells us how
long it took and gives some nice details at the very end. If
a host cannot be resolved, just the IP address can be specified as well:
&rprompt; ping -c 1 172.16.20.5
PING ash (172.16.20.5): 56 data bytes
64 bytes from 172.16.20.5: icmp_seq=0 ttl=64 time=0.452 ms
----ash PING Statistics----
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.452/0.452/0.452/0.000 ms
Now, not unlike any other tool, the times are very subjective,
especially in regards to networking. For example, while the
times in the examples are good, take a look at the localhost ping:
&rprompt; ping -c 4 localhost
PING localhost (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=0.091 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=255 time=0.129 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=255 time=0.120 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=255 time=0.122 ms
----localhost PING Statistics----
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.091/0.115/0.129/0.017 ms
Much smaller because the request never left the machine. Pings
can be used to gather information about how well a network
is performing. It is also good for problem isolation, for instance,
if there are three relatively close in size &os; systems on a
network and one of them simply has horrible ping times, chances
are something is wrong on that one particular machine.
traceroute
The &man.traceroute.8; command is great for making sure a path is
available or detecting problems on a particular path. As
an example, here is a trace between the example ftp server
and ftp.NetBSD.org:
&rprompt; traceroute ftp.NetBSD.org
traceroute to ftp.NetBSD.org (204.152.184.75), 30 hops max, 40 byte packets
1 208.44.95.1 (208.44.95.1) 1.646 ms 1.492 ms 1.456 ms
2 63.144.65.170 (63.144.65.170) 7.318 ms 3.249 ms 3.854 ms
3 chcg01-edge18.il.inet.qwest.net (65.113.85.229) 35.982 ms 28.667 ms 21.971 ms
4 chcg01-core01.il.inet.qwest.net (205.171.20.1) 22.607 ms 26.242 ms 19.631 ms
5 snva01-core01.ca.inet.qwest.net (205.171.8.50) 78.586 ms 70.585 ms 84.779 ms
6 snva01-core03.ca.inet.qwest.net (205.171.14.122) 69.222 ms 85.739 ms 75.979 ms
7 paix01-brdr02.ca.inet.qwest.net (205.171.205.30) 83.882 ms 67.739 ms 69.937 ms
8 198.32.175.3 (198.32.175.3) 72.782 ms 67.687 ms 73.320 ms
9 so-1-0-0.orpa8.pf.isc.org (192.5.4.231) 78.007 ms 81.860 ms 77.069 ms
10 tun0.orrc5.pf.isc.org (192.5.4.165) 70.808 ms 75.151 ms 81.485 ms
11 ftp.NetBSD.org (204.152.184.75) 69.700 ms 69.528 ms 77.788 ms
All in all, not bad. The trace went from the host to the local
router, then out onto the provider network and finally out onto
the Internet looking for the final destination. How to interpret
traceroutes, again, are subjective, but abnormally high times in
portions of a path can indicate a bottleneck on a piece of network
equipment. Not unlike ping, if the the host itself is suspect,
run traceroute from another host to the same destination. Now,
for the worst case scenario:
&rprompt; traceroute www.microsoft.com
traceroute: Warning: www.microsoft.com has multiple addresses; using 207.46.230.220
traceroute to www.microsoft.akadns.net (207.46.230.220), 30 hops max, 40 byte packets
1 208.44.95.1 (208.44.95.1) 2.517 ms 4.922 ms 5.987 ms
2 63.144.65.170 (63.144.65.170) 10.981 ms 3.374 ms 3.249 ms
3 chcg01-edge18.il.inet.qwest.net (65.113.85.229) 37.810 ms 37.505 ms 20.795 ms
4 chcg01-core03.il.inet.qwest.net (205.171.20.21) 36.987 ms 32.320 ms 22.430 ms
5 chcg01-brdr03.il.inet.qwest.net (205.171.20.142) 33.155 ms 32.859 ms 33.462 ms
6 205.171.1.162 (205.171.1.162) 39.265 ms 20.482 ms 26.084 ms
7 sl-bb24-chi-13-0.sprintlink.net (144.232.26.85) 26.681 ms 24.000 ms 28.975 ms
8 sl-bb21-sea-10-0.sprintlink.net (144.232.20.30) 65.329 ms 69.694 ms 76.704 ms
9 sl-bb21-tac-9-1.sprintlink.net (144.232.9.221) 65.659 ms 66.797 ms 74.408 ms
10 144.232.187.194 (144.232.187.194) 104.657 ms 89.958 ms 91.754 ms
11 207.46.154.1 (207.46.154.1) 89.197 ms 84.527 ms 81.629 ms
12 207.46.155.10 (207.46.155.10) 78.090 ms 91.550 ms 89.480 ms
13 * * *
.......
In this case, the Microsoft server cannot be found either
because of multiple addresses or somewhere along the line a
system or server cannot reply to the information request.
At that point, one might think to try ping, in the Microsoft
case, a ping does not reply, that is because somewhere on
their network ICMP is most likely disabled.
netstat
Another problem that can crop up on a &os; system is routing
table issues. These issues are not always the systems fault.
The &man.route.8; and &man.netstat.1; commands can show information about
routes and network connections (respectively).
The route command can be used to look at and modify routing tables
while netstat can display information about network connections
and routes. First, here is some output from route show:
&rprompt; route show
Routing tables
Internet:
Destination Gateway Flags
default 208.44.95.1 UG
loopback 127.0.0.1 UG
localhost 127.0.0.1 UH
172.15.13.0 172.16.14.37 UG
172.16.0.0 link#2 U
172.16.14.8 0:80:d3:cc:2c:0 UH
172.16.14.10 link#2 UH
marie 0:10:83:f9:6f:2c UH
172.16.14.37 0:5:32:8f:d2:35 UH
172.16.16.15 link#2 UH
loghost 8:0:20:a7:f0:75 UH
artemus 8:0:20:a8:d:7e UH
ash 0:b0:d0:de:49:df UH
208.44.95.0 link#1 U
208.44.95.1 0:4:27:3:94:20 UH
208.44.95.2 0:5:32:8f:d2:34 UH
208.44.95.25 0:c0:4f:10:79:92 UH
Internet6:
Destination Gateway Flags
default localhost UG
default localhost UG
localhost localhost UH
::127.0.0.0 localhost UG
::224.0.0.0 localhost UG
::255.0.0.0 localhost UG
::ffff:0.0.0.0 localhost UG
2002:: localhost UG
2002:7f00:: localhost UG
2002:e000:: localhost UG
2002:ff00:: localhost UG
fe80:: localhost UG
fe80::%ex0 link#1 U
fe80::%ex1 link#2 U
fe80::%lo0 fe80::1%lo0 U
fec0:: localhost UG
ff01:: localhost U
ff02::%ex0 link#1 U
ff02::%ex1 link#2 U
ff02::%lo0 fe80::1%lo0 U
The flags section shows the status and whether or not it is a
gateway. In this case we see U, H and G (U is up, H is host and
G is gateway, see the man page for additional flags).
Now for some netstat output using the -r (routing) and -n
(show network numbers) options:
Routing tables
Internet:
Destination Gateway Flags Refs Use Mtu Interface
default 208.44.95.1 UGS 0 330309 1500 ex0
127 127.0.0.1 UGRS 0 0 33228 lo0
127.0.0.1 127.0.0.1 UH 1 1624 33228 lo0
172.15.13/24 172.16.14.37 UGS 0 0 1500 ex1
172.16 link#2 UC 13 0 1500 ex1
...
Internet6:
Destination Gateway Flags Refs Use
Mtu Interface
::/104 ::1 UGRS 0 0
33228 lo0 =>
::/96 ::1 UGRS 0 0
The above output is a little more verbose. So, how can this help?
Well, a good example is when routes between networks get changed
while users are connected. I saw this happen several times when
someone was rebooting routers all day long after each change.
Several users called up saying they were getting kicked out
and it was taking very long to log back in. As it turned out,
the clients connecting to the system were redirected to another
router (which took a very long route) to reconnect. I observed
the M flag or Modified dynamically (by redirect) on their connections.
I deleted the routes, had them reconnect and summarily followed up
with the offending technician.
tcpdump
Last, and definitely not least is &man.tcpdump.8;, the network
sniffer that can retrieve a lot of information. In this discussion,
there will be some sample output and an explanation of some of
the more useful options of tcpdump.
Following is a small snippet of tcpdump in action just as it starts:
&rprompt tcpdump
tcpdump: listening on ex0
14:07:29.920651 mail.ssh > 208.44.95.231.3551: P 2951836801:2951836845(44) ack 2
476972923 win 17520 <nop,nop,timestamp 1219259 128519450> [tos 0x10]
14:07:29.950594 12.125.61.34 > 208.44.95.16: ESP(spi=2548773187,seq=0x3e8c) (DF)
14:07:29.983117 smtp.somecorp.com.smtp > 208.44.95.30.42828: . ack 420285166 win
16500 (DF)
14:07:29.984406 208.44.95.30.42828 > smtp.somecorp.com.smtp: . 1:1376(1375) ack 0
win 7431 (DF)
...
Given that the particular server is a mail server, what is shown makes
perfect sense, however, the utility is very verbose, I prefer to
initially run tcpdump with no options and send the text output
into a file for later digestion like so:
&rprompt; tcpdump > tcpdump.out
tcpdump: listening on ex0
So, what precisely in the mish mosh are we looking for? In short, anything that does not seem to fit, for example, messed up packet lengths (as in a lot of them) will show up as improper lens or malformed packets (basically garbage). If, however, we are looking for something specific, tcpdump may be able to help depending on the problem.
Specific tcpdump Usage
These are just examples of a few things one can do with tcpdump.
Look for duplicate IP addresses:
tcpdump -e host ip-address
For example:
tcpdump -e host 192.168.0.2
Routing Problems:
tcpdump icmp
There are plenty of third party tools available, however,
&os; comes shipped with a good tool set for tracking down
network level performance problems.
Accounting
The &os; system comes equipped with a great deal of performance
monitors for active monitoring, but what about long term monitoring?
Well, of course the output of a variety of commands can be sent to
files and re-parsed later with a meaningful shell script or program.
&os; does, by default, offer some extraordinarily powerful
low level monitoring tools for the programmer, administrator or
really astute hobbyist.
Accounting
While accounting gives system usage at an almost userland level,
kernel profiling with gprof provides explicit system call usage.
Using the accounting tools can help figure out what possible
performance problems may be laying in wait, such as increased
usage of compilers or network services for example.
Starting accounting is actually fairly simple, as root, use the
&man.accton.8; command. The syntax to start accounting is:
accton filename
Where accounting information is appended to filename, now,
strangely enough, the lastcomm command which reads from an
accounting output file, by default, looks in
/var/account/acct
so I tend to just use the default location, however, lastcomm
can be told to look elsewhere.
To stop accounting, simply type accton with no arguments.
Reading Accounting Information
To read accounting information, there are two tools that can be used:
&man.lastcomm.1;
&man.sa.8;
lastcomm
The lastcomm command shows the last commands executed in order,
like all of them. It can, however, select by user, here is some
sample output:
&uprompt lastcomm jrf
last - jrf ttyp3 0.00 secs Tue Sep 3 14:39 (0:00:00.02)
man - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:01:49.03)
sh - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:01:49.03)
less - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:01:49.03)
lastcomm - jrf ttyp3 0.02 secs Tue Sep 3 14:38 (0:00:00.02)
stty - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:00:00.02)
tset - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:00:01.05)
hostname - jrf ttyp3 0.00 secs Tue Sep 3 14:38 (0:00:00.02)
ls - jrf ttyp0 0.00 secs Tue Sep 3 14:36 (0:00:00.00)
...
Pretty nice, the lastcomm command gets its information from the
default location of /var/account/acct, however, using the -f
option, another file may be specified.
As may seem obvious, the output of lastcomm could get a little
heavy on large multi user systems. That is where sa comes into play.
sa
The sa command (meaning "print system accounting statistics") can be
used to maintain information. It can also be used interactively to
create reports. Following is the default output of sa:
&uprompt sa
77 18.62re 0.02cp 8avio 0k
3 4.27re 0.01cp 45avio 0k ispell
2 0.68re 0.00cp 33avio 0k mutt
2 1.09re 0.00cp 23avio 0k vi
10 0.61re 0.00cp 7avio 0k ***other
2 0.01re 0.00cp 29avio 0k exim
4 0.00re 0.00cp 8avio 0k lastcomm
2 0.00re 0.00cp 3avio 0k atrun
3 0.03re 0.00cp 1avio 0k cron*
5 0.02re 0.00cp 10avio 0k exim*
10 3.98re 0.00cp 2avio 0k less
11 0.00re 0.00cp 0avio 0k ls
9 3.95re 0.00cp 12avio 0k man
2 0.00re 0.00cp 4avio 0k sa
12 3.97re 0.00cp 1avio 0k sh
...
From left to right, total times called, real time in minutes,
sum of user and system time, in minutes, Average number of I/O
operations per execution, size, command name.
The sa command can also be used to create summary files or reports
based on some options, for example, here is the output when
specifying a sort by CPU-time average memory usage:
&uprompt; sa -k
86 30.81re 0.02cp 8avio 0k
10 0.61re 0.00cp 7avio 0k ***other
2 0.00re 0.00cp 3avio 0k atrun
3 0.03re 0.00cp 1avio 0k cron*
2 0.01re 0.00cp 29avio 0k exim
5 0.02re 0.00cp 10avio 0k exim*
3 4.27re 0.01cp 45avio 0k ispell
4 0.00re 0.00cp 8avio 0k lastcomm
12 8.04re 0.00cp 2avio 0k less
13 0.00re 0.00cp 0avio 0k ls
11 8.01re 0.00cp 12avio 0k man
2 0.68re 0.00cp 33avio 0k mutt
3 0.00re 0.00cp 4avio 0k sa
14 8.03re 0.00cp 1avio 0k sh
2 1.09re 0.00cp 23avio 0k vi
The sa command is very helpful on larger systems.
How to Put Accounting to Use
Accounting reports, as was mentioned earlier, offer a way to help
predict trends, for example, on a system that has cc and make
being used more and more may indicate that in a few months some
changes will need to be made to keep the system running at an
optimum level . Another good example is web server usage. If it
begins to gradually increase, again, some sort of action may
need to be taken before it becomes a problem. Luckily, with
accounting tools, said actions can be reasonably predicted and
planned for ahead of time.
Kernel Profiling
Profiling a kernel is normally employed when the goal is to
compare the difference of new changes in the kernel to a previous
one or to track down some sort of low level performance problem.
Two sets of data about profiled code behavior are recorded independently:
function call frequency and time spent in each function.
Getting Started
First, take a look at both and
.
The only difference in procedure for setting up a kernel
with profiling enabled is when you run config add the -p option. The
build area is ../compile/<KERNEL_NAME>.PROF ,
for example, a
GENERIC kernel would be ../compile/GENERIC.PROF.
Following is a quick summary of how to compile a kernel with profiling
enabled on the i386 port, the assumptions are that the appropriate
sources are available under /usr/src and the
GENERIC configuration is
being used, of course, that may not always be the situation:
cd /usr/src/sys/arch/i386/conf
config -p GENERIC
cd ../compile/GENERIC.PROF
make depend && make
cp /netbsd /netbsd.old
cp netbsd /
reboot
Once the new kernel is in place and the system has rebooted,
it is time to turn on the monitoring and start looking at results.
Using kgmon
To start kgmon:
&uprompt; kgmon -b
kgmon: kernel profiling is running.
Next, send the data into the file gmon.out:
&uprompt; kgmon -p
Now, it is time to make the output readable:
&uprompt; gprof /netbsd > gprof.out
Since gmon is looking for gmon.out, it should find
it in the current working directory.
By just running kgmon alone, you may not get the information you need,
however, if you are comparing the differences between two different
kernels, then a known good baseline should be used. Note, it is
generally a good idea to stress the subsystem if you know what it is
both in the baseline and with the newer (or different) kernel.
Interpretation of kgmon Output
Now that kgmon can run, collect and parse information, it is time
to actually look at some of that information. In this particular
instance, a GENERIC kernel is running with profiling enabled for
about an hour with only system processes and no adverse load,
in the fault insertion section, the example will be large enough
that even under a minimal load detection of the problem should be easy.
Flat Profile
The flat profile is a list of functions, the number of times
they were called and how long it took (in seconds). Following
is sample output from the quiet system:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
99.77 163.87 163.87 idle
0.03 163.92 0.05 219 228310.50 228354.34 _wdc_ata_bio_start
0.02 163.96 0.04 219 182648.40 391184.96 wdc_ata_bio_intr
0.01 163.98 0.02 3412 5861.66 6463.02 pmap_enter
0.01 164.00 0.02 548 36496.35 36496.35 pmap_zero_page
0.01 164.02 0.02 Xspllower
0.01 164.03 0.01 481968 20.75 20.75 gettick
0.01 164.04 0.01 6695 1493.65 1493.65 VOP_LOCK
0.01 164.05 0.01 3251 3075.98 21013.45 syscall_plain
...
As expected, idle was the highest in percentage, however,
there were still some things going on, for example, a
little further down there is the vn_lock
function:
...
0.00 164.14 0.00 6711 0.00 0.00 VOP_UNLOCK
0.00 164.14 0.00 6677 0.00 1493.65 vn_lock
0.00 164.14 0.00 6441 0.00 0.00 genfs_unlock
This is to be expected, since locking still has to take place, regardless.
Call Graph Profile
The call graph is an augmented version of the flat profile showing
subsequent calls from the listed functions. First, here is some
sample output:
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.01% of 164.14 seconds
index % time self children called name
<spontaneous>
[1] 99.8 163.87 0.00 idle [1]
-----------------------------------------------
<spontaneous>
[2] 0.1 0.01 0.08 syscall1 [2]
0.01 0.06 3251/3251 syscall_plain [7]
0.00 0.01 414/1660 trap [9]
-----------------------------------------------
0.00 0.09 219/219 Xintr14 [6]
[3] 0.1 0.00 0.09 219 pciide_compat_intr [3]
0.00 0.09 219/219 wdcintr [5]
-----------------------------------------------
...
Now this can be a little confusing. The index number is mapped
to from the trailing number on the end of the line, for example,
...
0.00 0.01 85/85 dofilewrite [68]
[72] 0.0 0.00 0.01 85 soo_write [72]
0.00 0.01 85/89 sosend [71]
...
Here we see that dofilewrite was called first, now we can look at
the index number for 64 and see what was happening there:
...
0.00 0.01 101/103 ffs_full_fsync <cycle 6> [58]
[64] 0.0 0.00 0.01 103 bawrite [64]
0.00 0.01 103/105 VOP_BWRITE [60]
...
And so on, in this way, a "visual trace" can be established.
At the end of the call graph right after the terms section is
an index by function name which can help map indexes as well.
Putting it to Use
In this example, I have modified an area of the kernel I know
will create a problem that will be blatantly obvious.
Here is the top portion of the flat profile after running the system
for about an hour with little interaction from users:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
93.97 139.13 139.13 idle
5.87 147.82 8.69 23 377826.09 377842.52 check_exec
0.01 147.84 0.02 243 82.30 82.30 pmap_copy_page
0.01 147.86 0.02 131 152.67 152.67 _wdc_ata_bio_start
0.01 147.88 0.02 131 152.67 271.85 wdc_ata_bio_intr
0.01 147.89 0.01 4428 2.26 2.66 uvn_findpage
0.01 147.90 0.01 4145 2.41 2.41 uvm_pageactivate
0.01 147.91 0.01 2473 4.04 3532.40 syscall_plain
0.01 147.92 0.01 1717 5.82 5.82 i486_copyout
0.01 147.93 0.01 1430 6.99 56.52 uvm_fault
0.01 147.94 0.01 1309 7.64 7.64 pool_get
0.01 147.95 0.01 673 14.86 38.43 genfs_getpages
0.01 147.96 0.01 498 20.08 20.08 pmap_zero_page
0.01 147.97 0.01 219 45.66 46.28 uvm_unmap_remove
0.01 147.98 0.01 111 90.09 90.09 selscan
...
As is obvious, there is a large difference in performance. Right
off the bat the idle time is noticeably less. The main difference
here is that one particular function has a large time across
the board with very few calls. That function is
check_exec. While
at first, this may not seem strange if a lot of commands
had been executed, when compared to the flat profile of
the first measurement, proportionally it does not seem right:
...
0.00 164.14 0.00 37 0.00 62747.49 check_exec
...
The call in the first measurement is made 37 times and has
a better performance. Obviously something in or around that
function is wrong. To eliminate other functions, a look at
the call graph can help, here is the first instance of
check_exec
...
-----------------------------------------------
0.00 8.69 23/23 syscall_plain [3]
[4] 5.9 0.00 8.69 23 sys_execve [4]
8.69 0.00 23/23 check_exec [5]
0.00 0.00 20/20 elf32_copyargs [67]
...
Notice how the time of 8.69 seems to affect the two previous
functions. It is possible that there is something wrong with them,
however, the next instance of check_exec
seems to prove otherwise:
...
-----------------------------------------------
8.69 0.00 23/23 sys_execve [4]
[5] 5.9 8.69 0.00 23 check_exec [5]
...
Now we can see that the problem, most likely, resides in
check_exec. Of course, problems
are not always this simple and in fact, here
is the simpleton code that was inserted right after
check_exec
(the function is in sys/kern/kern_exec.c):
...
/* A Cheap fault insertion */
for (x = 0; x < 100000000; x++) {
y = x;
}
..
Not exactly glamorous, but enough to register a large change with profiling.
Summary
Kernel profiling can be enlightening for anyone and provides a
much more refined method of hunting down performance problems
that are not as easy to find using conventional means, it is also
not nearly as hard as most people think, if you can compile a kernel,
you can get profiling to work.
System Tuning
Now that monitoring and analysis tools have been addressed,
it is time to look into some actual methods. In this section,
tools and methods that can affect how the system performs
that are applied without recompiling the kernel are addressed,
the next section examines kernel tuning by recompiling.
Using sysctl
The sysctl utility can be used to look at and in some cases
alter system parameters. There are so many parameters that
can be viewed and changed they cannot all be shown here, however,
for the first example here is a simple usage of sysctl to look
at the system PATH environment variable:
&uprompt; sysctl user.cs_path
user.cs_path = /usr/bin:/bin:/usr/sbin:/sbin:/usr/pkg/bin:/usr/pkg/sbin:/usr/local/bin:/usr/local/sbin
Fairly simple. Now something that is actually related to performance.
As an example, lets say a system with many users is having file
open issues, by examining and perhaps raising the kern.maxfiles
parameter the problem may be fixed, but first, a look:
&uprompt; sysctl kern.maxfiles
kern.maxfiles = 1772
Now, to change it, as root with the -w option specified:
&rprompt; sysctl -w kern.maxfiles=1972
kern.maxfiles: 1772 -> 1972
Note, when the system is rebooted, the old value will return,
there are two cures for this, first, modify that parameter in
the kernel and recompile, second (and simpler) add this line
to /etc/sysctl.conf:
kern.maxfiles=1972
UBC and sysctl
From what I have gathered on a variety of lists, the most
commonly tweaked
part of the system via sysctl is
the Unified Buffer Cache (UBC). In a really great
email
a developer made some excellent recommendations for tuning a system
with UBC, this section will peruse that for the example.
Following along with the maxfiles example above, even though there
can now be a larger amount of open files, there still may be more
that can be done to make the system run faster. Since more files
can be opened, it stands to reason that it would be good if the
caching were optimized as well. In the case of this machine,
there is a decent amount of RAM (256MB) that despite a lot of
low level activity (such as editing and small compiles) is not
being utilized to its full potential. This means that the buffer
cache could probably be expanded, thereby making performance better
since more would be in the cache versus rereading from disk. Now,
a word of warning, it is of course possible to raise the size to
a point where getting a cache hit is taking too long, but as
has been well documented, that number seems to be fairly high. So
what is the parameter? It is kern.maxvnodes and in the case of
this machine, 38k was used which gave a pretty decent performance boost.
In general, it has been seen that a kern.maxvnodes operates
well when the size is set to between 1/6 to 1/4 of the amount of
RAM (depending on what the system is being used for).
You can also improve performance by tuning vm.anonmax, which
sets the amount of physical memory that can be reclaimed for anonymous
application data. Simon Burge wrote a
very good post to tech-kern
discussing this; it is highly recommended that you read his entire
e-mail before setting values for your system. For the lazy amongst us,
though, sysctl -w vm.anonmax=95 seems to work well
for him on his pc532 with 8MB of RAM; it might work well for your
system too.
memfs & softdeps
An operating system can often benefit from a few configuration
changes (along the same lines, it can also be of great detriment).
Two particular cases where system performance can be changed are
by using memory based filesystems and/or soft updates.
Using memfs
When to use and not to use the memory based filesystem can be hard on
large multi user systems. In some cases, however, it makes pretty
good sense, for example, on a development machine used by only one
developer at a time, the obj directory might be a good place, or
some of the tmp directories for builds. In a case like that, it
makes sense on machines that have a fair amount of RAM on them. On the other side of the coin, if a system only has 16MB of RAM and
/var/tmp
is memfs based, there could be severe applications issues that occur.
The GENERIC kernel has memfs enabled by default. To use it on a
particular directory first determine where the swap space is
that you wish to use, in the example case, a quick look in
/etc/fstab indicates that
/dev/wd0b is the swap partition:
mail% cat /etc/fstab
/dev/wd0a / ffs rw 1 1
/dev/wd0b none swap sw 0 0
/kern /kern kernfs rw
This system is a mail server so I only want to use /tmp
with memfs, also on this particular system I have linked
/tmp to /var/tmp to save space
(they are on the same drive). All I need to do is add the following entry:
/dev/wd0b /var/tmp mfs rw 0 0
Now, a word of warning, make sure said directories are empty and nothing
is using them when you mount the memory file system! At this point I
can either mount -a or reboot the system.
Using softdeps
Soft-dependencies is a mechanism that does not write meta-data to
disk immediately, but it is written in an ordered fashion, which
keeps the filesystem consistent.
describes how to
enable soft-dependencies. A great deal more information about
softdep capabilities can be found on the
author's page.
Kernel Tuning
While many system parameters can be changed with sysctl, many
improvements by using enhanced system software, layout of the
system and managing services (moving them in and out of inetd
for example) can be achieved as well. Tuning the kernel however
will provide better performance, even if it appears to be marginal.
Preparing to Recompile a Kernel
First, get the kernel sources for the release. Note, this document
can be used for -current tuning, however, a read of the
Tracking
-current documentation should be done first,
much of the information there is repeated here.
Set the CVSROOT environment variable, under csh or tcsh do the following:
&uprompt setenv :pserver:anoncvs@anoncvs.NetBSD.org:/cvsroot
Under sh, ksh and similar shells:
&uprompt CVSROOT=:pserver:anoncvs@anoncvs.NetBSD.org:/cvsroot; export CVSROOT
Now that the variable is set, it is time to checkout the sources
for the kernel.
&uprompt; cvs checkout -rnetbsd-2.0 src/sys
The above will checkout the required kernel sources to
src/sys
relative to the directory the user is in, note the
-rnetbsd-2-0,
this is indicating checkout the kernel sources for &os; 2.0.
If 1.6 is the release, the syntax would be:
&uprompt; cvs checkout -rnetbsd-1-6 src/sys
Last and not least, for tracking -current simply omit the release and
cvs will default to the latest sources:
&uprompt; cvs checkout src/sys
Configuring the Kernel
Configuring a kernel in &os; can be daunting. This is because of
multiple line dependencies within the configuration file itself,
however, there is a benefit to this method and that is, all it really
takes is an ASCII editor to get a new kernel configured and some
dmesg output. The kernel configuration file is under
src/sys/arch/ARCH/conf where ARCH is your
architecture (for example,
on a SPARC it would be under src/sys/arch/sparc/conf).
This is where
dmesg becomes your friend. A clean dmesg will show all of the
devices detected by the kernel at boot time. Using dmesg output, the
device options really needed can be determined.
Some example Configuration Items
In this example, an ftp server's kernel is being reconfigured to run
with the bare minimum drivers and options and any other items that
might make it run faster (again, not necessarily smaller, although
it will be). The first thing to do is take a look at some of the
main configuration items. So, in
/usr/src/sys/arch/i386/conf the
GENERIC file is copied to FTP, then the file FTP edited.
At the start of the file there are a bunch of options beginning
with maxusers, which will be left alone, however, on larger
multi-user systems it might be help to crank that value up a bit.
Next is CPU support, looking at the dmesg output this is seen:
cpu0: Intel Pentium II/Celeron (Deschutes) (686-class), 400.93 MHz
Indicating that only the options I686_CPU options needs to be used.
In the next section, all options are left alone except the
PIC_DELAY which is recommended unless it is an older machine.
In this case it is enabled since the 686 is relatively new.
Between the last section all the way down to compat options there
really was no need to change anything on this particular system.
In the compat section, however, there are several options that do
not need to be enabled, again this is because this machine is
strictly a FTP server, all compat options were turned off.
The next section is File systems, and again, for this server very
few need to be on, the following were left on:
# File systems
file-system FFS # UFS
file-system LFS # log-structured file system
file-system MFS # memory file system
file-system CD9660 # ISO 9660 + Rock Ridge file system
file-system FDESC # /dev/fd
file-system KERNFS # /kern
file-system NULLFS # loopback file system
file-system PROCFS # /proc
file-system UMAPFS # NULLFS + uid and gid remapping
...
options SOFTDEP # FFS soft updates support.
...
Next comes the network options section. The only options left on were:
options INET # IP + ICMP + TCP + UDP
options INET6 # IPV6
options IPFILTER_LOG # &man.ipmon.8; log support
IPFILTER_LOG is a nice one to have around since the server
will be running ipf.
The next section is verbose messages for various subsystems,
since this machine is already running and had no major problems,
all of them are commented out.
Some Drivers
The configurable items in the config file are relatively few and
easy to cover, however, device drivers are a different story.
In the following examples, two drivers are examined and their
associated areas
in the file trimmed down. First, a small
example, the cdrom, in dmesg are the following lines:
...
cd0 at atapibus0 drive 0: >CD-540E, , 1.0A< type 5 cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2
pciide0: secondary channel interrupting at irq 15
cd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2 (using DMA data transfer
...
Now, it is time to track that section down in the configuration file.
Notice that the cd is on an atapibus and requires pciide support.
The section that is of interest in this case is the IDE section.
It is worth noting at this point, in and around the IDE section are
also ISA, PCMCIA etc., on this machine in the dmesg output there are
no PCMCIA devices, so it stands to reason that all PCMCIA references
can be removed, but first, the cd drive.
At the start of the IDE section is the following:
...
wd* at wdc? channel ? drive ? flags 0x0000
wd* at pciide? channel ? drive ? flags 0x0000
# ATAPI bus support
atapibus* at atapi?
...
Well, it is pretty obvious that those lines need to be kept. Next is this:
...
# ATAPI devices
# flags have the same meaning as for IDE drives.
cd* at atapibus? drive ? flags 0x0000 # ATAPI CD-ROM drives
sd* at atapibus? drive ? flags 0x0000 # ATAPI disk drives
st* at atapibus? drive ? flags 0x0000 # ATAPI tape drives
uk* at atapibus? drive ? flags 0x0000 # ATAPI unknown
...
The only device type that was in the dmesg output was the cd,
the rest can be commented out.
The next example is slightly more difficult, network interfaces.
This machine has two of them:
...
ex0 at pci0 dev 17 function 0: 3Com 3c905B-TX 10/100 Ethernet (rev. 0x64)
ex0: interrupting at irq 10
ex0: MAC address 00:50:04:83:ff:b7
UI 0x001018 model 0x0012 rev 0 at ex0 phy 24 not configured
ex1 at pci0 dev 19 function 0: 3Com 3c905B-TX 10/100 Ethernet (rev. 0x30)
ex1: interrupting at irq 11
ex1: MAC address 00:50:da:63:91:2e
exphy0 at ex1 phy 24: 3Com internal media interface
exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
...
At first glance it may appear that there are in fact three devices,
however, a closer look at this line:
exphy0 at ex1 phy 24: 3Com internal media interface
Reveals that it is only two physical cards, not unlike the cdrom,
simply removing names that are not in dmesg will do the job. In
the beginning of the network interfaces section is:
...
# Network Interfaces
# PCI network interfaces
an* at pci? dev ? function ? # Aironet PC4500/PC4800 (802.11)
bge* at pci? dev ? function ? # Broadcom 570x gigabit Ethernet
en* at pci? dev ? function ? # ENI/Adaptec ATM
ep* at pci? dev ? function ? # 3Com 3c59x
epic* at pci? dev ? function ? # SMC EPIC/100 Ethernet
esh* at pci? dev ? function ? # Essential HIPPI card
ex* at pci? dev ? function ? # 3Com 90x[BC]
...
There is the ex device. So all of the rest under the PCI section
can be removed. Additionally, every single line all the way down
to this one:
exphy* at mii? phy ? # 3Com internal PHYs
Can be commented out as well as the remaining.
Multi Pass
When I tune a kernel, I like to do it remotely in an X windows session,
in one window the dmesg output, in the other the config file. It
can sometimes take a few passes to rebuild a very trimmed kernel
since it is easy to accidentally remove dependencies.
Building the New Kernel
Now it is time to build the kernel and put it in place. In the
conf directory on the ftp server, the following command prepares the build:
&uprompt config FTP
When it is done a message reminding me to make depend will display, next:
&uprompt cd ../compile/FTP
&uprompt make depend && make
When it is done, I backup the old kernel and drop the new one in place:
&rprompt cp /netbsd /netbsd.orig
&rprompt cp netbsd /
Now reboot. If the kernel cannot boot, stop the boot process
when prompted and type boot netbsd.orig to
boot from the previous kernel.