Mitch Richling: UNIX System Admin Tools

Author:	Mitch Richling
Updated:	2025-06-17

1. Introduction
2. stats.pl: Compute Statistics
3. Fast filesystem traversal

1. Introduction

You will find here several simple tools that may be of use for UNIX system administrators. My Guide to UNIX System Programming has a few programs that may be useful as well.

2. `stats.pl`: Compute Statistics

stats.pl started life as a bit of one line perl magic, and has grown over the years into what you see here. The idea is simple to bust text data into columns of numeric values, optionally grouped by categorical (factor) variables, and report simple statistical summary information like mean, average, max, min, count, standard deviation, variance, regression lines, and histograms. The output can be customized allowing a range of formats from machine readable ones like CSV and TSV to human consumable reports using fixed width tables.

Perhaps the most complex, and useful, feature of the stats.pl script is the powerful techniques it uses to extract the data in the first place. This is doubly important for UNIX geeks that tend to deal with numerous oddly formatted text files on a daily basis. Note that the script is not only capable of using the data it extracts, it is also capable of outputting the filtered and scrubbed data in various formats (like CSV). For many users stats.pl is less of a computational tool, and more of a general purpose "data extractor and filter" allowing them to feed data into tools like R, SAS, or (goodness forbid) Excel.

For simple cases, the script "just works" with the default values; however, more complex examples are easy to find in the day-to-day life of a UNIX system administrator:

How do I extract the data from vmstat?: The output of vmstat is funny as the second line has the titles while the first and third lines are junk with the data starting on line four. That sounds painful, but stats.pl makes it easy:

-skipLines=3 -headerLine=2

How do I get extract the data from mpstat?: The output of mpstat is another odd one in that the first line and every fourth line consists of column titles. How kooky is that? We note that each title line has the string CPU and none of the data lines do. So we can use something like this:

-headerLine -notRegex=CPU

OK. I got the data from mpstat, but I want a summary for each CPU?: The CPU is labeled in the output of mpstat in a column called CPU - the column we used qin the previous FAQ entry to delete the title lines. All we need do is tell stats.pl about this column. The following options will do the trick:

-headerLine=1 -notRegex=CPU -cats=CPU=

How do I get the data from sar?: The output from sar is more complex. The first three lines are bogus, the fourth line has titles MIXED with data, and the last two lines are junk (a blank line and an "Average" line). Still, it isn't too bad telling stats.pl how to get the data. Because this one is so complex, there are different ways to do it. Here are three:

-notRegex=Average -goodColCnt=5=
-stopRegex='^$' -skipLines=4=
-notRegex='(^$|Average)' -skipLines=4=

How can I get better titles from sar data?: First, see the previous question about how to get the data. Use one of the options, and add the following to the command line:

-colNames==time,usr,sys,wio,idle

3. Fast filesystem traversal

The traditional way to traverse a file system is to simply use a recursive algorithm.

This algorithm is generally I/O bound; however, the culprit on modern systems is often I/O latency - not bandwidth. This is particularly true with today's transaction based I/O subsystems and network file systems like NFS. One way to alleviate this bottleneck is to have multiple I/O operations simultaneously in flight. Using this technique on a single CPU Linux box with a local file system only produces marginal performance increases, but when dealing with NFS file systems the speedup can be quite significant. Experiments with multi-CPU hosts utilizing gigabit Ethernet with large NFS servers show incredible performance improvements of well over 50x (20 hours cut down to 20 minutes). This set of programs has been used to traverse hundreds of terabytes of storage distributed across more than a billion files and 100 fileservers in just a few hours.

The idea is to first store every directory given on the command line in a linked list. Then a thread pool is created, and the threads pop entries off of that linked list in the order they were placed in the list (FIFO). Each thread then reads all the entries in the directory it popped off the list, performs user defined actions on each entry, and stores any subdirectories at the end of the linked list. This algorithm leads to a roughly depth-first directory traversal. The nature of the algorithm places a heavy load upon the caching systems available in many operating systems. For example, ncsize plays a roll in how effective this program is on a Solaris system. Also in Solaris the number of simultaneous NFS connections dramatically effects performance. Depending on what the optional processing functions are doing, this program can place an incredible load on nscd.

The code base is designed to be customized so that binaries may be easily produced to do special tasks as the need arises. That said code linked here is written in C, and makes use of ancient C techniques to provide for tool customization. As a demonstration of how to customize the behaviour, several compile options exist for the code in the archive that generate different binaries that do very different things. Currently the following examples may be compiled right out of the box:

du: A very fast version of /bin/du. It has no command line options, and simply displays the output of a 'du -sk'.
dux: A very fast, extended version of /bin/du that displays much more data about the files traversed including: file sizes, number of blocks, detects files with holes, and lots of other data.
own: Prints the names of all files in a directory tree that are owned by a specified list of users.
age: Produces a report regarding the ages of the files in a directory tree.
noown: Prints the names of all files in a directory tree that are NOT owned by a specified list of users.
dirgo: Simply lists the files it finds. This is similar to a 'find ./', only it does an almost depth-first search.

Mitch Richling: UNIX System Admin Tools

Table of Contents

1. Introduction

2. stats.pl: Compute Statistics

3. Fast filesystem traversal

2. `stats.pl`: Compute Statistics