The Tale of a Slow Server

In casual conversation with a friend, we're talking about scaling companies and the fun and challenges of growth at small companies (less than 50 employees). When he says,

"Hey, our main log server is getting slower and slower by the week. We've been troubleshooting it for days to no avail. Would you be willing to take a look?"

And that's how it began. 9,217,483 directories and 5 days later, we have a solution and plan for Log Server 2.0. Names and details are left out to protect the company.

The company doesn't have any sysadmins/devops per se. They hire contractors for specific projects and otherwise are technically inclined themselves to deploy pre-configured packages to use for the business. Nothing really out of the ordinary here, except maybe the lack of a CTO or chief architect to oversee everything.

Background

This particular server is the central Rsyslog server for an infrastructure of a few hundred servers. Most of the infrastructure is point and click images (docker, bitnami, etc), which are really just preconfigured virtual servers. All of them were configured to use Rsyslog to remotely log to this one machine. This one machine is real hardware running a flavor of linux. The details on the machine are 64 GB ECC RAM, 6 Intel Xeon dual-core cpus with hyper threading enabled, and 10 Hitachi enterprise disk drives (HGST HUS726060ALE610 to be specific). The drives were configured as RAID 6 using native linux software RAID with the on-board SATA controller. A contractor was hired to setup the server and configure the centralized Rsyslog. When they add new servers, they run a script which serially loops through all the servers and deploys the correct Rsyslog configuration. The script literally takes around 8 hours to run because it has to connect, transfer, and then go back and restart Rsyslog on nearly 1,000 servers. I showed them ansible and how to parallelize this and/or just set it up for new servers.

Investigation

First off, I setup some statistics logging, so we can quantify "slowness" and start to isolate the various sub-systems (memory/virtual memory, cpu utilization, disk I/O, etc). It's an Rsyslog server, so some disk I/O and network traffic is expected. However, the "slowness" seems to be constant and get worse when the array has to run the daily summary jobs.

After logging in and poking around, a few things catch my eye, but nothing really enough to warrant the claimed slowdown. And then, I find the /opt/log directory. syslog is set to write logs here. Under /opt/log are so many directories that "ls" cannot even run. I run exa and some other commands, to find there are 9,217,483 directories, never mind the over 16 million files. All in one directory. Somehow there are 281 million inodes used in this filesystem alone, out of 700 million configured. We've found the problem.

Thankfully, everything is written as YYYY-MM-DD-8-character-random-hex in both files and directories. An example is "2018-03-10-39ee782d" Each of the hundreds of systems is logged to this directory with the timestamp for the current day. Each of the servers is assigned an 8 character hex string in the config file. There is a separate text file to map real server name to this 8 character hex string. The 8 character strings appear to be randomly generated and assigned to each server. This string is generated by RANDOMSTRING="$(openssl rand -hex 4)". Something appears to have broken along the way, because sometimes directories are created, sometimes an error occurs and the files are written with the same naming template, but to the /opt/log directory.

Hacking together a solution

The quickest solution is to index the files by date and first 3 characters of the hex string. Using the example above, "2018-03-10-39ee782d", we create a directory structure of: /opt/log/2018/03/10/3/9/e and put the log files for "2018-03-10-39ee782d" into this directory. Simple solution, long, slow implementation.

The next discovery while creating all these directories and moving files around is that we run into millions of cases where the log user doesn't have permission to read or write a directory. The problem is various scripts and contractors have either mis-typed the log user or created new log users. So we have around 20 different accounts writing to this /opt/log directory. A simple "chown -R loguser:loguser ./" is really slow. I try breaking it up with parallel but that just overloads the disk subsystem even faster, causing everything to slow down. I end up with a "find ./ -type d -name "pattern here" -exec chown -R loguser:loguser {} \;" command that runs the fastest with the disk subsystem able to keep up reliably.  Four days later, we can read/write everything and it took a weekend to finally move around the files and directories to the new layout.

Drive Sub-System

Massive reads should be fast. Massive writes will be slow. But why is the system this slow? A couple of observations:

  1. The native system board SATA controller is used. Chances are this is cheap and not fast. In fact it's the Intel C602 chipset (rev 6).
  2. The stripe block size is set to 512 bytes. "Sector Sizes:     512 bytes logical, 4096 bytes physical"
  3. On the same set of 10 disks, there are 4 different arrays, some RAID 1, some RAID 6.
  4. After looking at the individual disks with smartctl, we find 9 of 10 are running "SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)".

The server is a dedicated server rented monthly at some provider, so a physical inspection is out. However, we know a few things now:

  1. The SATA controller is great for boot drives and basic disks, but not ready for enterprise workloads, especially not this one.
  2. The disks are physically optimized for 4k sectors, but the array is setup for 512 byte sectors. Therefore, the disks are having to map the data from 512 to 4k on write and read, meaning a performance hit as this is done automatically via a translation table maintained by each disk.
  3. The different array types mean different workloads for the SATA controller and disks, but also that since they're spread across all 10 drives. The drives are being constantly asked to do very different workloads. This likely means each drive is just scattered trying to keep up with it all. Or said another way, since this is spinning disk with moving heads, nothing is ever in the optimal place for reads or writes. Especially long streams of reads or writes.
  4. 90% of the disks are connected at half-speed capacity. This could be due to cables, configuration, or something else I haven't figured out, like mismatch between drive and Intel SATA controllers.

A New Server/Hope

I helped spec out a new server. This time with 30x 2 TB SSD 3D NAND drives, 2 dedicated hardware raid controllers setup as JBOD mode connected to PCI Express, disabled hyper threading, and went with FreeBSD with ZFS managing the RAID and drives. NVMe drives would be much, much faster, but they're still really expensive compared to cheaper 2TB SSD. 10 TB spinning disk drives would be cheapest, but the company wants far better performance to hedge against future scaling up and lower quality contractors developing solutions. Of course, now we'll have to transfer millions of directories and 30TB of files to the new server.  The 2 dedicated hardware controllers aren't for doing the RAID arrays in hardware, but to speed up the individual ops requested. One controller is for RAID 1, there other for RAID 6/Z and on separate PCIExpress busses to optimize traffic with the CPUs and memory. The memory on the new server is dual channel and maximized the chips to provide the more bandwidth across the memory buss.

Why not the cloud?

Why not? Cost. Renting/provisioning a 60 TB filesystem in the cloud is costly. Doing so on high iops guaranteed drives is even more so.

The Real Solution

The real solution here isn't to build a fast log server. The real solution is one which doesn't involve a log server at all. They should be running a 7x24 streaming log analysis pipeline which eventually feeds into ElasticSearch/Kibana or Solr/Banana for realtime reporting and alerting from this system. Something like Apache Kafka or a commercial solution would handle this load much better, get rid of the daily batch nature of reports, and run on far less expensive hardware, possibly even in their current providers virtual server environment.

In hearing about the rest of the environment, I suspect most of the pre-configured images could be dropped in lieu of simple caching proxy/webservers talking to a load balanced cluster of app servers, talking to a load balanced cluster of database servers. All of this can run on smaller virtual servers than they rent now (probably half as many) which would reduce costs overall. However, the total cost may be equivalent in the end when you factor in the cost of retooling of existing code, a devops/sysadmin person to manage it all, and the cost of the time spent focusing on this rather than other growth opportunities. We're still working out the math. And this sounds like a full-time job.