I recently upgraded one of my work machines from a single processor whitebox to a dual 3.1GHz processor, hyperthreaded Dell 470. In doing so, I also installed OpenSuse 10.2 (x86_64). This is a big step up from the old box I was using. One thing I did notice however, was that when attempting to transfer the tar files of my old data from another machine, ssh would constantly die with the following error: “Disconnecting: Corrupted MAC on input.”

I googled for a while and found that the error can be caused by a few different things. Bad hardware, buggy drivers, etc.. I first tried disabling the on-board network card and installing a 3COM 509c in it’s place with no success – I was still getting the same errors but sadly, the errors would happen even sooner than with the on-board controller.

Some more research dug up a tie with SMP and a possible bug in OpenSSH or it’s underlying libraries. It seems that context switching and migration of processes/threads between CPUs can expose a possible bug in ssh, causing the problem.

To see if this was really the problem, I followed some advice I had read while googling for solutions. That was to transfer the same files using http, FTP, NFS or some other protocol multiple times and check the md5 sum of the transfered files against the original. This would narrow the errors to the programs transferring the data or to the driver. If the files corrupted regardless of the method, then the driver is at fault otherwise, it’s the program. After some experimentation I found that all the files transfered using http with no errors at all. It was scp.

Not wanting to disable my second CPU, hyperthreading or both, I really wanted to find a way to continue using OpenSSH on my machine and still have a functioning SMP machine – after all, isn’t the point of having hardware to be able to use it?

The Linux scheduler has a way of tying processes to certain processors. This is called Processor Affinity. Normally, the scheduler will keep a process on a single CPU until another process is requesting use. This is “natural processor affinity”. The scheduler tries to keep a process tied to a processor for performance reasons. However, processes can get bumped and it’s not that hard to do. This is where using hard processor affinity can really help out.

Processor affinity can be done programatically using the APIs discussed in an IBM developer works article located here. What I was in need of was a way to tie a command like program (scp, in my case) to a single processor.

Enter taskset

From the man page:

taskset is used to set or retrieve the CPU affinity of a running pro- cess given its PID or to launch a new COMMAND with a given CPU affinity. CPU affinity is a scheduler property that “bonds” a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs. Note that the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications.

taskset can work to either change the processor affinity of an existing processes using it’s PID or it can be used to start a new command, tying that process to one or more CPUs.

In my case, I wanted to tie scp to a single CPU. I did this using the -c flag which takes a comma delimited list of one or more 0-indexed based CPUs:

    taskset -c 0 scp root@mybackupserver:/backup.tar .

This will tie the scp process to processor 0 during it’s lifetime (unless it’s affinity is changed again using taskset). This solved the problem completely. Looking at my CPU load monitors (Gkrellm), I noted that only one processor was loaded during the transfer of my backup files. No failures occurred and my data arrived without error. Looks like OpenSSH may have some SMP related bugs.

Setting the CPU affinity for a process can help in other ways too, not just for fixing applications which are incompatible or buggy with SMP. Several programs such as Oracle’s database are licensed on a per-CPU basis and may not actually run on a system that exceeds the CPU licensing requirements. Tying such a process to one or set of CPUs might allow the program to run correctly.

If you’re interested in trying taskset, it is in the schedutils package in some distros and in the util-linux package in OpenSUSE 10.2.