I'm sick of trying to debug ssh connection issues; they're often hard, and I work on them just infrequently enough to forget everything I knew. This blog post is a note to myself about what kinds of problem to look for and where to look when ssh has gone south…
I use Dirvish to do my backups. It's got some issues, but the overall structure is a really good idea. The biggest of several problems with it is that it requires that the backup host have appropriately-restricted root ssh access to all of the remote hosts being backed up. Typically, when backups start failing, its either that the backup drive is screwed up, or that something has changed with the ssh
setup and Dirvish cannot connect to the host to be backed up.
Step 1: Look in the application's ssh logs. My Dirvish setup connects through a shell script I wrote, which writes logfiles into /var/log/dirvish
. The message I'm currently seeing is
Received disconnect from 192.168.1.999: 2: Too many authentication failures for root
OK, so this is definitely an ssh authentication problem. It turns out that I did in fact recently change a lot of my setup on the target machine, so this makes sense. The client-side log shows nothing interesting, so as expected Dirvish is not getting through.
Step 2: Look at /var/log/auth.log
on the target host. In this case, I find a bunch of messages like
Jan 1 14:42:07 myhost sshd[19070]: User root not allowed because account is locked
Hmm. That's weird. But it explains some of what's going on. Maybe I can Google it?
Step 3: Try running the offending ssh
command with verbose debugging. Note that this never helps. The debugging messages put out by ssh are verbose yet useless. In my case, I get a lot of "Roaming not allowed by server
" messages, which are pretty cryptic.
Step 4: Check the auth files and permissions. On both the client and server side of my link, the .ssh
directory and their contents look reasonable. Make sure in particular that your authorized_keys
file contains exactly one key per line; it's really easy to get these insanely long lines broken up by various tools. Use wc -l
to check that everything is happy. On my system, it all looks mostly OK, but the client key seems to be busted. I fix it.
Step 5: Google for help. In my case, I infer a simple explanation for my problem. I had just "cleared" the root password by sticking a "*" in root's password field. This works fine for me, since I can log in as root locally without a password, and don't want to be able to log in as root at all elsewhere. Some post I was reading pointed out that modern passwd
has a "-S
" status option. Sure enough, it showed that root
was "locked": not what I intended at all.
The fix? Change root's password field to "x". This is read by PAM as a valid (but impossible) password and root is "unlocked".
Now I can has backups again. (B)