Troubleshooting Iowa Network Issues
Lots missing from before 3/15. Was not properly documenting up until this point but I'll try and summarize:
I Only symptoms have been identified, cause is still unfound
I Massive packet loss and retransmissions betweeen any and all devices
I seems to be triggered by shutting a machine down
C Disabled wireless devices on all workstations because the wireless NIC was responding to arpings for the other NIC
C Disabled the virtual bridge for vbox on ws118 and rebooted. Issue still occured
C created user 'tinkerbell' to test user logins and NFS
I DynDNS for the workstations still is not working
I fabfile.py can be found in file0.ia.votesmart.org:/root/build/utilities/fabric/iowa/
I We did try disconnecting the Wireless AP early in the situation, but no change.
I Tried using the proprietary driver for a workstation. Reboot still triggered the issue
I Triple checked MTU on workstations and server. All 1500, assuming the switches are not doing packet reassembly.
I Disabled TCP Offload(TOS) on file0 to no effect
3/14
D Got 2 tcpdump output files to analyzeI Only scratched the surface of analyzing the dumps
3/15
C restarted all(seemingly relevant) RPC services and nfs-server
C Cleared old connections from showmount
C set debugging output in /etc/sysconfig/nfs
I noticed ws133(172.16.253.193) did not have a proper DNS entry. A couple workstations also have this problem.
I Workstation DynDNS updates are still being rejected due to "rejected by secure update"
I Best way to get current NFS connections is `netstat -na | grep 2049`
I There's lot of automount expire messages on the clients. Unsure if this is expected behavior.
I DHCP lease expiration is set to 1200(20 minutes?)
I There seems to be open connections to 192.168.255.30:53(old poprocks). How is this possible? No mention in resolv.conf.
I I had Walker power cycle the core switches. Resetting the switches did actually seem to alivieate the problems. The active connections from `netstat -an` did NOT change after switch power cycling. !!!NOTE!!!: Right before he did pull the plug, it looked like the network was improving. So it could have to do with his usage more than power cycling the switches. Will have to retry that at some point when the problems are occuring. UPDATE: Issues occurred again not too long after switch powercycle.
C Found the reference to 192.168.255.30:53. The server record in IPA(LDAP) for file0 had '192.168.255.30' set as its forwarders. This must be leftover from the server's original setup in MT. Have now switched this to use Drake's DNS servers as upstream resolvers. This should improve Internet performance a bit for the clients.
I Stopping FreeIPA entirely(`ipactl stop`) had no effect on network issues. (network did improve about 15 minutes after, probably not related)
I still going through the dumps from yesterday. Nothing sticks out yet. Need a better analysis tool, maybe.
I There's ongoing SSH attacks, which isn't ideal, but doesn't seem any more extreme than any other public-facing machine we had before. fail2ban is setup and the local network is whitelisted
I iptables rechecked. fail2ban is working and returns to the chain if not matched. cstate INVALID is set to drop.
C Deleted the iptables rule dropping invalid packets. Considering the network errors, I think file0 should probably accept out of sequence packets which INVALID does grab.
3/16
I I asked Luke with Drake to take a look at the switches today to see if there's any interesting debug information. Looks like a lot of machines are flapping between their respective port and port 48(either switch uplink or wireless controller). See the following image for more:

I Noticed a mild DHCP broadcast storm. Send first from 00:24:97:ce:d2:60, then immediately from 00:42:5a:76:23:24 but spammed a couple hundred times in rapid succession. This might not be a problem for one machine, but across many machines(rebooting at once) and a short DHCP lease time(1200), it could add up to a perfect storm. Though no reason to think this would cause sustained issues. UPDATE: Found second occurrence. Different machine and IP. UPDATE: These aren't broadcasts. Not even close to a storm.
I Kerberos on file0 appears to be using RST packets. Unclear on trivial research if this is normal or not. Usually right after the RST, some out of order segments start coming through. The RST seems initiated by the client, maybe?.
I Another Kerb RST from a client in response to a retransmit of a packet form 6 seconds before. Connection closing. Probably normal client response, but very overdue retransmission. Also repeated 8 seconds after the last... (253.154)
I file0 sent RST for ldap connection to 253.152. Looks like it's in response to a poor connection.
CategoryITNotes