Robbat2 (robbat2) wrote,

[Gentoo] Upgrading/using nss_ldap/nss_mysql/nss_nis/nss... and not breaking your system

So lately there have been a lot of complaints about nss_ldap-249+ breaking systems on boot. The source of this is actually not a breakage, but a change in behavior that exposed something that was always broken. Many of the comments below go for all NSS backends where the actual data source might not be available during the early phases of booting (because the LDAP server may not have started yet, or network may not be started).

In your /etc/nsswitch.conf file, you may have lines like:
passwd: files ldap
group: files ldap
If you have it the other way around, that's the first cause for breakage. The always-on sources need to be available at system boot time.

During boot, nearly every init script causes at least one lookup, in the cases of things like udev, it causes a lot of lookups, as it needs them. If it can find everything from the files nss backend, then it doesn't need to go to LDAP (or any other unavailable backend). In the case of udev, for a very long time there has been this rule:
/etc/udev/rules.d/50-udev.rules:KERNEL=="tpm*", NAME="%k", OWNER="tss", GROUP="tss", MODE="0600"
This causes udev to look up the user and group 'tss' (that's two lookups). Does your system have a 'tss' user and group? Unless you have the app-crypt/trousers package installed, you probably don't.

Ok, so if this has always been a problem, why did it suddenly turn up now? nss_ldap-249 has a change of behavior (badly documented by upstream unfortunetly). It changed from a hardcoded timeout numbers to using configurable timeout numbers, and greatly increased the timeout values. Previously, if the server was not available or otherwise had issues, nss_ldap failed out after at most 30 seconds (and a lot less if the server IP/port were actually unreachable). As of 249, it takes 124 seconds. It tries twice, then waits 4 seconds, then another 8 seconds, another 16 seconds, another 32 seconds, and finally another 64 seconds, with an attempt between each of the waits. Unfortuntely this behavior is serial, and happens for every lookup. udev tries to look up user 'tss', then group 'tss', etc. On some systems, this made the boot-up unbearly slow, as there were 30+ lookups that went to nss_ldap, at 2 minutes each, leading to an hour of waiting before the actual login prompt came up.

How do we fix this?
The proper way: For every Gentoo init script, we need to make sure that every value looked up is actually in the system files, so that no requests go to nss_ldap or any other remote backend. In the case of udev, this is a known flaw of udev, that it looks up stuff it doesn't need to. If somebody has enough time to look at the udev code, upstream would greatly appreciate it - they don't have enough time to do it. You can comment out the tss line temporarily as well if you want.
The temporary hack: I've commited nss_ldap-250-r1 that changes the default timeouts in the header files, as well documenting them, and the old ones, and even faster ones (read: more dangerous) in /etc/ldap.conf.

Side note: It does seem there is something that changed with regards to SSL behaviour in either openldap-2.3.* or nss_ldap between 239 and 249. In some setups, 'ssl on' no longer works, but specifying a plain ldap:// URL instead of ldaps://, and using 'ssl start_tls' works perfectly fine. If you run into this, move to TLS!

Tags: gentoo, ldap, libc, nss_ldap
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded