Bug #47898 for dhcp-public: Document how to safely change pools after a failover partnership has already been established

Wed, 04 Jul 2018 09:47:37 -0400 Cathy Almond <cathya@isc.org> - Ticket created

Subject:	Better handle moving pool from one DHCP failover association to another
From:	cathya@isc.org
Date:	Wed, 04 Jul 2018 13:47:33 +0000
To:	dhcp-suggest@isc.org

As requested: There appears to be a bug in the ISC failover code. I've tried this out with 4.3.6-P1 and reviewed the current release notes and don't see any newer patches that would address it. If a server (A) is participating in two failover associations (one each to two other servers (B & C)) and a pool is moved from one FA (AB) to the second FA (AC) the server's knowledge of the state of the free and backup pools will be inconsistent with the new peer. That is A will think that it has a certain number of free & backup leases, while C will not have this information. When C requests a pool re-balance A will believe the pools are properly balanced and will not send any updates. To demonstrate this issue create three severs A, B and C with two failover associations AB and AC. In the config file for A include three subnets and pools 17.16.131.0/24, 17.16.132.0/24 and 172.16.133.0/24 with reasonable ranges. Associate two ranges with AB and one with AC. Start all three servers and let them sync and balance. At this point A & B will each have half of their two pools and A & C will each have half of their one pool. Now stop all three servers and move one pool from AB to AC. When A reads it's lease file it will still have half of all three of it's pools and will think it's peers have the other half. When C reads it's lease file it will have half of the first pool it had but will not have any leases from the new pool. It will request a pool rebalance from A but as far as A can tell it already has half of the available leases.

Wed, 04 Jul 2018 09:48:02 -0400 Cathy Almond <cathya@isc.org> - Reference to https://support.isc.org/Ticket/Display.html?id=13067 added

Wed, 04 Jul 2018 12:23:03 -0400 Cathy Almond <cathya@isc.org> - Correspondence added

Original submitter responses to questions various: > Our initial thoughts are that the primary doesn't really have any > good way to know that there has been the switch, or to know quite > what to do with the fact that it happened - and that dhcpd has > probably operated this way since forever with failover partnerships. I haven't checked but I believe the people we had reproduce this had issues with either the primary or secondary in this case. I suspect you are correct that this has been around for a while. I didn't dig into the code really deeply but don't believe there is any way for either the primary or secondary to figure out that something has happened directly. There would need to be some information in the lease file to allow the server to compare what it had previously and what it has now. > We wonder what would happen if you fault the databases on the > secondary failover peer servers. I think removing the lease file on the peer that pool was moved to would work as it should cause a full update. However there would be a performance hit. This probably isn't too bad if the peer is in only one FA but might be unpleasant if the peer is in multiple peering relationships and now needs to do a full update for all of them. > Do you think this behaviour violates the DHCP Failover Specification > in some way? Faulting the DBs? no Having inconsistent DBs between peers? It might not violate the letter of it but I'd certainly believe it violates the spirit of it. > What thoughts do you have on how dhcpd should adapt to this > kind of pool/peering swapping? Things we've thought about: 1) Basically do a full update on the pool on a rebalance request by adding code to add all of the possible leases (in our case we mostly care about backup and free - active are kept in our DB and so are properly available) to the update queue 2) Trigger a full update when a pool is moved, this would probably require some sort of flag in the or maybe lease files. In our case the trigger would need to be removed from our DB automatically in your case you might be able to have the admin change the config file to add and then remove the flag. 3) tag the leases with the FA to which they belong, if it changes trigger an appropriate update 4) tag extra pool <-> FA information into the lease file 5) delete the lease file (I'm trying to have someone get some idea of the performance penalty in doing this.) I'm trying to balance the performance hit against the possible code complications

Wed, 04 Jul 2018 12:23:04 -0400 The RT System itself - Status changed from 'new' to 'open'

Wed, 04 Jul 2018 12:27:24 -0400 Cathy Almond <cathya@isc.org> - Correspondence added

Further correspondence on tackling this: ISC: > Our operational advice for handling this situation would be something > along the lines of: > > - comment out the to-be-differently-failover-paired subnets from the > server configurations that they're currently in > - restart/stabilize > - un-comment/edit to re-add the subnets as they now need to be > - restart/stabliize > > The servers on the first restarts will discard any leases that don't > belong anywhere in their configured subnets. > > On the second set of restarts, when clients renew (versus discover), > they should be re-granted leases in the reconfigured subnets with the > new failover pairings in action. > > Almost certainly there will be some operational 'nits' floating around > depending on how the clients behave etcetera, but the above would seem > to be the course of least disruption - particularly in a large-scale > deployment environment. > > It's better than faulting the database entirely on the secondaries, > which could be quite time-consuming to recover. We also know, > operationally, that it will work to establish the new failover > pairings correctly. > > Could someone manipulate the leases files to retain the existing > leases through this? Possibly - but I would not like to be that > sysadmin. Safer would be to discard the leases and let dhcpd take > care of things itself. > > Let me know what you think - and apologies that there aren't any > quick-fix answers to this one, although I doubt that you're too > surprised? Re: operational fix - I'll pass that along but I'm not sure if that is usable in our situation. With the config file generation and dhcpd restarts being handled by the grid master it adds another layer between the admin and the server. Normally this is good but it makes it harder to do something like your suggestion.

Thu, 12 Jul 2018 07:56:16 -0400 Cathy Almond <cathya@isc.org> - Subject changed from 'Better handle moving pool from one DHCP failover association to another' to 'Document how to safely change pools after a failover partnership has already been established'

Thu, 12 Jul 2018 07:56:16 -0400 Cathy Almond <cathya@isc.org> - Queue changed from #10 to dhcp-public

Thu, 12 Jul 2018 07:56:16 -0400 Cathy Almond <cathya@isc.org> - Priority P1 High added

Thu, 12 Jul 2018 08:05:20 -0400 Cathy Almond <cathya@isc.org> - Correspondence added

After lengthy discussion, a decision was made that we are not going to add any functionality to ISC DHCP to make it 'safe' to reconfigure pools that have already been created and established between failover partners without going through a process that involves: a) Deleting the pools from both partner configurations b) Restarting both failover partners (one at a time) to reload the leases file, thus removing any leases already granted from the two pools c) Reconfiguring the pools again in their new partnerships d) Restarting the (new) failover partners The underlying issue is around what happens to 'sort out' the anomalies during the transition: - which server is responsible for leases already allocated (particularly if that server becomes no longer responsible for that pool after the reconfiguration) - how pool balancing is managed when the servers are being rebooted, when instead of deleting and re-adding the pools to 'clean up' first, they are simply edited, so you end up with a transient mis-match between DHCP servers as the partners are restarted to load the new configuration/topology. The same constraints would also apply to alterations to the range of addresses handled by a failover pair, for example if one a pool becomes exhausted and needs to be extended. Again it is the pool balancing during the transition where the outcome is unpredictable because the two partners have a different view of the size of the pool. ---- Instead, we will document our recommendations for safely making this type of ISc DHCP configuration change.

Thu, 12 Jul 2018 08:43:12 -0400 Cathy Almond <cathya@isc.org> - Reference to https://support.isc.org/Ticket/Display.html?id=13093 added

Created:	Wed, 04 Jul 2018 09:47:34 -0400
Updated:	Thu, 12 Jul 2018 08:43:26 -0400
Closed:	Not set

Bug #47898 for dhcp-public: Document how to safely change pools after a failover partnership has already been established

This bug tracker is no longer active.