X-RT-Interface: Web Content-Disposition: inline Content-Type: text/plain; charset="utf-8" From: michal@isc.org MIME-Version: 1.0 Content-Transfer-Encoding: binary Subject: Prevent further races in RFC 5011-related code Date: Thu, 02 Nov 2017 15:42:53 +0100 To: bind9-public@isc.org Message-ID: X-Mailer: MIME-tools 5.508 (Entity 5.508) X-RT-Original-Encoding: utf-8 Content-Length: 4273 Prevent further races in RFC 5011-related code The updated mkeys system test triggered an assertion failure [1]: 02-Nov-2017 00:27:13.069 zone_timer: managed-keys-zone : enter 02-Nov-2017 00:27:13.069 zone_maintenance: managed-keys-zone : enter 02-Nov-2017 00:27:13.069 zone_dump: managed-keys-zone : enter 02-Nov-2017 00:27:13.069 socket 0x7fef84928540 10.53.0.1#53238: packet received correctly 02-Nov-2017 00:27:13.069 zone_refreshkeys: managed-keys-zone : enter 02-Nov-2017 00:27:13.069 managed-keys-zone: Creating key fetch in zone_refreshkeys() for '.' 02-Nov-2017 00:27:13.069 fetch: ./DNSKEY 02-Nov-2017 00:27:13.069 fctx 0x7fef7801df90(./DNSKEY): create 02-Nov-2017 00:27:13.069 received control channel command 'managed-keys refresh' 02-Nov-2017 00:27:13.069 log_ns_ttl: fctx 0x7fef7801df90: fctx_create: . (in '.'?): 1 16 02-Nov-2017 00:27:13.069 fctx 0x7fef7801df90(./DNSKEY): join 02-Nov-2017 00:27:13.069 fetch 0x7fef84922548 (fctx 0x7fef7801df90(./DNSKEY)): created 02-Nov-2017 00:27:13.069 managed-keys-zone: synchronizing trusted keys 02-Nov-2017 00:27:13.069 zone_settimer: managed-keys-zone : enter 02-Nov-2017 00:27:13.069 zone_settimer: managed-keys-zone : settimer inactive 02-Nov-2017 00:27:13.069 zone_gotwritehandle: managed-keys-zone : enter 02-Nov-2017 00:27:13.101 rbtdb.c:1483: REQUIRE(rbtdb->future_version == ((void *)0)) failed, back trace 02-Nov-2017 00:27:13.101 #0 0x42781d in assertion_failed()+0x4d 02-Nov-2017 00:27:13.101 #1 0x6164ea in isc_assertion_failed()+0xa 02-Nov-2017 00:27:13.101 #2 0x4fd14a in newversion()+0x4ba 02-Nov-2017 00:27:13.101 #3 0x5c5ca0 in sync_keyzone()+0x140 02-Nov-2017 00:27:13.101 #4 0x5c645b in dns_zone_synckeyzone()+0x8b 02-Nov-2017 00:27:13.101 #5 0x43a275 in named_server_mkeys()+0x425 02-Nov-2017 00:27:13.101 #6 0x422bcd in named_control_docommand()+0x92d 02-Nov-2017 00:27:13.101 #7 0x425b3d in control_recvmessage()+0x80d 02-Nov-2017 00:27:13.101 #8 0x6387ed in run()+0x2cd 02-Nov-2017 00:27:13.101 #9 0x7fef90398dc5 in __do_global_dtors_aux_fini_array_entry()+0x7fef8fa8acd5 02-Nov-2017 00:27:13.101 #10 0x7fef900c773d in __do_global_dtors_aux_fini_array_entry()+0x7fef8f7b964d 02-Nov-2017 00:27:13.101 exiting (due to assertion failure) It is caused by zone_refreshkeys() releasing the zone lock too early: 9801 UNLOCK_ZONE(zone); 9802 9803 dns_diff_clear(&diff); 9804 if (ver != NULL) { 9805 dns_rriterator_destroy(&rrit); 9806 dns_db_closeversion(db, &ver, commit); 9807 } When "rndc managed-keys refresh" is called around the time a key refresh fetch is created, threads may be scheduled in a way which causes sync_keyzone() to call dns_db_newversion() before zone_refreshkeys() cleans up the database version it creates earlier. While investigating this issue, I found another one: if receiving the response to the initializing RFC 5011 key refresh query is quickly followed by receiving a control message from "rndc managed-keys refresh" and these two events happen at the same timestamp (with one second resolution), no further RFC 5011 refreshes will occur until named is restarted. This is caused by the same condition that caused RT #46055 because it checks whether the add hold-down timer for a key is lower than the current timestamp instead of checking whether it is equal or lower. This results in the DNS_ZONEFLG_REFRESHING flag being permanently set. As a side note, the code is a bit inconsistent about when a key is considered to be trusted, e.g. keyfetch_done() marks a key as trusted when its add hold-down timer is equal or lower than current time while mkey_dumpzone() claims a key with an add hold-down timer equal or greater than current time is pending trust. RFC 5011 does not specify which interpretation is correct as it uses the term "timer expiry" to define the moment a key should be trusted. I am inclined to consistently use the interpretation used by keyfetch_done(), i.e. a key with an add hold-down timer equal to current time should be considered trusted. [1] https://jenkins.isc.org/view/BIND/job/bind9-master-centos7-amd64-1/334/