MIME-Version: 1.0
In-Reply-To: <20170526103815.GA22789@jurassic>
X-Mailer: MIME-tools 5.505 (Entity 5.505)
Content-Disposition: inline
X-RT-Interface: Web
References: <20170526103815.GA22789@jurassic> <RT-Ticket-45293@isc.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.2.8-30832-1496926116-967.45293-0-0@isc.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
RT-Send-CC:
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 2669

I believe the mkeys system test is subject to subtle timing issues,
namely we do some grepping on the managed-keys zone file without waiting
until it is actually written to disk after "rndc managed-keys sync" is
invoked.

I was able to intermittently reproduce this issue only on a Fedora 25
VM.  I then modified bin/tests/system/mkeys/tests.sh to print current
time with nanosecond precision just after the problematic grep call
(`grep "trust pending" ns2/managed-keys.bind` in the "restore untrusted
standby key, revoke original key" test).  Here is an example of a failed
test being run with that modification:

    I: check for signed record (1)
    I: check positive validation with valid trust anchor (2)
    I: check positive validation using delv (3)
    I: check for failed validation due to wrong key in managed-keys (4)
    I: check new trust anchor can be added (5)
    I: ns2 refreshing managed keys for '_default'
    I: check new trust anchor can't be added with bad initial key (6)
    I: ns3 refreshing managed keys for '_default'
    I: remove untrusted standby key, check timer restarts (7)
    I: ns2 refreshing managed keys for '_default'
    I: restore untrusted standby key, revoke original key (8)
    I: ns2 refreshing managed keys for '_default'
    13:05:56.664647628
    I:failed

Meanwhile, named.run for that specific test contains:

    13:05:56.656 received control channel command 'managed-keys sync'
    13:05:56.656 zone_dump: managed-keys-zone : enter
    13:05:56.656 zone_gotwritehandle: managed-keys-zone : enter
    ...
    13:05:56.673 dump_done: managed-keys-zone : enter

In other words, we run grep on what we believe is the new managed-keys
zone file, compare that to the output of a similar grep command run on
the old managed-keys zone file and expect the two values to be
different.  What actually happens, though, is that both grep commands
are run on the same file, i.e. the old managed-keys zone file (I
confirmed this by echoing $t1 and $t2 in the failing test).

A quick way to reproduce this issue reliably (at least on the machines I
am using) is to remove the "sleep 1" line after the "managed-keys sync"
rndc call in the "remove untrusted standby key, check timer restarts"
test.

To fix this issue and also achieve some consistency in the mkeys system
test, I modified it so that we always sleep for 1 second after each
"managed-keys refresh" and "managed-keys sync" rndc call.  To avoid
confusion, I also removed all "managed-keys sync" calls whose results
(i.e. the managed-keys zone file) are not used for determining the
result of the test.

Fixes described above are in the rt45293 branch.  Please review.