Report information
The Basics
Id:
46602
Status:
resolved
Priority:
Medium/Medium
Queue:

People
Requestors:
Stephen Morris <stephen@isc.org>(email delivery suspended)
Cc:
AdminCc:

BugTracker
Version Fixed:
(no value)
Version Found:
(no value)
Versions Affected:
(no value)
Versions Planned:
(no value)
Priority:
P2 Normal
Severity:
S2 Normal
CVSS Score:
(no value)
CVE ID:
(no value)
Component:
BIND Server
Area:
test

Dates
Created:Tue, 14 Nov 2017 11:07:39 -0500
Updated:Sun, 25 Feb 2018 18:29:19 -0500
Closed:Sun, 25 Feb 2018 18:29:19 -0500



This bug tracker is no longer active.

Please go to our Gitlab to submit issues (both feature requests and bug reports) for active projects maintained by Internet Systems Consortium (ISC).

Due to security and confidentiality requirements, full access is limited to the primary maintainers.

To: bind-suggest@isc.org
Subject: Allow system tests to run in parallel
Date: Tue, 14 Nov 2017 16:07:39 +0000
From: stephen@isc.org
The system tests can take a significant time to run. Ondrej suggested that they could be modified to run in parallel. This requires a number of changes to the system test scripts. Requirements --- 1. Run all the tests with up to a user-specified number of tests running in parallel. 2. If required, it should be possible to run a single test. It should also be possible to run all the tests sequentially. 3. Basic start/end messages should be output to stdout - messages from different tests may be interspersed. The full log from all of the tests should be placed in systests.output, ordered such that the output from each test forms a block of consecutive lines (as it does at present). 4. It should be possible to run single tests as can be done at present.
On Tue Nov 14 16:07:39 2017, stephen wrote: > The system tests can take a significant time to run. Ondrej suggested > that they could be modified to run in parallel. This requires a > number of changes to the system test scripts. => unfortunately this is impossible with networking protocols because the use of resources like address/port pairs requires tests to be run strictly in sequence. BTW in Kea the way unit and system tests of the agent are run by the Makefile conflicts for this reason so it is not possible to run "make distcheck" with a "-j xxx" argument. In conclusion not only it is impossible but we already have a product where some tests bug when run in parallel. Now there are tools to build applications on different machines (another meaning of "parallel") so perhaps we can adapt one to run tests this way. Note that all unit tests and system tests we have only requires to be run individually on one box.
On Tue Nov 14 23:57:40 2017, fdupont wrote: > On Tue Nov 14 16:07:39 2017, stephen wrote: > > The system tests can take a significant time to run. Ondrej suggested > > that they could be modified to run in parallel. This requires a > > number of changes to the system test scripts. > > => unfortunately this is impossible with networking > protocols because the use of resources like address/port > pairs requires tests to be run strictly in sequence. No, you are mistaken. I already parallelized several long running tests. Please let me and Stephen do our work, so we don't end up in a long streak of discussion.
On Wed Nov 15 03:01:24 2017, ondrej wrote: > On Tue Nov 14 23:57:40 2017, fdupont wrote: > > On Tue Nov 14 16:07:39 2017, stephen wrote: > > > The system tests can take a significant time to run. Ondrej > > > suggested > > > that they could be modified to run in parallel. This requires a > > > number of changes to the system test scripts. > > > > => unfortunately this is impossible with networking > > protocols because the use of resources like address/port > > pairs requires tests to be run strictly in sequence. > > No, you are mistaken. I already parallelized several long running > tests. => "to be run in parallel" means to run 2 different system tests in parallel (i.e., dnssec and xfer), not to parallelize some steps inside a script. So I commented about the first as I join you on the idea tests take far too long. > Please let me and Stephen do our work, so we don't end up in a long > streak of discussion. => change the queue: bind-suggest is about ideas which need more thinking, not about current work. And for other queues to have an owner means something too.
I already mailed this to Stephen, but the clarity, I am going to put this here as well. I pushed gitlab-ci-parallel-test branch into repo.isc.org (ignore the gitlab-ci part for now, I did is for GitLab Ci, but the whole parallel thing is independent). These are the changes that need to be made: * Makefile - add parallelized tests as individual (.PHONY) targets and remove them from conf.sh.* * Dynamically allocate a free port(s) for each individual parallelized test (something like this might be sufficient PORT=$$((32767+$$RANDOM))) * Pass the port(s) to the ./run.sh * Modify the individual tests to use the passed port(s) instead of 5300 Bonuses: * Modify the parallelized tests to just used localhost addresses * Modify the parallelized tests to use the address(es) passed from command line * Modify the parallelized tests to cope with IPv4 and IPv6 addresses
JFTR allow_query serve-stale rpzrecurse have already been modify to run in parallel. Note: I think the random port allocation is enough for the tests, but if you want to make it more robust a script that would lock, allocate, check the port, unlock should be made. But I strongly believe it's not necessary for low number of tests and 32k ports available.
On Wed Nov 15 07:18:41 2017, ondrej wrote: > JFTR allow_query serve-stale rpzrecurse have already been modify to > run in parallel. > > Note: I think the random port allocation is enough for the tests, but > if you want to make it more robust a script that would lock, allocate, > check the port, unlock should be made. But I strongly believe it's > not necessary for low number of tests and 32k ports available. => 32k available but 1/2 collision probability after ~180 ports. More seriously I really prefer the "more robust" option as I run tests on an other purpose machine. BTW a feature which is very useful in some cases should come for free (i.e. just check it works) with random is to be able to run simultaneously; the same test on different builds.
On Wed Nov 15 09:24:16 2017, fdupont wrote: > On Wed Nov 15 07:18:41 2017, ondrej wrote: > > JFTR allow_query serve-stale rpzrecurse have already been modify to > > run in parallel. > > > > Note: I think the random port allocation is enough for the tests, but > > if you want to make it more robust a script that would lock, allocate, > > check the port, unlock should be made. But I strongly believe it's > > not necessary for low number of tests and 32k ports available. > > => 32k available but 1/2 collision probability after ~180 ports. We have 103 system tests now and only the most time critical are being ported. > More seriously I really prefer the "more robust" option as I > run tests on an other purpose machine. BTW a feature which > is very useful in some cases should come for free (i.e. just > check it works) with random is to be able to run simultaneously; > the same test on different builds. If you provide the script that will print the usable port number to stdout, we would integrate it.
>> More seriously I really prefer the "more robust" option as I >> run tests on an other purpose machine. BTW a feature which >> is very useful in some cases should come for free (i.e. just >> check it works) with random is to be able to run simultaneously; >> the same test on different builds. > > If you provide the script that will print the usable port number to > stdout, we would integrate it. I've already done this. "make" is used to run the tests in parallel: instead of putting the dependencies in the main Makefile (which is where the difficulties of assigning unique ports arises), I've got it to dynamically create a temporary makefile where each test is assigned a unique port, and to recursively run "make" on that. >> => 32k available but 1/2 collision probability after ~180 ports. > We have 103 system tests now and only the most time critical are being ported. Although the issue with the clash on the query and control port has been resolved, there is another source of clash. Evan raised the point that after Kaminski, BIND uses a random port for each upstream query. If we have multiple instances of BIND running on the same IP address, it is possible that more than one instance will choose the same port at approximately the same time. There is no guarantee which instance will receive which response. The probability of this happening during an individual test is small, but we run tests continuously on multiple machines. Sooner or later there will be a clash and we will have an apparent random failure. I think we could get round this by adding a statement of the form "use-v4-udp-ports <low> <high>" to the configuration files, with a unique range of ports being chosen for each test. Given the number of tests, I'd suggest giving each test a pool of 200 ports in the range 20,000 to 60,000. This will allow for up to 200 tests before we need to start thinking about port ranges. I don't think this will have an impact on the tests but I'd appreciate input.
The consensus on my last question is that it wasn't an issue - ports will be bound before being used. So nothing has been changed in this respect. This ticket is now ready for first review. As parallelising all the tests will change a large number of files, I'd prefer to get a review done of the framework changes and the few tests that have already been converted before proceeding with the rest of the changes. I suggest that the review first start with bin/tests/system/README. This has been more or less completely rewritten, and should provide context for the review of the remaining files.
On Fri Nov 24 13:16:32 2017, stephen wrote: > The consensus on my last question is that it wasn't an issue - ports > will be bound before being used. So nothing has been changed in this > respect. > > This ticket is now ready for first review. As parallelising all the > tests will change a large number of files, I'd prefer to get a review > done of the framework changes and the few tests that have already been > converted before proceeding with the rest of the changes. Ack. Especially the changes in tests.sh gets confusing as there's a lot of replacements. > I suggest that the review first start with bin/tests/system/README. > This has been more or less completely rewritten, and should provide > context for the review of the remaining files. I attached my comments here: https://gitlab.isc.org/BIND/bind9/merge_requests/7 But it needs one more reviewer as I am co-author of this.
I've pushed a few minor changes to the README text. BTW, might want to consider making this a markdown file. According to the README, when tests are running in parallel, output is jumbled together in systests.output. This seems like a bad idea; let's capture the output of each test separately, then concatenate them as the tests complete. I'd put the output for every test into <test-name>/test-output, and then have stop.sh "cat <test-name>/test-output >> systests.output". (There's still a slight chance two tests would be writing into systests.output at the same time, but it would be much reduced.) The top-level clean.sh can remove test-output; that way it won't be necessary for each per-test clean.sh file to do so explicitly. A minor feature request: I've often wished there were a way to run a subset of the tests with a single command, like "sh run.sh dnssec autosign inline", but never got around to implementing it. While we've got the hood open, can we do that? If we support "make -j3 test-dnssec test-autosign test-inline", we can even take advantage of parallelization. There's nothing in the README about the SEQUENTIALDIRS variable in conf.sh.in. IMHO that should be included in case someone needs to write a test that can't be parallelized for some reason.
I reviewed the branch on GitLab. Lots of nits, but also some doubts about the design. Not sure whether to give this ticket to Ondřej or Stephen; since Ondřej was the one who gave it to me, I am passing it back to him.
> I've pushed a few minor changes to the README text. BTW, might want to consider making this a markdown file. I've created a new issue (#39) in gitlab for this. > According to the README, when tests are running in parallel, output is jumbled together in systests.output. This seems like a bad idea; let's capture the output of each test separately, then concatenate them as the tests complete. I'd put the output for every test into <test-name>/test-output, and then have stop.sh "cat <test-name>/test-output >> systests.output". (There's still a slight chance two tests would be writing into systests.output at the same time, but it would be much reduced.) The top-level clean.sh can remove test-output; that way it won't be necessary for each per-test clean.sh file to do so explicitly. I originally did it this was as output to the screen is jumbled and the existing code just tees that into a file. But an ordered output is required for automated testing to be able to identify failing tests, so something will have to be done. The idea of writing into a test-output file is a good one, although creating the systests.output file once all tests are complete will avoid any problem with tests writing the file at the same time. > A minor feature request: I've often wished there were a way to run a subset of the tests with a single command, like "sh run.sh dnssec autosign inline", but never got around to implementing it. While we've got the hood open, can we do that? If we support "make -j3 test-dnssec test-autosign test-inline", we can even take advantage of parallelization. Created gitlab issue #41 for this. > There's nothing in the README about the SEQUENTIALDIRS variable in conf.sh.in. IMHO that should be included in case someone needs to write a test that can't be parallelized for some reason. Will do. originally I hadn't documented it as I intended it to be temporary until all the tests had been converted to parallel running. But the idea that a future test might not be able to be parallelized is a valid one, so I will document it.
This change has been merged to all branches, gitlab merge requests 7 and 59.