Skip to content

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Apr 22, 2019

This reverts commit 07209fb, reversing changes made to 7dfc00c.

This PR reapplies the changes of the PR faster TLS v4 which was reverted from develop after test failures showed up on Windows under the cmd shell.

The test failures of develop seem to be caused by a bug in the test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp which should be resolved after merging the branch https://github.com/stan-dev/math/tree/bugfix/issue-1210-coupled_ode_system . The final confirmation is pending as this bug in the test only showed up when this PR was combined with the original code and when run on Windows on the cmd shell.

The content of this PR was approved as is in its previous v4 version. In this (hopefully last) iteration the reviewer may consider these points:

  1. Agreement on proceeding with issue Problem with coupled_ode_system #1210 Done, it's merged.
  2. We could consider to revert the commit 236b6d9 which changed ChainableStack::instance() into ChainableStack::instance_->. I did this as I was anticipating compiler problems on Windows with it which I was hoping to avoid, but as it turns out, this is not the case and the additional abstraction which we had with the instance() method is a good one. On the other hand we can keep things simple and just merge this PR as is to avoid further need to validate this PR. We can revert this change later should we think it is needed.

Summary

Please refer to the linked v4 PR, #1171 , of this for details.

As I am not a git guru, here is what I did to get to this PR git wise:

git checkout origin/develop
git checkout -b feature/faster-ad-tls-v5
git revert 07209fb48faf61221af4079a467b3f9eada90b12 -m 1

Tests

Side Effects

Checklist

wds15 added 2 commits April 22, 2019 09:57
…ster-ad-tls-v4"

This reverts commit 07209fb, reversing
changes made to 7dfc00c.
…ster-ad-tls-v4"

This reverts commit 07209fb, reversing
changes made to 7dfc00c.
@syclik syclik force-pushed the feature/faster-ad-tls-v5 branch from 296e23c to f76ecac Compare May 3, 2019 02:44
@wds15
Copy link
Contributor Author

wds15 commented May 3, 2019

@syclik What do you think about point 2 above?

I am good with not reverting the commit 236b6d9 as we would then merge exactly what was approved before. Revering that one can still be done whenever we touch that AD system again later (and I plan to overhaul it a bit with parallelisation which is coming).

... or we revert that commit, but then we would have to rerun a few performance benchmarks, I guess, as a sanity check. This is why I am somewhat in favor for leaving that commit in for now... we will anyway benchmark the AD stack thoroughly when we add parallelisation.

@wds15 wds15 changed the title WIP Faster TLS v5 Faster TLS v5 May 5, 2019
@wds15
Copy link
Contributor Author

wds15 commented May 8, 2019

It would be really good to get this merged as is given it was already approved. I just had to fight another merge conflict which is unnecessary resource drain.

@syclik
Copy link
Member

syclik commented May 8, 2019

Sorry, missed the question. It would really be better to revert that commit (point 2).

That adds another level of complication to this PR; this changes the API when it seems like it's unnecessary. (At least based on what you've said.)

Could you revert it? I'll approve it within a day of that happening, and we'll merge when tests pass.

Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for cleaning up the PR so much! It made it a lot easier to review.

@@ -46,6 +46,7 @@ TEST(AgradAutoDiff, gradient_threaded) {
EXPECT_FLOAT_EQ(x_ref(0) * x_ref(0) + 3 * 2 * x_ref(1), grad_fx_ref(1));

auto thread_job = [&](double x1, double x2) {
stan::math::ChainableStack thread_instance;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required now when using threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you need an instance of ChainableStack in every thread. Otherwise the AD tape is not initialized. You can have more than one of those instances, but only the first one created is relevant.

@@ -0,0 +1,102 @@
#ifndef STAN_MATH_REV_MAT_FUNCTOR_MAP_RECT_CONCURRENT_HPP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: why was this logic moved? I didn't see it mentioned in the PR description and it's easier to spot it now in the diff. (Just wondering why it's done this way, I'm not trying to imply that there's anything wrong with this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was moved due to dependecy reasons. The threaded map_rect depends now on the include

#include <stan/math/rev/core/chainablestack.hpp>

as we are sticking the chainablestack instance into the lambda doing all the work. This requires us to move essentially the prim version into rev. This is not ideal, but once we get the TBB we can handle the instantiation of chainablestack better (for now we have to make sure via the lambda that an instance is available).

@wds15
Copy link
Contributor Author

wds15 commented May 9, 2019

@seantalts & @serban-nicusor-toptal could you please have a look at the (upstream) Jenkins failure on this one? The error looks like garbage to me. Thanks.

@serban-nicusor-toptal
Copy link
Contributor

@wds15
Jenkins should be fine now, restarted the job.
Let me know if anything else happens.

Thanks!

@wds15
Copy link
Contributor Author

wds15 commented May 9, 2019

I am running now into this for upstream tests... looks like Jenkins or something-else-out-of-my-control thing?

g++ -Werror  -std=c++1y -m64 -Wall -Wno-unused-function -Wno-uninitialized -Wno-unused-but-set-variable -Wno-unused-variable -Wno-sign-compare -Wno-unused-local-typedefs     -O3 -I src -I lib/stan_math/ -I lib/stan_math/lib/eigen_3.3.3 -I lib/stan_math/lib/boost_1.69.0 -I lib/stan_math/lib/sundials_4.1.0/include  -D_USE_MATH_DEFINES  -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION     -c -O0 -include src/stan/optimization/bfgs_update.hpp test/dummy.cpp -o nul
After 20s process did not stop
jenkins.util.io.CompositeIOException: Unable to delete 'C:\Jenkins2\workspace\Stan_downstream_tests@tmp\durable-90a0035a'. Tried 3 times (of a maximum of 3) 

@serban-nicusor-toptal
Copy link
Contributor

I see a build in progress: http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/view/change-requests/job/PR-1212/13/

I will let that run and if it fails I will look into why that happened.

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Ok... now the downstream cmdstan build 599 fails with:


Started by upstream project "Stan/downstream_tests" build number 525
originally caused by:
 Started by upstream project "Math Pipeline/PR-1212" build number 13
 originally caused by:
  Started by user wds15
  java.lang.NullPointerException
	at org.jenkinsci.plugins.workflow.cps.replay.ReplayCause.getOriginal(ReplayCause.java:66)
	at org.jenkinsci.plugins.workflow.cps.replay.ReplayCause.print(ReplayCause.java:74)
	at hudson.model.Cause$UpstreamCause.print(Cause.java:322)
	at hudson.model.Cause$UpstreamCause.print(Cause.java:319)
	at hudson.model.Cause$UpstreamCause.print(Cause.java:298)
	at hudson.model.BuildListener.started(BuildListener.java:49)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:273)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:429)
Finished: FAILURE

These Jenkins failures are very frustrating... so help is very much appreciated.

@serban-nicusor-toptal
Copy link
Contributor

Good morning, looking into it.

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Thanks for restarting... one more hickup:


ar -rs lib/sundials_4.1.0/lib/libsundials_nvecserial.a lib/sundials_4.1.0/src/nvector/serial/nvector_serial.o lib/sundials_4.1.0/src/sundials/sundials_math.o
ar -rs lib/sundials_4.1.0/lib/libsundials_cvodes.a lib/sundials_4.1.0/src/cvodes/cvodes_nls_stg1.o lib/sundials_4.1.0/src/cvodes/cvodes_nls_sim.o lib/sundials_4.1.0/src/cvodes/cvodes_ls.o lib/sundials_4.1.0/src/cvodes/cvodea_io.o lib/sundials_4.1.0/src/cvodes/cvodes_spils.o lib/sundials_4.1.0/src/cvodes/cvodes.o lib/sundials_4.1.0/src/cvodes/cvodes_bandpre.o lib/sundials_4.1.0/src/cvodes/cvodes_bbdpre.o lib/sundials_4.1.0/src/cvodes/cvodes_nls_stg.o lib/sundials_4.1.0/src/cvodes/cvodea.o lib/sundials_4.1.0/src/cvodes/cvodes_io.o lib/sundials_4.1.0/src/cvodes/cvodes_direct.o lib/sundials_4.1.0/src/cvodes/cvodes_diag.o lib/sundials_4.1.0/src/cvodes/cvodes_nls.o lib/sundials_4.1.0/src/sundials/sundials_sparse.o lib/sundials_4.1.0/src/sundials/sundials_dense.o lib/sundials_4.1.0/src/sundials/sundials_nvector_senswrapper.o lib/sundials_4.1.0/src/sundials/sundials_nvector.o lib/sundials_4.1.0/src/sundials/sundials_pcg.o lib/sundials_4.1.0/src/sundials/sundials_math.o lib/sundials_4.1.0/src/sundials/sundials_sptfqmr.o lib/sundials_4.1.0/src/sundials/sundials_mpi.o lib/sundials_4.1.0/src/sundials/sundials_linearsolver.o lib/sundials_4.1.0/src/sundials/sundials_iterative.o lib/sundials_4.1.0/src/sundials/sundials_spbcgs.o lib/sundials_4.1.0/src/sundials/sundials_band.o lib/sundials_4.1.0/src/sundials/sundials_version.o lib/sundials_4.1.0/src/sundials/sundials_nonlinearsolver.o lib/sundials_4.1.0/src/sundials/sundials_direct.o lib/sundials_4.1.0/src/sundials/sundials_matrix.o lib/sundials_4.1.0/src/sunmatrix/band/sunmatrix_band.o lib/sundials_4.1.0/src/sunmatrix/dense/sunmatrix_dense.o lib/sundials_4.1.0/src/sunlinsol/band/sunlinsol_band.o lib/sundials_4.1.0/src/sunlinsol/dense/sunlinsol_dense.o lib/sundials_4.1.0/src/sunnonlinsol/newton/sunnonlinsol_newton.o lib/sundials_4.1.0/src/sunnonlinsol/fixedpoint/sunnonlinsol_fixedpoint.o
ar -rs lib/sundials_4.1.0/lib/libsundials_idas.a lib/sundials_4.1.0/src/idas/idas_nls.o lib/sundials_4.1.0/src/idas/idas_nls_sim.o lib/sundials_4.1.0/src/idas/idas_ls.o lib/sundials_4.1.0/src/idas/idas.o lib/sundials_4.1.0/src/idas/idas_bbdpre.o lib/sundials_4.1.0/src/idas/idas_nls_stg.o lib/sundials_4.1.0/src/idas/idaa_io.o lib/sundials_4.1.0/src/idas/idas_direct.o lib/sundials_4.1.0/src/idas/idas_io.o lib/sundials_4.1.0/src/idas/idaa.o lib/sundials_4.1.0/src/idas/idas_ic.o lib/sundials_4.1.0/src/idas/idas_spils.o lib/sundials_4.1.0/src/sundials/sundials_sparse.o lib/sundials_4.1.0/src/sundials/sundials_dense.o lib/sundials_4.1.0/src/sundials/sundials_nvector_senswrapper.o lib/sundials_4.1.0/src/sundials/sundials_nvector.o lib/sundials_4.1.0/src/sundials/sundials_pcg.o lib/sundials_4.1.0/src/sundials/sundials_math.o lib/sundials_4.1.0/src/sundials/sundials_sptfqmr.o lib/sundials_4.1.0/src/sundials/sundials_mpi.o lib/sundials_4.1.0/src/sundials/sundials_linearsolver.o lib/sundials_4.1.0/src/sundials/sundials_iterative.o lib/sundials_4.1.0/src/sundials/sundials_spbcgs.o lib/sundials_4.1.0/src/sundials/sundials_band.o lib/sundials_4.1.0/src/sundials/sundials_version.o lib/sundials_4.1.0/src/sundials/sundials_nonlinearsolver.o lib/sundials_4.1.0/src/sundials/sundials_direct.o lib/sundials_4.1.0/src/sundials/sundials_matrix.o lib/sundials_4.1.0/src/sunmatrix/band/sunmatrix_band.o lib/sundials_4.1.0/src/sunmatrix/dense/sunmatrix_dense.o lib/sundials_4.1.0/src/sunlinsol/band/sunlinsol_band.o lib/sundials_4.1.0/src/sunlinsol/dense/sunlinsol_dense.o lib/sundials_4.1.0/src/sunnonlinsol/newton/sunnonlinsol_newton.o lib/sundials_4.1.0/src/sunnonlinsol/fixedpoint/sunnonlinsol_fixedpoint.o
C:\Rtools\mingw_64\bin\ar.exe: creating lib/sundials_4.1.0/lib/libsundials_idas.a
C:\Rtools\mingw_64\bin\ar.exe: unable to rename 'lib/sundials_4.1.0/lib/libsundials_idas.a'; reason: Permission denied
C:\Rtools\mingw_64\bin\ar.exe: creating lib/sundials_4.1.0/lib/libsundials_cvodes.a
make: *** [make/libraries:55: lib/sundials_4.1.0/lib/libsundials_idas.a] Error 1
make: *** Waiting for unfinished jobs....
C:\Rtools\mingw_64\bin\ar.exe: creating lib/sundials_4.1.0/lib/libsundials_nvecserial.a

One more restart?

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

I restarted once more. @seantalts I hope there are plans+ideas to stabilize Jenkins... this PR is now on its way to do it's third repeat on the Jenkins pipeline. The Jenkins config appears fragile and Windows is not stable. If possible we could almost think about automatic restarts for those OS related failures. Anything what makes this more stable would be great. It's work to follow this along and babysit.

@serban-nicusor-toptal
Copy link
Contributor

It's pretty strange that this happens, it never did before and there were no changes to how this part worked.
I will look into the windows Jenkinsfile configuration to see if I can force permissions.
Please keep me updated @wds15

Thank you!

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Sure, will keep you posted.

Windows problems I have seen in the past: virus scanner, no disk space, permissions, ...

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

What is going on now? Jenkins claims that the performance tests failed downstream, but when I go there everything is green. I am not sure where I can look for an error. If in doubt I am going to restart a 4th time as there were changes in between to the Jenkins config.

@serban-nicusor-toptal
Copy link
Contributor

Let's take a look.

@serban-nicusor-toptal
Copy link
Contributor

That happened because the performance was +15% worse than the last run.
I've pushed a PR into Jenkins now to only run these tests on master.
Will Abort and Retry this Job.

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Thanks for diving into this.

(This changeset was merged 2x to develop and reverted 2x...looks like now Jenkins really wants to prevent this presumably possibly "final" merge)

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

ahh... that makes sense. This is because the SIR example on develop is now a lot faster and that changeset hasn't been merge to this PR.

Where would I have seen this? Is there some doc on where I can see this myself? I mean to me things just look odd.

@serban-nicusor-toptal
Copy link
Contributor

serban-nicusor-toptal commented May 10, 2019

It will now complete successfully!
I'm really, really sorry for the trouble caused.

As always please keep me updated anytime Jenkins goes nuts.

Thanks! Have a great day !

@serban-nicusor-toptal
Copy link
Contributor

See my latest commit: stan-dev/performance-tests-cmdstan@00f25e9

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

I am happy to ping you, but I also don't intend to waste your time on these things... so if there is doc about things I should know then just let me know.

@serban-nicusor-toptal
Copy link
Contributor

You're not wasting my time.
Jenkins is my thing, so feel free to ping me for anything related to it!

Sadly there isn't a doc about this but I've assigned myself an Issue for this so I'll work on it. It'll take a bit since I have to understand the whole picture too.

Expect to start finding stuff in here by the next week or so: https://github.com/stan-dev/jenkins-config/wiki


perfReport compareBuildPrevious: true, 

                    relativeFailedThresholdPositive: 15,
                    relativeUnstableThresholdPositive: 10,

                    errorFailedThreshold: 1, 
                    failBuildIfNoResultFile: false, 
                    modePerformancePerTestCase: true, 
                    modeOfThreshold: true,
                    sourceDataFiles: '*.xml', 
                    modeThroughput: false,
                    configType: 'PRT'

This is the part that triggered the FAILURE.
More exactly: relativeFailedThresholdPositive: 15

Which translated to, if a test performs 15% worse, the job will automatically fail and in a later stage notify through email.

@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Ok...right now it is not apparent to me to see this 15% being crossed. I would also like to see which of the tests failed. I am usually clicking on red stuff to find out and that did not work this time. It would be good if the doc contained pointers where i can look for helping myself.

Apart from that we should consider testing not the pr branch itself, but the tests should be on the pr branch merged to develop (this is anyway what we care about).

@serban-nicusor-toptal
Copy link
Contributor

serban-nicusor-toptal commented May 10, 2019

You can go here: http://d1m1s1b1.stat.columbia.edu:8080/job/CmdStan%20Performance%20Tests/job/downstream_tests/24/console

And all the way down. You should see the test results that caused the Math Job to FAIL.
How I got there? Followed the downstream.
Math -> Stan -> CmdStan -> Performance Tests

I think the test logic was fine, the reporting tool was the issue here.
It should only report things as FAILURE on master branch, but that's now fixed.

We're in Math Console, to see the Stan downstream just do a CTRL + F after Starting building
And so on ... downstream.

@wds15
Copy link
Contributor Author

wds15 commented May 11, 2019

@serban-nicusor-toptal Looks like one more hickup:

--- CmdStan v2.19.1 built ---
+ ./runCmdStanTests.py -j25 src/test/interface
bad value for -j flag
exit now (05/11/19 02:45:11 UTC)

This happens in the cmdstan downstream tests. Thanks for looking into this.

@serban-nicusor-toptal
Copy link
Contributor

Good morning.
I've looked into it and I don't know how that happened.
If we look into a successful CmdStan build that step looks like:

--- CmdStan v2.19.1 built ---
+ ./runCmdStanTests.py src/test/interface

Which in our case is:

--- CmdStan v2.19.1 built ---
+ ./runCmdStanTests.py -j25 src/test/interface
bad value for -j flag
exit now (05/11/19 02:45:11 UTC)

And the script is called by the following:

                    steps {
                        setupCXX("${MPICXX}")
                        sh "echo STAN_MPI=true >> make/local"
                        sh "make build-mpi > build-mpi.log 2>&1"
                        sh runTests("./")
                    }
def runTests(String prefix = "") {
    """ make -j${env.PARALLEL} build
  ${prefix}runCmdStanTests.py src/test/interface
    """
}

So i have no idea how the -j25 got in the runCmdStanTests.py.

Investigating more

@wds15
Copy link
Contributor Author

wds15 commented May 11, 2019

Whow... after 5x repeats of running Jenkins!

@wds15 wds15 merged commit 5e76d67 into develop May 11, 2019
@seantalts
Copy link
Member

seantalts commented May 11, 2019 via email

seantalts added a commit that referenced this pull request May 16, 2019
rok-cesnovar added a commit that referenced this pull request May 18, 2019
Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5"
rok-cesnovar added a commit that referenced this pull request May 18, 2019
@wds15 wds15 deleted the feature/faster-ad-tls-v5 branch June 30, 2019 17:57
@serban-nicusor-toptal serban-nicusor-toptal modified the milestone: 2.19.2 Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants