Message boards : Number crunching : Does Rosie create new jobs if no 'net connection available?
Author | Message |
---|---|
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,632,667 RAC: 9,103 |
I'm suprised I've not read this anywhere yet, but it's not been an issue for me until now: Does Rosie make up new random seeds to run if there's no connection to the internet available? cheers Danny |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I'm suprised I've not read this anywhere yet, but it's not been an issue for me until now: Yes. The random seed comes from your machine, not the server. Moderator9 ROSETTA@home FAQ Moderator Contact |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,632,667 RAC: 9,103 |
I'm suprised I've not read this anywhere yet, but it's not been an issue for me until now: That's good cos I've got a machine running with no net connection that I can't get to till after this bank holiday weekend, and it only had about 22hrs of work queued! cheers |
Robinski Send message Joined: 7 Mar 06 Posts: 51 Credit: 85,383 RAC: 0 |
I'm suprised I've not read this anywhere yet, but it's not been an issue for me until now: Great to hear, as one of my machines got disconected somehow and didn't have a large queue. Member of the Dutch Power Cows Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Rosetta does not make new jobs on your machine. If you run out of work and can't connect to the server, then your machine will go idle. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Ya, the seed is generated at the start of each model run. But the number of models done is based on your project preferences for the length of time you prefer to crunch. I see in my C:Program FilesBOINCslots1init_data.xml (one of the slots presently running Rosetta) that it has the value for <cpu_run_time>86400</cpu_run_time> (My machine is set to crunch each WU 24 hrs). I'm thinking you could change an XML file somewhere (I think the one in the slots file is a COPY of the one you need to change) and bump up your time to crunch on your remaining WUs. Normally you'd do this via the Rosetta preferences and update to the project. But, with no network connection, you can't update to the project. And the update to the project actually affects your existing WUs! (can be interesting if you go suddenly from 2hrs to 24hr WUs :) I'm not sure WHERE to tell you to change the value. But I'm confident it can be done. Hoping someone else can post some specific details. [edit] oh HERE it is! Untested, but I think if you shutdown BOINC, edit this file: C:Program FilesBOINCaccount_boinc.bakerlab.org_rosetta.xml bump up your <cpu_run_time>86400</cpu_run_time> then restart, you'll make the most of your existing WUs. I'm not positive what the project has the WU max time set to presently. I think 24hrs, so the 86400 may be as high as you can go right now. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,632,667 RAC: 9,103 |
ah It'd be a good idea to implement this automatically into rosetta then (or does BOINC control this?). It doens't seem difficult to do in theory, but I know that doens't always match the practice... cheers Danny |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Actually the seed number is getting passed as an argument now so it does come from the server with each work unit. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Actually the seed number is getting passed as an argument now so it does come from the server with each work unit. David would know!! Could you explain further? I mean when I run 50 models... are they all against that same seed then? Or seed through seed+50? (ok, +49! I know :) (also, please review the suggestion below). dcdc, I fear you're right, in practice it's harder than it sounds. BOINC wants to sort of protect you from a WU that gets "stuck", so a maximum runtime is ALSO defined for each WU. That was what was behind my reccomendation not to try to increase beyond 24hrs. Not that it would hurt anything. The WU just ends when max time is reached, and it reports back what it found up to that point. But ya, I hope someday they can overcome that hurdle. It would allow a dial up user to just download one or a few proteins, and crunch them against various versions of Rosetta or against various seeds over time and not have to endure the multi-MB downloads of the WUs. I know, I know, we can set the WU length in the preferences. I'm talking about WUs that are used and reused and reported and studied for MONTHS, not just days. I am ALSO talking about downloading new proteins on YOUR schedule, not BOINC's. It might also let folks with multiple PCs just download a central repository of proteins ONCE and all their PCs (regardless of CPU type, even OS type) utilize it. Once such a thing is in place, if we put together an installation CD, then it might include say 100 (or 1,000!) proteins on it as well. And you could just crunch against those if you like. Any new proteins the project wants to study would be downloaded to broadband users. Or downloaded to dial-up users at times when they schedule a longer connection time. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,632,667 RAC: 9,103 |
wouldn't it just be a case of: if no work left and no net connection available, generate new seed and run loop ? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
wouldn't it just be a case of: ...well, if you were out there on your own it would be that simple. But you've got a server trying to manage the work and distribute work that is specific to what they wish to study most. For example, say they sent you protein X a few days ago, and now you've crunched your WU length of Protein X and are looking for more work... well, perhaps the findings back at the Baker Lab are telling them that protein X has now been well defined, or is no longer of interest to their work... they now wish to pursue protein Y. So further crunching on protein X is now considered to be a fruitless path to pursue. Anyway, the server is tracking all of the WUs and deadlines, and credits, and failure rates, and etc. And managing all of this would become vastly more complex if clients were dynamically generating WUs and reporting back results that were never requested from the project. But yes, in theory, it works, and could be managed. Hence my suggestion. In the meantime, increase your setting for "connect to network every X days" in your general preferences, update to the project, and you'll retain a larger cache of WUs. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
In the meantime, increase your setting for "connect to network every X days" in your general preferences, update to the project, and you'll retain a larger cache of WUs. Actually, I think that (assuming always-on Internet) the best setting for Rosetta would be to reduce "connect to network every X days" to e.g. 0.1 (as low as practical, that's 0.1 days, i.e. connect every couple of hours), to update jobs as fast as possible and reduce roundtrip times for the project. And if you want, you can increase variable-WU-runtime to 4, 8, 12, 24 hours (I use 8hr myself, as the best tradeoff, so I still send 3 WUs per day per PC) Unlike other projects, keeping a large cache of WUs is really counterproductive for Rosetta@home, which needs fast round-trip times for results (not to mention you may load the boat with buggy WUs and have to cancel them one by one!) Project Scientists: Let me know if I got this all wrong please! Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Actually, I think that (assuming always-on Internet) the best setting for Rosetta would be to reduce "connect to network every X days" to e.g. 0.1 (as low as practical, that's 0.1 days, i.e. connect every couple of hours), to update jobs as fast as possible and reduce roundtrip times for the project. My point was that in the case-at-hand, there's apparently NOT always-on internet. And so the cache size helps assure there is always work on-hand. Even if you have an ISP outage on a day that you would normally connect. I've got mine set to 2.5 days, and, as a result, I was entirely unaffected by last weekends WU problems, because I already had 2 days of work (which, with 24hr WUs is still only 2 or 3 WUs... 4-6 on a dual-CPU). So... can work both ways. Also, if the project has an outage, I can keep crunching, and generally avoid contact during the catch-up phase of restarting the server and congestion servicing all of those with no work left. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
As I understand it, the seed is not for the model as such, but for the random number generator. The so-called random nunmer generator is in fact a pseudo-random number generator - the numbers come out in an apparently random sequence, but if you start at the same place every time you get the same sequence - the Groundhog Day effect. To prevent this each run seeds the random number generator before starting with a different seed, thus it starts its sequence at a different point. With maybe a billion numbers before it loops round, this means if the seed is truly random then the number drawn are also random; however by recording the seed the sequence is repeatable, and if the result bombs out then the team can run exactly the same sequence again by using exactly the same seed. Now then, however many models you run, you only need to seed the random number generator once. After the first model, even tho it branches back to the same code, the random number generator is at a different point in its pseudo-random sequence, and therefore gives a different set of "dice throws" this time round, and you get a different model. In David's analogy, you parachute down onto a different part of Planet Rosetta. The chance of hitting the same model as some other run that started from a different seed is roughly 1 in the total number of seeds - small enough to be discounted in practice. If David wants to run a million models of the same code and the same protein, then he'd want a seed set with at least a billion different possible values to avoid duplication. Disclaimer: I do not know the details of David's code, but the above is based on my general understanding of multi-run monte carlo models, and on my belief that David's code is such an animal. If I've made any material errors in the above please could one of the project programmers correct me. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Given we technically do this 'continueation' when we increase out task(job) length to 4day, 1 day etc... It should not harm in any way to have it look at , hey no connection, hey no jobs just continue (i.e. continue time length) Don;t think this alters much on hte server other than the correction factor, but then the correction factor will be doing it job right ? As for the, this form of task is no longer useful argument... well Rosetta don't send out kill commands (like CPDN do, afaik anyway Rosetta have never sent out a kill command even after the stuck jobs, though could easily be wrong or maybe they need to). So if I have my task length set to 4 days I would be doing pointless work if they didn't kill my task... ? Team mauisun.org |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Thanks River, yes, I misspoke. I do understand how random number generators work, and now we've clarified that the initial seed is sent from the server. From that an actual starting point is durived and a model is run. Am I correct that each "model" is just a different starting point? The other choice would be that each model represents a different set of values/ parameters/ heuristics in the algorythm used. I'm just still not clear why we aren't grinding through ALL of the possible starting points to completely test a given algorythm. Seems that the randomness might skew the results and miss how perfect the new adjustments to the algorythm CAN be if you hit the right start point. This would then be an easy way to get shorter WUs too. You assign starting numbers in increments of say 500 for people with long WU runtimes. So you assign 1-500 to one host, 501-1000 to the next and so on. Then they report back their results, say the first host crunched 375 models, now there's 125 models left to reassign, and maybe assign back to the same host. Send a WU that specifically says to start at 376, and run no more than 125 models. If that host then crunches another 100 models, you assign a WU to crunch start point 476-500, or perhaps you just crunch those on the project's local grid. Now you've got "complete" coverage. I'm still unclear exactly what defines the starting points, or how many possible starts there are. Perhaps they themselves already have gaps and are only a subset of the possible configurations? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
hi Feet1st, I have tested this in the oppostite direction - reducing the run time to stop a WU running yet keep the work that has already been done. I edited the slot file down to 3600 sec, and the aacount file down to 7200, and restarted BOINC to see what would happen to a wu which had then got a total of 13hours already run. As soon as BOINC restarted the WU went to completion (obviously picked up the latest checkpoint, found it had gone over time. Looking at the result on the database, it picked up the 7200 value not the 3600 one - so your guess was right, it is the account_boinc... file that needs editing. It is interesting that you can see both the original and the amended target cpu time (in seconds) in the stderr.txt output. River~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... well Rosetta don't send out kill commands (like CPDN do, afaik anyway Rosetta have never sent out a kill command ... hi FC: the CPDN kill command is only sent when the server receives a trickle (for those not familiar with CPDN, a trickle is a kind of intermediate report sent by a result that is still in progress). BOINC has no facility for the server to contact the client - it is always the client that contacts the server, whether it be an update, an upload, a download, or whatever. Among other things this makes working through a firewall much easier and it would be a very unpopular move among some users if this policy ever changed. In principle a kill command could be added to a normal update - when the client reports unit X is complete, the server could send a pre-emptive kill for unit Y - but in fact nothing like that has ver been suggested, let alone implemented. As at present only CPDN operate trickles, it must also be the case that only CPDN can kill a job once issued. Even CPDN cannot kill a job that has not yet started - it has to run till the first trickle before they can get at it. It is a damage limitation feature, not a damage prevention one. In fairness, it matters more on their megasecond workunits than it does here - OK the lost work before the next trickle may be more than one of our whole work units, but the saved work from the rest of their wu far exceeds any problem Rosetta has ever had. Hope that helps - sorry if I told you more than you wanted to know, that's a geek thing ;-) |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
if I am right that this is a fairly normal "monte carlo" model, yes. If I have decoded the naming conventions correctly, then the left hand end of the result name defines a set of results that all have the same values / parameters / heuristics. The final three digit number in the result name distinguishes results that are identical apart from the seed. The final _0 is a BOINC thing because there might be another result sent out with absolutely the same details, including the same seed.
Because there is an infinite plane on which we can operate, and an exhaustive search therefore means setting up a finite grid on top of a continuous terrain. It turns out that working to a grid skews results even more. Monte Carlo methods choose a very fine grid to work on, one that is so fine that we could not possibly cover all options, and then choose some of the possible options at random. If you want to know more about why this works, and when it doesn't, read up about Monte Carlo programming (eg Google, public library, etc). The approach is called Monte Carlo in favour of the well known European casino town/state - had the idea been first used by an American I guess we'd call it Las Vegas modelling ;-) River~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Given we technically do this 'continueation' when we increase out task(job) length to 4day, 1 day etc... It should not harm in any way to have it look at , hey no connection, hey no jobs just continue (i.e. continue time length) The harm is not obvious, and is not local, but in fact would be fatal to the project. In normal use the app would test the connection, and if working would finish the work, and in due course the client would upload it. A small overhead of one test packet per result, probably no problem. FC's line dies, FC's app tests the connection, no connection, nobody else knows, no problem. Line comes up, app ends at the next complete model, jobsa goodun. Bakerlab's line dies for three days. All the apps in the world overrun, and are typically ending a model every 20 to 40 minutes. Every 20 to 40 minutes about a zillion test packets get sent to bakerlab. When the bakerlab server comes up again it is hit by a home grown denial of service attack. Project never comes back again. Oh dear. The idea is possible, but you'd have also to build in an exponential back off, so that after the first one you did maybe two more models, then four more, then eight more, etc. Quite soon you'd be doing more models than the user was happy with, especially if the line to other projects was still working. This is a very well known danger, and why experienced programmers try hard to avoid live netowrk tests in an auto-repeating program. River~~ PS hey - how come this thread has suddenly become the River tutorial page. I'm stopping here before even I get tired of the sight of my own typing ;-) |
Message boards :
Number crunching :
Does Rosie create new jobs if no 'net connection available?
©2024 University of Washington
https://www.bakerlab.org