Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 301 · Next
Author | Message |
---|---|
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
[snip] To be clear, I was advocating for having a shorter work 'buffer' - ie. no need to have 10+ days of work buffered and thus drive up the average 'turn around time' of a given WU (even a short one) to be many days because it gets downloaded onto the end of a really long queue.. By all means, set a longer WU target run length. What I think is probably not a good idea is a 10 day buffer/cache of 1 Hour WUs. Basically, high server load coupled with long turn around time.. As someone who deals with crunching data daily at my day job, I appreciate first hand the frustration of waiting for queries to run that take many hours, I can't imagine being asked to iterate and experiment in an environment where queries take many days or even weeks to complete.. hence why I think a small cache is better for the project just in terms of enabling faster iterative experimentation. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
LC Send message Joined: 10 Jun 09 Posts: 8 Credit: 1,895,973 RAC: 0 |
Well that was fast... I changed from "Store at least 10 days" to storing at least 2 days. I changed "Store up to an additional 10 days" to storing up to 5 days. I now have 38 tasks due on 31st October and 66 due on 2nd November. So my buffer is back! Yay. Now it's time for me to understand what the problem was. Because it still makes no sense. So I /was/ asking for 10 days worth of work before, with /potentially/ another 10 days on top of that for a total of 20 days worth of work. According to Juha, Rosetta presently has WUs due in either 2, 5 or 7 day deadlines. So *obviously* under those circumstances, I should not be given 10 days worth of work, much less 20 days worth, because if I were, then anything after 7 days wouldn't finish in time. BUT what the heck? WHY wasn't I being given at least 7 days of work instead? Or 5 days instead? Or 2 days? WHY do I all of a sudden have to know what Rosetta's deadlines are for their WUs? I never knew before and everything worked just fine. This only makes partial sense *if* Rosetta had 10-day deadlines over the past 7 years...until just a few weeks ago. And even if that were true, it still makes no sense to prevent me from having ANY additional days at all when I'm requesting the max. If you can't give me 10 days that's ok, no big deal, but give me 7 days, or 5, or 2! If I'm requesting 10 days and it's blatantly obvious I can crunch dozens of WUs per day, why was I being *entirely* cut off from having a single extra WU? I'm not upset, I'm just confused because it doesn't make sense to have not allowed me to have ANY extra WUs and it makes even less sense that this problem came up suddenly out of nowhere after I had 7 years with the same settings. I think I understand what's happening right now but it doesn't seem to make sense that it's happening suddenly and it certainly doesn't make sense that I was being entirely cut off from having any additional WUs. Logically, given my max settings, I should have just been given the slightly lesser amount of buffer WUs which I could in fact handle. So what happens now if Rosetta drops 7-day deadlines? I'll run out of *all* buffer WUs yet again and have the same exact problem instead of being given the 'X' amount of 2-day deadline work which I can handle? That doesn't make sense to have that happen because you're forcing people to know Rosetta's deadlines as they choose to change them. I don't work in the lab...I just want the max they'll allow me in an amount which my computer can handle. As I said in an earlier post, I've never had major problems not meeting deadlines because BOINC "learns" the amount of WUs your computer can handle. Don't misread me, I'm not upset over any of this, I just think something needs to be tweaked with the way buffer preferences are handled by Rosetta. And I still don't understand how/why this problem suddenly became an issue out of nowhere a few weeks ago. Time for me to go do some more re-reading & testing. I'll post again after I learned more. Thanks again! |
Juha Send message Joined: 28 Mar 16 Posts: 13 Credit: 705,034 RAC: 0 |
The deadlines were longer before. I have a copy of client_state.xml from summer and the one Rosetta task contained in it has 14 day deadline. The deadlines were probably shortened to limit the size of database and thus limit the load on the database server. edit: With that I mean: If a host goes missing after being assigned a task, the task waits until the deadline before it is sent to another host. Shorter deadline -> shorter wait time -> task get out of the system faster -> smaller and faster database. It also reduces the amount of work people can cache which is again better for the database. Your host didn't receive less than 10 days of work because BOINC assumed that the next time your host would be online would be 10 days from the moment your host requested work. Since the deadlines for all tasks were less than that your host would not have been able to complete and report the tasks in time. That setting is a tricky one. I think you are not the first one to get bitten by it. |
LC Send message Joined: 10 Jun 09 Posts: 8 Credit: 1,895,973 RAC: 0 |
Your host didn't receive less than 10 days of work because BOINC assumed that the next time your host would be online would be 10 days from the moment your host requested work. But why would Rosetta think my personally-preferred Cache/WU buffer has anything at all to do with my online status? BOINC runs well over 80% of the time for my computer and out of that amount of time BOINC probably has internet connectivity at least 80% of the time if not 90%+. WU cache and internet connectivity percentage /may/ be related in some instances but they are in fact separate things. For someone using *web* preferences, there is a specific and separate setting under "Computing Preferences" which specifically asks how often your computer is connected to the internet and it states it will try to give you that much work to keep you going. In my case of using *local* preferences, there is no such setting. Local computing /preferences/ allow you to request a minimum buffer ("Store at least X days of work.") and a maximum ("Store up to an additional X days of work."). These are merely preferences...if Rosetta feels it needs to give you less because of shorter deadlines then it should give you less...not cut you off entirely. I think you're absolutely right - this setting is tricky. BUT now I think we've found a way to deal with it. If using *local* preferences, the trick seems to be to set the "Store /at least/" setting to a very low number and set the "Store /up to an additional/" setting to whatever higher number you want. I had set the "at least" setting high because I was hoping to keep a 'high minimum' buffer...which is all that setting /should/ be about. But you guys have helped prove there's obviously a second variable at work. I still want to set my minimum amount high but now that I think about it maybe there's no need to. So...it seems we've figured it out and we've figured out how to work around it. Very interesting stuff. I hope this thread comes in handy to anyone else who runs into the same issue. I'll update here again when I have a chance to play around with this more. Thanks again to everyone for all your time, help & suggestions. Side note - I understand the server-load effects of all these settings manipulations but it would be great to hear from someone inside Rosetta as to what they prefer. From my POV, if they offer 1-hour WUs, I would think they're happy enough to have me work on a bunch. If however they really prefer longer WUs, I would be very willing to increase my setting to accommodate. To be honest, I'll probably change it anyway just for the sake of being different. By the way, I now have about 80 WUs due the 31st October and about 200 due the 2nd November. I haven't changed anything since my last post. Thanks again everyone, I'll update again soon. LC |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@LC Rosetta@home does not do such thinking. It is the BOINC Manager that does all of the scheduling and assignments of work and work requests. The first setting used to be worded something more like "connect every ... days". The second setting is your personally-preferred cache/WU buffer. You started out saying your machine has limited access to internet, but it looks like it's average turnaround time on WUs is just a few hours. So, that would imply that you almost have access to the internet all the time. If you do have access full-time, then you new settings sound much more appropriate. As for the earlier discussion about "holding up the science", the project does not send strictly identical tasks to various hosts. It sends numerous tasks that pertain to the same protein and method of trying to solve it, but they all use different starting configurations. In the end, some very small percentage of the tasks sent will report stellar results. And even if your stellar task solution is never returned, some other task will likely come up with just about the same stellar result because of the numbers being sent. That is part of the challenge to the Project Team as well, to figure out how to make the program only come up with stellar results. That could reduce the number of tasks that need to be crunched to find it by about 90%. And because that is not where the science is yet, there are dozens of researchers worldwide working to further improve the searching algorithms used. By having such large sample sizes, and the crunching power of BOINC, they can rather immediately see if a new approach is indeed producing more stellar results or not. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,194,697 RAC: 9,774 |
Don't misread me, I'm not upset over any of this, I just think something needs to be tweaked with the way buffer preferences are handled by Rosetta. Buffer preferences are handled by Boinc, not Rosetta. What's changed with Rosetta is the deadlines. The deadlines were longer before. I have a copy of client_state.xml from summer and the one Rosetta task contained in it has 14 day deadline. My memory's bad, but that was my impression too. 14 days is more than the minimum 10 days setting, result happiness. The deadlines were probably shortened to limit the size of database and thus limit the load on the database server. Limiting the load on the database server was a big issue with the recent problems, yes, though the reason for shortening deadlines was a separate thing. In truth, I think it's my fault... In previous CASP runs, tasks that needed a quick turnaround were given 14-day deadlines like everything else, and I complained that there was no way users could identify and micro-manage which were priorities and which weren't. My suggestion was this should be set in the deadlines by the project itself and Boinc could manage things from there on. I think this point was accepted, hence the 2/5/7-day deadlines. I think 14-days was reduced to 10-days first of all, but now 7-days seems to be the default max. Maybe that part is due to the server issues - just guessing. But why would Rosetta think my personally-preferred Cache/WU buffer has anything at all to do with my online status? BOINC runs well over 80% of the time for my computer and out of that amount of time BOINC probably has internet connectivity at least 80% of the time if not 90%+. I actually agree with you here. I didn't know that first setting did what Juha says it does (but Mod-Sense reminds me that it did used to be called 'connect ever x days' so maybe I just forgot). Upshot is, I agree you should set this at zero and use the next field to determine how many days worth of tasks you'd like. I'm pretty sure the second field meets your expectation - that is, if you ask for 100 days but it can only give you 6 days, you get as many as it can offer at the time and it'll keep trying to fill your wishlist as more tasks become available. Side note - I understand the server-load effects of all these settings manipulations but it would be great to hear from someone inside Rosetta as to what they prefer. From my POV, if they offer 1-hour WUs, I would think they're happy enough to have me work on a bunch. If however they really prefer longer WUs, I would be very willing to increase my setting to accommodate. To be honest, I'll probably change it anyway just for the sake of being different. I kind of agree in that the option of 1hr should be forcibly removed, especially since the recent site problems, and that everyone on 1hr should be bumped up to a new minimum, whatever that's decided to be. The default runtime used to be 3hrs and there was a very long thread about bumping it up to 4hrs, with many resisting. With the recent server problems the default was pushed up to 6hrs and then 8hrs without telling anyone (a bit naughty). We can safely say Rosetta prefers you to run 8hrs and the longer you're comfortable with between 1 and 8hrs the better. By the way, I now have about 80 WUs due the 31st October and about 200 due the 2nd November. I haven't changed anything since my last post. While you've got a big buffer, increasing runtime will turn 7 days of 1hr tasks immediately into that number of 2 hour tasks (14 days), so cut your buffer in half and run it down a bit before increasing runtime one notch, else half the tasks will miss the deadline. Same again if you increase to 3hrs, 4hrs etc. You get my drift. |
LC Send message Joined: 10 Jun 09 Posts: 8 Credit: 1,895,973 RAC: 0 |
@ModSense Rosetta@home does not do such thinking. It is the BOINC Manager... Gotcha, mea culpa. You started out saying your machine has limited access to internet...If you do have access full-time, then you new settings sound much more appropriate. My laptop has some sort of sporadic wifi disconnect issue and almost every night I lose connectivity for varying amounts of time, usually several hours. Sorry, I didn't intend to imply it was worse than this but from my POV the connection drops are very annoying. As for the earlier discussion about "holding up the science"... Thanks for that info, I wasn't aware of any of that. Learned a bunch of new things now. @Sid I actually agree with you here. I didn't know that first setting did what Juha says it does (but Mod-Sense reminds me that it did used to be called 'connect ever x days' so maybe I just forgot). Upshot is, I agree you should set this at zero and use the next field to determine how many days worth of tasks you'd like. I already changed it to 2+5 but I'll probably just drop it to 0+5ish because as you've guessed, yes, I do already have the cache behaviour I was looking for. After everything you guys have taught me over the past couple days, I'd agree with what everyone seems to be leaning towards - 1 hour runtimes should probably be dropped, maybe even 2 hour ones but I think I would prefer to keep 4. Some 'part-timers' only contribute very small chunks of time and may miss deadlines if you raise the minimum too high. To be honest, I would have set my runtimes higher from day one had I known what I know now. Thanks for all the help everyone. I'll report back when I have time to test more changes. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@LC Yes, unless your laptop has an intentional or scheduled removal from internet, your minimum work buffer should be under one. Basically indicating that it should reasonably expect to have access each day. Which means it should expect to be able to report results back each day as well. Please let me know if anyone objects to me moving this conversation off to a new thread. I'll probably do that this weekend. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2122 Credit: 41,194,697 RAC: 9,774 |
By the way, I now have about 80 WUs due the 31st October and about 200 due the 2nd November. I haven't changed anything since my last post. Having a quick check, it's apparent the runtime setting was changed up to 4 hours before letting outstanding tasks run down, so it looks like a lot of them will miss deadline. Probably best just to let it happen - they'll get sent out again and new tasks will come down to replace them with a more appropriate deadline. |
Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0 |
I got a validate error at https://boinc.bakerlab.org/rosetta/result.php?resultid=886140787. I sincerely doubt that this is caused by situations where task files are uploaded and then the tasks are reported before the files are finally written in storage because of the time stamp of the reporting of the work unit and the following BOINC client log: 11/15/2016 10:56:30 AM | rosetta@home | Computation for task rb_11_12_70285_113852__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449751_697_1 finished 11/15/2016 10:56:31 AM | rosetta@home | Started upload of rb_11_12_70285_113852__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449751_697_1_0 11/15/2016 10:56:37 AM | rosetta@home | Finished upload of rb_11_12_70285_113852__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449751_697_1_0 11/15/2016 10:56:48 AM | rosetta@home | Computation for task rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_8103_1 finished 11/15/2016 10:56:51 AM | rosetta@home | Computation for task rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7725_1 finished 11/15/2016 10:56:52 AM | rosetta@home | Started upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_8103_1_0 11/15/2016 10:56:53 AM | rosetta@home | Started upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7725_1_0 11/15/2016 10:56:56 AM | rosetta@home | Finished upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_8103_1_0 11/15/2016 10:56:57 AM | rosetta@home | Finished upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7725_1_0 11/15/2016 11:00:01 AM | rosetta@home | Computation for task rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7681_1 finished 11/15/2016 11:00:02 AM | rosetta@home | Started upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7681_1_0 11/15/2016 11:00:05 AM | rosetta@home | Finished upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7681_1_0 11/15/2016 11:03:00 AM | rosetta@home | Computation for task rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_965_1 finished 11/15/2016 11:03:01 AM | rosetta@home | Started upload of rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_965_1_0 11/15/2016 11:03:07 AM | rosetta@home | Finished upload of rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_965_1_0 11/15/2016 11:03:35 AM | rosetta@home | Computation for task rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7647_1 finished 11/15/2016 11:03:36 AM | rosetta@home | Started upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7647_1_0 11/15/2016 11:03:40 AM | rosetta@home | Finished upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7647_1_0 11/15/2016 11:05:08 AM | rosetta@home | Computation for task rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7638_1 finished 11/15/2016 11:05:09 AM | rosetta@home | Started upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7638_1_0 11/15/2016 11:05:16 AM | rosetta@home | Finished upload of rb_11_12_70288_113830_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_449736_7638_1_0 11/15/2016 11:05:16 AM | rosetta@home | Computation for task rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_398_1 finished 11/15/2016 11:05:17 AM | rosetta@home | Started upload of rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_398_1_0 11/15/2016 11:05:23 AM | rosetta@home | Finished upload of rb_11_12_70293_113866__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449742_398_1_0 11/15/2016 11:13:50 AM | rosetta@home | Computation for task rb_11_12_70287_113849__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449753_103_1 finished 11/15/2016 11:13:51 AM | rosetta@home | Started upload of rb_11_12_70287_113849__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449753_103_1_0 11/15/2016 11:13:56 AM | rosetta@home | Finished upload of rb_11_12_70287_113849__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_449753_103_1_0 11/15/2016 11:19:39 AM | rosetta@home | update requested by user 11/15/2016 11:19:40 AM | rosetta@home | Sending scheduler request: Requested by user. 11/15/2016 11:19:40 AM | rosetta@home | Reporting 9 completed tasks 11/15/2016 11:19:40 AM | rosetta@home | Not requesting tasks: don't need (CPU: not highest priority project; NVIDIA GPU: job cache full) 11/15/2016 11:19:41 AM | rosetta@home | Scheduler request completed The file for the affected work unit was uploaded at 11/15/2016 11:05:16 AM UTC-5 (US Eastern Standard Time). The scheduler request was issued at 11/15/2016 11:19:40 AM UTC-5 and finished at 11/15/2016 11:19:41 AM UTC-5. There is at least 14 minutes between the upload and the report. Does the upload server need to have its file system checked? |
makarios Send message Joined: 2 Apr 14 Posts: 1 Credit: 350,281 RAC: 0 |
Proudly number crunching for Rosetta@home! Thank you for such an amazing project! Nolan Keck / "makarios" |
Petr Pulc Send message Joined: 1 Oct 16 Posts: 2 Credit: 8,475 RAC: 0 |
Since circa Nov 24. all units crash on my notebook straight away with a following output: <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> </stderr_txt> ]]> My system is: 4.8.0-1-amd64 #1 SMP Debian 4.8.7-1 (2016-11-13) x86_64 GNU/Linux Any idea how to trace the possible cause of the problem (bad memory usage)? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Since circa Nov 24. all units crash on my notebook straight away with... This thread describes similar issue. Memory and overclocking are other possibilities discussed in this FAQ Rosetta Moderator: Mod.Sense |
Juha Send message Joined: 28 Mar 16 Posts: 13 Credit: 705,034 RAC: 0 |
process got signal 11 Check system log if you have messages about vsyscall similar to these. If you have then Rosetta app is not compatible with your kernel. See this message for workaround. |
Petr Pulc Send message Joined: 1 Oct 16 Posts: 2 Credit: 8,475 RAC: 0 |
process got signal 11 Thanks for replies. Yes, it is caused by disabled vsyscall emulation. |
Steve1979 Send message Joined: 27 Apr 06 Posts: 2 Credit: 3,074,072 RAC: 237 |
I have a work unit that has been running for 43 hours. Usually they last 8 hours. What should I do? It seems a waste to lose all of that processing but if it has crashed I think I should stop it? The remaining column is blank, as if it had completed. Thanks for any help with this. Steve |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,028,097 RAC: 7,117 |
I have a work unit that has been running for 43 hours. Usually they last 8 hours. What should I do? It seems a waste to lose all of that processing but if it has crashed I think I should stop it? The remaining column is blank, as if it had completed. If I notice a job like that, I just kill it, but it is a personal decision on your part. It would be a greater waste if it runs for another 43 hours and then you kill it. 8-) |
Steve1979 Send message Joined: 27 Apr 06 Posts: 2 Credit: 3,074,072 RAC: 237 |
I have aborted it, Thanks for your help. I've never noticed this happening in the many years I have been contributing to this project, I'll keep a closer eye on runtimes in the future. Steve |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,193,109 RAC: 903 |
I have aborted it, Thanks for your help. I've never noticed this happening in the many years I have been contributing to this project, I'll keep a closer eye on runtimes in the future. Sometimes just suspending and resuming a task like that can get them running correctly again. They seem to get stuck in a loop, check the task manager to see if CPU is being used as it might be "running" but not actually doing anything and not using a CPU core. Conan |
David Brundage Send message Joined: 10 Dec 16 Posts: 3 Credit: 2,590,266 RAC: 3,377 |
I seem to get quite a few work units that end with "computation error" although more finish with out any problem. I am new to Rosetta so I'm just wondering if this is normal or if I should be looking for some kind of solution? |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org