Message boards : Number crunching : client upgrade, stalled WU's - what is the cause and the fix???
Author | Message |
---|---|
PresterJohn Send message Joined: 4 Nov 05 Posts: 24 Credit: 2,121,609 RAC: 0 |
1) for the windows version of the client, is there a way to tell what version number of the software i am running? i see that my machine downloaded rosetta_4.79_windows_intelx86.exe but how can i tell if it is actually running 4.79? i see no mention of the 4.79 executable being started in my stdoutdae.txt 2) i've skimmed thru some of one or two of the related threads about WU's stuck at 1%, etc and correct me if i'm wrong, but it seems that there is a number of different possibilities and no one seems to know what exactly is the cause of the problem. since this weekend, i've had approx 5 occurrences of stalled WU's. in two of those cases, the client kept happily trying to finish and eventually wasted 43.8 and 14.5 hrs respectively only to return a client error as the final outcome (see links below). https://boinc.bakerlab.org/rosetta/result.php?resultid=1626055 https://boinc.bakerlab.org/rosetta/result.php?resultid=1368310 the other three occurrences were cases of 'active' stalled jobs (the latest of which i discovered 90 minutes ago), which were aborted by user intervention. all told, probably over 120 hrs of wasted time and money (electricity in nyc isn't cheap you know) doing absolutely nothing useful. so understandably, i am not in a particularly happy mood about this and would like to know what is being done to diagnose and fix this problem. i would rather not hear suggestions about running boincview or checking my boxes more frequently. in the two sites that i run r@h, boincview will not work for one of them because the highly secured router/switch environmment locks out the bionc service port. find-a-drug users are/were accustomed to a client that ran smoothly with a minimum of user intervention and administration. an occasional bad batch of WU's being pushed out to users i can understand and live with, but unexplained, unreproducible errors which might be occurring on a frequent basis and which could result in nonproductive conditions that may last for days is almost untenable. we have some large crunchers on our team and the extra overhead to manage and check host machines to make sure they are properly working is entirely unsatisfactory and will probably negatively impact the number of participants interested in running rosetta. [edit] fixed typo in thread subject. - team XPC - 'Where merry times and good crunching meet head-on!' |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
Quick Answer: Those results are from the 4.78 client, not the 4.79 client. Once you start crunching 4.79 you shouldn't get any "stuck" WUs. Longer Answer: 1) for the windows version of the client, is there a way to tell what version number of the software i am running? On the Work tab of Boinc Manager in the Application column, you'll find what application version is being used. To actually see what windows is running, you can open the task manager go to the processes tab and you'll see either rosetta_4.78_windows_intelx86.exe or rosetta_4.79_windows_intelx86.exe using most of your cpu. 2) i've skimmed thru some of one or two of the related threads about WU's stuck at 1%, etc and correct me if i'm wrong, but it seems that there is a number of different possibilities and no one seems to know exactly is the cause of the problem. The above WU links indicate that they used the 4.78. Once you're using the 4.79 you shouldn't get any WUs with the 1% stalls... If you can't wait for your cache to deplete the 4.78, just abort them or reset the project. EDIT: added longer answer |
PresterJohn Send message Joined: 4 Nov 05 Posts: 24 Credit: 2,121,609 RAC: 0 |
see my question #1... the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version? - team XPC - 'Where merry times and good crunching meet head-on!' |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version? In the Work tab, the Application column. Mine shows "rosetta 4.79" at the moment. |
PresterJohn Send message Joined: 4 Nov 05 Posts: 24 Credit: 2,121,609 RAC: 0 |
the stalled job that i killed on one of my machines this morning had d/l'ed 4.79 yesterday. how can i verify that it is indeed running the new version? yep, i noticed version # listed in the application column about a minute ago and it did say 4.78. but how exactly does the software know to use 4.79? just now i attempted to manually force 4.79 to load by renaming the 4.78 exe. it took two restarts on boincmgr to get 4.79 to load but in the process it cleared out my queue and it attempted to download 4.78 again. --- quoted from message log ---------------------- 2005-11-15 11:01:03 [---] request_reschedule_cpus: start failed 2005-11-15 11:01:03 [rosetta@home] Computation for result 1hz7A_abrelaxmode_random_gauss_fix_bb_jitter03_110659_0 finished 2005-11-15 11:01:03 [rosetta@home] Starting result 1n0u__abrelaxmode_random_length20_jitter02_omega_16322_0 using rosetta version 479 2005-11-15 11:01:26 [rosetta@home] Finished download of rosetta_4.78_windows_intelx86.exe 2005-11-15 11:01:26 [rosetta@home] Throughput 209181 bytes/sec 2005-11-15 11:02:01 [rosetta@home] Fetching master file 2005-11-15 11:02:06 [rosetta@home] Master file download succeeded 2005-11-15 11:02:12 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 2005-11-15 11:02:12 [rosetta@home] Reason: To fetch work 2005-11-15 11:02:12 [rosetta@home] Requesting 728251 seconds of new work, and reporting 41 results 2005-11-15 11:02:17 [rosetta@home] Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded 2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded 2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded 2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded 2005-11-15 11:02:18 [---] request_reschedule_cpus: files downloaded something does not look right here! - team XPC - 'Where merry times and good crunching meet head-on!' |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
how exactly does the software know to use 4.79? This info is stored in the xml files in the boinc main directory. Which file and where, I don't exactly know. If you want to use 4.79 instead of 4.78 you'll have to create an app_info.xml in {BOINC_INSTALL_DIR}projectsboinc.bakerlab.org_rosetta See this link about app_info.xml: link Basically the app_info.xml file will tell the boinc client what exe to use for 4.78. I believe your xml would look something like this (although I haven't tested this): <app_info> <app> <name>rosetta</name> </app> <file_info> <name>rosetta_4.79_windows_intelx86.exe</name> <executable/> </file_info> <app_version> <app_name>rosetta</app_name> <version_num>478</version_num> <file_ref> <file_name>rosetta_4.79_windows_intelx86.exee</file_name> <main_program/> </file_ref> </app_version> </app_info> However, after all this... I'd just abort the 4.78 WUs, and keep the 4.79 WUs. :) |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
but how exactly does the software know to use 4.79? Each result you are assigned has a version that it should be processed with. Thus the earlier statement that you may have to look at your Work tab and "abort" any results which show 4.78 as the required application version. Note that until all 4.78 results have been processed, you may continue to receive a few. A project can run multiple science apps against different results as needed - this for example lets projects like Predictor have two different types of WUs, processed by two different science apps, yet all still be "Predictor". Deleting (or renaming) a science app will just cause another copy to be downloaded if you get a result assigned that requires it. EDIT:: I would caution against using an app_info.xml file in this case. It would probably work, but the slightest mistake can result in a large number of "lost" results, more than simply aborting them would. This file is normally used when, for example, you want to use an optimized SETI app in place of the standard app, and you know that the outcome of processing a result with either app is supposed to be identical. Also, you must remember to delete the file when it is no longer needed. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
The "stuck at 1%" issue was not directly addressed in the new version so it may still occur. However, there has been some significant changes in the code from our development team so I wouldn't be surprised if a fix was made unknowingly. This is a very peculiar and hard to find bug as you may have gathered already from the message board threads. |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
EDIT:: I would caution against using an app_info.xml file in this case. It would probably work, but the slightest mistake can result in a large number of "lost" results, more than simply aborting them would. This file is normally used when, for example, you want to use an optimized SETI app in place of the standard app, and you know that the outcome of processing a result with either app is supposed to be identical. Also, you must remember to delete the file when it is no longer needed. I agree I would not use the app_info.xml in this case, but I just presented it to him as another option. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
As I have noted in another thread, we have not had any problems with R@H 4.79. It seems very stable. Most of our boxes are running BOINC 5.x, 3 projects, clients stay in memory and change after 60 mins. R@H client 4.78 (the old version) hung at 1% on a number of occasions even with 120 mins of contiuous run time. Seems encourging......Cheers, Rog. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
I agree I would not use the app_info.xml in this case, but I just presented it to him as another option. Yep! No problem - it's a viable option here. I didn't notice your statement at the bottom that you'd just abort the WUs or I wouldn't have bothered editing. I'm not concerned in this specific case, but I've seen people get "carried away" with this type of thing on the SETI boards - read a recommendation to one person, assume it applies to them even though they have a totally different problem - and then they wind up so messed up it takes a total reinstall to get it straightened up. Editing the xml is sometimes the _only_ way to go, but it's not for the masses; it's like "hit Reset Project" - for a while it seemed like everyone was doing that for every problem, even though it only helped in maybe 10% of the cases, and it caused tons of lost results... |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
how exactly does the software know to use 4.79? I believe the call to use one version over the another is actually encoded in the work unit itself. (again, Im only speaking from observation).. Once the workunit arrives, it triggers the support files (actual EXE) that needs to be downloaded to run it. Thus the abort is the safest, abort until you receive 4.79 work units... or (and I STRONGLY dont suggest this), manipulate the xml files to fake BOINC to use 4.79 instead of 4.78. |
Message boards :
Number crunching :
client upgrade, stalled WU's - what is the cause and the fix???
©2024 University of Washington
https://www.bakerlab.org