Message boards : Number crunching : Help us solve the 1% bug!
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
Author | Message |
---|---|
James Send message Joined: 8 Jan 06 Posts: 21 Credit: 11,697 RAC: 0 |
Well assuming my standard routine is 'routine behavior' (and I take that assumption as fact) I can no longer say that every WU is encountering the 1 percent issue. 3 though, have. As in, stuck for hours on end. I assumed that, too, was normal behavior related to size of the work units. I checked back on them, they were relatively small ones. I had not been checking the graphics and had treated it as more than the same. I did the restart and stuff got fixed. So...3 out of umm whatever I have now. Not too many. But I still get a few. The excessive stickiness was about 5 hours on 1 percent before restarts... There. I suppose I should RTFM more closely. Part of the issue is that I wanted a quick fix to einstein boredom and wasn't too interested in the progress meters. Although...it is a little odd that the progress isn't a bit more real time. |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 702,872 RAC: 1,035 |
OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests. I neglected to mention that I have "Leave applications in memory" set to "yes". In the early days of Rosetta, that was the only way to get it to work at all on a multi-processor setup. I also upped the memory from 768MB to 1GB on my machines, and they typically run at 60% or less memory usage now with 5 projects. |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 702,872 RAC: 1,035 |
Well assuming my standard routine is 'routine behavior' (and I take that assumption as fact) I can no longer say that every WU is encountering the 1 percent issue. You'll know that you've hit the "1% issue" if you look at the graphics and the line "Step: xxxx" is not increasing. (a good reason to have graphics to look at, BTW) |
eberndl Send message Joined: 17 Sep 05 Posts: 47 Credit: 3,062,163 RAC: 2,108 |
About the PPAH/R@H interaction possibility: I run these two projects (each with about 1/3 of my computer time) and have never had a 1% WU, so I doubt that this is the cause of the problems. I DO have BOINC set to left in memory though... Questions? Try the Wiki! Take a look inside my brain |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests. Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on. Can you tell us which API calls you are using in the application? Perhaps we can think through the behaviors and come up with a possible link ... |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 702,872 RAC: 1,035 |
Might there be the possibility that it has to do with stopping/restarting BOINC or with BOINC pausing/resuming the app? I do have "leave in memory" set -- is that still necessary? |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
Hi All, I am running Rosseta on a amd64 dual core. It is the only project in Boinc and the computer does nothing else but crunch and talk to the Rosseta server. I have had every single bug crop up mentioned in this thread and the other thread aborted workunits. This computer does a large volume of WU's around ~20-25 a day. Thats the only constant I have been able to see is they crop up randomly and the frequency of the bug goes up as more WU's get crunched. Ciao! and Have a Great day no matter were you are the UTC landscape...... |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I've never encountered a problem on WinXP, but today I'm seeing Rosetta proceses stuck on a Linux (Debian Sarge) box (which Rosetta shares with 3 other BOINC projects). Boinc ver. 5.2.13. The other projects had been running for several days fine. (I'm still "testing waters" wrt DC). I've been running WCG/HPF (which is using an earlier version 4.21 of Rosetta) on this Linux machine and never had a problem sofar. Just FYI. The first time I noticed rosetta4.8 stuck (12hr ago) I killed the rosetta task and boinc tasks and restarted BOINC. It worked for a few hours and a few minutes earlier I found rosetta stuck again. Rosetta was stopped (ps flags are SN, stopped/nice), takes no CPU time ("top" doesn't show the Rosetta task, whereas normally it'd take 99% of CPU time). No other BOINC processes take its place, so the whole BOINC work queue gets stuck too. ~/BOINC/slots/1/stdout.txt says near the bottom:
So does it look like the 1% error? If so, then it's a Linux thing too. I've run rosetta from the command-line from its own home directory, as you described and will let you know. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Following up my prior post, the rosetta process has finished:
and the ~/BOINC/projects/boinc.bakerlab.org_rosetta/ has following files: $ ls -lat|head total 22508 -rw-r--r-- 1 boinc boinc 55 2006-01-15 06:40 stderr.txt drwxr-xr-x 2 boinc boinc 4096 2006-01-15 06:40 . -rw-r--r-- 1 boinc boinc 27258 2006-01-15 06:40 aa1ogw.out.gz -rw-r--r-- 1 boinc boinc 0 2006-01-15 06:40 boinc_finish_called -rw-r--r-- 1 boinc boinc 2424 2006-01-15 06:40 init_data.xml -rw-r--r-- 1 boinc boinc 5 2006-01-15 06:40 rosetta_decoy_cnt.txt -rw-r--r-- 1 boinc boinc 939 2006-01-15 06:40 rosetta_random.txt -rw-r--r-- 1 boinc boinc 48711 2006-01-15 06:40 stdout.txt -rw-r--r-- 1 boinc boinc 7 2006-01-15 06:23 aa1ogw.last_pdb $ cat stderr.txt Can't open init data file - running in standalone mode $ tail stdout.txt wobblemin trials: 76 accepts: 0 %: 0 final score: -166.453827 --------------------------------------------------- BOINC :: [2006-01-15 06:40:37] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 10 :: num_decoys: 10 BOINC :: [2006-01-15 06:40:37] :: Structure: 10 completed :: num_decoys: 10 :: total_iterations: 10 :: percent complete: 1 GZIP SILENT FILE: ./aa1ogw.out ====================================================== DONE :: 1 starting structures built 10 (nstruct) times This process generated 10 decoys from 10 attempts ====================================================== Let me know if there's anything else to check and report back. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
A NO_SIM-ANNEAL_NO_BARCODE_2reb_243_286_0 is running for more than 2:25:xx now and still at 1 %. What is the wright thing to do ? EDIT : Just found the instructions below, so wil check that. EDIT @ : Followed the instructions but no graphics (W2K), time is running and still at 1% at the moment. When can I expect the 1% change ? |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=6477569 Aborted it after running for another hour. |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
For these, unless you've uncovered a different problem, you can stop BOINC and restart it and the workunit will start over and should complete normally. In this case (4 hour wasted) six of these a day and you'de better chose another project to prefent wasting idle time. It's easier to use a program that will give an alarm or abort right away if the WU is still at 1% after 15 minutes and than abort instead of taking the trouble to follow the instructions. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
For these, unless you've uncovered a different problem, you can stop BOINC and restart it and the workunit will start over and should complete normally. I have been reading the older posts in this thread, and may be able to shed some light on one of Paul's theories. All my machines are Macs running OS 10.4.4 and BOINC 5.2.13. One is a Dual G$, one is a Dual G5, and one is a powerbook G4. I cannot remember ever having a WU stuck at 1%, but I have had a few that stuck at other points during processing. I had one (reported earlier) that stuck at 80%. That was on the Dual G4. That system also runs SETI, Predictor, Climate, SETI Enhanced, and occasionally Einstein. At least in my case there seems to be no relationship between any stuck WUs and any other apps on the system. I have had a few stuck on the Dual G5 and it is only running R@H. It is interesting to note that BOINC actually keeps track of the CPU time and the counters not R@H. This function is somewhat dependent on what the project projects as the time a particular WU will take. That said, if the WU sticks at 1% one has to wonder if that is actually what is happening or is it simply a failure of BOINC to properly update., and if left alone it might complete. In my experience, when a WU sticks (again I never see this at 1%) there will always be a Kernel process running that was started by "root" that starts to eat up CPU cycles. This is usually how I can tell if the WU is stuck or just taking a long time. If I do not see this Kernel process running and eating up the system, then the WU is not stuck and I leave it alone. If I do see it I will shutdown BOINC abort the Kernel process (it won't stop on its own), restart BOINC, and usually the WU will run to completion, picking up at the last checkpoint. I have to assume that since the WU started, that it will complete with the same random seed because that should have been stored on initialization. So If that is true, it is not likely a seed problem. At least on my system, processing does not actually stop. It is as though the program has managed to find a tight loop somewhere. So far I have been unable to capture any stats on this Kernel process but I do know it never changes state, it uses a lot of CPU percent, and it does some paging and it makes no machine calls. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there... |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there... Well Bill, I would but I can't find it. It looks like all the thousands of Mac folks just coming into BOINC are having all the same startup problems the rest of us had, all at once. It looks like a free for all over there. Kind of makes me want to write something to try to help them out. From what I can tell most of them just don't quite understand how to make the thing run in the first place. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there...Well Bill, I would but I can't find it. Sorry - I sent you to the wrong place. here is the thread I had in mind, on the BOINC boards instead of SETI - but it's not going to tell you anything new. |
premier Send message Joined: 30 Dec 05 Posts: 14 Credit: 23,872,868 RAC: 0 |
My 1% stuck WU: 2006-01-17 12:48:14|rosetta@home|Starting result PRODUCTION_ABINITIO_1bq9A_250_543_0 using rosetta version 481 It's been doing almost 7 hours. It freezes at step 21933. Suspending and starting again unfortunately don't solve the problem. I closed BOINC client (Windows version) and open again and guess what. The WU pasess the 21933 step and process whole WU in 1hr 15min. The bad thing is I lost 7 hours of doing nothing (and lost credit) :( And this is not my first 1% stuck WU. I have my stdout.txt saved form stuck unit. If You want it - I can mail it. One more thing. After restarting BOINC I compared the stdout.txt form WU that stucked and the same WU wihich completed successfuly. To the line where is random seed they were almost identical, except one line: WARNING: check_decoy_exists: unexpected decoy number: start#,decoy#,lastdecoy# 1 1 40 0 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
premier, can you email me both stdout.txt files? dekim at u dot washignton dot edu |
Message boards :
Number crunching :
Help us solve the 1% bug!
©2024 University of Washington
https://www.bakerlab.org