Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
James

Send message
Joined: 8 Jan 06
Posts: 21
Credit: 11,697
RAC: 0
Message 9016 - Posted: 14 Jan 2006, 16:20:40 UTC - in response to Message 9010.  

Well assuming my standard routine is 'routine behavior' (and I take that assumption as fact) I can no longer say that every WU is encountering the 1 percent issue.

3 though, have. As in, stuck for hours on end. I assumed that, too, was normal behavior related to size of the work units. I checked back on them, they were relatively small ones. I had not been checking the graphics and had treated it as more than the same. I did the restart and stuff got fixed.

So...3 out of umm whatever I have now. Not too many. But I still get a few. The excessive stickiness was about 5 hours on 1 percent before restarts...

There. I suppose I should RTFM more closely. Part of the issue is that I wanted a quick fix to einstein boredom and wasn't too interested in the progress meters.

Although...it is a little odd that the progress isn't a bit more real time.
ID: 9016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 9017 - Posted: 14 Jan 2006, 16:32:42 UTC - in response to Message 9008.  

OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests.
----


I neglected to mention that I have "Leave applications in memory" set to "yes". In the early days of Rosetta, that was the only way to get it to work at all on a multi-processor setup. I also upped the memory from 768MB to 1GB on my machines, and they typically run at 60% or less memory usage now with 5 projects.

ID: 9017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 9018 - Posted: 14 Jan 2006, 16:37:03 UTC - in response to Message 9016.  

Well assuming my standard routine is 'routine behavior' (and I take that assumption as fact) I can no longer say that every WU is encountering the 1 percent issue.


You'll know that you've hit the "1% issue" if you look at the graphics and the line "Step: xxxx" is not increasing. (a good reason to have graphics to look at, BTW)

ID: 9018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
eberndl
Avatar

Send message
Joined: 17 Sep 05
Posts: 47
Credit: 3,062,163
RAC: 2,108
Message 9021 - Posted: 14 Jan 2006, 16:50:04 UTC

About the PPAH/R@H interaction possibility: I run these two projects (each with about 1/3 of my computer time) and have never had a 1% WU, so I doubt that this is the cause of the problems. I DO have BOINC set to left in memory though...


Questions? Try the Wiki!
Take a look inside my brain
ID: 9021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 9034 - Posted: 14 Jan 2006, 19:11:44 UTC - in response to Message 9008.  

OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests.
----
[edit]
When I ran it from the command prompt, it ran (and is continuing to run) normally. It's at 20% now.

WU name: PRODUCTION_ABINITIO_1iibA_239_573_0

WU link:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5159485
----
[edit]
The command line that I used:
projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe xx 1iib A -output_silent_gz -silent -increase_cycles 10 -nstruct 10 -constant_seed -jran 1248601
[/edit]

I'm going to stop the command-line app and let it run again normally.
[/edit]


Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on.


ID: 9034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 9037 - Posted: 14 Jan 2006, 20:16:01 UTC - in response to Message 9034.  

Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on.

Can you tell us which API calls you are using in the application?

Perhaps we can think through the behaviors and come up with a possible link ...
ID: 9037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 9041 - Posted: 14 Jan 2006, 21:22:34 UTC - in response to Message 9034.  


Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on.



Might there be the possibility that it has to do with stopping/restarting BOINC or with BOINC pausing/resuming the app? I do have "leave in memory" set -- is that still necessary?

ID: 9041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 9054 - Posted: 15 Jan 2006, 0:54:36 UTC

Hi All,

I am running Rosseta on a amd64 dual core.

It is the only project in Boinc and the computer does nothing else but crunch and talk to the Rosseta server.

I have had every single bug crop up mentioned in this thread and the other thread aborted workunits.

This computer does a large volume of WU's around ~20-25 a day. Thats the only constant I have been able to see is they crop up randomly and the frequency of the bug goes up as more WU's get crunched.

Ciao! and Have a Great day no matter were you are the UTC landscape......
ID: 9054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 9065 - Posted: 15 Jan 2006, 2:55:35 UTC - in response to Message 8909.  
Last modified: 15 Jan 2006, 3:08:45 UTC


So I don't think it is a rosetta problem (bug) per se, and maybe it doesn't happen with boinc on linux.
Does anything similar happen with WU from other projects? Maybe James's computer can provide a clue--why does this happen to all of his WU?


I've never encountered a problem on WinXP, but today I'm seeing Rosetta proceses stuck on a Linux (Debian Sarge) box (which Rosetta shares with 3 other BOINC projects). Boinc ver. 5.2.13. The other projects had been running for several days fine. (I'm still "testing waters" wrt DC). I've been running WCG/HPF (which is using an earlier version 4.21 of Rosetta) on this Linux machine and never had a problem sofar. Just FYI.

The first time I noticed rosetta4.8 stuck (12hr ago) I killed the rosetta task and boinc tasks and restarted BOINC. It worked for a few hours and a few minutes earlier I found rosetta stuck again.

Rosetta was stopped (ps flags are SN, stopped/nice), takes no CPU time ("top" doesn't show the Rosetta task, whereas normally it'd take 99% of CPU time). No other BOINC processes take its place, so the whole BOINC work queue gets stuck too.

~/BOINC/slots/1/stdout.txt says near the bottom:


BOINC :: [2006-01-15 04:27:07] :: Total iterations: 10 :: mode: abrelax :: nstar
tnm: 1 :: number_of_output: 10 :: num_decoys: 9 :: percent complete: 0.9


So does it look like the 1% error?

If so, then it's a Linux thing too. I've run rosetta from the command-line from its own home directory, as you described and will let you know.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 9065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 9079 - Posted: 15 Jan 2006, 13:38:29 UTC

Following up my prior post, the rosetta process has finished:


./rosetta_4.80_i686-pc-linux-gnu aa 1ogw _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -new_centroid_packing -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -110 -filter2 -145 -nstruct 10 -constant_seed -jran 1626021


and the ~/BOINC/projects/boinc.bakerlab.org_rosetta/ has following files:

$ ls -lat|head
total 22508
-rw-r--r-- 1 boinc boinc 55 2006-01-15 06:40 stderr.txt
drwxr-xr-x 2 boinc boinc 4096 2006-01-15 06:40 .
-rw-r--r-- 1 boinc boinc 27258 2006-01-15 06:40 aa1ogw.out.gz
-rw-r--r-- 1 boinc boinc 0 2006-01-15 06:40 boinc_finish_called
-rw-r--r-- 1 boinc boinc 2424 2006-01-15 06:40 init_data.xml
-rw-r--r-- 1 boinc boinc 5 2006-01-15 06:40 rosetta_decoy_cnt.txt
-rw-r--r-- 1 boinc boinc 939 2006-01-15 06:40 rosetta_random.txt
-rw-r--r-- 1 boinc boinc 48711 2006-01-15 06:40 stdout.txt
-rw-r--r-- 1 boinc boinc 7 2006-01-15 06:23 aa1ogw.last_pdb

$ cat stderr.txt
Can't open init data file - running in standalone mode

$ tail stdout.txt
wobblemin trials: 76 accepts: 0 %: 0
final score: -166.453827
---------------------------------------------------
BOINC :: [2006-01-15 06:40:37] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 10 :: num_decoys: 10
BOINC :: [2006-01-15 06:40:37] :: Structure: 10 completed :: num_decoys: 10 :: total_iterations: 10 :: percent complete: 1
GZIP SILENT FILE: ./aa1ogw.out
======================================================
DONE :: 1 starting structures built 10 (nstruct) times
This process generated 10 decoys from 10 attempts
======================================================

Let me know if there's anything else to check and report back.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 9079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9146 - Posted: 16 Jan 2006, 18:29:12 UTC
Last modified: 16 Jan 2006, 19:04:39 UTC

A NO_SIM-ANNEAL_NO_BARCODE_2reb_243_286_0 is running for more than 2:25:xx now and still at 1 %.
What is the wright thing to do ?

EDIT : Just found the instructions below, so wil check that.

EDIT @ : Followed the instructions but no graphics (W2K), time is running and still at 1% at the moment.
When can I expect the 1% change ?


ID: 9146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9152 - Posted: 16 Jan 2006, 20:08:20 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=6477569

Aborted it after running for another hour.

ID: 9152 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 9157 - Posted: 16 Jan 2006, 21:38:30 UTC

For these, unless you've uncovered a different problem, you can stop BOINC and restart it and the workunit will start over and should complete normally.

ID: 9157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9158 - Posted: 16 Jan 2006, 21:52:53 UTC - in response to Message 9157.  
Last modified: 16 Jan 2006, 22:12:14 UTC

For these, unless you've uncovered a different problem, you can stop BOINC and restart it and the workunit will start over and should complete normally.

In this case (4 hour wasted) six of these a day and you'de better chose another project to prefent wasting idle time.
It's easier to use a program that will give an alarm or abort right away if the WU is still at 1% after 15 minutes and than abort instead of taking the trouble to follow the instructions.

ID: 9158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9165 - Posted: 17 Jan 2006, 0:23:48 UTC - in response to Message 9158.  

For these, unless you've uncovered a different problem, you can stop BOINC and restart it and the workunit will start over and should complete normally.

In this case (4 hour wasted) six of these a day and you'de better chose another project to prefent wasting idle time.
It's easier to use a program that will give an alarm or abort right away if the WU is still at 1% after 15 minutes and than abort instead of taking the trouble to follow the instructions.


I have been reading the older posts in this thread, and may be able to shed some light on one of Paul's theories. All my machines are Macs running OS 10.4.4 and BOINC 5.2.13. One is a Dual G$, one is a Dual G5, and one is a powerbook G4. I cannot remember ever having a WU stuck at 1%, but I have had a few that stuck at other points during processing. I had one (reported earlier) that stuck at 80%. That was on the Dual G4. That system also runs SETI, Predictor, Climate, SETI Enhanced, and occasionally Einstein.

At least in my case there seems to be no relationship between any stuck WUs and any other apps on the system. I have had a few stuck on the Dual G5 and it is only running R@H. It is interesting to note that BOINC actually keeps track of the CPU time and the counters not R@H. This function is somewhat dependent on what the project projects as the time a particular WU will take. That said, if the WU sticks at 1% one has to wonder if that is actually what is happening or is it simply a failure of BOINC to properly update., and if left alone it might complete.

In my experience, when a WU sticks (again I never see this at 1%) there will always be a Kernel process running that was started by "root" that starts to eat up CPU cycles. This is usually how I can tell if the WU is stuck or just taking a long time. If I do not see this Kernel process running and eating up the system, then the WU is not stuck and I leave it alone. If I do see it I will shutdown BOINC abort the Kernel process (it won't stop on its own), restart BOINC, and usually the WU will run to completion, picking up at the last checkpoint.

I have to assume that since the WU started, that it will complete with the same random seed because that should have been stored on initialization. So If that is true, it is not likely a seed problem. At least on my system, processing does not actually stop. It is as though the program has managed to find a tight loop somewhere. So far I have been unable to capture any stats on this Kernel process but I do know it never changes state, it uses a lot of CPU percent, and it does some paging and it makes no machine calls.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9165 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 9174 - Posted: 17 Jan 2006, 4:35:28 UTC

This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there...

ID: 9174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9175 - Posted: 17 Jan 2006, 5:21:48 UTC - in response to Message 9174.  

This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there...


Well Bill, I would but I can't find it. It looks like all the thousands of Mac folks just coming into BOINC are having all the same startup problems the rest of us had, all at once. It looks like a free for all over there. Kind of makes me want to write something to try to help them out. From what I can tell most of them just don't quite understand how to make the thing run in the first place.

Regards
Phil

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,362
RAC: 9
Message 9201 - Posted: 17 Jan 2006, 13:40:02 UTC - in response to Message 9175.  

This "kernal_task" that takes CPU time - recent thread in SETI in Q&A-Mac; might look there...
Well Bill, I would but I can't find it.


Sorry - I sent you to the wrong place. here is the thread I had in mind, on the BOINC boards instead of SETI - but it's not going to tell you anything new.

ID: 9201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
premier

Send message
Joined: 30 Dec 05
Posts: 14
Credit: 23,872,868
RAC: 0
Message 9222 - Posted: 17 Jan 2006, 19:39:54 UTC

My 1% stuck WU:
2006-01-17 12:48:14|rosetta@home|Starting result PRODUCTION_ABINITIO_1bq9A_250_543_0 using rosetta version 481

It's been doing almost 7 hours. It freezes at step 21933. Suspending and starting again unfortunately don't solve the problem. I closed BOINC client (Windows version) and open again and guess what. The WU pasess the 21933 step and process whole WU in 1hr 15min. The bad thing is I lost 7 hours of doing nothing (and lost credit) :( And this is not my first 1% stuck WU. I have my stdout.txt saved form stuck unit. If You want it - I can mail it.

One more thing. After restarting BOINC I compared the stdout.txt form WU that stucked and the same WU wihich completed successfuly. To the line where is random seed they were almost identical, except one line:

WARNING: check_decoy_exists: unexpected decoy number: start#,decoy#,lastdecoy# 1 1 40 0

ID: 9222 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 9234 - Posted: 17 Jan 2006, 21:55:12 UTC

premier,

can you email me both stdout.txt files? dekim at u dot washignton dot edu
ID: 9234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org