Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8802 - Posted: 11 Jan 2006, 20:49:23 UTC

To solve this problem, we need to determine whether it is due to a bug in rosetta, or a problem with boinc or the boinc/rosetta interface.
The way to answer this is to run exactly the same job that got stuck in boinc on the same machine, but stand alone (not inside boinc) and see if it gets stuck or not. If it still gets stuck, it is rosetta problem, if not, it is a boinc or boinc-rosetta problem. Let us know!

David Kim has written up instructions on how to do this. If you have time to do the diagnosis, this will help us out a lot! Here they are:



If you notice your work unit stuck at "1%" for over an hour (the infamous 1% bug) and would like to help us determine its cause, please follow the instructions below. We would like to know if the bug is reproducible when running the same command without using the boinc client.


1. Stop or suspend the boinc client.
2. Find the stdout.txt file in the slot directory where the work unit was being run.

The "slots" directory should be located in your boinc installation directory:

For windows: C:Program FilesBOINC
For mac: /Library/Application Support/BOINC Data/
For linux: in your BOINC installation directory

In the slots directory you should see one or more directories with numbers as names (like 0, 1, ...etc).
One of these directories will be where the work unit was being run and you should see a stdout.txt file.

3. Get the command and the random number seed from the stdout.txt file.

Open the stdout.txt file with a text editor. The command is on the second line starting with "command executed: ."

example:

command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 1n0u _ -silent -abrelax_mode -farlx -ex1 -ex2 -rand_envpair_res_wt -rand_SS_wt -ssblocks -barcode_from_fragments -barcode_mode 3 -barcode_from_fragments_length 20 -nstruct 10 -jitter_frag -jitter_amount 2 -relax_score_filter -filter1 48 -filter2 -81 -output_silent_gz -output_chi_silent -vary_omega -stringent_relax

save the command line (text after "command executed: ") somewhere you can get back to easily.

The random number seed will be located further down the stdout.txt file in a line that starts with "# random seed: ."

example:

# random seed: 324481

save the seed value somewhere you can get back to easily.

4. Using a command terminal (for windows you can open a "Command Prompt" or use Cygwin) go to the "boinc.bakerlab.org_rosetta" project directory in the "projects" directory located in your boinc installation directory.

You should see all input files and executable(s) for the rosetta work units if you list the directory contents.

5. Run the rosetta executable in this directory using the original arguments in the command line after adding the random seed to the command.

To add the random seed to the command, append the following: -constant_seed -jran 324481
where the example seed 324481 is replaced with your seed.

final example command to be run:

rosetta_4.81_windows_intelx86.exe aa 1n0u _ -silent -abrelax_mode -farlx -ex1 -ex2 -rand_envpair_res_wt -rand_SS_wt -ssblocks -barcode_from_fragments -barcode_mode 3 -barcode_from_fragments_length 20 -nstruct 10 -jitter_frag -jitter_amount 2 -relax_score_filter -filter1 48 -filter2 -81 -output_silent_gz -output_chi_silent -vary_omega -stringent_relax -constant_seed -jran 324481


6. If you are using Windows, you should see the graphics window and see the work unit's progress.

If you are on Mac or Linux, the progress can be seen by monitoring the stdout.txt file that gets generated in the same directory.


Please let us know if rosetta gets stuck again at 1% or if it continues to run like a normal work unit.

Thank you.
ID: 8802 · Rating: 3 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 14,945,062
RAC: 0
Message 8842 - Posted: 12 Jan 2006, 13:38:25 UTC

David,

for our understanding: If a WU is stuck at 1% and BOINC (and Rosetta) are restarted, is the random seed normally the same or does this number change by restarts ?




Supporting BOINC, a great concept !
ID: 8842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 8882 - Posted: 12 Jan 2006, 21:14:12 UTC
Last modified: 12 Jan 2006, 21:27:12 UTC

So have the 1% bugs gone into hiding, now that the Baker Lab is going after then ? ;-)

I certainly would want to help by performing the suggested diagnostic tests - however, I didn't encounter a single 1% bug among my 1000+ processed Rosetta WUs, so far. In fact, the only errors I had were the ones every one else was having over the Holidays, plus initially, a couple of errors caused by local problems (use of an obsolete BOINC client version, unstable memory). Oh, and I run Rosetta on Linux 24 hours a day (so no switching between projects, suspensions, shutdowns, etc).
ID: 8882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 8893 - Posted: 13 Jan 2006, 1:41:52 UTC
Last modified: 13 Jan 2006, 1:54:52 UTC

Found a stuck workunit just a few minutes ago (my first). 10 hours @ 1%, frozen graphics at step 25567. I ran the custom command (I hope I did it properly, please check my work) as shown:



and I got:



I think (!) I did it right...

EDIT: I just restarted BOINC, the result started over, back to zero CPU time, and blew right by the step number it was stuck on before.

---------stdout.txt------------

[2006-01-12 09:52:57] :: BOINC :: boinc_init()
command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe xx 2vik _ -output_silent_gz -silent -increase_cycles 10 -nstruct 10
[STR OPT]Default value for [-paths] paths.txt.
[T/F OPT]Default FALSE value for [-unix_paths]
--------------------------------------------
WARNING:: paths.txt file not found!!
Setting all paths to .
Using default fragment file names:
aa*****03_05.200_v1_3
aa*****03_05.200_v1_3
--------------------------------------------
[T/F OPT]Default FALSE value for [-version]
[T/F OPT]Default FALSE value for [-score]
[T/F OPT]Default FALSE value for [-abinitio]
[T/F OPT]Default FALSE value for [-refine]
[T/F OPT]Default FALSE value for [-assemble]
[T/F OPT]Default FALSE value for [-idealize]
[T/F OPT]Default FALSE value for [-relax]
[T/F OPT]Default FALSE value for [-abrelax_mode]
[T/F OPT]Default FALSE value for [-abrelax]
[T/F OPT]Default FALSE value for [-design]
[T/F OPT]Default FALSE value for [-dock]
[T/F OPT]Default FALSE value for [-membrane]
[T/F OPT]Default FALSE value for [-loops]
[T/F OPT]Default FALSE value for [-pdbstats]
[T/F OPT]Default FALSE value for [-interface]
[T/F OPT]Default FALSE value for [-pose1]
[T/F OPT]Default FALSE value for [-bk_min]
[T/F OPT]Default FALSE value for [-jumping]
[T/F OPT]Default FALSE value for [-extract]
[T/F OPT]Default FALSE value for [-close_chainbreaks]
[T/F OPT]Default FALSE value for [-barcode_stats]
Rosetta mode: abinitio
[T/F OPT]Default FALSE value for [-chain]
[T/F OPT]Default FALSE value for [-protein]
[T/F OPT]Default FALSE value for [-series]
series_code = xx :: protein_name is 2vik:: chain_id is _.
[INT OPT]New value for [-nstruct] 10
[T/F OPT]Default FALSE value for [-use_pdbseq]
[T/F OPT]Default FALSE value for [-read_all_chains]
[T/F OPT]Default FALSE value for [-use_pdb_numbering]
[T/F OPT]Default FALSE value for [-fa_input]
[T/F OPT]Default FALSE value for [-fa_output]
[T/F OPT]Default FALSE value for [-overwrite]
[T/F OPT]Default FALSE value for [-no_filters]
[T/F OPT]Default FALSE value for [-output_pdb_gz]
[T/F OPT]New TRUE value for [-output_silent_gz]
[T/F OPT]Default FALSE value for [-output_scorefile_gz]
[T/F OPT]Default FALSE value for [-termini]
[T/F OPT]Default FALSE value for [-Nterminus]
[T/F OPT]Default FALSE value for [-Cterminus]
[T/F OPT]Default FALSE value for [-use_trie]
[T/F OPT]Default FALSE value for [-no_trie]
[T/F OPT]Default FALSE value for [-trials_trie]
[T/F OPT]Default FALSE value for [-no_trials_trie]
[T/F OPT]Default FALSE value for [-read_interaction_graph]
[T/F OPT]Default FALSE value for [-write_interaction_graph]
[STR OPT]Default value for [-ig_file] .
[T/F OPT]Default FALSE value for [-silent_input]
[T/F OPT]Default FALSE value for [-output_chi_silent]
[T/F OPT]Default FALSE value for [-timer]
[T/F OPT]Default FALSE value for [-count_attempts]
[T/F OPT]Default FALSE value for [-status]
[T/F OPT]Default FALSE value for [-ise_movie]
[T/F OPT]Default FALSE value for [-output_all]
[T/F OPT]Default FALSE value for [-skip_missing_residues]
[STR OPT]Default value for [-cst] cst.
[STR OPT]Default value for [-dpl] dpl.
[STR OPT]Default value for [-resfile] none.
[T/F OPT]Default FALSE value for [-auto_resfile]
[T/F OPT]Default FALSE value for [-chain_inc]
[T/F OPT]Default FALSE value for [-full_filename]
[T/F OPT]Default FALSE value for [-map_sequence]
[INT OPT]Default value for [-max_frags] 200
[T/F OPT]Default FALSE value for [-enable_dna]
[T/F OPT]Default FALSE value for [-loops]
[T/F OPT]Default FALSE value for [-taboo]
[T/F OPT]Default FALSE value for [-multi_chain]
[T/F OPT]Default FALSE value for [-score_contact_flag]
[T/F OPT]Default FALSE value for [-score_contact_weight]
[T/F OPT]Default FALSE value for [-score_contact_threshold]
[T/F OPT]Default FALSE value for [-scorefxn]
default centroid scorefxn: 4
default fullatom scorefxn: 12
[INT OPT]Default value for [-run_level] 0
[T/F OPT]New TRUE value for [-silent]
run level: -4
[T/F OPT]Default FALSE value for [-benchmark]
[T/F OPT]Default FALSE value for [-debug]
[T/F OPT]Default FALSE value for [-ligand]
[T/F OPT]Default FALSE value for [-enzyme_design]
[STR OPT]Default value for [-s] none.
[STR OPT]Default value for [-l] none.
Reading .Rama_smooth_dyn.dat_ss_6.4.gz
Reading .phi.theta.36.HS.resmooth.gz
Reading .phi.theta.36.SS.resmooth.gz
[STR OPT]Default value for [-atom_vdw_set] default.
[T/F OPT]Default FALSE value for [-IUPAC]
Atom_mode set to all
Reading .paircutoffs.gz
[T/F OPT]Default FALSE value for [-decoystats]
set_decoystats_flag: from,to F F
[T/F OPT]Default FALSE value for [-decoyfeatures]
BOINC :: [2006-01-12 09:52:58] :: Total iterations: 10 :: mode: abinitio :: nstartnm: 1 :: number_of_output: 10 :: num_decoys: 0 :: percent complete: 0.01
Searching for dat file: .2vik.dat
Searching for dat file: .2vik.dat
WARNING!! .dat file not found!
Looking for fasta file: .2vik_.fasta
[T/F OPT]Default FALSE value for [-find_disulf]
[T/F OPT]Default FALSE value for [-fix_disulf]
Looking for psipred file: .2vik_.psipred_ss2
Protein type: alpha/beta Fraction beta: 0.584615409
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: CONSTRAINT FILE NOT FOUND
Searched for: .2vik_.cst
Running without distance constraints
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
WARNING: DIPOLAR CONSTRAINT FILE NOT FOUND
Searched for: .2vik_.dpl
Dipolar constraints will not be used
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
fragment file: .aa2vik_03_05.200_v1_3.gz
Total Residue 122
frag size: 3 frags/residue: 200
fragment file: .aa2vik_09_05.200_v1_3.gz
Total Residue 122
frag size: 9 frags/residue: 200
generating 1mer library from 3mer library
[T/F OPT]Default FALSE value for [-ssblocks]
[T/F OPT]Default FALSE value for [-check_homs]
calculating fragment_diversity...
[T/F OPT]Default FALSE value for [-barcode_mode]
RG cutoff: 16.8790283
[REAL OPT]Default value for [-co] -1
Contact order cutoff: 19.9639988
[REAL OPT]Default value for [-rms] -1
Searching for pdb...: .2vik.pdb
Looking for dssp file: .2vik.dssp
dssp file not found
Looking for secondary structure assignment file: .2vik_.ssa
ssa file not found
calculating secondary structure from torsion angles
[REAL OPT]Default value for [-parallel_weight] 1
[REAL OPT]Default value for [-antiparallel_weight] 1
[T/F OPT]Default FALSE value for [-new_centroid_packing]
[REAL OPT]Default value for [-cb_weight] 1
[T/F OPT]Default FALSE value for [-separate_centroid_pack_score]
[T/F OPT]Default FALSE value for [-repeatin]
[T/F OPT]Default FALSE value for [-repeatout]
NEXT STRUCTURE: .xx2vik0001.pdb
[INT OPT]Default value for [-number_3mer_frags] 200
[INT OPT]Default value for [-number_9mer_frags] 25
[REAL OPT]New value for [-increase_cycles] 10
[T/F OPT]Default FALSE value for [-just_smooth_cycles]
[T/F OPT]Default FALSE value for [-rand_envpair_res_wt]
[T/F OPT]Default FALSE value for [-rand_SS_wt]
[T/F OPT]Default FALSE value for [-random_parallel_antiparallel]
[T/F OPT]Default FALSE value for [-rand_cst_res_wt]
[T/F OPT]Default FALSE value for [-random_frag]
starting fragment insertions...
[T/F OPT]Default FALSE value for [-constant_seed]
[INT OPT]Default value for [-seed_offset] 0
# =====================================
# random seed: 712181
# =====================================
[T/F OPT]Default FALSE value for [-jitter_frag]
score0 done: (best, low) rms
0 0 33.0176163
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
13.8851242 -6.62228107 21.5588932 15.6374063
standard trials: 20000 accepts: 913 %: 4.565
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 0.371 14.968 0.371 36 15.637 13.149 15.637
[REAL OPT]Default value for [-cpu_frac] 0.100000001
[REAL OPT]Default value for [-frame_rate] 10

ID: 8893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James

Send message
Joined: 8 Jan 06
Posts: 21
Credit: 11,697
RAC: 0
Message 8894 - Posted: 13 Jan 2006, 1:55:00 UTC

It doesn't actually bother me - but every WU i've done has the 1 percent issue. When I get some time I'll run the tests...seeing as I have lots of opportunities.
ID: 8894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 8897 - Posted: 13 Jan 2006, 2:22:09 UTC
Last modified: 13 Jan 2006, 2:50:59 UTC

Whoops! I ran the application from boincslots instead of boincprojectsboinc.bakerlab.org_rosetta. I'm running it from the proper folder now (it launched the graphics window). It seems to be running normally so far, it has passed step 25567 that it originally froze on.

EDIT: This workunit ran standalone past 1% with no problems.

BOINC was restarted, the WU started over at 0 CPU time and stdout.txt was appended with a new random seed.
ID: 8897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8908 - Posted: 13 Jan 2006, 5:41:07 UTC - in response to Message 8842.  

David,

for our understanding: If a WU is stuck at 1% and BOINC (and Rosetta) are restarted, is the random seed normally the same or does this number change by restarts ?




I'm glad you asked--this is a confusing point!

In the jobs we are sending out now, the random seed is taken from the clock time when you start the job. rosetta writes out the seed to the log file so that the exact job can be reproduced if need be. to run rosetta with a specific random seed, it can be specified using -constant_seed xxxx on the command line, where xxxx is the random seed.

So, if you restart rosetta and boinc, you will be using a new random seed unless -constant_seed is present in the argument list (it isn't now, but as I just said on another thread, we will change this to avoid duplication)
ID: 8908 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8909 - Posted: 13 Jan 2006, 5:48:58 UTC - in response to Message 8894.  

It doesn't actually bother me - but every WU i've done has the 1 percent issue. When I get some time I'll run the tests...seeing as I have lots of opportunities.



Thanks for all of your help! From the results in this thread, the problem seems to involve BOINC on specific computers. Evidence:

(1) James has the problem with every WU

(2) Holderlin has not had the problem on > 1000 WU

(3) Backslash ran a problem WU outside of boinc and it worked fine


So I don't think it is a rosetta problem (bug) per se, and maybe it doesn't happen with boinc on linux.
Does anything similar happen with WU from other projects? Maybe James's computer can provide a clue--why does this happen to all of his WU?



ID: 8909 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 14,945,062
RAC: 0
Message 8921 - Posted: 13 Jan 2006, 9:36:49 UTC

David,

all the time I can't avoid thinking of a different reason :-(

Could it be, that the problem is related to something like a change from one WU to the next and rosetta not clearing all intern buffers / variables / tmp ... ?

As you can easily see in my sig, I'm running a lot of different projects; when other projects had something similar, it could be found a reason in the project-client. Since last summer, I didn't watch such a problem with a non-rosetta-applikation, so I guess, it must have something to do with rosetta ...



Supporting BOINC, a great concept !
ID: 8921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 8932 - Posted: 13 Jan 2006, 12:40:41 UTC
Last modified: 13 Jan 2006, 12:47:38 UTC

David,

I hate to rain on the parade, but I have had a variety of the stuck at 1% over time and the system varied. I cannot recall if I NEVER had one on OS-X or not. I think I did cause I was annoyed that I could not take the screeen shot ...

However, in the last few weeks I have not had one yet ...

And ALL of the ones restarted completed normally. Of course, this is most likely because the seed changed. Now we have more directions if i get one I will certainly try to get more data for you. But, I am not as convinced that it is a "pure" BOINC problem ...

Questions about the 100% failure rate:
All the same WU name?
What other projects are running on the systems?
Are the systems stand alone, BOINC only? Or do they do other work?
What OS+SP?

and so on.

One thought DID just occur to me. I hate to perhaps seem biased, but are those seeing the 1% problem runing Predictor@Home? Just thinking about it, I have stopped doing PPAH.

Easy to answer, is anyone getting 1% errors commonly NOT running PPAH?

==== edit

This means that it COULD be a BOINC only problem, but caused by interactions with various projects ... something I predicted would happen back in BOINC Beta when we were only doing SETI@Home ... has it finally happened? :)

==== edit #2

The seed MUST be written out as part of the error data so that it is returned with the result. THEN, in the future, if this is an issue, we have the data to reproduce it. This thought occurred to me as I read the other thread where the particpant aborted the work ... now we have nothing ... to my mind this is a fix that should be done soon, like TODAY ... Einstein@Home has been putting some state information like this into the output field and it has proved useful in understanding a run's events ...
ID: 8932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 14,945,062
RAC: 0
Message 8933 - Posted: 13 Jan 2006, 12:45:36 UTC - in response to Message 8932.  

Easy to answer, is anyone getting 1% errors commonly NOT running PPAH?

HM, if I remember right, I saw the 1%-problem on boxes, that had no PPAH-units



Supporting BOINC, a great concept !
ID: 8933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,100,301
RAC: 114
Message 8938 - Posted: 13 Jan 2006, 13:55:53 UTC - in response to Message 8932.  
Last modified: 13 Jan 2006, 13:56:31 UTC

One thought DID just occur to me. I hate to perhaps seem biased, but are those seeing the 1% problem running Predictor@Home? Just thinking about it, I have stopped doing PPAH.

Easy to answer, is anyone getting 1% errors commonly NOT running PPAH?


Paul, I have not run Predictor@home since starting out @ the Rosetta Project. I was getting the 1% Error quit a bit some time ago but like you I haven't had one that I can recall for at least the last 2-3 Weeks.

The only Problem's I have right now is sometimes when I suspend a WU it will give me a Computation Error, any more I try not to Suspend or Shut down the BOINC Manager & Exit BOINC until the Rosetta WU has finished to minimize the Computation Error's ...
ID: 8938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 8947 - Posted: 13 Jan 2006, 15:34:46 UTC

I don't know if this info can be of any help, but here it is anyway.

I have never had a 1% stuck WU. With my old 4.72 client, I thought I had one, so I aborted it and sent the stdout file to David Kim, where he answered back, that it was about to end the first stage, so if I hadn't aborted it, it would have continued.

When I could use the 5.* client in LHC, I upgraded to the recommended client 5.* something, and some of the Rosetta WU's sometimes looked like stuck in that, but I could always jumpstart them by exit the BOINC manager and start it again. Then they continued fine.

I run the 5.3.2 client at the moment. My former teammate Ageless told me that it is stable, so I intalled it, and I must say it's very stable (Thanks Ageless :-) ). I haven't had any problems at all with my Rosetta WU's and they all run very smooth. (Except for the only DEFAULT_xxx_205_ I got, and the few ones that crashed after a few seconds lately)

So maybe it has something to do with the BOINC client?


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 8947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 8976 - Posted: 13 Jan 2006, 23:43:45 UTC - in response to Message 8932.  

However, in the last few weeks I have not had one yet ...

I never had one at all until last night.

And ALL of the ones restarted completed normally. Of course, this is most likely because the seed changed.

It restarted normally in BOINC for me as well with a new seed. With the procedure that David gave, it also ran normally with the same seed used.

Easy to answer, is anyone getting 1% errors commonly NOT running PPAH?


I'm not running PPAH, but I am not getting 1% errors commonly :)

ID: 8976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hoogie

Send message
Joined: 4 Nov 05
Posts: 13
Credit: 1,572,894
RAC: 0
Message 8979 - Posted: 14 Jan 2006, 0:44:27 UTC

I've had 2 of these recently on different computers. They both just stopped running, one at around 6000 steps, and the second at 20995 steps. I followed David Baker's instructions, and ran the second one without BOINC. It ran without incident. The first one also run to completion after suspending it, and resuming. It started from the beginning. If I resume the second one, still in the queue, it does not run, but stays at step 20995.
ID: 8979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James

Send message
Joined: 8 Jan 06
Posts: 21
Credit: 11,697
RAC: 0
Message 8981 - Posted: 14 Jan 2006, 1:41:04 UTC

Okay, I still haven't done what you want yet, but I've been getting short WUs and little sleep.

However I did want to mention that from when I actually look at the client it gets 'stuck' on one for awhile, jumps to 10, then to 20. Gets stuck on 20 for a bit, moves to 60, 70, 80 pretty quickly. Stuck on 80 then short 90 then done. That's about the entire deal. Note: One of the reasons this is a non-issue from my standpoint is that the 'cpu time' and 'to completion' both work in the 4 second increments. As such, I know the client is working. the percentage thing is really more an aesthetic thing.

Personally, I don't find it to be a problem because I'm still doing WUs and doing them as efficiently as those who don't have the problem. It would only 'bother' me if the 1 percent deal actually had some adverse consequences.

As for other boinc projects, no, this is the first encounter with the 1 percent issue. I've done predictor and seti a small amount and close to a year on einstein with zero issues there.
I think it's probably the way boinc interacts with rosetta code. It appears to be 'computer' but it's also widespread. It appears to be platform/compilation specific.

I do sort of wonder if the 'stickiness' has something to do with the graphical interface. Specifically, the freeze ends when the protein folding enters a new stage.
ID: 8981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James

Send message
Joined: 8 Jan 06
Posts: 21
Credit: 11,697
RAC: 0
Message 8983 - Posted: 14 Jan 2006, 1:50:24 UTC - in response to Message 8981.  

Client I run: 5.2.13, i.e., the 'recommended version'.

I haven't followed the discussion on this closely. I was, and am, assuming that the stickiness issue has to do with long periods stuck at a specific percentage *but* the client is still crunching. If this is referring to a lack of crunching then it has nothing to do with me. Also, someone below mentioned they don't get the graphical interface for rosetta when they're 'stuck'. I do.

So, to be clear: When I am 'stuck' the client states that I'm at 1 percent for a long time, 10 percent for a bit, 20 percent for awhile, 50 percent sometimes, 60, 70, then 80 for awhile, 90 quickly and then done. The rosetta interface is always accessible if I want to look at the foldings.

My main reason for a lack of real concern is that the stickiness factor, for me, appears to be tied to the stages, i.e., it gets 'unstuck' when the folding enters a new stage.

Again, if the 'bug' you are referring to has to do with a complete stop in work then no, that's not me. But I did notice below that graphics aren't working for some, etc.

Also, I have never had a termination problem. All WUs have been completed.

I switched to Rosetta because of the cause and my own interests - MPH from UW, MD from elsewhere.

Anyway, hope that helps.
ID: 8983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 8986 - Posted: 14 Jan 2006, 3:49:39 UTC - in response to Message 8983.  

So, to be clear: When I am 'stuck' the client states that I'm at 1 percent for a long time, 10 percent for a bit, 20 percent for awhile, 50 percent sometimes, 60, 70, then 80 for awhile, 90 quickly and then done. The rosetta interface is always accessible if I want to look at the foldings.


Right, that's normal behavior.

ID: 8986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 702,872
RAC: 1,035
Message 9008 - Posted: 14 Jan 2006, 14:20:50 UTC
Last modified: 14 Jan 2006, 15:08:40 UTC

OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests.
----
[edit]
When I ran it from the command prompt, it ran (and is continuing to run) normally. It's at 20% now.

WU name: PRODUCTION_ABINITIO_1iibA_239_573_0

WU link:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5159485
----
[edit]
The command line that I used:
projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe xx 1iib A -output_silent_gz -silent -increase_cycles 10 -nstruct 10 -constant_seed -jran 1248601
[/edit]

I'm going to stop the command-line app and let it run again normally.
[/edit]

ID: 9008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 9010 - Posted: 14 Jan 2006, 15:07:15 UTC

Well, good news and bad.

I got one that was hung. I could not tell what it was doing at the time for sure as I did not have graphics enabled. Worse, I could not tell which of the two slots was the one in "trouble". The good news is that I DID savve the slots directories and have them in a zip.

So, if you are interested in the contents let me know where you want the stuff sent ...

p.d.buck@comcast.net

I did not try the diagnostics as this was a remote computer and trying to do things over VNC is not always reliable ... or easy ...
ID: 9010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org