Rosetta 4.1+ and 4.2+

Message boards : Number crunching : Rosetta 4.1+ and 4.2+

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 34 · Next

AuthorMessage
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 96100 - Posted: 5 May 2020, 10:24:50 UTC - in response to Message 96097.  
Last modified: 5 May 2020, 10:25:11 UTC

3 finish file present too long errors on Pi4 Rosetta v4.20 aarch64-unknown-linux-gnu

That’s a BOINC issue. If you can get a 7.16.5 or later one might help. They extended the time limit before BOINC complains about the files still being in the slot directory.
thanks, i'd check that out
ID: 96100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
reindl

Send message
Joined: 31 Mar 20
Posts: 1
Credit: 1,765,751
RAC: 0
Message 96108 - Posted: 5 May 2020, 12:47:36 UTC - in response to Message 96086.  
Last modified: 5 May 2020, 12:49:05 UTC

Can you reduce the size of task for Android phones? I have Samsung S20 equiped with Qualcomm flagship processor Snapdragon 865, and it could take more than half day to finish one task. And the deadline was set to about 3 days after task downloaded. I have to keep my phone charged most time of a day to finish the tasks received. This is not reasonable and gave me a lot of pressure. So, could you please reduce the size of each task? Thanks


There are 2 things you can do:

    1. Get the 17.16.3 Android App and set the buffer to 0 or close to 0
    2. Go to your settings and create a seperate profile for your phones with a shorter target runtime

ID: 96108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 96187 - Posted: 6 May 2020, 23:36:36 UTC

Can the servers be updated such that a wingman is only created once the originally created task is unable to report results? Otherwise first guy reports late, but gets in before the second guy, and then the second guy gets the same WU reporting back. See discussion here, and sample wu here
Rosetta Moderator: Mod.Sense
ID: 96187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,773,598
RAC: 22,870
Message 96210 - Posted: 7 May 2020, 9:32:50 UTC - in response to Message 96187.  
Last modified: 7 May 2020, 9:39:18 UTC

Can the servers be updated such that a wingman is only created once the originally created task is unable to report results?
Project options

<report_grace_period>x</report_grace_period>
<grace_period_hours>x</grace_period_hours>
A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x.

So my thought is the Grace period needs to be 12 hours.
The deadline can be 3 Days, 7 days etc, then there is the Watchdog timer which is presently 10 hours. Allow another couple of hours (just because...) and that gives you 12 hours for the grace_period_hours x.

So a new Task won't be created until 12 hours after the deadline for the initial replication has passed (thinking about it even 6 hours would probably be long enough most of the time).
Grant
Darwin NT
ID: 96210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 96230 - Posted: 7 May 2020, 14:00:51 UTC

i'm getting more finish file too long errors on Pi4 4.20, i've not upgraded boinc-client.
can't find a binary package that would install problem free, many dependencies.

however, i noticed one thing about the finish file too long errors.
they seem related to the Junior_HalfRoid tasks
https://boinc.bakerlab.org/rosetta/result.php?resultid=1172540347
https://boinc.bakerlab.org/rosetta/result.php?resultid=1172395662
and when these wu run, my Pi4 is close to using up all ram available. I'm not too sure if memory may after all be involved.
e,g. that they generate many error messages in the 'finish file' due to low memory conditions

it doesn't seem to be an easy way to solve it if it is due to memory short of running fewer tasks. but the point is when the tasks start memory consumption normally looks ok and it grows as the work progress.
ID: 96230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,180,210
RAC: 9,765
Message 96232 - Posted: 7 May 2020, 14:04:32 UTC - in response to Message 96210.  

then there is the Watchdog timer which is presently 10 hours

Minor diversion from the topic:
I know this is what the watchdog is set to the last time we heard, but wasn't it for a very specific reason?
Does that reason apply any more? Because if it doesn't, it's a really long time for nominally 8hr task runtimes.

My sense of the watchdog was it's to allow for relatively short overruns that happen from time to time, but provides a cutoff for tasks if they've kind of gone rogue for some unknown reason.
10hrs doesn't really do the job any more and should be reduced to something more appropriate (was 4hrs)
ID: 96232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 96233 - Posted: 7 May 2020, 14:10:58 UTC - in response to Message 96232.  

If I'm not mistaken, I believe the watchdog was extended to 10 hours, specifically for these potentially long-running Halfroids.
Rosetta Moderator: Mod.Sense
ID: 96233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 96234 - Posted: 7 May 2020, 14:17:20 UTC - in response to Message 96230.  

One of those WUs used over 1GB and the other used over 2GB. What was in the out file about memory?

It would seem that running fewer threads would be better than failing WUs. But I would suspect that BOINC client would have had to put the others to "waiting for memory" in order to run the larger one anyway. So, reducing the number of threads should basically be occurring automatically, and only when the specific WU requires it.
Rosetta Moderator: Mod.Sense
ID: 96234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,773,598
RAC: 22,870
Message 96245 - Posted: 7 May 2020, 18:46:48 UTC - in response to Message 96230.  

and when these wu run, my Pi4 is close to using up all ram available. I'm not too sure if memory may after all be involved.
Low available system RAM would impact on RAM available for disk caching.
Grant
Darwin NT
ID: 96245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,180,210
RAC: 9,765
Message 96260 - Posted: 8 May 2020, 8:31:15 UTC - in response to Message 96233.  

If I'm not mistaken, I believe the watchdog was extended to 10 hours, specifically for these potentially long-running Halfroids.

So the reason may still apply in future? I haven't seen one for a while. Ok
ID: 96260 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dduggan47

Send message
Joined: 18 Sep 05
Posts: 12
Credit: 4,108,853
RAC: 8,281
Message 96271 - Posted: 8 May 2020, 18:01:10 UTC

I apologize for asking a question that's probably already been asked and answered but, despite having been running BOINC since the early days (and SETI before BOINC existed), I'm not always sufficiently technical to follow all the details discussed here.

My problem is that I'm getting many tasks which get "timed out - no response". For a while I was trying to look ahead and abort a lot of tasks, started and unstarted, which weren't going to finish by the deadline. I gather though that that might not be my best strategy for resolving this.

On one machine a couple of days ago I changed the "store at least" and "... additional" to 1 day and 0.25 days respectively, but on the other box I forgot and didn't make that change until today. At the moment I have 3 running tasks on the 1st machine that will not make it. On the other machine it's 12 running and about that many more which haven't started yet but won't make the deadline.

Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out?

Thanks.
ID: 96271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 96275 - Posted: 8 May 2020, 19:09:28 UTC - in response to Message 96271.  
Last modified: 8 May 2020, 19:14:24 UTC


Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out?

Thanks.


This project (instead of SETI you familiar with) allows to reduce lenght of already received tasks.
Best option for your host is to set them to minimal possible length anfd then gradually increase as long as you don't miss deadline.

This can be done in project options here:


https://clip2net.com/s/47qBO85

As you could see I have 2 different sets of options - for powerful hosts (big task length) and for netbooks/smartphones (short length, 4 hours per task currently)

P.S. You need to update project settings (update project from BOINC) and then restart BOINC client itself to update already downloaded tasks length. Newly downloaded will be of new length already.
ID: 96275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dduggan47

Send message
Joined: 18 Sep 05
Posts: 12
Credit: 4,108,853
RAC: 8,281
Message 96289 - Posted: 9 May 2020, 4:47:00 UTC - in response to Message 96275.  

Thanks for your help, Raistmer.


Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out?

Thanks.


This project (instead of SETI you familiar with) allows to reduce lenght of already received tasks.
Best option for your host is to set them to minimal possible length anfd then gradually increase as long as you don't miss deadline.

This can be done in project options here:


https://clip2net.com/s/47qBO85

As you could see I have 2 different sets of options - for powerful hosts (big task length) and for netbooks/smartphones (short length, 4 hours per task currently)


This seems counterintuitive. Wouldn't I be better off to increase the expected length and then (I hope) run them in less time than to decrease the time and risk not making the deadlines?

P.S. You need to update project settings (update project from BOINC) and then restart BOINC client itself to update already downloaded tasks length. Newly downloaded will be of new length already.




I changed the expected times before reading your note but did it the opposite way as I described above. I can redo that if you advise that it would work better, even though I can't say I understand why. I also aborted anything that didn't look like it was going to make the deadline.

After seeing your post I stopped and restarted the BOINC client. This seemed to increase the expected times by a lot more than my change on some (but not all) running tasks but had little or no effect on unstarted tasks.

In my decades of running BOINC on around 40 different projects I've never run into this problem before. I'm finding it quite confusing. OTOH I was decades younger then too. Age tends not to reduce confusion! :-)

Thanks again.
ID: 96289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,773,598
RAC: 22,870
Message 96290 - Posted: 9 May 2020, 5:13:39 UTC - in response to Message 96289.  

Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out?

Thanks.

This project (instead of SETI you familiar with) allows to reduce lenght of already received tasks.
Best option for your host is to set them to minimal possible length anfd then gradually increase as long as you don't miss deadline.
The best option is just to use the default Target CPU Runtime, and to have no cache at all, given the number of projects you are running.
Even if Rosetta were your only project, 0.5 days & 0.02 days extra is plenty.
Grant
Darwin NT
ID: 96290 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 96294 - Posted: 9 May 2020, 7:29:12 UTC - in response to Message 96289.  
Last modified: 9 May 2020, 7:31:16 UTC


This seems counterintuitive. Wouldn't I be better off to increase the expected length and then (I hope) run them in less time than to decrease the time and risk not making the deadlines?

Expected length is the amount of CPU time task will allowed to run. And here is the big difference with SETI and most other projects.
Task doesn't contain fixed number of calculations to complete it. If CPU time allows, new model will be started for same task (slightly different initial atoms configuration or smth alike).
So, if you allow 8 hours per task it will run 8 hours. Only 2h - then it will end in 2 hours.

And yes, to avoid cache overflow in the future better to set BOINC cache size as small as it could be. But changing cache size will not help with already downloaded tasks.
ID: 96294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,773,598
RAC: 22,870
Message 96352 - Posted: 11 May 2020, 4:27:49 UTC

rb_05_09_24541_24116_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_10_927507_5_0

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_05_09_24541_24116_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_05_09_24541_24116_ab_t000__robetta.zip -frag3 rb_05_09_24541_24116_ab_t000__robetta.200.3mers.index.gz -fragA rb_05_09_24541_24116_ab_t000__robetta.200.10mers.index.gz -fragB rb_05_09_24541_24116_ab_t000__robetta.200.5mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1576447
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:


File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)
 ------------------------ Begin developer's backtrace ------------------------- 
BACKTRACE:
 ------------------------- End developer's backtrace -------------------------- 


AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.



</stderr_txt>
]]>


This is the second time i've had this particular error message- last time it was dodgy WU, the other system that got it also got the same error.
Waiting to see if that's the case again this time around.
Grant
Darwin NT
ID: 96352 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ivailo Bonev

Send message
Joined: 9 May 07
Posts: 15
Credit: 4,285,869
RAC: 0
Message 96357 - Posted: 11 May 2020, 8:29:10 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=1176852042

<core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol jhr_boinc_v4.xml @flags -in:file:silent Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.zip @Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3876534
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: f39b38c813752ceb1e616c99588b316d_n0_c0_1_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
BOINC:: Error reading and gzipping output datafile: default.out
11:22:22 (11520): called boinc_finish(1)

</stderr_txt>
]]>
ID: 96357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,773,598
RAC: 22,870
Message 96383 - Posted: 12 May 2020, 6:17:40 UTC - in response to Message 96352.  

rb_05_09_24541_24116_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_10_927507_5_0

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_05_09_24541_24116_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_05_09_24541_24116_ab_t000__robetta.zip -frag3 rb_05_09_24541_24116_ab_t000__robetta.200.3mers.index.gz -fragA rb_05_09_24541_24116_ab_t000__robetta.200.10mers.index.gz -fragB rb_05_09_24541_24116_ab_t000__robetta.200.5mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1576447
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:


File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)
 ------------------------ Begin developer's backtrace ------------------------- 
BACKTRACE:
 ------------------------- End developer's backtrace -------------------------- 


AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.



</stderr_txt>
]]>


This is the second time i've had this particular error message- last time it was dodgy WU, the other system that got it also got the same error.
Waiting to see if that's the case again this time around.




Looks like it was another dodgy WU- other system had the same error.
Grant
Darwin NT
ID: 96383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,567,403
RAC: 6,975
Message 96421 - Posted: 13 May 2020, 5:49:40 UTC

Some "access violation"
1178319689
1178319933
etc
ID: 96421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 96476 - Posted: 14 May 2020, 7:59:22 UTC
Last modified: 14 May 2020, 8:17:20 UTC

Reference message 96433, which discusses problems with similar WUs.
Edit- having said that, i just had one of those WUs do the same thing on my system, yet was processed OK on another system, and even though i've processed several others of the same type with no problems.
3cl_7aa_6lu7_modified_AVLstub_relaxed_renumbered_0074_110_extract_B_SAVE_ALL_OUT_927956_74_0
(unknown error) - exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...
Reason: Access Violation (0xc0000005) at address 0x00007FF63B7D1D48
Name: new_3cl_10aa_6lu7_modified_AVLstub_relaxed_renumbered_0674_33_extract_B_SAVE_ALL_OUT_928500_391_1
Application: Rosetta v4.20 windows_x86_64
Device: 3710630
Task: 1178942057. WU: 1058857778
Status: Error while computing.
Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION
Errors: Too many errors (may have bug) Too many total results.
Stderr output:
(unknown error) - exit code -1073741819 (0xc0000005)
Unhandled Exception Detected...

Reason: Access Violation (0xc0000005) at address 0x0000000140348316 read attempt to address 0xFFFFFFFF

Engaging BOINC Windows Runtime Debugger...

My task was the 2nd try for this WU. The first host got same error, so question issue with this type of WU/task.

My host also rec'd the same error with WU 1058853076, with my host again being the 2nd try for the same task.

Edit: As mentioned by others, some of the above WUs process normally while others receive the above-mentioned error. My host quoted above normally processed task 1178341520 (new_3cl_10aa_6lu7****).
ID: 96476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 34 · Next

Message boards : Number crunching : Rosetta 4.1+ and 4.2+



©2024 University of Washington
https://www.bakerlab.org