Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 270 · 271 · 272 · 273 · 274 · 275 · 276 . . . 308 · Next

AuthorMessage
BlackPoison357

Send message
Joined: 5 Mar 24
Posts: 1
Credit: 1,674,561
RAC: 0
Message 108991 - Posted: 15 Mar 2024, 20:49:43 UTC - in response to Message 108986.  

That must been why I've had 31 errors within the last 2 days.
ID: 108991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 109007 - Posted: 16 Mar 2024, 18:53:20 UTC - in response to Message 108983.  
Last modified: 16 Mar 2024, 18:54:27 UTC

Just got 21 of those.

It is this series: 7a_hal_c_hal_7aa_.................

Just downloaded 4 beta 6.05 tasks one of which immediately (0.02 seconds CPU) failed with :-

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-linux-gnu @7a_hal_c_hal_7aa_12899_d40_0001.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_0f7f01a1b07/database

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
21:15:06 (176255): called boinc_finish(1)

</stderr_txt>
]]>

Boinc 7.24.1 and Ubuntu 22.04.4
ID: 109007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 6
Message 109009 - Posted: 17 Mar 2024, 7:52:58 UTC
Last modified: 17 Mar 2024, 8:44:34 UTC

A flock of work units arrived recently that are behaving oddly, well, all but one of them. They came with a run time of 8 hours, of those which are running, one has elapsed 6 hours and remaining 6 hours odd, this figure wobbles about for a while, then drops some, then starts wobbling again and repeat. The others have remaining times of 2 to 6 days increasing . The percentage complete figure is increasing, but very slowly.
<edit>
The task is Beta 6.04 running on Windows 10 x64.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 109009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1720
Credit: 18,351,686
RAC: 24,923
Message 109010 - Posted: 17 Mar 2024, 10:29:02 UTC - in response to Message 109009.  

A flock of work units arrived recently that are behaving oddly, well, all but one of them. They came with a run time of 8 hours, of those which are running, one has elapsed 6 hours and remaining 6 hours odd, this figure wobbles about for a while, then drops some, then starts wobbling again and repeat. The others have remaining times of 2 to 6 days increasing . The percentage complete figure is increasing, but very slowly.
<edit>
The task is Beta 6.04 running on Windows 10 x64.
And the answer is the same as always, but since you have ignored the answer for over 4 years now, there's no point repeating it.
Run time 1 days 20 hours 40 min 36 sec
CPU time 11 hours 59 min 57 sec
Almost 2 days to do 12 hours work, all because of settings you have made & chose not to fix.
Grant
Darwin NT
ID: 109010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109011 - Posted: 17 Mar 2024, 11:13:19 UTC - in response to Message 109009.  

What else do you have running?
ID: 109011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 6
Message 109012 - Posted: 17 Mar 2024, 12:16:16 UTC - in response to Message 109011.  

The only thing running that has any real impact is Folding@Home, this is set to its minimum activity level, but still grabs quite a lot of resources. The only other things I can see are Firefox, and the VPN service I use which has a small window that normally sits open on the desktop. I've no idea what Grant is talking about, (above), but it gives me something to look for. I don't have any weird settings by the way.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 109012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109013 - Posted: 17 Mar 2024, 12:18:41 UTC

Reduce folding@home to 4 cores(if running on CPU) and Rosetta to 4 cores
ID: 109013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,142,074
RAC: 3,111
Message 109014 - Posted: 17 Mar 2024, 23:53:01 UTC - in response to Message 80621.  

Please report any issues with work units in this thread.

All recent WUs on at least 3 different hosts all got cut short with the "error while computing"... :(
ID: 109014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,994
Message 109019 - Posted: 18 Mar 2024, 13:15:46 UTC - in response to Message 109014.  

Please report any issues with work units in this thread.

All recent WUs on at least 3 different hosts all got cut short with the "error while computing"... :(


"Error while computing" means that an error was detected, but gives no information about WHAT error.

There's generally at least one more line saying something about what error.
ID: 109019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Notarick

Send message
Joined: 19 Nov 06
Posts: 1
Credit: 2,932,430
RAC: 6,892
Message 109032 - Posted: 25 Mar 2024, 14:20:29 UTC
Last modified: 25 Mar 2024, 14:29:35 UTC

I'm not sure what is going on but about 1/4 of the work units are reporting "Error while computing" and I am no longer getting new tasks (which I assume is due to the errors). I've been running boinc for years without issue, but I'm still running some CPU stress tests/diagnostics to see if it is a hardware thing. Anway:

My CPU is a Ryzen 7 3700X on a Gigabye X570 UD mobo with 32 GB of ram and a Radeon RX 570 video card. I'm running Windows 10. The only thing that has changed recently is a UEFI update.

Here is the diagnostic information from one of them (they all have the same error code and similar Stderr output):

Name 7a_hal_d_hal_7aa_13011_d120_0001_SAVE_ALL_OUT_2977649_108_1
Workunit 1383425641
Created 23 Mar 2024, 8:12:33 UTC
Sent 23 Mar 2024, 8:12:35 UTC
Report deadline 26 Mar 2024, 8:12:35 UTC
Received 23 Mar 2024, 8:14:02 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 1 (0x00000001) Unknown error code
Computer ID 6222889
Run time 11 sec
CPU time 2 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 5.15 GFLOPS
Application version Rosetta Beta v6.04
windows_x86_64
Peak working set size 117.73 MB
Peak swap size 88.94 MB
Peak disk usage 0.01 MB

Stderr output

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
Incorrect function.
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.04_windows_x86_64.exe @7a_hal_d_hal_7aa_13011_d120_0001.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_0f7f01a1b07database

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
02:12:51 (5004): called boinc_finish(1)

</stderr_txt>
]]>
ID: 109032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MJH333

Send message
Joined: 29 Jan 21
Posts: 18
Credit: 6,745,546
RAC: 22,305
Message 109034 - Posted: 25 Mar 2024, 14:48:08 UTC - in response to Message 109032.  
Last modified: 25 Mar 2024, 15:20:39 UTC

I'm not sure what is going on but about 1/4 of the work units are reporting "Error while computing" and I am no longer getting new tasks (which I assume is due to the errors).
I doubt that there is anything wrong with your system. The latest big batch of work units has run out.

You can see this from the Server Status section on the Rosetta home page: https://boinc.bakerlab.org/rosetta/, which currently shows "Total queued jobs" of 0. It can also be seen from the Project status page: https://boinc.bakerlab.org/rosetta/server_status.php, which currently shows "Tasks ready to send" of 0.

You may, of course, pick up the odd resend, or some Robetta tasks, but there will be no steady flow of tasks until another big batch of work units is released.

We may get another batch if/when the work units you identified with the "residue 1 does not have a LOWER_CONNECT" error are corrected and reissued.

When trying to get an idea of how much work is available, I tend to look at the "Total queued jobs" figure on the Rosetta home page because that shows all the work units available to be crunched, which may be in the millions. Whereas the "Tasks ready to send" figure on the Project status page shows just the tasks ready to be distributed, which is usually no more than 5,000.
ID: 109034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109043 - Posted: 27 Mar 2024, 13:28:25 UTC

I see this in stderr.txt
command: projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.04_windows_x86_64.exe @7ahall_e_hal_7aa_15545_d239_0001.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Extracting in project directory: database_0f7f01a1b07.zip
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/rotamer/bbdep02.May.sortlib
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/rotamer/peptoid_rotlibs/001.rotlib
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.R.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.6.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.1.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.D.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.V.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.4.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/disulfide_jump_database_wip.dat
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/fragpicker_rama_tables/L_QP.counts.gz
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/vall.jul19.2011.torsions.gz
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/protocol_data/tensorflow_graphs/gcn_test_model/gcn_test_model_plot.png
        Permission denied
Extracting in slot directory: minirosetta_database.zip
Using database: database

looks like each task tried to extract database to E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07 all at once, gave up, and extracted to slot directory.
ID: 109043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 109055 - Posted: 2 Apr 2024, 1:28:54 UTC

It's a strange thing, but every time tasks run out recently, another ~million seem to be added to the queue to keep us going.

I realise many have given up a bit on expecting reliability from Rosetta, but it almost seems like someone is paying a little attention on the quiet.

Or maybe I'm just wishing that was the case.

Either way, it's appreciated.
And there are still enough people around to blast through and return them quickly too.
(Comparatively) good times...
ID: 109055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109057 - Posted: 2 Apr 2024, 10:47:24 UTC

When i run graphics app it closes immediately.
I see this in stderrgfx.txt

user@ubuntu:/var/lib/boinc/slots/19$ cat stderrgfx.txt

ERROR: Unable to open file: /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/../database/chemical/residue_type_sets/fa_standard/residue_types.txt

ERROR:: Exit from: src/core/chemical/GlobalResidueTypeSet.cc line: 145
13:38:44 (25733): called boinc_finish(0)
ID: 109057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 109058 - Posted: 2 Apr 2024, 12:50:40 UTC - in response to Message 109055.  

And there are still enough people around to blast through and return them quickly too.
(Comparatively) good times...


+1
But, after months and hundreds of thousands of wus, maybe it's the time to let the app out from beta stage
ID: 109058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109059 - Posted: 2 Apr 2024, 13:48:52 UTC

Tasks finish in 3 hours for me.

I have set "Target CPU run time" to "not selected"
ID: 109059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
just1vet

Send message
Joined: 13 Nov 05
Posts: 4
Credit: 3,673,481
RAC: 26,688
Message 109062 - Posted: 2 Apr 2024, 21:41:05 UTC

Big problems with Rosetta on Linux Mint 20 and 21. Had to remove the project from the client on both of my machines. It would freeze the computers to where it had to be rebooted, only to lock up again, soon as it started on Rosetta.
I narrowed it down to the Rosetta project after replacing hard drive, mother board and ram. they run fine on the other projects. This has been going on for a while. Runs fine on my Windows computers.

Any ideas?
ID: 109062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 109063 - Posted: 2 Apr 2024, 21:43:47 UTC - in response to Message 109062.  

Maybe you can reduce number of cpus allocated from 100% to 75%?
ID: 109063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 109064 - Posted: 2 Apr 2024, 23:47:57 UTC - in response to Message 109059.  

Tasks finish in 3 hours for me.

I have set "Target CPU run time" to "not selected"

Has something changed then? I noticed this elsewhere too.

Doesn't apply to me - since tasks have been harder to come by I've changed my default to 12hrs.
Maybe it's time to be explicit on runtime and change it to 8hrs, rather than let it run at a dodgy default value.
ID: 109064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
just1vet

Send message
Joined: 13 Nov 05
Posts: 4
Credit: 3,673,481
RAC: 26,688
Message 109065 - Posted: 3 Apr 2024, 3:28:05 UTC - in response to Message 109063.  

Right now they are 32 core with 16gb of RAM. Which should be enough for crunching.
ID: 109065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 270 · 271 · 272 · 273 · 274 · 275 · 276 . . . 308 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org