Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 34 · Next
Author | Message |
---|---|
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
I had a bunch of TFSCAFFOLD0001 tasks fail. I’m currently up to 117 failures. My wingman on these has also failed so I don’t think it’s my machines (9 of them). Some seem to work but the dud ones fail within 18 seconds so at least don’t waste much compute time. The logs say it completed 1 decoy and then this: </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_9ji0sq4q_953352_10_1_r2051777753_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> BOINC blog |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,272,718 RAC: 1,488 |
I've seen 4 recent tasks on my computer that failed with an upload error, each with names starting with TFSCAFFOLD0001_ and with very short run times. I also have two such tasks that have run for at least two hours each. A guess at what went wrong: The application encountered an error that it was not set up to report at all, and therefore declared that it was finished without doing anything to produce the file that failed to upload. Some of the event log: 7/9/2020 8:45:41 AM | Rosetta@home | Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 7/9/2020 8:45:58 AM | Rosetta@home | Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 finished 7/9/2020 8:45:58 AM | Rosetta@home | Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0_r1023400029_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_9nx4xt4n_953354_14_0 absent 7/9/2020 8:46:01 AM | Rosetta@home | Starting task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 7/9/2020 8:46:21 AM | Rosetta@home | Computation for task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 finished 7/9/2020 8:46:21 AM | Rosetta@home | Output file TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0_r541959699_0 for task TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_7ph8ms6m_953352_14_0 absent 7/9/2020 8:47:12 AM | Rosetta@home | Computation for task TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0 finished 7/9/2020 8:47:14 AM | Rosetta@home | Started upload of TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0_r1048254222_0 7/9/2020 8:47:19 AM | Rosetta@home | Finished upload of TFSCAFFOLD0001_9_SAVE_ALL_OUT_IGNORE_THE_REST_2ya1kk9v_953360_6_0_r1048254222_0 |
Curt3g Send message Joined: 30 Mar 20 Posts: 4 Credit: 1,908,126 RAC: 0 |
I've seen 4 recent tasks on my computer that failed with an upload error, each with names starting with TFSCAFFOLD0001_ and with very short run times. Same here. I've had 17 failures of this species of WU within the last 24 hours (on a single machine). |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,739,033 RAC: 7,061 |
Failures for me too, on all of my 6 Windows 10 PCs. Some scaffold tasks work, some don't. The failures only take 20 seconds, so I'll just leave them trying as some work. Non-scaffold tasks all work on all machines. https://boinc.bakerlab.org/rosetta/results.php?hostid=3792849&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=4360598&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=4368214&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3745283&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3746264&offset=0&show_names=0&state=6&appid= https://boinc.bakerlab.org/rosetta/results.php?hostid=3772248&offset=0&show_names=0&state=6&appid= 09/07/2020 3:13:32 PM | Rosetta@home | Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 09/07/2020 3:13:54 PM | Rosetta@home | Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 finished 09/07/2020 3:13:54 PM | Rosetta@home | Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1_r1892657513_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 absent 09/07/2020 3:13:59 PM | Rosetta@home | Starting task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 09/07/2020 3:14:02 PM | Rosetta@home | [sched_op] Deferring communication for 00:01:26 09/07/2020 3:14:02 PM | Rosetta@home | [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_4ub7hx3y_953354_6_1 09/07/2020 3:14:17 PM | Rosetta@home | Computation for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 finished 09/07/2020 3:14:17 PM | Rosetta@home | Output file TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1_r1282681604_0 for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 absent 09/07/2020 3:14:27 PM | Rosetta@home | Starting task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 09/07/2020 3:14:27 PM | Rosetta@home | [sched_op] Deferring communication for 00:02:36 09/07/2020 3:14:27 PM | Rosetta@home | [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_8_SAVE_ALL_OUT_IGNORE_THE_REST_5zi6bs4t_953359_12_1 09/07/2020 3:14:49 PM | Rosetta@home | Computation for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 finished 09/07/2020 3:14:49 PM | Rosetta@home | Output file TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1_r9418955_0 for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 absent 09/07/2020 3:14:54 PM | Rosetta@home | [sched_op] Deferring communication for 00:06:53 09/07/2020 3:14:54 PM | Rosetta@home | [sched_op] Reason: Unrecoverable error for task TFSCAFFOLD0001_3_SAVE_ALL_OUT_IGNORE_THE_REST_2fs4sj4q_953354_12_1 |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
8 TFSCAFFOLD0001 tasks here on my Ryzen Win10 machine, all upload failures with very short runtimes. Example: TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f_953352_14_1 <core_client_version>7.16.7</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol marA_boinc_v1.xml @flags_TFSCAFFOLD -in:file:silent TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.zip @TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2479129 Using database: database_357d5d93529_n_methylminirosetta_database ====================================================== DONE :: 1 starting structures 1201 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 10:58:15 (5968): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>TFSCAFFOLD0001_1_SAVE_ALL_OUT_IGNORE_THE_REST_3yo4ci7f_953352_14_1_r428283726_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Example: WU: 1092308808 Task: 1217343606 Application: Rosetta v4.20 windows_x86_64 Errors: Too many errors (may have bug) Too many total results I was wingman for original host, so we both errored out. Had same error info as others reporting here. Hopefully powers that be got word of this problem WU and will let researcher(s) know to correct the problem. |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
Another TFSCAFFOLD0001 errored out on my end, that was expected. One task ended prematurely due to a power outage for me, although it validated fine and wasn't marked as a computational error. 11527f43_fold_SAVE_ALL_OUT_951790_903_0 <core_client_version>7.16.7</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers.index -in:file:native 00001.pdb -silent_gz 1 -frag9 00001.200.9mers.index -out:file:silent default.out -ex1 1 -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 5 -abinitio::use_filters false -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -beta 1 -abinitio::rg_reweight 0.5 -in:file:boinc_wu_zip 11527f43_data.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3121464 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Fullatom mismatch in checkpointer. ERROR:: Exit from: ......srcprotocolscheckpointCheckPointer.cc line: 380 01:10:25 (15132): called boinc_finish(0) </stderr_txt> ]]> Methinks something went wrong with the check-pointing system? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,563,789 RAC: 6,764 |
4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours. Still TSCAFFOLD errors |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,767,500 RAC: 22,869 |
I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out.4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours.Still TSCAFFOLD errors A very bad batch of work. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,563,789 RAC: 6,764 |
A very bad batch of work. The same TFSCAFFOLD has passed before on Ralph@Home with the same errors. I cannot understand why they published these wus on Rosetta... |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,739,033 RAC: 7,061 |
I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out. Strange, I'm only getting about 20-30% failing. Anyway, nothing to worry about, since they only waste 20 seconds. They'll see them all coming back with a problem and sort it. The server won't resend them repeatedly, it'll give up after a few hosts reject it. resetting the project would have them marked as Abandoned. If I reset a project, does it inform the project of the not-done work units? Or should I always abort and update first? One task ended prematurely due to a power outage for me, although it validated fine and wasn't marked as a computational error. Presumably that's because Rosetta units comprise a number of modules, and you had probably completed some of them. It's strange it broke the unit though, normally it just continues, or resumes from the last checkpoint. Maybe the power went off as it was saving to disk? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,155,074 RAC: 4,016 |
Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,767,500 RAC: 22,869 |
The advantage of resetting is it also deletes all the project files, and then re-downloads them. So if you have a corrupted application (or support) file, Resetting the project will sort out that issue, whereas aborting & updating won't (it just gets rid of the present cache of Tasks and then gets new ones).If I reset a project, does it inform the project of the not-done work units? Or should I always abort and update first?Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,563,789 RAC: 6,764 |
Anyway, nothing to worry about, since they only waste 20 seconds. I'm worry, because it's a waste of time and of download. The problem has been reported days ago. And no solution or message from admins This morning other 3 errors. <message> Please, STOP THIS BATCH! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,563,789 RAC: 6,764 |
1220249128 [ ERROR ]: Caught exception: |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
In the last approximate 24 hours (7/13-7/14/20) I had 4 tasks error out on my 2 hosts, while I had 3 complete and validate. Appears quite random. Also, no WU had one valid and one failed task; either validated with first host or failed with both hosts. Definitely an issue with certain tasks having "issues."I've had a few run to completion, but i'd estimate around a 95%+ failure rate for those particular Work Units. none making it to 30 seconds before erroring out.4 TFSCAFFOLD0001 tasks errored out on my 2 hosts in last 24 hours.Still TSCAFFOLD errors |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,739,033 RAC: 7,061 |
I'm worry, because it's a waste of time and of download. Is it really that big a deal? Your computer is only wasting 20 seconds on a buggered 8 hour task. Most of them work. Those that don't just get sent back and they'll realise their error eventually. The project has plenty of bandwidth unlike Universe. Resetting aborts all workunits and they are put back in the queue...SOME Projects resend them back to the original user during the reset process but not all do that. Your cache settings will depend on how many you get back if a project resends them back to you. To directly answer your question 'it's 6 of one half a dozen of another' on the way you do it. Thanks, I'll do that in future. I used to set no new tasks, abort unstarted work, update the project, then reset. Actually come to think of it I more often detach. Does that also send back unfinished units? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,155,074 RAC: 4,016 |
Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,739,033 RAC: 7,061 |
Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. I disagree, on a project that gives you two weeks to do them, you could be delaying stuff for another 2 weeks. Boinc really should tell the server about any tasks you won't be doing. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,155,074 RAC: 4,016 |
Yes just not as quickly, they do so after they expire because they are waiting for you to finish them, the Server has no way to know you actually left. At projects like this with a 3 day deadline it's better to abort all of them first but at most projects with 10 to 14 day or more deadlines there's not alot of difference except for those people who have a small queue and are waiting for a wingman. PrimeGrid tracks the workunits to a degree but I don't know any project that does what you are suggesting. |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org