Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 34 · Next
Author | Message |
---|---|
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,738,758 RAC: 8,494 |
Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks. I swear by "Memtest 86" (or "Memtest 86+"), whichever works on your system - one doesn't work on older machines and one doesn't work on newer ones, I can't remember which. You download it for free, it makes a bootable OS-independant CD, and you run it for about an hour or so until it says "pass complete". Even one single RAM error reported, you need to replace the RAM. You can easily find out which chip is faulty by testing one at a time. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
I only run two Rosetta tasks at a time at most. The one task that I mentioned uses all available memory (32GB) plus all of the 6GB swap file every ten minutes or so. Must be writing out to a scratch file or something. Most single task memory usage I ever saw before on any Rosetta task was around 4GB. What prompted me to bump to 32GB in the first place. So this task species is most definitely an extreme outlier. As far as changing settings, since no other tasks from no other projects have any issues, the solution is just to quit crunching Rosetta. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks. I swear by stressapptest. My systems pass 24 hours of memory testing using all available memory and all available cores with no errors. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
[Edit 2] Well it was this Rosetta task kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_201. It is grabbing all the memory and the swap file every five minutes or so.It's a resend, this is what the first system got with it. Thanks for the reply. That report is exactly what I am seeing on this task. My memory usage for the task climbs from 1GB all the way to all memory and swap in use for the task every ten minutes or so and then falls back to normal. Looking at it in htop was what allowed me to figure out the culprit. So I assume a faulty work unit and I will just abort it now. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,738,758 RAC: 8,494 |
I only run two Rosetta tasks at a time at most. The one task that I mentioned uses all available memory (32GB) plus all of the 6GB swap file every ten minutes or so. Must be writing out to a scratch file or something. Most single task memory usage I ever saw before on any Rosetta task was around 4GB. What prompted me to bump to 32GB in the first place. Kinda looks like you're getting unlucky and receiving dodgy tasks that eat memory. Oh well, put up with the errors and let them see the problem, or switch it off and let someone else get the horrid ones for a while. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,738,758 RAC: 8,494 |
Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks. I'd never heard of that, I assume it does the same thing as Memtest. Does it run within the OS? If so I'd not trust it, as the OS can't let it test memory in use by the kernel. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks. Well I first used Memtest from a USB stick. But the memory testers on OCN state that is a very poor tester for Linux. They recommend the Google stressapptest. That is the one Google developed to test their servers that they deploy in their AWS farms before putting them into service. It is a standard application in the repositories. I then follow up the memory stress testing with several hours of Prime95 and y-cruncher to put the system under actual compute loads to make sure it is stable before starting up BOINC with my actual loads. Closest I can come to actual BOINC loads. But BOINC is the final arbiter of stability. If I don't run Rosetta, I don't get any errors on any of my other projects. [Edit] Here are some links about it. https://www.ghacks.net/2009/10/19/google-stress-app-test/ https://rog.asus.com/forum/showthread.php?73665-Our-preferred-memory-stress-test |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
If I don't run Rosetta, I don't get any errors on any of my other projects.Yet you're the only one that is having signal 11 issues with WUs that others can process with no problems at all- even with the same application. Signal 11 indicates a memory problem. The problem only occurs with Rosetta Tasks- which in general use way more RAM than other projects Tasks. And if you've been getting these errors since before the faulty memory pig WUs came out. Everything points towards a hardware memory issue- be it too much/too little voltage, to much overclock, or just a dodgy address(es). *shrug* Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
If I don't run Rosetta, I don't get any errors on any of my other projects.The thing is: other people don’t get any errors on the Rosetta tasks that fail on your machine. Rosetta seems to be uncovering a fault that those synthetic stress testers fail to detect. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
If I don't run Rosetta, I don't get any errors on any of my other projects.Yet you're the only one that is having signal 11 issues with WUs that others can process with no problems at all- even with the same application. Not arguing with you. As I previously stated, I guess Rosetta tasks work the memory harder than any other project. The Einstein GW tasks are supposedly very hard on memory yet I have no issues. The TN-Grid tasks which are also molecular modeling like Rosetta have no issues. And I never got any response from my question about VSYSCALL=emulate needed or not for Rosetta apps. Maybe that is the problem. I can either continue to run tasks here and have errors or give up completely. No skin off my nose as far as I am concerned. Only using 2 of 30 cores so not losing too much compute time. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
If I don't run Rosetta, I don't get any errors on any of my other projects.The thing is: other people don’t get any errors on the Rosetta tasks that fail on your machine. Rosetta seems to be uncovering a fault that those synthetic stress testers fail to detect. Well neither Prime95 or y-cruncher are synthetic applications. They are real compute loads like Rosetta. And yet they uncover no memory issues or cause sigsegv errors. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,738,758 RAC: 8,494 |
I think I've always used CDs because older machines sucked at booting from USB. What do you mean a "poor tester for Linux"? It tests the physical RAM, and it doesn't matter what OS you run on the machine afterwards. I've used all sorts of dodgy 2nd hand crap from Ebay, and Memtest has always spotted faulty RAM within 5 or 10 minutes. Never had anything crash that's passed a 2 hour memtest. It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU. If you had hardware problems, other projects would screw up too. Rosetta is hardly the most difficult project to run. I'd say LHC stresses it most (the virtual machine apps, not Sixtrack), do you run that? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU.That isn't the case. There are the same types of Tasks running using the same application on the same model CPUs without error. If it occurred on all systems of a given CPU using different applications, but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application. As the errors are only occurring on a given system, and not on other systems using the same application & the same CPU it's a pretty fair bet that it is is an issue with that system. Grant Darwin NT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,738,758 RAC: 8,494 |
It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU.That isn't the case. Agreed. I guess he either has to mess around with hardware settings or pull RAM chips, or just not give that PC Rosetta to do. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
What do you mean a "poor tester for Linux"? It tests the physical RAM, and it doesn't matter what OS you run on the machine afterwards. It is the opinion of the memory testers at OCN that Memtest is a particularly poor tester. Does not test very thoroughly and also not very consistently. Since those experts have much more experience than I, I trust their opinions. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,545 RAC: 421 |
but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application. But we still do not know that. As far as I can tell in my research in various threads here and at Seti and Einstein, if an application is written expecting the deprecated VSYSCALL function to be available, the application will segfault. Only applies to Linux systems. Not applicable in Windows. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
Does the OS version info give you an idea of whether VSYSCALL function is likely to be available or not? Does the the fact that you do complete some Tasks indicate it's not the issue?but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application. Other systems running the same Linux application that completed WUs that errored out on your system. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125457579 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125456414 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125187602 Grant Darwin NT |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
I'm seeing a bunch of fold_and_dock work units that are using huge amounts of memory. I just spotted one where its properties had a working set size of 43GB. The machine in question has 64GB. I have suspended all the other tasks so it can get out of the way. I am now seeing the disk LED on constantly so its probably grown past the available memory and paging. I have a couple of failures like this which only wanted 18GB. Not sure what they're doing but the average BOINC user isn't going to have machines with that much memory and most are going to fail. BOINC blog |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
I have a couple of failures like this which only wanted 18GB.And on the other system that tried to process it (also Linux), while it only used 228MB of RAM, it crashed out with a Signal 11 error in 45 sec. Still waiting to see a the result of one of these Tasks on a Windows system. Edit- just found one against the out of control RAM Task that Keith aborted. Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 5159178 Run time 19 min 44 sec CPU time 18 min 38 sec Validate state Invalid Credit 10.00 Device peak FLOPS 3.28 GFLOPS Application version Rosetta v4.20 windows_x86_64 Stderr output <core_client_version>7.0.80</core_client_version> <![CDATA[ <message> Función incorrecta. (0x1) - exit code 1 (0x1) </message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3873245 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump! ERROR:: Exit from: ......srccorekinematicsFoldTree.cc line: 436 BOINC:: Error reading and gzipping output datafile: default.out 16:00:04 (3796): called boinc_finish(1) </stderr_txt> ]]> Looks like another batch of dud Work Units. Edit- just found one running on one of my systems. Been going for just under an hour, and it's properties at this stage are Virtual memory size 45.69GB Working set size 9.35GBAnd it hasn't check pointed since 7 minutes after it started. Checking Task Manager, the RAM usage for that task is increasing at roughly 2MB per second. Grant Darwin NT |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
I have a couple of failures like this which only wanted 18GB.And on the other system that tried to process it (also Linux), while it only used 228MB of RAM, it crashed out with a Signal 11 error in 45 sec. The other system only has 8GB of memory, that might account for why it only ran for 45 seconds. I have aborted all the fold_and_dock tasks and let the other tasks run. BOINC blog |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org