Questions and Answers : Unix/Linux : boincmgr with rosetta downloaded lots of data and when I rebooted it seemed to start over
Previous · 1 · 2
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Looks like you are getting download errors: <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>database_357d5d93529_n_methyl.zip</file_name> <error_code>-120 (RSA key check failed for file)</error_code> <error_message>signature verification failed</error_message> </file_xfer_error> </message> ]]> Perhaps you have an anti-virus that is blocking the zip file from downloading? Rosetta Moderator: Mod.Sense |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
No, I'm using Linux, I don't use AV. But I guess I may have learned why I was getting so many failed units, if I haven't figured out my DL issues. I hadn't been able to run memtest because even though it was installed it wasn't one of my Ubuntu grub choices, I'm not sure why. Maybe I installed the system in UEFI mode, I don't know if that is a factor. But I booted from a live Debian image and was able to run memtest, and it started kicking up errors pretty quickly. I have a G.SKILL Ripjaws V Series 16GB 288-Pin DDR4 SDRAM DDR4 3200 stick and was running it at its XMP-2 profile, its rated speed, which is the rated speed of the RAM. So I set the RAM speed to 2133 MHz, the lowest speed, and it passed memtest. And I've been running it at that speed for 5 hours and have gotten no further errors. Also, for some reason I was using a little of my swap partition even though I always had plenty of reserve memory. Now I'm using 8 of 16GB of RAM, but no swap at all. After I'm finish using the machine for the day I'll reset the memory to its XMP 1 profile (which is probably ~2933 MHz or so) and run some memtest on it. If it is stable maybe I'll try pushing it up just a little bit. I'm not sure how much memory speed affects BOINC crunching speed. I'm just relieved that it doesn't look like my PSU is at fault, that would have been expensive to fix. |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
I dialed the memory timings down from 3200MHz to 2933MHz, which seems like the maximum I can squeeze out of this stick and still pass a round of memtest. But I'm still getting a few errors. Over maybe 10 hours I've gone from 133 total errors to 136. How bad is that? Are errors to be expected or do any errors indicate a serious issue? Maybe I should dial the RAM down to 2800 or try to RMA the stick? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
How bad is that?Extremely bad. You should not get any errors. However there will be some tasks that are cancelled by the project that will be classed as an error, and there will be some tasks that do error out. Actual computation errors (unless there are a batch of bad Work Units) should be 2% or less of you Total Task number. So you should have no more than 2 Computation errors for that system. Some of your errors are related to the download issues, but the others are computation related and show memory problems (or data corruption). Maybe I should dial the RAM down to 2800 or try to RMA the stick?You need to revert your CPU & memory clocks and voltages to stock values. Computation Errors show that the overclock is not stable. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
I dialed the memory timings down from 3200MHz to 2933MHz, which seems like the maximum I can squeeze out of this stick and still pass a round of memtest. But I'm still getting a few errors. Over maybe 10 hours I've gone from 133 total errors to 136. I didn't understand the relevance of this earlier in the thread, so I didn't want to interfere, but I just looked up your CPU and it says it can't access RAM faster than 2667. Obviously it has been, with errors, but it sounds like a good idea to step it down until it's fully successful. There's often a margin, so 2800 is worth a try next. RMA might be tricky if it's only failing at a speed you already know your CPU can't handle in the first place. The other thing is you have a 6/12-core processor with 16Gb of memory. With the project's RAM demands recently, you'll struggle to run more cores without more. You mentioned you could add another 8Gb - that sounds like a good idea too Also, ensure you have the latest BIOS. Some updates improve the stability of higher speed RAM. |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
My $85 Ryzen 5 1600 AF can handle higher RAM speeds even though AMD rather conservatively says that its rated speed is only 2667MHz. From what I've read the motherboard is more of a constraint than the CPU, at least up to about 3200MHz. My motherboard's QVL list mentions many kits that have been tested to run substantially faster than 2667MHz. https://www.asrock.com/mb/AMD/B450M%20Pro4/index.us.asp#Memory Of the 290 RAM kits that ASRock tested, 61 of them were rated at 3000 or better and none of them tested as running slower than 2933MHz, and all of the 29 3200MHz kits apparently ran at their rated speeds, and the 7 tested '2933MHz' sets also tested running at their rated speeds. I do so love playing with spreadsheets! I reclocked my memory down to 2800MHz and have not gotten any additional errors over the last few days, running maybe 6-8 hours a day, so I guess that is where I'll stay. I am a bit disappointed that my memory does so much worse than all the other relatively fast sticks tested by ASRock, but probably it won't hurt my folding unduly. I'm not about to overclock my CPU, with the stock AMD processor fan I'm hitting rather high temps (80C) even at the rated default clock speed of 3200MHz (max burst speed is apparently 3700MHz without overclocking, but I've never seen the processor go faster than 3500MHz). On hot days I even reduce the CPU limits in BOINC preferences to keep the machine from overheating. Given my unimpressive performance I apparently wasn't too lucky in the hardware lottery, but what the heck, I built the system for about $300 and tax, if you don't count the case and power supply I recycled from an old Athlon XP 1700+ build. I'm only using 9GB of RAM now, and have never seen it go over 10GB on this (or any) machine, so I have 5.6GB in the bank, but if I ever see the RAM usage go over 12GB I'll order another stick, RAM prices seem to be falling at the moment after climbing for a few months. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
My $85 Ryzen 5 1600 AF can handle higher RAM speeds even though AMD rather conservatively says that its rated speed is only 2667MHz. From what I've read the motherboard is more of a constraint than the CPU, at least up to about 3200MHz. My motherboard's QVL list mentions many kits that have been tested to run substantially faster than 2667MHz. https://www.asrock.com/mb/AMD/B450M%20Pro4/index.us.asp#Memory Of the 290 RAM kits that ASRock tested, 61 of them were rated at 3000 or better and none of them tested as running slower than 2933MHz, and all of the 29 3200MHz kits apparently ran at their rated speeds, and the 7 tested '2933MHz' sets also tested running at their rated speeds. I do so love playing with spreadsheets! I understand but istm the board can take a whole range of CPUs which will support faster speeds than your CPU can, but seeing as yours is at the lower end, it's the CPU providing a bottleneck, so your RAM is successful at 2800Mhz with a CPU that's supposed to only handle 2667MHz - consider yourself ahead. Now you've found a stable speed, to eke out the last dregs you might want to run CPUz and check what timings your RAM supports at other speeds. I've found it's possible to tweak them just a little, especially if the RAM isn't running at full speed. Eg My DDR3 RAM defaults at 8-8-8-24-36 2T but I've edged it down to 8-8-8-24-33 1T while I've increased my FSB to run the RAM 2.85% above it's top speed (1645.6 for 1600MHz RAM). It's marginal, but it's also stable - 22days up-time for my overclock (that even surprised me!) I'm not about to overclock my CPU, with the stock AMD processor fan I'm hitting rather high temps (80C) even at the rated default clock speed of 3200MHz (max burst speed is apparently 3700MHz without overclocking, but I've never seen the processor go faster than 3500MHz). On hot days I even reduce the CPU limits in BOINC preferences to keep the machine from overheating. Given my unimpressive performance I apparently wasn't too lucky in the hardware lottery, but what the heck, I built the system for about $300 and tax, if you don't count the case and power supply I recycled from an old Athlon XP 1700+ build. I'm running a small overclock but with permanent maximum speed (4525GHz instead of 4300MHz on an old FX8370) at 50C - also for 22days - but if you have heat issues and a stock fan there's no point me suggesting anything. And if you're limiting CPU usage with Boinc settings, that probably explains why you're well within your RAM limits, so that's one bonus. Overall, even though you're not happy with how you're running, it doesn't sound like there's a lot more you can do while avoiding errors. And unfortunately, I've never noticed tweaking RAM improves task performance at all tbh. Sorry. |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
Crud, I thought I had this working properly, but I'm still getting a few errors. The machine was not cranking through units nearly as fast as I would like, but my errors didn't seem to be going up, so I ignored it for a couple of weeks. But now since I think May 30th I've gone from 44 completed units to 77, but errors crept up from 136 to 143. So a ratio of about 4.7 complete units to each failed unit, but that is still too high, right? So in my ~/.BOINC directory there is a big (724.5 KiB) text file called stderrgui.txt. It has 12417 lines and most all them contain the words fatal, failure, error, invalid, WARNING, CRITICAL or failed, but I have no idea is this file is what I want to look at or how to grep through it to find significant hints about why I am still getting failed tasks. Most of the error messages seem to be 'drawing failure for widget' or other things that seem to refer to the GUI, which is not normally running. Here are a few snippets from the file. (firefox:8163): Gtk-WARNING **: 18:24:11.661: Theme parsing error: colors.css:74:53: Invalid number for color value (firefox:8163): Gtk-WARNING **: 18:24:11.661: Theme parsing error: colors.css:75:53: Invalid number for color value (firefox:8163): Gtk-WARNING **: 18:24:11.661: Theme parsing error: colors.css:76:56: Invalid number for color value (boincmgr:9675): Gtk-WARNING **: 17:49:40.458: drawing failure for widget 'wxPizza': invalid matrix (not invertible) (boincmgr:9675): Gtk-WARNING **: 17:49:40.458: drawing failure for widget 'GtkBox': invalid matrix (not invertible) (boincmgr:9675): Gtk-WARNING **: 17:49:40.458: drawing failure for widget 'GtkWindow': invalid matrix (not invertible) (boincmgr:9675): Gtk-WARNING **: 17:49:40.475: drawing failure for widget 'wxPizza': invalid matrix (not invertible) Memory pressure relief: Total: res = 14671872/14622720/-49152, res+swap = 10211328/10211328/0 Memory pressure relief: Total: res = 14622720/14622720/0, res+swap = 10158080/10158080/0 Memory pressure relief: Total: res = 14622720/14622720/0, res+swap = 10166272/10166272/0 Memory pressure relief: Total: res = 14618624/14630912/12288, res+swap = 10166272/10166272/0 Memory pressure relief: Total: res = 14561280/14561280/0, res+swap = 10108928/10108928/0 Memory pressure relief: Total: res = 14561280/14565376/4096, res+swap = 10113024/10113024/0 (boincmgr:170652): Gtk-CRITICAL **: 19:59:42.269: gtk_box_gadget_distribute: assertion 'size >= 0' failed in GtkScrollbar (boincmgr:170652): Gtk-CRITICAL **: 19:59:42.269: gtk_box_gadget_distribute: assertion 'size >= 0' failed in GtkScrollbar Gdk-Message: 20:00:16.658: WebKitWebProcess: Fatal IO error 11 (Resource temporarily unavailable) on X server :0. Gdk-Message: 20:00:16.658: boincmgr: Fatal IO error 11 (Resource temporarily unavailable) on X server :0. (boincmgr:460505): Gtk-CRITICAL **: 14:33:52.096: gtk_box_gadget_distribute: assertion 'size >= 0' failed in GtkScrbarTemps seem OK (~70C. now), memory usage is 5.35G/15.6G, and according to mpstat I'm only using about 35% of my CPU power even though my BOINC computing preferences Usage linits are "at most 100% of CPUs" and "at most 90% of CPU time"; maybe I should try and re-seat my CPU fan? This machine just really seems to be underperforming. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
Crud, I thought I had this working properly, but I'm still getting a few errors. The machine was not cranking through units nearly as fast as I would like, but my errors didn't seem to be going up, so I ignored it for a couple of weeks. But now since I think May 30th I've gone from 44 completed units to 77, but errors crept up from 136 to 143. So a ratio of about 4.7 complete units to each failed unit, but that is still too high, right? I'm not sure it's quite as bad as you think. You say errors have increased by 7 and there are 7 boincmgr messages in your file snippets, all of which sound like errors within those tasks and not to do with your host machine (I'm not certain on that tbh) Your memory usage looks good - plenty of margin - and your temps are 10C lower than you reported before. But you've mentioned that you've set "at most 90% of CPU time", which i don't think you've mentioned before. Unintuitively, Boinc interprets this as running all cores at 100% for 90% of the time and 0% for 10% of the time and has been known to cause task errors. While you've gained some extra margin in your temps, bump this up to 100% and see how it goes. Temps are bound to go up, but with the benefit of an 11% improvement in CPU utilisation and hopefully some extra stability - maybe those weird errors will disappear. If temps become a problem, better to reduce "at most 100% of CPUs" to 92% (11 of 12 threads) to retain stability than reduce CPU time to anything below 100% Aside from that, I have no idea why your CPU utilisation is reporting so low. Bear in mind that Rosetta is struggling to provide sufficient tasks to download at the moment - you've completed all your tasks right now |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
The snippet I quoted was just a few lines to give the flavor, the file had 1240+ lines, each one (that I noticed) with some sort of error message. As Sid seemed to suggest, I bumped my computing preferences up to 100% of CPUs 100% of the time and I've been running the machine a few more days without doing anything else and my failed tasks are still 143 while my completed tasks has gone from 77 to 100, and temps are holding steady at 78-80C, which is higher than a lot of folks with my processor report, but still acceptable apparently. So unless I get a big increase in failed tasks I'm just going to assume the machine is behaving reasonably well and making a contribution and I just got a mucked up work unit download. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
I would reduce the size of your cache, as it's taking you around 3 days to return work, and the deadlines are 3 days, but with the amount of work you are carrying you are occasionally missing deadlines. In your account settings, Other Store at least 0.2 days of work Store up to an additional 0.02 days of workThat should give you enough work to keep the system busy, and not miss deadlines. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
As Sid seemed to suggest, I bumped my computing preferences up to 100% of CPUs 100% of the time and I've been running the machine a few more days without doing anything else and my failed tasks are still 143 while my completed tasks has gone from 77 to 100, and temps are holding steady at 78-80C, which is higher than a lot of folks with my processor report, but still acceptable apparently. So unless I get a big increase in failed tasks I'm just going to assume the machine is behaving reasonably well and making a contribution and I just got a mucked up work unit download. Good news. And a little bit of googling lets me bring some more. As I'm still on an AMD FX8370 i don't know too much about any of the Ryzens, so when you said you have a Ryzen 5 1600AF I didn't recognise the significance of the AF bit. I've discovered what it is from a table here and the good news is that while the original 1600 could only access 2667 RAM, the AF can access 2933. It might be we got sidetracked on RAM speeds, while looking up the wrong processor, until we realised you were using that "90% of the time" setting, which might've been the real cause of your issues. So I'm going to suggest you bump your RAM speed back up to 2933 and see how that goes. Fingers crossed. |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
Grant, my Computing preferences settings are as follows: Store at least 0.1 days of work Store up to an additional 0.5 days of work Switch between tasks every 60 minutes Can you suggest better values? My machine is on typically 5-8 hours a day, but often missing a day or more. I had no idea I only had 3 days to return a work unit. That seems a little tight for folks that do not leave their computers on all the time or have very fast machines. |
Macuilxochitl Send message Joined: 11 Oct 08 Posts: 13 Credit: 134,700 RAC: 0 |
Sid, I think I will ramp up my RAM speed back to 2933MHz. I passed memtest at that speed, but dropped my memory speed down to 2800MHz when I noticed that I was getting failed units. But now that I know that my failed units might have been caused by taking too long to complete my tasks dropping the speed was probably a misteak. BTW, Mr. Celery, I'm amazed how much work and how low your temps are with an AMD FX8370. My CPU is supposedly about twice as fast as your 6 year-old processor and uses half the amount of juice, but you have much lower temps and seem to be getting more crunching done. https://www.cpubenchmark.net/compare/AMD-Ryzen-5-1600-vs-AMD-FX-8370-Eight-Core/2984vs2347 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
Grant, my Computing preferences settings are as follows: Boinc ought to account for your reduced up-time, but if you still think it's providing more tasks than you can complete within the relatively short deadlines there's nothing stopping you from manually reducing the additional days figure. There's no real rule here, apart from using a figure that allows you to be successful, while also calling down extra tasks before you run out, given you have slightly limited internet access. Edge your additional days down if you think you have too many tasks to complete by deadline, but if you find you have unused cores before grabbing more, tweak it back up again. Sid, I think I will ramp up my RAM speed back to 2933MHz. I passed memtest at that speed, but dropped my memory speed down to 2800MHz when I noticed that I was getting failed units. But now that I know that my failed units might have been caused by taking too long to complete my tasks dropping the speed was probably a mistake. Ha! The main reason for this is I'm in the Midlands of England and you're in Cuba. Ambient temperatures here give me a distinct advantage (the only one?) of letting me overclock and adjust my power settings so I'm running permanently at 4.525Ghz and 24/7 while you're nearer 3.2GHz and 1/3 of not quite every day. I was in trouble last week when we had a mini heatwave with temps in the 90s, but now we're struggling to reach 70F I'm ok again. My CPU's max working temp is 62C though, not 95C like yours, so I don't have great margins and I also use a 280mm water-cooled CPU cooler to keep the temps right down. Once we get past what we laughingly call our summer, I'll be looking to increase my overclock back to 4.73GHz which I was running before I blew my last motherboard in March - running so fast comes at some cost. But I note your most recent 8hr tasks are crediting you over 410, while mine average 250-270. That seems to better reflect the power of your CPU over mine now all your settings seem optimised. |
Questions and Answers :
Unix/Linux :
boincmgr with rosetta downloaded lots of data and when I rebooted it seemed to start over
©2024 University of Washington
https://www.bakerlab.org