GPU computing

Author	Message
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80709 - Posted: 5 Oct 2016, 23:50:01 UTC - in response to Message 80706. Is there any progress worth speaking of? Plenty of scientific progress! https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf ID: 80709 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2457 Credit: 46,464,996 RAC: 1,096	Message 80710 - Posted: 6 Oct 2016, 1:12:19 UTC - in response to Message 80709. Is there any progress worth speaking of? Plenty of scientific progress! https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf I think the question was about re-coding to take advantage of newer protocols, but wrt these papers from some weeks ago, these are the sort of things that should be posted up in the Science forum when they're available ID: 80710 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80711 - Posted: 6 Oct 2016, 5:23:31 UTC - in response to Message 80710. Is there any progress worth speaking of? Plenty of scientific progress! https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf I think the question was about re-coding to take advantage of newer protocols, but wrt these papers from some weeks ago, these are the sort of things that should be posted up in the Science forum when they're available They have been posted and tweeted. Lots of cool science happening recently. ID: 80711 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80712 - Posted: 6 Oct 2016, 14:14:45 UTC - in response to Message 80708. There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge. I hope not. If it is "kludge", why you are here? Because I want to help. That's why. One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster? There are a lot of issues with these devices and this is IMHO a waste of developer resources. A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance. The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem. ID: 80712 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80713 - Posted: 6 Oct 2016, 18:11:27 UTC - in response to Message 80712. There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge. I hope not. If it is "kludge", why you are here? Because I want to help. That's why. One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster? There are a lot of issues with these devices and this is IMHO a waste of developer resources. A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance. The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem. We do use local and UW hosted clusters. https://itconnect.uw.edu/service/shared-scalable-compute-cluster-for-research-hyak/ We also have been given time on cloud computing resources. We also have had many many compute years awarded for supercomputing resources like blue gene etc. Specific questions and concerns about code development and optimizations are more of a Rosetta Commons issue. They have hired developers to tackle such issues. Keep in mind, we are a research lab whose main priority is research. One overlooked benefit to distributed computing is getting people familiar with science and allowing them to be directly a part of it. ID: 80713 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1250 Credit: 14,421,737 RAC: 0	Message 80721 - Posted: 9 Oct 2016, 4:59:26 UTC - in response to Message 80082. I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!! For comparison a relatively high performance chip like an overclocked 5820K will do maybe 350GFlops. So we are talking an order of magnitude difference. In addition the Tesla HPC version will probably be double that at 8 TFlops. (Edit: Looks like it is actually 5.3TFlops) The Volta version of the gtx1080 (next gen on, due in about 18 months time) is rumoured to be 7TFlops FP64 in the consumer version. There is no way that conventional processors can keep up with that level of calculation. At what point does the gap between serial CPU and parallel GPU have to be before the project leaders decide they can not afford NOT to invest in recoding to parallel processing? Because by 2 years time, HPC GPUs will be around 35 times faster than CPUs. How much will it cost to rewrite the code, $100-150K maybe?? Isn't that worth paying for such a huge step up? With that kind of performance increase, you can make calcs more accurate. You no longer have to use approximations like LJ potentials, you can calculate the energy accurately and get a better answer in a quicker time than now. Whats not to like? It seems like so many projects, everyone is comfortable with what they are doing now. Revolution has been forsaken for evolution. Understandable, but the best way to do things? Be bold and take the leap! More computing performance is not a good answer if the limit comes from available memory limits rather than from computing limits. Rosetta@Home has already looked into GPU versions, and found that they would require about 6 GB of graphics memory per GPU to get the expected 10 times as much performance as for the CPU version. The GPU version would run each workunit at about the same speed as the CPU version, and would therefore need to run 10 workunits at the same time, using 10 times as much memory, to get 10 times as much performance. Rather few of the high-end graphics boards have that much memory. ID: 80721 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1250 Credit: 14,421,737 RAC: 0	Message 80722 - Posted: 9 Oct 2016, 5:20:25 UTC - in response to Message 80106. Last modified: 9 Oct 2016, 5:27:08 UTC I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO. If i remember correctly, the public test of rosy on gpu was with and old version of pycl This is the post one developer wrote about this test. It's a pity that pdfs are not longer available I've used Fortran for several years, and have taken classes in C++ and CUDA since then. Is any help needed for translating any remaining Fortran code to C++? I would not be able to travel for this. I'm still looking for an online OpenCL class aimed at GPUs rather than FPGAs. A CUDA version would work on most Nvidia GPUs, but not on other brands. An OpenCL version should work on other brands of GPUs. A GPU version REQUIRES that most of the application allows many threads to run in any order, or even at the same time, since they don't use anything produced by the other threads. If this is not satisfied, the GPU version may be as slow as a quarter of the speed of the CPU version. ID: 80722 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1250 Credit: 14,421,737 RAC: 0	Message 80723 - Posted: 9 Oct 2016, 5:38:19 UTC - in response to Message 80712. Last modified: 9 Oct 2016, 5:39:35 UTC There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge. I hope not. If it is "kludge", why you are here? Because I want to help. That's why. One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster? There are a lot of issues with these devices and this is IMHO a waste of developer resources. A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance. The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem. So you want far fewer processors to be used? None of my computers use a CPU that even has AVXx available, and not enough money is available to replace all the computers available through BOINC with equivalents that have AVXx available. It would be possible, though, to produce separate compiles of the application for computers with AVXx and computers without, and add a shell program that tests what the CPU has available, then starts only the version of the program best for the current CPU. ID: 80723 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80726 - Posted: 9 Oct 2016, 10:40:30 UTC - in response to Message 80723. So you want far fewer processors to be used? None of my computers use a CPU that even has AVXx available, and not enough money is available to replace all the computers available through BOINC with equivalents that have AVXx available. Your computer does support AVX. AVX2 however has been introduced with the Haswell CPU generation. AVX-512 will be featured on Skylake-EP CPUs. It would be possible, though, to produce separate compiles of the application for computers with AVXx and computers without, and add a shell program that tests what the CPU has available, then starts only the version of the program best for the current CPU. Yes, I know. We've already had that discussion here on this board. We're just waiting for results. ID: 80726 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80788 - Posted: 26 Oct 2016, 17:42:55 UTC Last modified: 26 Oct 2016, 18:02:32 UTC imho GPUs may be simplified as very simplified ALUs with 1000s of 'registers' in which the ALUs can do SIMD (single instruction multiple data) executions on them. typical GPUs possibly have hundreds to thousands of 'gpu' (e.g. cuda) 'cores' on them & they benefit from a specific class of problem, i.e. the whole array or matrix is loaded into the gpu as 'registers' and in which simd instructions runs the algorithm in a highly vectorized fashion. this means among various things, the problem needs to be vectorizable and large and runs completely in the 'registers' without needing to access memory, it is useless if we are trying to solve 2x2 matrices over and over again in which the next iteration depends on the previous iteration. the whole of the rest of the gpu is simply unused except for a few transistors. In addition, adapting algorithms to gpus is often a significantly difficult software task. it isn't as simply as 'compiling' a program to optimise for gpu. Quite often the algorithms at hand cannot make use of GPU vectorized infrastructure, this requires at times a complete redoing of the entire design and even completely different algorithms and approaches. while i'd not want to discourage users who have invested in GPUs, the above are true software challenges to really 'make it work'. As i personally did not use s/w that particular use the above aspects of gpu, i've actually refrained from getting one and basically made do with a rather recent intel i7 cpu. i would think that similar challenges would confront the rosetta research team and i tend to agree that functional needs are the higher priority vs trying to redo all the algorithms just to make them use gpus. as the functional needs in themselves could be complex and spending overwhelming efforts into doing 'gpu' algorithms could compromise the original research objectives ID: 80788 · Rating: 0 · rate: / Reply Quote

Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0	Message 80793 - Posted: 27 Oct 2016, 1:26:17 UTC As someone with 14 discrete GPU cards, I support those projects that have applications that run primarily in the GPUs (Einstein, SETI). My five computers have fairly modern CPUs, so I also give their cycles to projects that DON'T have applications for GPUs (Rosetta, LHC). This works for me. Keeps both GPUs and CPUs busy. ID: 80793 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2116 Credit: 12,390,943 RAC: 1,062	Message 80795 - Posted: 27 Oct 2016, 10:32:50 UTC Last modified: 27 Oct 2016, 10:33:21 UTC Up in this thread, i've reported the 2 pdf about gpu in Rosetta@Home project. They said, in this papers, that they have created a gpu app (so, it's possible) for specific simulations, but they were not satisfied about performances. This, over 3 years ago. Now, i don't know if they retried this app with recent and powerful gpus, if they have recompiled this app with new updated compilers/libraries/etc or if they have abandoned definitively this app... ID: 80795 · Rating: 0 · rate: / Reply Quote

AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0	Message 80796 - Posted: 27 Oct 2016, 13:49:36 UTC For curiostity's sake, what about incorporating open source, specifically this (second paragraph)? ID: 80796 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2116 Credit: 12,390,943 RAC: 1,062	Message 80798 - Posted: 28 Oct 2016, 9:47:14 UTC - in response to Message 80796. For curiostity's sake, what about incorporating open source, specifically this (second paragraph)? First, it's great that poem's admins will release the code. Second, i don't think that rosetta can use this code. Poem, if i'm not wrong, runs omogeneous simulations, not heterogeneous like Rosetta (ab initio, docking, etc). ID: 80798 · Rating: 0 · rate: / Reply Quote

ToyMachine Send message Joined: 31 Oct 16 Posts: 1 Credit: 621,562 RAC: 0	Message 80821 - Posted: 2 Nov 2016, 2:19:01 UTC Could this thread be made a "Sticky"? Right up front, separated from all the other newb questions. It might also be appropriate to add this topic to the FAQ section, and maybe a bit on the main page. "We don't utilize GPUs, and here's why." I think that would make it quicker and easier for new contributors to determine which project to add to which computer, or where to direct upgrade funds, even if they (I) are too lazy to dig into the forum. ;) ID: 80821 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2116 Credit: 12,390,943 RAC: 1,062	Message 80868 - Posted: 24 Nov 2016, 13:13:37 UTC - in response to Message 80721. More computing performance is not a good answer if the limit comes from available memory limits rather than from computing limits. Rosetta@Home has already looked into GPU versions, and found that they would require about 6 GB of graphics memory per GPU to get the expected 10 times as much performance as for the CPU version. I thought, until yesterday, that gpu memory is only a problem of "amount", not a problem of "kind" of memory... Matrix-vector case study ID: 80868 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 0	Message 80872 - Posted: 25 Nov 2016, 18:34:05 UTC That test basically hits the memory wall where data can't be moved fast enough to fully utilize the processing cores. In this case its the GPU and HBM improves the bandwidth between processor and memory. HMC is a similar tech for the CPU and main memory. ID: 80872 · Rating: 0 · rate: / Reply Quote

Greg Tippitt Send message Joined: 4 May 07 Posts: 5 Credit: 11,693,761 RAC: 18	Message 80901 - Posted: 14 Dec 2016, 9:00:26 UTC There is a reason that PC's must have a CPU rather than simply a big video card that tries to run the entire operating system on a GPU. The reason is that GPUs are specialized processors for applications that can be designed for parallel computing. Huge speed improvements for some analysis when using GPUs does not mean that everything can run on a GPU more quickly. One metaphor for understanding this is to think about a WalMart store. If they open all of the checkout lanes, then it makes it faster for you to checkout without having to wait in line. This is like a GPU using parallel computing. Having lots of available checkout lanes will not make it faster for you to do your shopping, if for instance you need milk, antifreeze, shampoo, a pair of sweatpants, and a bag of kitty litter. These items are normally in departments that scattered all over the store, so it takes you lots of time to go to each. Having lots of empty checkout lanes doesn't help. If you've taken your family with you to Walmart, then you can send each person to get different items and rendezvous at the checkout. This might be thought of being analogous to having 4 CPU cores. The repeated posts "Why don't they compile the code for GPUs so it will run faster?" is somewhat like asking "Why doesn't the highway department attached a snowblower to the front of a Dodge Challenger SRT Hellcat, so that they can clear all the streets really quickly instead of using those slow trucks that take forever to get about town?" The easy solution is to run Rosetta on your CPU cores, and then run GPUGRID, or your other favorite BOINC apps, on your GPUs. ID: 80901 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2116 Credit: 12,390,943 RAC: 1,062	Message 80903 - Posted: 14 Dec 2016, 17:00:27 UTC - in response to Message 80901. There is a reason that PC's must have a CPU rather than simply a big video card that tries to run the entire operating system on a GPU. The reason is that GPUs are specialized processors for applications that can be designed for parallel computing. Huge speed improvements for some analysis when using GPUs does not mean that everything can run on a GPU more quickly. Completely agree with you. In fact, top500 supercomputers use cpus and gpus TOGETHER The easy solution is to run Rosetta on your CPU cores, and then run GPUGRID, or your other favorite BOINC apps, on your GPUs. A BETTER solution is to use "deeply" our cpus, for example with SSEx or AVX ID: 80903 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2116 Credit: 12,390,943 RAC: 1,062	Message 87895 - Posted: 13 Dec 2017, 13:14:59 UTC - in response to Message 80903. Some news on OpenCl side (i posted these also on ralph@home forum) New CodeXL 2.5. ROCm now is at 1.6 version. Codeplay released ComputeCpp to develop SYCL app in Visual Studio. VC4CL brings OpenCl to Raspberry Pi. Khronos Group releases SYCL 1.2.1 to develop "code for heterogeneous processors to be written in a “single-source” style using completely standard modern C++" (and supports TensorFlow). ID: 87895 · Rating: 0 · rate: / Reply Quote