R@H Scientists/Coders: An analysis of the Rosetta binaries...

Author	Message
dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,963,503 RAC: 117	Message 78401 - Posted: 3 Jul 2015, 9:24:47 UTC - in response to Message 78400. I don't think so. Gpu computational power (if sw is ok) outclasses cpu We'll see how Knights Landing will perform with its AVX512 support and much more cores than a reglar DT CPU. The logical future: AVX1024 and TSX to handle all those cores efficiently. This is off-topic but interesting. Isn't the benefit of GPGPU that the silicon is already there anyway, whereas AVXx might be better but isn't likely to make up much of a desktop processor's die area? ID: 78401 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2183 Credit: 13,608,396 RAC: 9,255	Message 78402 - Posted: 3 Jul 2015, 10:21:15 UTC - in response to Message 78400. We'll see how Knights Landing will perform with its AVX512 support and much more cores than a reglar DT CPU. Knights Landing is not a cpu.....and is not a gpu... :-) ID: 78402 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 78403 - Posted: 3 Jul 2015, 11:06:25 UTC - in response to Message 78402. Knights Landing is not a cpu.....and is not a gpu... :-) It's heterogeneous computing vs. homogeneous computing. Knights Landing will not just only be available as a coprocessor but also as a host cpu capable of running an OS and applications on its own. Nobody really wants to rewrite his or her application to utilize CUDA or OpenCL. But obey some coding rules and compile it with the right flags, you'll get an instant speedup. Write clean code in your favorite programming language and let the compiler do the hard work for you. GPGPU suffers from all kinds of problems, e.g. latency, power consumption and the necessary rewrite of your application. David E K already showcased what is possible with minor optimizations and I figure rosetta@home is not necessarily written with speed in mind. I also think it's beneficial to the project when the developers discard the support for really ancient cpus and support the capabilities of reasonably new cpus. ID: 78403 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2183 Credit: 13,608,396 RAC: 9,255	Message 78404 - Posted: 3 Jul 2015, 13:04:00 UTC - in response to Message 78403. It's heterogeneous computing vs. homogeneous computing. Knights Landing will not just only be available as a coprocessor but also as a host cpu capable of running an OS and applications on its own. Nobody really wants to rewrite his or her application to utilize CUDA or OpenCL. I agree with you. But you're speaking about monster cpu/gpu/coprocessor. Xeon Phi 7120P costs over 4000 dollars (and have 1.2 Tflops of DP). Radeon 290X have 700Gflops in DP and costs 600 dollars. P.S. Recent version of OpenCl, for example, give the possibility to write code for cpu and "pass" it on gpu easely with Spir. I also think it's beneficial to the project when the developers discard the support for really ancient cpus and support the capabilities of reasonably new cpus. +1 I think we are OT :-P ID: 78404 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78407 - Posted: 3 Jul 2015, 16:04:31 UTC - in response to Message 78403. Knights Landing is not a cpu.....and is not a gpu... :-) I also think it's beneficial to the project when the developers discard the support for really ancient cpus and support the capabilities of reasonably new cpus. +1. GROMACS (used by Folding@Home) for example, is written in assembly by hand to squeeze every bit of performance. ID: 78407 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 78408 - Posted: 3 Jul 2015, 16:19:09 UTC - in response to Message 78407. +1. GROMACS (used by Folding@Home) for example, is written in assembly by hand to squeeze every bit of performance. Hmmm, this rather -1. It's the exact opposite of what I am trying to convey here. Hands off assembly, hands off intrinsics, just heed some coding guidelines and let the compiler (designers) do the hard job. Updating the compiler infrastructure and providing us with an (whatever instruction enabled) 64 Bit build is the way to go. ID: 78408 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78409 - Posted: 4 Jul 2015, 16:58:51 UTC - in response to Message 78408. +1. GROMACS (used by Folding@Home) for example, is written in assembly by hand to squeeze every bit of performance. Hmmm, this rather -1. It's the exact opposite of what I am trying to convey here. Hands off assembly, hands off intrinsics, just heed some coding guidelines and let the compiler (designers) do the hard job. Updating the compiler infrastructure and providing us with an (whatever instruction enabled) 64 Bit build is the way to go. It as more of a +1 to approve your suggestion, then I added an example as how far some teams go in the name of speed. I wasn't suggesting R@H to code in assembly haha! ID: 78409 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 78410 - Posted: 4 Jul 2015, 17:49:16 UTC - in response to Message 78409. It as more of a +1 to approve your suggestion, then I added an example as how far some teams go in the name of speed. I wasn't suggesting R@H to code in assembly haha! We're on the same page here :-) Definitely nice to see some optimizations in the pipeline... ID: 78410 · Rating: 0 · rate: / Reply Quote

xdarma Send message Joined: 20 Jan 08 Posts: 5 Credit: 5,589,505 RAC: 1,229	Message 78412 - Posted: 5 Jul 2015, 10:29:04 UTC - in response to Message 78373. If you are the developer/researcher, the question they ask is "How many systems are going to use this new feature and will it pay back the researcher effort for the port?" The Rosetta researchers have an idea about what the machine distribution looks like. I don't know if the number of AMD HSA APUs is sufficient to warrant the effort. This principle also applies in the case of AVX-512? IMO, I think there are much more APU on the market than AVX-512 enabled cpus. Even intel cpu own an integrated gpu. Not HSA-capable, but is however unused compute power. From wikipedia: The AVX instructions support both 128-bit and 256-bit SIMD. The 128-bit versions can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128.[4] AVX-128 instructions that do not use YMM registers are also safe to use on operating systems without AVX-support, since AVX-support in operating systems refers to handling YMM register state.[3] Maybe the best test is to use gcc with the option -mprefer-avx128. Don't know about icc. IIRC, gcc keeps an eye on portability, not on performance. So, using icc maybe hurts the ARM version of rosetta. For sure, using icc hurts all non-intel cpu. Just some random thoughts, indeed. ID: 78412 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78414 - Posted: 5 Jul 2015, 19:02:01 UTC - in response to Message 78412. Last modified: 5 Jul 2015, 19:03:12 UTC If you are the developer/researcher, the question they ask is "How many systems are going to use this new feature and will it pay back the researcher effort for the port?" The Rosetta researchers have an idea about what the machine distribution looks like. I don't know if the number of AMD HSA APUs is sufficient to warrant the effort. This principle also applies in the case of AVX-512? IMO, I think there are much more APU on the market than AVX-512 enabled cpus. Even intel cpu own an integrated gpu. Not HSA-capable, but is however unused compute power. From wikipedia: The AVX instructions support both 128-bit and 256-bit SIMD. The 128-bit versions can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128.[4] AVX-128 instructions that do not use YMM registers are also safe to use on operating systems without AVX-support, since AVX-support in operating systems refers to handling YMM register state.[3] Maybe the best test is to use gcc with the option -mprefer-avx128. Don't know about icc. IIRC, gcc keeps an eye on portability, not on performance. So, using icc maybe hurts the ARM version of rosetta. For sure, using icc hurts all non-intel cpu. Just some random thoughts, indeed. AVX-512? AVX-512 will be in Xeon PHI to be released soon but it will not likely have many target machines running Rosetta@Home in the near future. That is why I suggested the ICC -ax option which will generate fat binaries with multiple CPU support. APU vs AVX-512? Which APU target would you use? Nvidia CUDA? AMD native? If you target OpenCL, you get both Nvidia and AMD GPU ..... and you get all Intel GPU too. OpenCL takes some substantial coding changes. ICC vs ARM version? ICC does not generate ARM code so if you want to generate an ARM Rosetta@Home target, you would use the ARM gcc compiler. gcc with the option -mprefer-avx128. Most of the developers adding to the gcc optimizations have @intel.com mail address. gcc is a good compiler and lags icc by (I would guess ... a year or so) in feature development. The option itself will tell the compiler to use the XMM registers AND if the compiler cannot vectorize the code, it can be just as fast as the 256 or 512 bit options since ..... you are doing 1 operation at a time. The developer must insure that the code parallelism is recognizable to the complier. Many times poor coding practices introduce ambiguities that prevent the generation of vector code. It is VERY tough to say that binary "B" is XX% faster than binary "A" because it depends on where the program bottlenecks are. An Intel Wolfdale ( http://ark.intel.com/products/codename/24736/Wolfdale?q=wolfdale#@Desktop ) will behave much different than any CPU that followed it. Nehalem CPU an beyond had dramatic improvements in the cache subsystems which many times moved the bottleneck to different areas of the program. David's performance increases are probably different than what I would see on my Haswell Intel Core i7 5930K with DDR4 memory. Future Intel CPU are going to increase memory bandwidth and programs will different % of performance increase going from application version to version. 2011 Sandy Bridge era AVX1 complier presentation. It talks about the icc v12 compiler and I am currently beta testing the v16. https://indico.cern.ch/event/125167/material/slides/0.pdf It is a always a very fun puzzle to figure out how to structure the code so the compiler can generate vector code. ID: 78414 · Rating: 0 · rate: / Reply Quote

xdarma Send message Joined: 20 Jan 08 Posts: 5 Credit: 5,589,505 RAC: 1,229	Message 78416 - Posted: 6 Jul 2015, 20:30:53 UTC - in response to Message 78414. AVX-512 I was wrong: I did not mean AVX-512, but AVX2. APU vs AVX-512 The gpu client is a well-know desire of the rosetta crunchers. IIRC, developers have tested an OpenCL client few years ago but did not fit the needs. And I can't compare clients that doesn't exist. As a side note: I don't think nvidia can sell cpu or apu without paying royalties to intel or amd. ICC for ARM ICC does not generate ARM code... Thank you to confirm this. Most of the developers adding to the gcc optimizations have @intel.com mail address. gcc is a good compiler and lags icc by (I would guess ... a year or so) in feature development. For sure, if intel want gcc supports its cpus, it must contribute ;-) I do not think gcc "lags behind", but follow a different path. For example: supporting the ARM architecture. The option itself will tell the compiler to use the XMM registers AND if the compiler cannot vectorize the code, it can be just as fast as the 256 or 512 bit options since ..... you are doing 1 operation at a time. The developer must insure that the code parallelism is recognizable to the complier. Many times poor coding practices introduce ambiguities that prevent the generation of vector code. So, you agree with me? -mprefer-avx128 worth a test? It is VERY tough to say that binary "B" is XX% faster than binary "A" because it depends on where the program bottlenecks are. [...cut...] 2011 Sandy Bridge era AVX1 complier presentation. It talks about the icc v12 compiler and I am currently beta testing the v16. https://indico.cern.ch/event/125167/material/slides/0.pdf Thank you for informations, but I'm no longer interested on buying intel cpus. Due to unfair competition. Sorry. ID: 78416 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2183 Credit: 13,608,396 RAC: 9,255	Message 78424 - Posted: 9 Jul 2015, 13:07:08 UTC 06/24 David wrote: I'll push it out to ralph soon 07/09 Any news?? ID: 78424 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78425 - Posted: 9 Jul 2015, 23:12:35 UTC Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home. David K ID: 78425 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1836 Credit: 124,963,503 RAC: 117	Message 78426 - Posted: 10 Jul 2015, 9:01:19 UTC This is great news :D I can't help with the optimisation but I have a spare PC that I'm happy to set up teamviewer on if someone wants to use it to run copies of Rosetta on to speed up testing. D ID: 78426 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78430 - Posted: 11 Jul 2015, 17:30:00 UTC - in response to Message 78425. Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home. David K I have looked at the Rosetta license a couple times but I am not associated with any educational institution and did not feel I qualified to download under the Academic license. I would be interested in looking at optimizations under the direction of someone on the project, share any findings with them so they could verify and incorporate. I would require some guidance on how to build and validate the results. It would be easier for me to work in Linux Fedora21 environment but I can build a VirtualBox of any Linux .... or finally tackle a Windows VS version. ID: 78430 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78431 - Posted: 11 Jul 2015, 22:22:37 UTC - in response to Message 78430. Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home. David K I have looked at the Rosetta license a couple times but I am not associated with any educational institution and did not feel I qualified to download under the Academic license. I would be interested in looking at optimizations under the direction of someone on the project, share any findings with them so they could verify and incorporate. I would require some guidance on how to build and validate the results. It would be easier for me to work in Linux Fedora21 environment but I can build a VirtualBox of any Linux .... or finally tackle a Windows VS version. That's great! I'll get back to you about how you can get the source. We should be able to work something out regarding the license. ID: 78431 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78432 - Posted: 12 Jul 2015, 3:36:10 UTC - in response to Message 78430. Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home. David K I have looked at the Rosetta license a couple times but I am not associated with any educational institution and did not feel I qualified to download under the Academic license. As long as you don't run off making billions of dollars thru the Rosetta software... I think you'd be good regarding your qualifications to download the code. As a side note, I'm glad all of this discussion is turning out well. ID: 78432 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2183 Credit: 13,608,396 RAC: 9,255	Message 78442 - Posted: 13 Jul 2015, 14:46:01 UTC - in response to Message 78430. It would be easier for me to work in Linux Fedora21 environment but I can build a VirtualBox of any Linux .... or finally tackle a Windows VS version. VirtualBox 5.0 released: - Make more instruction set extensions available to the guest when running with hardware-assisted virtualization and nested paging. Among others this includes: SSE 4.1, SSE4.2, AVX, AVX-2, AES-NI, POPCNT, RDRAND and RDSEED ID: 78442 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78444 - Posted: 13 Jul 2015, 22:00:44 UTC - in response to Message 78442. It would be easier for me to work in Linux Fedora21 environment but I can build a VirtualBox of any Linux .... or finally tackle a Windows VS version. VirtualBox 5.0 released: - Make more instruction set extensions available to the guest when running with hardware-assisted virtualization and nested paging. Among others this includes: SSE 4.1, SSE4.2, AVX, AVX-2, AES-NI, POPCNT, RDRAND and RDSEED That VirtualBox change is very nice. Thanks for pointing it out. It will be interesting to see what you have to do to change the instruction set. I would expect it to be shutdown the guest OS and change configuration so it BOOTS again as a different CHIP. ID: 78444 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2183 Credit: 13,608,396 RAC: 9,255	Message 78445 - Posted: 14 Jul 2015, 7:05:46 UTC - in response to Message 78349. I'll also look into a VS upgrade. According to this source, MS will release VS2015 during this summer... :-) From MS: On July 20th we will celebrate the final release of Visual Studio 2015 VS2015 ID: 78445 · Rating: 0 · rate: / Reply Quote