Algorithm Discussion

Author	Message
uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0	Message 11267 - Posted: 23 Feb 2006, 20:12:51 UTC Can anyone point me to discussion of the search algorithm R@H uses? A quick search lead me to the rnd generator discussion initiated by MR. Buck, which more got bogged down in minutia rather than really getting anywhere. I'm looking for something aimed at programmers, but something's better than nothing. ID: 11267 · Rating: 1 · rate: / Reply Quote

alo_dk Send message Joined: 11 Dec 05 Posts: 19 Credit: 30,425 RAC: 0	Message 11272 - Posted: 23 Feb 2006, 22:03:25 UTC - in response to Message 11267. Last modified: 23 Feb 2006, 22:06:50 UTC I'm looking for something aimed at programmers, but something's better than nothing. Another try at talking about the algorithem was here: Genes... ID: 11272 · Rating: 0 · rate: / Reply Quote

uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0	Message 11275 - Posted: 23 Feb 2006, 23:23:07 UTC - in response to Message 11272. Last modified: 23 Feb 2006, 23:33:29 UTC I missed that one; thanks. I was more wanting greater detail about how the current crop of workunit-styles aproaches the task of paring down and navigating the search-space. In case someone in the know wanders by, here are a few questions just to get the ball rolling: (Note, I ended up asking a lot below. Feel free to take the low-hanging fruit, at least as a start.) First off, as I understand it, the app performs a few runs of a stochastic local search initiated at random (except for the'barcode' units?) on the search space, using simulated annealing and some heuristics int he form of known likely positions for subsequences (except fro the 'random frag' units?) Is that approaching correct? I was wondering how much we know about the search space? I'm not a molecular biologist or chemist, so I really have no clue how things would pan out or how much is known. Similarly, how good are our heuristics? Extending from Paul's questions in the other thread, what is the "granularity" and coverage of the initial random trajectories of these workunits, and how does that compare with the average travel of a search trajectory? O.K. So I made up granularity and travel. What I mean is what's the minimum spacing between trajectories, and how frequently will two trajectories overlap if they start near each other? How prevalent are local minima, and how steep is the approach to the known results? What I mean is, using our algorithm, how right would your random guess have to be in order to smack the nail on the head and get the real folding via our annealing? How does that compare to the granularity I mentioned above? Any answers would be read with interest, but I realize I've asked a lot more than I intended to. I don't expect the in-depth answer to everything right here, right now. I know that most everyone with answers is also likely to be fairly busy :) Thanks in advance; -alex. [edit]Maybe this one would be better for the FAQ, but what do the names of the workunits mean?[/edit] ID: 11275 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 11293 - Posted: 24 Feb 2006, 5:15:53 UTC - in response to Message 11275. good questions. some partial answers below I missed that one; thanks. I was more wanting greater detail about how the current crop of workunit-styles aproaches the task of paring down and navigating the search-space. In case someone in the know wanders by, here are a few questions just to get the ball rolling: (Note, I ended up asking a lot below. Feel free to take the low-hanging fruit, at least as a start.) First off, as I understand it, the app performs a few runs of a stochastic local search initiated at random (except for the'barcode' units?) on the search space, using simulated annealing and some heuristics int he form of known likely positions for subsequences (except fro the 'random frag' units?) Is that approaching correct? yes I was wondering how much we know about the search space? I'm not a molecular biologist or chemist, so I really have no clue how things would pan out or how much is known. Similarly, how good are our heuristics? what is exciting about the calculations you are all doing is they provide really the first comprehensive picture of the landscape. there is very little currently known due to the very large size Extending from Paul's questions in the other thread, what is the "granularity" and coverage of the initial random trajectories of these workunits, and how does that compare with the average travel of a search trajectory? O.K. So I made up granularity and travel. What I mean is what's the minimum spacing between trajectories, and how frequently will two trajectories overlap if they start near each other? this is a non trivial question to answer, and we are still analyzing the results to get a clear picture. it appears that the trajectories are pretty well spaced, but for some of the proteins there are biases which are limiting sampling near the native strucutre. the HBLR runs I've submitted recently seek to reverse this bias. How prevalent are local minima, and how steep is the approach to the known results? What I mean is, using our algorithm, how right would your random guess have to be in order to smack the nail on the head and get the real folding via our annealing? How does that compare to the granularity I mentioned above? the local minima are very prevelant--that is why this is such a hard problem. the low resolution model found in the first part of the search has to be pretty close to the correct answer to "smack the nail on the head". Any answers would be read with interest, but I realize I've asked a lot more than I intended to. I don't expect the in-depth answer to everything right here, right now. I know that most everyone with answers is also likely to be fairly busy :) Thanks in advance; -alex. [edit]Maybe this one would be better for the FAQ, but what do the names of the workunits mean?[/edit] ID: 11293 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 5,737	Message 11302 - Posted: 24 Feb 2006, 8:03:11 UTC - in response to Message 11293. Last modified: 24 Feb 2006, 8:05:16 UTC I've copied David's post from above and tidied it up (with [quotes]). If a mod comes along feel free to use this or remove it and edit the original. good questions. some partial answers below I missed that one; thanks. I was more wanting greater detail about how the current crop of workunit-styles aproaches the task of paring down and navigating the search-space. In case someone in the know wanders by, here are a few questions just to get the ball rolling: (Note, I ended up asking a lot below. Feel free to take the low-hanging fruit, at least as a start.) First off, as I understand it, the app performs a few runs of a stochastic local search initiated at random (except for the'barcode' units?) on the search space, using simulated annealing and some heuristics int he form of known likely positions for subsequences (except fro the 'random frag' units?) Is that approaching correct? yes I was wondering how much we know about the search space? I'm not a molecular biologist or chemist, so I really have no clue how things would pan out or how much is known. Similarly, how good are our heuristics? what is exciting about the calculations you are all doing is they provide really the first comprehensive picture of the landscape. there is very little currently known due to the very large size. Extending from Paul's questions in the other thread, what is the "granularity" and coverage of the initial random trajectories of these workunits, and how does that compare with the average travel of a search trajectory? [quote]O.K. So I made up granularity and travel. What I mean is what's the minimum spacing between trajectories, and how frequently will two trajectories overlap if they start near each other? this is a non trivial question to answer, and we are still analyzing the results to get a clear picture. it appears that the trajectories are pretty well spaced, but for some of the proteins there are biases which are limiting sampling near the native strucutre. the HBLR runs I've submitted recently seek to reverse this bias. How prevalent are local minima, and how steep is the approach to the known results? What I mean is, using our algorithm, how right would your random guess have to be in order to smack the nail on the head and get the real folding via our annealing? How does that compare to the granularity I mentioned above? the local minima are very prevelant--that is why this is such a hard problem. the low resolution model found in the first part of the search has to be pretty close to the correct answer to "smack the nail on the head". Any answers would be read with interest, but I realize I've asked a lot more than I intended to. I don't expect the in-depth answer to everything right here, right now. I know that most everyone with answers is also likely to be fairly busy :) Thanks in advance; -alex. [edit]Maybe this one would be better for the FAQ, but what do the names of the workunits mean?[/edit] ID: 11302 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 11329 - Posted: 24 Feb 2006, 15:08:18 UTC This thread was moved from the Number crunching forum. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 11329 · Rating: 0 · rate: / Reply Quote

uioped1 Send message Joined: 9 Feb 06 Posts: 15 Credit: 1,058,481 RAC: 0	Message 11345 - Posted: 24 Feb 2006, 20:29:13 UTC - in response to Message 11293. Thanks for the quick response. (And thanks to the moderator for moving me to the correct location.) This is certainly an interesting computational, as well as scientific, problem. Some more questions that your responses brought up for me: You mentioned two phases to the search: The low resolution model found in the first part of the search has to be pretty close to the correct answer to "smack the nail on the head". It appears to me that the units I'm currently running are not using a separate full atom relax stage. Are we using the full atom model for the whole search now, or are we splitting that up between two types of workunits? (or am I just not knowing what is really going on with my workunits) How small of an RMSD is required for the results to be scientifically interesting, outside of the computational advence? I mean, for a protien with an unknown structure, would it be useful to the scientists to know that "this structure is 95% likely to have an RMSD below 3" I look forward to hearing more as you discover features of the search space and make refinements to your algorithm. ID: 11345 · Rating: 0 · rate: / Reply Quote

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 11352 - Posted: 24 Feb 2006, 23:10:32 UTC - in response to Message 11345. I look forward to hearing more as you discover features of the search space and make refinements to your algorithm. I guess we are all looking forward to the next biweekly ;-) science news update, hopefully explaining all those interestingly named WUs like, for example, 'quadruplelongrangeantiparallel'... ID: 11352 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 11360 - Posted: 25 Feb 2006, 0:56:01 UTC - in response to Message 11345. How small of an RMSD is required for the results to be scientifically interesting, outside of the computational advence? I mean, for a protien with an unknown structure, would it be useful to the scientists to know that "this structure is 95% likely to have an RMSD below 3" While at Distributed Folding, we were told (if I remmeber correctly) that a client that could create a structure with a RMSD of 4 Angstroms from the native structure would be the point where it would start being useful for drug manufacturers. The lower the RMSD from the native structure, the more useful the algorithm would be. The larger the structure involved, the harder it was to get a low RMSD. Which brings to mind the question.. what size structures are most of the current drugs? ID: 11360 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 11407 - Posted: 26 Feb 2006, 2:47:19 UTC - in response to Message 11302. Last modified: 26 Feb 2006, 2:48:44 UTC O.K. So I made up granularity and travel. What I mean is what's the minimum spacing between trajectories, and how frequently will two trajectories overlap if they start near each other? this is a non trivial question to answer, and we are still analyzing the results to get a clear picture. it appears that the trajectories are pretty well spaced, but for some of the proteins there are biases which are limiting sampling near the native strucutre. the HBLR runs I've submitted recently seek to reverse this bias. Maybe it's just a coincidence, but on 4 different HBLR WUs I've monitored today, I've quite often noticed much better RMSD values (3-6) than I've been used to (previously RMSD usually were 10-15, except for the occasional "NATIVE" cheating ones). I'm looking forward to the next Science news Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 11407 · Rating: 0 · rate: / Reply Quote

Branislav Send message Joined: 23 Mar 06 Posts: 6 Credit: 450,417 RAC: 0	Message 17045 - Posted: 25 May 2006, 9:08:44 UTC I understand that during the CASP you prefer not to shake the algorithm too much. But in the long run it might be beneficial to have some parts of the algorithm tested against other algorithms. For example I can see that considerable amount of time Rosetta spends in relaxation stage. During this stage the shape is not changing that much. In the early stages of a search, it might be better to have even higher proportion of time spent on random search instead. This would give you higher coverage of the search space for the price of not knowing the exact shape of each local minima. But I guess this could be acceptable for the early stages of the search. To improve the random search part of algorithm, you could reinforce the probability of those configurations that achieve lower energy (reinforcement learning). If every degree of freedom can have a limited number of values than assign a probability to each value. Then, after each throw of a dice, increase by a small amount probability of current values for all degrees of freedom if they decrease the energy, and decrease it if the opposite is true. After the local minimum is reached, increase the resolution of the search by searching only in the vicinity of the previous local minimum. Maybe You should organize a contest for the best search algorithm, provide the energy function (or several energy functions for several different proteins). You already have a reference: your current search algorithm (details of it might not even have to be published). A lot of researchers in the areas of neural networks, generic algorithms, reinforcement learning etc. need tough problems to further improve their algorithms. ID: 17045 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 17061 - Posted: 25 May 2006, 15:43:08 UTC - in response to Message 17045. I understand that during the CASP you prefer not to shake the algorithm too much. But in the long run it might be beneficial to have some parts of the algorithm tested against other algorithms. For example I can see that considerable amount of time Rosetta spends in relaxation stage. During this stage the shape is not changing that much. In the early stages of a search, it might be better to have even higher proportion of time spent on random search instead. This would give you higher coverage of the search space for the price of not knowing the exact shape of each local minima. But I guess this could be acceptable for the early stages of the search. To improve the random search part of algorithm, you could reinforce the probability of those configurations that achieve lower energy (reinforcement learning). If every degree of freedom can have a limited number of values than assign a probability to each value. Then, after each throw of a dice, increase by a small amount probability of current values for all degrees of freedom if they decrease the energy, and decrease it if the opposite is true. After the local minimum is reached, increase the resolution of the search by searching only in the vicinity of the previous local minimum. Maybe You should organize a contest for the best search algorithm, provide the energy function (or several energy functions for several different proteins). You already have a reference: your current search algorithm (details of it might not even have to be published). A lot of researchers in the areas of neural networks, generic algorithms, reinforcement learning etc. need tough problems to further improve their algorithms. Good ideas! We are using a strategy like this in CASP--these are the FA_CONTACTS runs, which are directed back towards promising regions of the landscape found in initial sets of runs. I have tried to popularize this problem as much as possible in the CS community as I agree it is a good problem to test out lots of approaches (and it certainly is tough)! The problem with just making the first ab initio stage longer is that you could find the right answer but not recognize it because the energy computation is so inaccurate. not much moves during the fullatom stage, but the energy drops considerably, to the point where structures in the vicinity of a very low energy minimum can be recognized. ID: 17061 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 17071 - Posted: 25 May 2006, 17:32:13 UTC Last modified: 25 May 2006, 17:35:37 UTC I've been wondering, sometimes when trying to brainstorm about a problem it is useful to take a different perspective. When studying an atomic level thing like a protein, it can be useful to pull back to the "big picture". I've noticed that when the native form is shown and I spin it around and study it, there seems to be a fair amount on consistency on the visual appearance of the thing. Now, my "experience" (I use the term very loosely) with this is very limited (to the Rosetta WUs we've processed), but I'm wondering if my observations hold over a larger sampling of the known structures. What I notice is that the structure almost always seems to have a central core, which appears hollow when you turn the thing the right way, and everything is wrapped around that. And that if you were to supersize it and drop it on a wood turning lathe, along that axis, that the thing is pretty compact. What I mean by that is it seems you could spin around that axis, and there aren't really any parts you'd visually expect would fly off (see my profile image for example). and so if your model has loose ends that dangle off the main structure or large lop-sided chains, it's probably not correct. If your model has chains that only have a single chain along side them, rather than the 3 or 4 that seems to be typical of native structure, then it's probably not right. I'm wondering if any of this "big picture" type of heuristics is built in to the algorythms to recongnize HOW to push the model into "looking" more correct. Such heuristics could be used to bias you towards the correct solution, or to recognize models that can safely be abandoned before completing processing, because they won't prove fruitful. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 17071 · Rating: 0 · rate: / Reply Quote

Branislav Send message Joined: 23 Mar 06 Posts: 6 Credit: 450,417 RAC: 0	Message 17415 - Posted: 31 May 2006, 11:53:13 UTC Could normalizing distance during random search stage improve coverage ? I was wandering whether you normalize the distance during the random search. To be more precise, when the new random configuration is created during the random search part of the algorithm, do you limit the maximum distance of that configuration from the original one. I guess you must control the distance one way or another or otherwise locality of the search couldn't be controlled. It makes more sense to have individual work units searching mostly non-overlapping volumes because this maximizes coverage of the configuration space for a given number of workunits. To increase the coverage even more, you could specify the random search to generate only configurations on exactly specified distance Dn from the starting one. 1. generate arbitrary random configuration, 2. measure the distance from the starting configuration D, 3. multiply coordinate distances with the scale factor S = Dn/D to determine new coordinates. In this way you still have a random search but you control the distance of the random search from the starting configuration. If different workunits are working different distances this gives you opportunity to start with unfolded position and gradually increase the search distance (knowing there want be overlapping). ID: 17415 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 17429 - Posted: 31 May 2006, 14:47:20 UTC - in response to Message 17415. Could normalizing distance during random search stage improve coverage ? I was wandering whether you normalize the distance during the random search. To be more precise, when the new random configuration is created during the random search part of the algorithm, do you limit the maximum distance of that configuration from the original one. I guess you must control the distance one way or another or otherwise locality of the search couldn't be controlled. It makes more sense to have individual work units searching mostly non-overlapping volumes because this maximizes coverage of the configuration space for a given number of workunits. To increase the coverage even more, you could specify the random search to generate only configurations on exactly specified distance Dn from the starting one. 1. generate arbitrary random configuration, 2. measure the distance from the starting configuration D, 3. multiply coordinate distances with the scale factor S = Dn/D to determine new coordinates. In this way you still have a random search but you control the distance of the random search from the starting configuration. If different workunits are working different distances this gives you opportunity to start with unfolded position and gradually increase the search distance (knowing there want be overlapping). During the ab initio low resolution part of the search there is no limit on the distance, that is why the protein moves around so much. During the second, high resolution stage, there is an upper bound on the distance moved (actually, the size of the change in the backbone torsion angles). This ensures locality as you suggest. ID: 17429 · Rating: 0 · rate: / Reply Quote