Message boards : Number crunching : code release and redundancy
Previous · 1 . . . 5 · 6 · 7 · 8
Author | Message |
---|---|
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
I'm sure there are folks who would like to see the code released, but how is version control going to be carried out? Are users going to post changes on the board? Do you intend to let users compile their own software, or only suggest changes that go out to everyone? We would start off maximally conservative. I think we would communicate directly with selected users, and incorporate changes ourselves. We are discussing ways to include some sort of authentication, and/or removing some components from the source, so executables compiled by others could not be used for credit. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
Thanks again to everyone. This discussion is extremely useful. Let me try to quickly summarize my understanding as of this moment, so you can tell me if I'm following correctly. I now think there are two related, but distinct issues: Absolutely! My current impression is that catching cheaters is what we are really talking about. Much of the discussion about ways to assign credit revolves around coming up with systems that will be more cheat proof. (Within the bounds of other constraints, such as lag between job completion and credit granting.) Well... actually I think you would have a very difficult time getting such scripts to work correctly given the current system, although some checking could be done just for knowledge. One example; I have a single result for which I received 209 credits. Under _any_ reasonable expectation, that would probably be red-flagged. Yet that one was perfectly legitimate, as it was a 31-HOUR result on a slow iBook G3 Mac, running the standard BOINC, no optimizations. (Also the only result for that Mac, as I then detatched it...) Had flops-counting been in place, I would assume that I would have gotten much less credit for that result. Flops-counting will bring the range "closer together" and eliminate the extraneous causes for the extremes, hopefully leaving only the rare "accident" and the "cheats". Then the scripts can be pretty simple, and require a lot less human intervention. Also, at present there is nothing in the "rules" to say that someone can't run an optimized BOINC Client with just as extreme a benchmark as they can get. We may all agree that it's "cheating", but I don't think the project can fairly actually _do_ anything about it, unless/until you either specifically change the rules to outlaw optimized Clients (a bad idea, imho, because of the Linux benchmarking issue) or change the method of giving credit. If you identified someone today that, because of extremely inflated benchmarks, is getting way too much credit - about all you can really do is ask them to stop. Doing any more than that _right_now_ I think would be bad P.R. and possibly even a legal problem. And I don't see how you can put an enforceable rule in place that says what is "okay" and what isn't, given the current benchmarking methods. Once flops-counting is in place, you could just put in a rule that says "anyone believed by the project to be returning artificially inflated credit claims on a regular basis will be removed from the project at the decision of the staff". In a sense, we are talking about using "pseudo-redundancy" to catch cheaters, but not to assign credit. Even though WUs that differ only in random number seed can have a wide variance in FLOP counts, we should still be able to see outliers. Especially because we could do our culling after most results were in. I don't see how some lag in the cheat finding can be avoided. It requires waiting until a significant number of results are returned. Credit could still be granted instantly. It just might get taken away. Credit taken away, or participant "expelled", or whatever outcome the project decides on. I don't see it as a problem if credit is given "instantly" and the cheat-finding scripts only run monthly. It seems that we could get an especially strong cheat detection algorithm by not simply looking for outliers in a given WU, but by looking for boxes that are consistently outliers across WUs. Exactly - looking at a single super-WU (to coin a term for a "batch" of WUs that differ only in seed), you're going to have outliers - the nature of statistics. The idea is that you take the "top 2%" or whatever (choose some figure that gives you at least 20-50 hosts per WU, but not thousands) from _several_ super-WUs, and then only manually look at the hosts that show up in this list more than once. If you have, say, 5 of these super-WUs available to examine, you're then looking for duplicates within a list of a few hundred hosts. You find something like seven that appear more than once, one of which appears eleven times. You at least look at the results list for all seven, but you _really_ study what's going on with that one guy... So, if such a system were in place, would people be comfortable with us releasing a partial version of the code? I would! |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
Now wait a second, couldn't one even go a step further and determine the rate by which each host over- or under-claims credit ? Using the terminology 'WU' for the collection of all 'results' that only differ by random seed, do the following: (1) calculate the mean of the claimed credit for each WU (2) for each result, divide the claimed credit by the mean for its respective WU (3) for each host, determine across all WUs for which results were returned, the mean and the error of the mean of the numbers determined unter (2); the error of the mean decreases with the sqrt of the number of results the host has returned (technically you also have would to include the error of the mean of the numbers determined under (1) in the error budget, using error propagation, but this may be small enough to be neglected) Well, the error calculation of course assumes that, on some level of approximation, all dsitributions are nicely Gaussian. ;-) You now are left with one number with associated (1 sigma) error for each host which tells you whether the host claims the correct amount of credit (number consistent with 1) under-claims (number less than 1 by more than say, three sigmas) or over-claims (number three sigmas above 1). Can anyone think of a catchy name for that number ? You would have to decide what you actually want to do with these numbers. I guess hell would break loose if you actually took away credit from a significant number of those who after all have returned valid results and who may believe to have valid reasons to use 'optimized clients'. Once the statistical error is small enough (many returned results) you could even think of correcting the credit, instead of taking it away (less of the 'legal' problems Bill was refering to ?). And the same would be possible for the under-claimers (Linux with standard clients). Oh, and you couldn't only apply the correction factors retroactively but also proactively to the claimed credit of incoming results. Note that all of this would not be applicable to Bill's 209 credit example, since it requires a reasonably large number of returned results per host. By the way, did you notice that the Rosetta credit/day on BOINCstats are leveling off (new users per day are still high). I hope this isn't due to our 'credit' discussion around here (trying to avoid the ch*** word). So what happend to my good intentions to spend less time in the forum ? ;-) |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
... to add one thing to my last post, since about one in every 1000 three sigma excesses is a statistica fluke, and since we have tens of thousands of hosts, the threshold to detect over- or under-claimers needs to be higher, something like 4.5 sigma (upper tail probability 3.4e-6) to be on the save side. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Jack, I am not sure I would say that my primary interest is to catch cheaters. My interest is to have a credit system that is fair and accurate. The primary side effect is of course improved confidence in the system. A very desirable side effect is that cheating is not possible or severely restricted. But fairness and accuracy are the point. |
stephan_t Send message Joined: 20 Oct 05 Posts: 129 Credit: 35,464 RAC: 0 |
Uh? It's their project, they can do whatever they want. Break the EULA for any product or service (say, your telco) and see if they have qualms about not disconnecting you... Companies regulary turn off the service of users who break their term and conditions, and keep sending them bills - which is perfectly normal. However, I'm not saying you should 'ban' cheaters IPs either. Plus, cheated credit on WU still means the WU is done. So, hit them where it hurts: take them out of the credit stats (including the xml feeds). No incentive to cheat == no cheaters. I do understand your concerns about non-cheaters being flagged as cheaters due to hardware problems. But I think you got it wrong - I myself had an overheating box which would bench high, and run slower and slower for days. That gave me a few 92pt WUs. But the CPU time was there to prove I spent time on it - a cheater would have quick turn around AND high benchmarks. Easy to spot. Team CFVault.com http://www.cfvault.com |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
Uh? It's their project, they can do whatever they want. Break the EULA for any product or service (say, your telco) and see if they have qualms about not disconnecting you... Companies regulary turn off the service of users who break their term and conditions, and keep sending them bills - which is perfectly normal. Absolutely true. Now point me to the webpage containing that EULA, that says I can't just claim 1000 credits per result? Or the "terms and conditions" page with the same rule? It's not here, or here. I totally agree that they can do it - but they need to get the RULES in place before they try to _enforce_ them. While a civil suit might get nowhere, as there isn't any monetary "damage" to be proved, it could still cost the project in legal fees just to get the lawyers to write the "are you ^&%* kidding me?" letters, and then to show up in court to explain BOINC to a judge. Better to CYA up front. There are a lot of lunatics out there with lawyers. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
It seems that we could get an especially strong cheat detection algorithm by not simply looking for outliers in a given WU, but by looking for boxes that are consistently outliers across WUs. Jack, rereading your previous post, it occured to me that much of what I said about detecting - and possibly correcting - excessive credit claims is probably exactly what you had in mind anyway. So sorry about the duplication. |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Thanks again to everyone. This discussion is extremely useful. Let me try to quickly summarize my understanding as of this moment, so you can tell me if I'm following correctly. I now think there are two related, but distinct issues: Actually, fixing #1 is in my opinion much more important at the moment for Rosetta@home than worrying about #2. An example from SETI@Home/BOINC, data collected in two different periods, Jan-April, and May-June. Period 2 includes all validated results, period 1 is after removing any -9 result_overflow. After calculating average claimed and granted, standard-deviation/average, and credit/cputime gets a table like this: pc-run#-sample-%time-claim - grant - %claim - %grant -%c/time-%g/time-max_c/t-max_g/t A-1 - 796 - 04,39% - 30,69 - 31,85 - 24,29% - 22,28% - 23,86% - 21,69% - 1,82 - 5,56 B-1 - 712 - 03,76% - 28,37 - 31,40 - 25,03% - 21,26% - 25,47% - 20,95% - 2,28 - 3,24 C-1 - 890 - 06,21% - 23,25 - 29,41 - 15,75% - 22,92% - 14,11% - 22,74% - 1,59 - 3,92 A-2 - 193 - 35,63% - 22,04 - 23,08 - 43,64% - 43,49% - 23,52% - 24,17% - 1,69 - 3,93 B-2 - 157 - 36,01% - 19,11 - 21,50 - 41,41% - 43,31% - 19,67% - 23,77% - 1,58 - 3,29 C-2 - 224 - 35,38% - 14,87 - 20,71 - 38,35% - 42,92% - 14,08% - 25,65% - 1,46 - 3,12 If we look on max variation in "claimed-credit/cpu-time", max_c/t, and same for granted credit, these should in theory be constant but they definitely are not. Instead, even on same computer they has huge differences, actually so huge that if average claimed credit on a computer is 10 CS, the claimed credit will instead variate from 6,63 CS to 15,09 CS, this is 2.28x variation. Granted credit is even worse, there a single computer can variate from 3,96 CS to 22,00 CS, this is 5,56x variation. Now, if we looks on average claimed credit and average granted credit, in run-1 there is a total variation of 32% in claimed, but only 8,3% in granted. In run-2, claimed has 48,3% variation, while granted has 11,5% variation. Meaning, in a project with min_quorum = 3, average granted credit between computers variates from 9,51 CS/result to 10,61 CS/result, meaning the "best" computer gets 11,5% more credit per result than the "worst" computer. If a project instead uses min_quroum = 1, like Rosetta@home is doing, average granted credit between computers variates from 7,96 CS/result to 11,80 CS/result, meaning the "best" computer gets 48,3% more credit per result than the "worst" computer. For anyone that is more interested in credit than in the science-part, it's very likely computer C is moved to something else. Also, since A gets upto 15,3% more average credit than B, it's also possible computer B is moved to something else. Meaning, by moving B+C to another project, Rosetta@home just lost 66.4% of the possible production... Also remembering the granted credit was variating from 3,96 CS to 22,00 CS, for someone that has average claimed credit near the low-end of this scale, even if not primarily crunching for the credits can get discouraged when being out-crunched by someone having 5x as much credit even has slower computer and uses more time/result... Since the huge variations in average claimed credit can happen even if no-one is trying to cheat, as long as Rosetta@home is directly converting this to granted credit there can be many users that gets discouraged and switches to something else... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
[For anyone that is more interested in credit than in the science-part, it's very likely computer C is moved to something else. Also, since A gets upto 15,3% more average credit than B, it's also possible computer B is moved to something else. Meaning, by moving B+C to another project, Rosetta@home just lost 66.4% of the possible production... Even for those of us that are interested in both may do things like this ... For example, I just moved my PowerMac to almost all Einstein@Home work. In effect, I get the best of both alternatives. More science as the application is able to do the work in 50% of the time on a comparable x86 PC (Dell Dual Xeon 3.4 GHz) and I get the benefit of having my scores "pulled up" in the averaging process. The only reason I mention this here is to demonstrate the tensions that can "pull" your participants in directions that are un-good for your specific project. Fundamentally, I don't think that the credit is the only thing that motivates people. But, a good project with fair, or above average credit grants is better than a good project ... :) Anyway, if any of this was easy we would not need Ingleside working on it ... |
stephan_t Send message Joined: 20 Oct 05 Posts: 129 Credit: 35,464 RAC: 0 |
[quote]Uh? It's their project, they can do whatever they want... Now point me to the webpage containing that EULA, that says I can't just claim 1000 credits per result? .... It's not here, or here. You are 100% right. However the stats page isn't mentionned in the EULA either, so banning the cheaters from a non-mentionned stat page would be no problem. In fact, their EULA is just that - it simply says 'we offer no guarantee - whatsoever'. Heck the whole application could be hijacked and transformed into a huge virus that takes over everybody's computer and the UW would still have nothing to answer to. Team CFVault.com http://www.cfvault.com |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
A brief suggestion: Should the project decide to go to flop counting, perhaps you could get away with something really simple: Strategically place two counters in the code, one in the main loop of the ab inito part and one in the all atom relax part and assume the total flop count to be a linear combination of the two: FLOP = a1*counter1 + a2*counter2. Then run a couple of different WUs with widely varying completion times (different protein, different Rosetta parameters) on a local computer for which you know the flop/s and by chi-square fit adjust the parameters a1 and a2 to best reproduce the measured FLOP (CPU-time * flop/s) values. If this turns out not to be accurate enough, add more counters till a satsifactory chi-square is reached. Just an idea to try to save you some time ... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
hi everyone, sorry for my late arrival to this thread, I only got to Rosetta on 15th, and missed the previous message by 4 days! I'd like to raise another possibility for redundancy - adaptive partial redundancy. The big disadvantage from a BOINC perspective is that it does mean different handling of results than at present, you can't just pull a big switch on the server and let it start, like you can with the quorum/validation system. Partial redundancy works by validating hosts rather than work units. It is adaptive because the degree of redundancy applied to that host's work is adapted in view of the outcome of past redundancy checks. Hosts start out as probationers, and then move up to progressively more and more trusted status. Work from probationers always gets run again by a non-probationer. For all the other levels of trust, only a percentage of work gets run twice - the percentage falling the more trusted you are. The results are compared for agreement on the science and on the credit claimed. You go up a notch when good agreement happens, you stay level when reasonable agreement happens, you go down a notch when poor agreement happens; and if at any time it becomes clear that you are cheating you lose all credit to date, not just for that host but for all that user's hosts. The longer you go on, the more you have to lose by cheating, and the less you will be watched. The two crucial differences with the BOINC model are that, first, users never know which wu will be double run until after both versions are returned. This prevents them from behaving only while they are being watched! Secondly, credit is always awarded as claimed, but a correction factor for each user is tweaked every time their work is compared. If you and I run the same work and I claim more than you, my multiplier is reduced and yours is increased a little. Over a large number of jobs everyone's multiplier tends to the fair value. Early on you may have over- or under-claimed, but those injustices get swamped by the huge numbers of later wu credited at a fair rate. When probationer's work is double run, only their correction factor is adjusted. They stay in probation till their adjusted credits start to be close enough to the comparator. Whenever a wu is run twice, the scientific results are tested for agreement. If a particular protein / app version / etc has more than a preset number of mismatches, that is a sign that the code may be unstable. Sufficient wu are run twice to make this a statistically valid check on the work, and this can be adjusted by increasing or decreasing the percentage redundancy of users at each of the trust levels above probationer. This is what happens to all other code, outside of DC. No other progrsammer ever tests every run of their program - you test it often enough to trust it for the general case. This is not my original idea (tho I wish it had been). Something like this was used on the Zetagrid project to validate the science - and validating credit was not an issue as it was pretty deterministc from the work. However, it would be a brave BOINC project to take on a radical re-design like this - it would mean writing a lot of new code (instead of just reusing the code that is there) and then it would take time for the system to gain the trust of the other projects. I think that neither this nor pseudo-redundancy is likely to be adopted by any BOINC project for this reason. However, if you are going to look at re-writing stuff for pseudo-reduncdancy, I think it would be worth also considering adaptive partial redundancy too. It saves most of the extra overhead of full redundancy, while keeping it's main advantage - a precise check of the same run on two different hosts . different apps, etc. River~~ |
Tom Send message Joined: 25 Nov 05 Posts: 2 Credit: 388,705 RAC: 0 |
I run a couple of open source projects. What I would suggest. Open a project on SourceForge. Don't allow anonomys check out from CVS. People have to ask to get the code. If you find someone taking advantage, don't let them have access. There are several people that seem to do a log of "extra" work on BOINC and SETI and the work is never integrated back into the process. If you do this right, you can co ordinate the effort and the program will get much, much better with lots of clients optimized for specific processors. If someone cares that much about the credits, let them cheat, they need a life. |
Tom Send message Joined: 25 Nov 05 Posts: 2 Credit: 388,705 RAC: 0 |
Also, regarding the cheating issue, since currently the gained credit is donated CPU time x benchmark rating, it isn't possible to get more credits by tinkering with the rosetta code, or am I missing something here ? So why should giving out the rosetta code be an issue as far as cheating for credits is concerned ? Since the energy can easily be re-calculated from the returned structure I also don't see why redundancy should be required to detect invalid results. This has come up with the SETI project more then once. I think people need to have a reality check. Are you really here for the credits? Or are you here to help find cures that could effect you, your kids, or your grand kids? The credits are fun and yes I look at them, but I would do this without the credits. Redundance is often done to make sure the science in accurate. I would suggest setting up a threshold for the units that look like they have promise and recrunch only those. In science you always have a control, so doing duplicate crunching isn't a waste, it is good science. Seti crunches each unit 4 times. I think that is a little over kill, 3 would probably do. |
Message boards :
Number crunching :
code release and redundancy
©2024 University of Washington
https://www.bakerlab.org