I think this practise kit helped myself best in preparing out of all 4 tests. There were a few puzzle types for which I hadn't completely figured out how they would look or feel. Making them helped me a lot. I think it definitely increased my Wired Tapa solving, which I found surprisingly fun to solve. Also the Tapa-Like Loop puzzles were a lot easier when I figured out a few tricks from the genre while making them. I think my Twilight also should have helped well for the test puzzles.

Overall I finished 4th, which seems appropriate I think. But I do think this normalised system with best 3 out of 4 isn't 100% perfect. Because if I hadn't made the small solving and answer key mistakes, I would have beaten Hideaki comfortably in this test and would have thought it might have been good to beat him for 3rd place as I did exactly what I had needed to do. Except because Palmer beat everyone by a mile, this whole test wouldn't have counted as everyone's normalised score is very low. And I don't think it accurately depicts the difference between all players. Really it now only compared everyone to Palmer as he was just better than the rest. I feel that if you take Palmer out of the equation in all tests, that you might see a bunch of shifts in the leader board as TVC XII would be far more influential than it is now. Nyuta and EKBM did particularly well compared to many in this test, but only got a normalised score of 720 and 685 to show for it, which seems a bit unfair. It might be better to do something similar to the CTC, where not only the highest score sets the mark but a group of the best scorers. That way it could lead to a better representation overall for all players. I'm not sure of this and can't really run the numbers very quickly on what the results would be, but I'd be interested to see though if someone feels like it.

I'll also talk a little bit about the CTC here. I finished 5th overall. I had a bit of a false start in the first 2 weeks with a lot of errors. Halfway through it became clear that Palmer was running away with this one. Hideaki and Nyuta settled into a comfortable 2nd and 3rd and there was a group of 5 players who formed a subtop, which included me at 8th place. This was kind of the point where I thought 4th place was the highest possible, which I had talked to Zoltan a bit about. I started in 8th place and slowly made my way up. Then there was a week of harder puzzles, through which I made my way pretty well and in which Hideaki and Nyuta dropped some points and all of the sudden 3rd place became a possibility. I made it to 5th place behind Zoltan, who came agonisingly close to taking 3rd place, but eventually had to concede 4th. Sadly it wasn't to me as Psyho overtook me on the same day I overtook Zoltan. Two days before the end I had a bad solve and lost all chances on 3rd place. Psyho came pretty close to getting 3rd, but made a mistake on the last day. I just missed out on getting 4th though. If there is another CTC next year, I will know I'll have a pretty good chance of reaching the top 3 as long as I stay a bit more mistake free as I really made far too many mistakes. Or if the penalty for mistakes isn't so big.

Okay, enough talk. I still have one Tapa puzzle for you all. It's a Tapa Tapa, which didn't match the rules. I really liked this puzzle. It worked out well. I know not all cities can be defined, but the Tapa Wall is unique though, which I think should be good enough.

Rules for Tapa

**Tapa Tapa**

Follow regular Tapa rules. Additionally each train represents a city. The shortest possible distance between certain cities are given. The shortest distance is the shortest of all routes traveling horizontally and vertically along the Tapa wall that touches 2 cities (diagonal touching is enough).

[Note: There is no clue restriction on the cities, the wall can touch them any way possible]

LMI has a good normalization system but others don't follow it enough. 2 point normalization, and rank based for a fraction too. I'd prefer it to the weird system of TVC. I'd obviously also prefer instant grading but few people seem to be following my lead.

ReplyDeleteBut your post also reminds me of a continuing problem of fixed and not proportional time bonus. This meant on earlier tests (like TVC IX and XI), large victories were undervalued so that XII stands out more. Let me explain.

Consider this:

On TVC XI, Palmer scored 1051 puzzle points in 55 minutes, 19.1 points per minute. But then he only got 10 points per minute time bonus for 20 minutes. A penalty for finishing early of 182 points.

On TVC XII, Palmer scored 1247 puzzle points in 67 minutes, 18.6 points per minute. Then he earned 13 points per minute time bonus for 8 minutes. A penalty for finishing early of just 44.8 points.

On both tests, I'd expect his score to be ~19*75 or ~1425 points. But his actual scores on the tests were 1251 and 1338 despite similar performances.

On TVC IX, I scored 749 puzzle points in 49 minutes, or 15.6 points per minute (Palmer would have been at 16 ppm if not for a typo). I then got just 8 points per minute time bonus for that result. 15.6 * 75 would be ~1170 points. I got 971. An effective 200 or so point penalty compared to my actual solving rate.

It's best to just compare "total finish times". If you did that, your 2nd on IX should be normalized to about 745 normalized score, not the 857.9 that happened.

On TVC XI, the second place solver would have had

about 842, not 912.1.

So it's not just that Palmer did great on XII (he certainly did). It's that giving smaller point values for time bonus, despite similar point/minute solving rates suppressed the natural sizes of earlier victories. Proportional bonus is the best way in my opinion.

I meant to say weird system of CTC (not TVC) above in the first paragraph. That each day's puzzle ends up worth a variable amount of points is not ideal for a 4 test system.

DeleteI don't mean a system like the CTC where each test gets a certain point value. I just meant making the 1000 point mark at the average of multiple players' scores instead of only the top player's score. So then all tests will still have a comparable distribution as now, except it takes out the problem a runaway performance creates.

DeleteThe problem you adressed last time is that you didn't get credit for having a run away performance. But now people don't really get credit for doing well compared to others while someone has a runaway performance. And I think the standings should accurately depict how people did compared to eachother. If you for example take out Palmer's scores, then your XI and XII performances are comparable to the rest of the players.

The same opposite for my X and XI performances. My X performance is better than my XI peformance compared to the other players. But because the gap Palmer had in XI to the rest is much smaller than in X the normalised scores are about the same.

The problem is that taking the best 3 out of 4 test can basically make a test like XII completely irrelevant.

So I mean to say is that you have to get the scores to accurately depict how well everyone did compared to eachother to pick the best 3 out of 4 tests. There should still be a normalised mark for each test so the scores will become equal for each test(unlike in CTC), but I don't know if only the top player's score is the best mark for that.

I actually had a pretty good XII until you consider I lost >15% of my expected points on one poorly formed submission, which take me from close to 2nd to a clear 2nd, and at the end of the test got the big loop puzzle out in 10 minutes when I only had 8.5 left to get it submitted (with penalty). I often have in mind an expected score given work on page and when I put mine in for this test, I get about 800 normalized to Melon, which feels both accurate to my performance but also to Melon's too. Given all the lost points by others in the top 10 with typos and such, I don't know that the suppressed XII is just an effect of Palmer. It's as much an effect of one solver finishing and the others having puzzles to do which mean just test timing and checking will lead to them having lower scores. See TVC X for a similar situation when Distiller was 20 minutes worth negative points for me, and I'm sure others too.

DeleteAll this gets to a general point that on fixed length tests with arbitrary puzzle points you can most easily compare finishers as a time to time comparison will be exact and anything else, including a finisher to someone still somewhere on the course will not be as accurate, as there are not partial points to suggest exactly how far you are towards your next 50/100 points jump.

So, in the absence of more time so everyone in the top 10 or 20 finishes, the LMI system is probably the right place to start next year as it does two things TVC did not: A) it has a second pivot point for the normalization, to smooth the normalization and B) it includes a rank component to also reward good performances even in the presence of a runaway first or second place solver. If I'm the only voice next year on TVC scoring, as I was this year on the LMI forum, I'll simply say "use the LMI normalization". But maybe more people will contribute to the discussion next time before, and not after, the contests.

--Thomas, whose comments keep getting eaten up so I'm trying to reply by some different means than openid.

(this is a reply to both the above comment and your LMI post)

DeleteI would certainly agree that in a perfect world a test with most everyone finishing and an instant grading system to not make errors so ruinous is optimal. But that would require either short tests (do not want) or for solvers to have lots of time to set aside for these (obviously going to exclude a lot of people). So for nonfinishers, issues like puzzle resolution are just going have to factor in to a solver's approach. In general when looking at top scores on tests I see a lot of LCS times near the time limit, so I think it's something that can be learned.

I am still not convinced pinning a pivot point at the top is a smart choice. Rank score mitigates the problem somewhat, but the issue remains the same: you reward someone's great performance by knocking everyone else down a lot. With a rating system that ignores worst performances, that might just mean the test gets ignored for other people, whether their relative performance was good or not.

I know you brought up that you have to arbitrarily decide where in the top to do the pivot, but I don't see why top10% average is not a good compromise. It still takes into account huge performances at the top, so that on TVC XII (106 positive scores) only 2nd and 1st finished above that line.

In response to your claim about certain tests being worth more than others, that's exactly what I think is out of balance with the current system, and what I think a top10% system could do better. Top finisher had a good day? That score probably gets thrown out, whether it was a good run or not. All the top finishers had a few hiccups? Suddenly the test could be worth quite a bit for everyone.

Best example of the latter I can think of now is TVC XI: my run was going excellent until I got really held up on Power Of. I think the normalized scores being higher on that test than on X or XII do reflect this appropriately (perhaps the puzzle resolution issue and the time bonus disadvantage are counterbalancing each other). And of course, the mistake issue is present here too; EKBM was very close to a high 9xx normalized score on that test.

Heh, it seems you proposed a CTC-style scoring system around the same time I posted the results of applying it to the TVCs.

ReplyDeleteAfter thinking it over more, I'm not sure an exponential distribution is appropriate for normalizing a whole test since it compresses scores in the middle far too much. A linear model, like the LMI ratings or the TVC normalization, is a good choice. It occurred to me the feature of the CTC scoring that I was really going after of was using the top 10% average (not top score) to normalize things, for exactly the reasons you detail. The current system is too tied to a runaway performance at the top.

That's an interesting point that the amount of time bonus might have contributed to which scores were the most lopsided. I would also prefer a proportional bonus system in general (WPC 2011 sprint rouuuuund). That said, I do think my TVC XI and IX performances were not as good as my run on XII. Not sure about TVC X, since Distiller was too crazy for me to evaluate how well I did relative to an "optimal day".

For Tapa Tapa (why that name?), how do you count path length? Is it number of wall squares, or total king-moves (with, in your puzzle, only wazir moves allowed along the wall), or something I managed to miss?

ReplyDeleteTapa Tapa is called that because it originally used Estonian city names. Tapa is an Estonian city and a major hub in the Estonian railroad network, therefor the trains.

DeleteYou count them by traveling horizontally and vertically along the tapa wall. You can start from any square horizontally, vertically or diagonally touching any city.

So... if I start at one train, make a diagonal step onto a wall, take three horizontal and two vertical steps *along* the wall, and make one last step onto another train, is the total 5? (Also confusing: the five wall-to-wall steps take place along six wall squares.)

DeleteThe length of the path is just the amount of squares of the wall between the cities.

Delete