Peer Review: Anonymity should not be at the expense of transparency
We’ve all been through this: You do some work, you think it’s good, you send it to a top conference, and it gets rejected. Happens to the best of us, and, in fact, happens more often than many think. For example, 2013 was a very good year for me publications-wise, with 7 papers in top venues (SOSP, TOCS, EuroSys, SIGMOD, OOPSLA and RTAS). Yet, of my 19 submissions that year, 8 got rejected, so my acceptance rate wasn’t even 60%. And that was a good year. I had much worse.
It has just happened again, this time with EuroSys, by any standard a top conference. (I would say that, I was the PC chair a few years ago. 😉 )
This time I was a bit, say, surprised. The paper had four reviews: one was a “weak reject”, one was a “weak accept”, and the remaining two were clear “accept”, making it, as my colleague and co-author Kev remarked, our highest-rated reject ever.
Now this sort of stuff happens. While one would naively think that with two clear supporters and one weak critic, this should get up, the reality is different. Things may come out of the discussion at the PC meeting, the weak critic may become a stronger critic from reading other reviews, etc. I’ve seen a lot of this happen in the many PCs I served. And I’m certainly not saying that the PC meeting made a mistake. That isn’t my call, the whole point of peer review is that the judgement is made independently.
However, I do have a significant issue with the way this paper was handled by the PC, and that relates to transparency: The PC gave us, the authors, no halfway convincing justification for their decision. To the contrary, the points raised against the paper by the review could have easily been rebutted. In the interest of transparency I will not just make this claim, but back up my rant by going through the reviewer’s arguments below.
So, basically, the PC made a decision that is not understandable or reproducible from the data at hand. This is an issue in that it creates an impression of arbitrariness, that has the potential to damage the reputation of the conference. Hopefully it is a one-off case. However, the conference failed to use two simple mechanisms that are part of what I consider to be best practice in transparency.
- Rebuttals: Many conferences send out reviews to authors and give them a few days to respond to them before the PC meeting. This is a mandatory requirement for all SIGPLAN-sponsored conferences (including ASPLOS, which is also one of “ours”). It has been used by most if not all EuroSys PCs since I introduced it in 2011, and has been used by many other top conferences, including last year’s OSDI. EuroSys’15 didn’t do rebuttals, which, in my opinion, is a transparency regression.
- Summary of PC discussion: This has recently been adopted by many top conferences: Papers discussed at the PC (at least those which end up being rejected) are given a brief summary of the PC discussion, stating the reasons for rejection. This has been used by both major systems conferences on whose PCs I served during the last year (OSDI and ASPLOS) and many others. I don’t remember whether previous EuroSys PCs used it, but it’s become a common enough practice to consider it best practice.
By not doing either of these, the EuroSys PC didn’t actually do anything wrong, but clearly missed out an opportunity to be seen to be transparent. I think that’s highly regrettable for a conference I think highly of. In the particular instance of our paper, I would have been far less annoyed about the outcome if I was given a good justification.
So, let’s have a look at the information we did get. Below I reproduce uncensored (other than for the list of typos) Review A, the only negative one. And I’ll show how I can rebut all criticism from the paper (in line with normal rules for rebuttals, which aren’t allowed to introduce new material). For clarity, I’ll be more detailed and verbose than I could in an actual rebuttal, which is generally very space constrained. You can check my arguments against the submitted version of the paper. And, to show that I’m not hiding anything, here are the complete reviews for the paper.
Note that I do not claim that being given the opportunity for rebuttal would have changed the outcome. But it would have given us some assurance that the decision wasn’t made on the basis of misunderstandings (this is the precise reason for rebuttals), and a summary would have indicated to us that our comments were taken into account. Without that we are left wondering.
Ok, here comes the review.
Overall merit: 2. Weak reject
Reviewer expertise: 4. Expert
===== Paper summary =====
The paper explores a trade-off between coarse grained locking, fine grained locking, and transactional memory in the context of a microkernel. It claims that using a single global lock with microkernel on a small embedded system is advantageous in these settings, the main reason being it allows for static verification of the kernel. The differences between used synchronization techniques are evaluated on a quad-core x86 and ARM systems and seL4 microkernel. The evaluation, which uses a server workload, shows no significant performance difference between the techniques on a real-world sized benchmark.
===== Comments for author =====
Experimental results don’t quite back-up the claim that a single global lock is better than fine-grained locking. Fine-grained locking performs best on both x86 and ARM with 4 cores on the multi-core IPC micro-benchmark (Figure 7), which is the benchmark showing any difference between synchronization different.
This simply mis-represents the claims of the paper. What we claim (in the abstract) is that “coarse-grained locking [with a microkernel on closely-clustered cores] achieves scalability comparable to fine-grained locking.” Further, at the end of the Introduction, we state that “the choice of concurrency control in the kernel is clearly distinguishable for extreme synthetic benchmarks. For a real I/O (and thus system call intensive) work-load, only minor performance differences are visible, and coarse grain locking is preferred (or the equivalent elided with hardware transactional memory) over extra kernel complexity of fine-grained locking.” A similar statement is in the Discussion section. I claim that the statements in the paper are consistent with the results presented, and contradict the assertions made by the reviewer.
If fine-grained locking performs best, even if only on a micro-benchmark, already with only four cores, why not use it?
Again, we explain so in the paper: it is more complex, and as such very error-prone (as we did experience very painfully when implementing it). That concurrency is hard is well established, and if you can avoid it without degrading performance, then it’s a reasonable thing to do. And, as we also clearly state, we wouldn’t be able to verify such a highly-concurrent kernel. So I really don’t understand why the reviewer simply ignores our arguments.
I think that it is interesting to see that transactional memory outperforms fine-grained locking and is the best option overall, but it is not (yet) widely spread and the paper is making a claim about BKL anyway.
While our microbenchmarks show that transactional memory performs best, this is only visible in extreme microbenchmarks. Like microbenchmarks in general, these are an analysis tool, but don’t tell you anything about real performance, as they are not representative of real workloads. (You don’t do 1000s of back-to-back identical system calls with no actual work in between, except for analysis.) This is why we did macrobenchmarks (Figs 8–11). And they show no discernible performance differences between the configuration. This is the core data that back our argument that the details of locking don’t matter (for the particular part of the design space we’re looking at), and justify taking the simplest option.
It is not clear that using four-core systems to evaluate synchronization techniques is acceptable today, as even phones are starting to have eight cores. The findings of the study are not really valid for eight cores, as claimed in the paper, as contention will be much higher with twice as many cores.
Yes, 8-way ARM chips are starting to appear, and our data cannot be used to prove anything beyond 4 cores. However, given that on the macrobenchmark there is no discernible difference for 4 cores, it is a fairly safe assumption that if there is a difference at 8 cores, it isn’t going to be dramatic.
This claim is backed by Fig 2(b). It represents an extreme case of hammering the kernel, with “think time” (i.e. out-of-kernel time) being equal to in-kernel time. As shown in Fig 1, only about 20% of pause times in our macrobenchmark are so short, the median is about four times as big. Hence, Fig 2(b) represents an unreasonably high lock contention. Yet, the benchmark still shows very little difference between lock implementations. If we double the number of cores, contention would double (as the reviewer states, it could actually be a bit worse than double), but would still not even be half of that of Fig 2(b) which shows essentially no difference. That’s why I can claim with reasonable confidence that the 8-core case would still work.
Performance of locking primitives and IPC is well evaluated and results are very clearly presented.
Evaluation is limited as it uses a single benchmark. It is hard to draw solid conclusions based on one workload.
We cannot dispute the fact that we only have one macro-benchmark. However, I can argue that not more could be learned from having more than one (see below).
Moreover, the benchmark is a server workload (YCSB on Redis) and the paper is supposed to focus on lower-end systems.
For the point of the exercise, it is totally relevant what kind of workload it is. All the actual work is at user level, the kernel just acts as a context-switching engine. The only thing that matters is the distribution of in-kernel and think times, as this is what determines lock contention.
We took this particular benchmark as, of all the realistic workloads we could think off, it hammers the kernel most savagely, by producing a very high system-call load. In other words, it was the most pessimistic yet halfway-realistic workload we could come up with.
Server systems will have many more cores (typically 16 today) and using fine-grained synchronization is likely to be even more important for them.
With respect, that’s besides the point. Server platform or not, we specifically state that we’re aiming at cores that share a cache and have low inter-core communication latencies (see Introduction). Server platforms with high core counts tend not to match this requirement, and we argue that they would be best served with a “clustered mulitkernel” approach, where a BKL is used across a closely-coupled cluster, and a multikernel approach without shared state across clusters. All that is stated in the Introduction of the paper.
It is not clear why more than 32000 records would exceed memory limitations of the systems. If benchmark uses 1kB key-value pairs, this should only be 32MB total and ARM system has 1GB and x86 has 16GB of memory.
OK, we could have been clearer about those implementation limitations of our setup. However, if you think about it, you’ll realise it doesn’t matter for the results. Using more records, and thus increasing the working set of the user-level code, would have resulted in an increase of cache misses, and thus overall reduced userland performance. The kernel, with its tiny working set, would be less affected. The net effect is that think times go up, and thus lock contention goes down, making the benchmark more BLK friendly.
End of review.
In summary, I think I can comprehensively rebut any of the arguments the reviewer makes against the paper. Where does that leave us?
Well, if that was all there was against the paper, then it is really hard to understand why it was rejected. There may have been other arguments, of course. But then the basic principle of transparency would have obliged the reviewers to tell us. After all, that’s the point of having written reviews that get released to authors, it’s to avoid the impression that random or biased decisions get made behind closed doors.
So, in my opinion, the PC failed the transparency test. I can only hope that we were the only such case.
PS: There was also a shadow PC, consisting of students and junior researchers. It operated under the same conditions as the real PC, but had no influence on the decisions of the real PC. Interestingly, I found the reviews of the shadow PC of much better quality than the “real” ones, more extensive and better argued.