Tests show that cache bypass doesn't scale very well past a few
concurrent processes, with a lot of lock contention in the RX
layer. Switching the implementation to the iovec based rx_Readv
alleviates much of this.
Also take advantage of the fact that the upper layer readpages
only sends down contiguous lists of pages, and issue larger read
requests and populate the pagecache pages from the iovecs we
get back. The loop logic is changed significantly to accomodate
the new pattern.
Read throughput is improved by about 30-40% for some parallel read
benchmarks I use. Along with some other tweaks, it can allow the
throughput to be more than doubled.