Well, look at the screenshot. The CPU is completely idle 50% of the time (and runs one thread during the other half, when it could run two).
An even better solution might be to have exactly one dedicated I/O thread which does all reads asynchronously, but sequentially and feeds the results to the blocking NUM_CPU+1 worker threads.
Under Linux, this probably won't bring an awful lot, since the elevator mechanism should already make sure that the hard disk is not trashing when two threads try to read concurrently, but I wouldn't be so sure about that under Windows...
If someone can produce a cross-platform (posix + Windows) memory mapping class, then we might leave that work to the OS, too. It would probably be the most efficient way, anyway.