Recently I was trying to make a program that would average the input of thousands of integers in a file and print the result. Furthermore there were thousands of files I needed to average. I wondered what would be the fastest way to do this? I could use more threads but wouldn’t memory I/O become saturated by just one or even two threads? To test this I made two programs AvgFileCreator.c and AvgMultFile.cpp. AvgFileCreator is a simple tool that writes sequences of integers into files as binary data. AvgMultiFile.cpp actually averages the integers in the file using the number of threads you specify on the command line.
I made 16 large files (~25 mb each) and timed how long it took to complete using different numbers of threads with the following command. I tested it on my 2012 Retina Macbook Pro that has 4 cores.
time ls <the_test_files> | <AvgFileMulti.out> <number_of_threads>
I ran the command several times before I began taking averages so cache misses, page faults would be negligible. I averaged the results over 10 trials for 1, 2, 4, 8 and 16 threads and found the following results.
For my purposes with the averaging program, this graph is the end of the story. Four threads is the fastest way to do it. However I was still curious if this meant that four threads was also the fastest way to read memory so I removed the averaging code and took a more time samples.
The results are not particularly surprising. Throwing more threads when trying to read in memory is only so helpful. You can probably get a speed boost by using more than one thread but not many after that. The fastest for my computer was 2 threads. Even 3 threads was slower.
The take away here is that the memory speed is well balanced with the number of cores for my test machine and I suspect the same for most well designed computers. If my computer had many more cores it would have been a waste because the memory is too slow. On the other hand if my computer had very very fast memory it wouldn’t matter because there wouldn’t be enough cores to process it any faster. All of this leads me to the conclusion that the fastest way to do a simple task in parallel with data in memory is to use all your cores and no more. A well designed computer will be bottlenecked by both the parallel processing power it has and the speed of it’s memory.
Checkout the bitbucket repo AvgFileFun for the code and readme.