Recently I was trying to make a program that would average the input of thousands of integers in a file and print the result. Furthermore there were thousands of files I needed to average. I wondered what would be the fastest way to do this? I could use more threads but wouldn’t memory I/O become saturated by just one or even two threads? To test this I made two programs AvgFileCreator.c and AvgMultFile.cpp. AvgFileCreator is a simple tool that writes sequences of integers into files as binary data. AvgMultiFile.cpp actually averages the integers in the file using the number of threads you specify on the command line.
I made 16 large files (~25 mb each) and timed how long it took to complete using different numbers of threads with the following command. I tested it on my 2012 Retina Macbook Pro that has 4 cores.
time ls <the_test_files> | <AvgFileMulti.out> <number_of_threads>
I ran the command several times before I began taking averages so cache misses, page faults would be negligible. I averaged the results over 10 trials for 1, 2, 4, 8 and 16 threads and found the following results.
For my purposes with the averaging program, this graph is the end of the story. Four threads is the fastest way to do it. However I was still curious if this meant that four threads was also the fastest way to read memory so I removed the averaging code and took a more time samples.
The results are not particularly surprising. Throwing more threads when trying to read in memory is only so helpful. You can probably get a speed boost by using more than one thread but not many after that. The fastest for my computer was 2 threads. Even 3 threads was slower.
The take away here is that the memory speed is well balanced with the number of cores for my test machine and I suspect the same for most well designed computers. If my computer had many more cores it would have been a waste because the memory is too slow. On the other hand if my computer had very very fast memory it wouldn’t matter because there wouldn’t be enough cores to process it any faster. All of this leads me to the conclusion that the fastest way to do a simple task in parallel with data in memory is to use all your cores and no more. A well designed computer will be bottlenecked by both the parallel processing power it has and the speed of it’s memory.
Checkout the bitbucket repo AvgFileFun for the code and readme.
From the movie Tron:Legacy
A few years ago as I had just started tinkering with Arduinos, I watched the dazzling movie Tron:Legacy. It’s not a grade “A” movie but the visuals are unbelievable. The idea of using an Arduino to create something that could capture someone’s fascination like the effects in the movie popped into my head. I forgot about the idea for a awhile but then recently stumbled across the TLC5940 chip that makes using large numbers of PWM outputs a breeze (read “Dimmable LED controller” if you’re unfamiliar). I created an Arduino powered array of 32 individually dimmable LEDs to display light shows. I put the display inside a picture frame and now it’s a constant source of intrigue. In this post I’ll share both how I made the device and the code LightArray runs so you can build one yourself.
But first a demo:
I recently made a small suite of programs that sorts words in a file in theoretically less than n*log(n) time and outputs only unique words. The master program works by piping the words to two “sort” programs to sort half of the words in the file. Finally another program merges the output of the sorts. It’s not terribly complicated but it’s a fun little program and the code came out very clean even for c. The program is a good introduction to forking, exec-ing, and piping. Heres a schematic of the program called Uniqify.
I say the program could theoretically sort faster than n*log(n) because it uses two different sort processes. If the OS was smart it would put these processes on different cores and thus do some of the sorting work simultaneously. In reality though, you will need a large input file to overcome the lost time in creating new processes and merging the sorted output back together.
Note: You must name the executable files correctly. Start the program with “./Uniqify”. It gets input from stdin so you’ll probably want to redirect an input file. Same for stdout.
Here is the source code:
Shell script that will compile everything and name it correctly: cmpall.txt
The DC3 algorithm is a powerful linear time suffix sorting algorithm. If you’re not familiar with suffix sorting, it is the process in which you sort the different parts of a single string. For example banana suffix sorted would be:
Efficient suffix sorting is needed in data compression, and substring matching. Practically speaking, suffix sorting can be used to search for text in an ebook, identifying long repeated sequences of DNA in a genome, etc.
The most obvious way to suffix sort is to use a n*log(n) comparison algorithm to compare all the suffices of a string. However in the last 20 years, faster linear time suffix sorting algorithms have been developed. These algorithms beat the n*log(n) barrier by avoiding redundant sorting associated with suffix sorting. For example, from the banana suffix sort above, if you have sorted “ana” and “a”, there is no reason to compare the entire suffices “nana” and “na”. Once you determine that they both start with “n”s you should just use the result of the “ana” and “a” sort. The DC3 algorithm uses this rationale to create an elegant solution for linear time suffix sorting. I’ll explain this algorithm by first exploring the process behind it and then by examining a java implementation I created.
Every year IFC puts out a “facebook” for all incoming freshmen. We usually use one of the standard blah designs from the publishing company, however this year I got to take a crack at the design. I didn’t want to use an overwhelming amount of photos, just a few really good ones that explain why Emory is awesome.