Lies, damned lies, and benchmarks.

Note: the GitHub code has since been updated. You can see the implementation this post is referring to by viewing the “1.0” tag.

I have been investigating various methods of statistically inferring synonyms from natural language corpora. As a part of my investigation, I implemented a method based on collocation counting, where two words are considered likely to be synonyms if they tend to occur close to each other in the training corpus. Since counting collocations is a CPU and memory intensive problem, I implemented my solution in two languages which I knew had excellent compilers: C++ (G++) and Haskell (GHC). I then benchmarked the compiled binaries in terms of speed and memory use. Both of my implementations were algorithmically equivalent and emphasized simplicity and ease of development over performance. As such I believe they provide a measure of relative performance that users of C++ and Haskell can expect from similar programs without manual optimizations. In this article I present the results of my evaluation.

Background

When using collocations for synonym inference, it is assumed that two words and are more likely to have a related meaning if they tend to occur close together. This is captured by a metric called mutual information [1]:

The individual probabilities can be estimated from data. For simplicity, assume we are using maximum likelihood estimates based on some language corpus . then becomes the frequency of in , and the frequency of occurring in close proximity to . What constitutes proximity depends on the implementation. One popular approach is to use absolute word distance: and are considered in each other’s proximity if they are separated by at most words, being a parameter of the model. Another approach is to use sentence distance, defined analogically. In my implementation I use sentence distance with . I furthermore ignore words in the current sentence and only consider words in adjacent sentences as neighbors (e.g. if the words “a” and “b” are present in the current sentence but “b” is not in any adjacent sentence, “a” and “b” will not be considered neighbors).

An adept reader will notice that a key element of estimating mutual information is counting proximal word pairs in a language corpus. Conceptually, this is a simple problem: for each word in the corpus and for each word near produce the count and then tally up the counts. In practice, however, the large size of the language corpus makes this a lengthy and memory-consuming operation. It is easy to see why: assuming a dictionary of possible words, in the worst case we will produce counts for each possible pair – a total of count pairs. Even for a modest dictionary of size 10000, this amounts to 10 billion count pairs.

When implementing a collocation counter, it is therefore imperative to choose a programming language and a compiler that produce efficient code. While C/C++ (G++) are natural candidates, according to the The Computer Language Benchmarks Game Haskell (GHC) is a worthy contender, being on average only 2.5x slower than G++ on a wide set of benchmarks. This seems like a small price to pay for the niceties of the language.

However, the data from The Computer Language Benchmarks Game can be a bit misleading as it only compares the fastest implementations in each language. As such, it only provides kind of an upper bound on the performance that a user can expect from a compiler. While tremendously useful in many application, it does not reflect the performance of a typical, idiomatic program in each language. This “typical performance” is what I attempted to measure in this article for the particular use case of counting collocations.

Implementation

The source code of my C++ and Haskell collocation counters is available on GitHub. Both implementations follow the same general algorithm:

get_counts(sentence, context):
  count_list=[]
  for w in words(sentence):
    for v in unique_words(context):
      count_list.append((w,v,1))
  return count_list
 
count_colocations(input):
  counts = new HashMap
  counts[w1 <- V, w2 <- V] = 0
  for paragraph in input:
    for sentence in paragraph:
      context = get_adjacent_sentences(sentence, paragraph)
      for (w,v,c) in get_counts(sentence, context):
        counts[w,v] += c
   return sort_by_words(to_association_list(counts))

The only thing of note is that get_counts returns a list of counts instead of directly accumulating the counts in the hash map, which is arguably non-idiomatic in the imperative C++. I made this particular choice to keep get_counts pure in Haskell, which would otherwise have to use Data.HashTable in the IO monad.

Both C++ and Haskell implementations read all their input into memory first (albeit Haskell does this lazily) from stdin and print the counts to stdout, one per line.

Results

I evaluated both implementations on an excerpt of plaintext Wikipedia articles generated with AutoCorpus. The full dataset is available in the GitHub repository as data/input.txt. The test system was a 2.4GHz quad-core Q6600 Linux box with 6GB of memory running 64-bit Ubuntu 11.10. The benchmarks were compiled with -O2 optimizations using G++ 4.6.1 and GHC 7.0.3. Unless noted otherwise, all presented data points have been averaged over 30 runs. For individual run results, see data/collocations-{cpp,hs}-times.csv in the git repository.

Figures 1 and 2 show the time and maximum memory used by the C++ and Haskell implementations as a function of input size in thousands of lines. Standard deviation was minimal and, while shown, cannot be discerned in the plots. Since both implementations ran essentially the same algorithm, the overall shape of the time and memory curves is identical as expected. On average, Haskell was 5.91x slower than C++ on the same input and used 4.95x more memory. This relationship held to within 0.9x regardless of the size of the input.

Figure 1: Time taken by the C++ (G++) and Haskell (GHC) implementations as a function of input size. On average Haskell was 5.91x slower than C++ for the same input.

Figure 2: Maximum memory used by the C++ (G++) and Haskell (GHC) implementations as a function of input size. On average Haskell used 4.95x more memory than C++ for the same input.

Tables 1 and 2 show the costliest calls (cumulative time) for the C++ and Haskell binaries for an input of 20 thousand lines. Please take these results with a grain of salt as they show only one run with compiler optimizations turned off. Due to space constraints, only the top functions are shown. For full information, the curious reader is welcome to look at the gprof and GHC profiler outputs. The visualized C++ call graph is also available here.

function cumulative time (%)
main

100
documentCounts

98.7
paragraphCounts

54.9
sentenceCounts

38.7
std::sort

31.7
words

29.1

Table 1: Costliest calls (cumulative time) of the C++ collocation counter. For function details, see Colocations.cpp.

function cumulative time (%)
main

100
documentCounts

48.7
paragraphCounts

28.1
sentenceCounts

27.9
sentenceToSet

22.5
aggregateCounts

19.5

Table 2: Costliest calls (cumulative time) of the Haskell collocation counter. For function details, see Colocations.hs.

The time profiles of the two programs are very similar. The difference between the times spent in documentCounts is due to the C++ implementation performing sorting inside that method, while the Haskell implementation does it directly in main. Curiously, the profiler indicates that over 70% of the individual time of the Haskell program was spent in the GC module (see GHC profiler output), suggesting that garbage collection might be the culprit of Haskell’s slower performance.

Conclusions

A naive collocation counting algorithm in Haskell (GHC) was 5.91x slower and used 4.95x more memory than the same algorithm implemented in C++ (G++). The interpretation of these results is up to the reader. Personally, I think the slowdown is a reasonable trade-off for the productivity benefits of Haskell as a language, and I suspect it is still significantly faster than an equivalent algorithm implemented in a popular scripting language like Python, Perl or Ruby (although I have no data to back up that claim ;-)).

References

  1. M. Baroni, S. Bisi. Using cooccurrence statistics and the web to discover synonyms in a technical language. Proc. of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). 2004.

20 responses


Do you want to comment?

Comments RSS and TrackBack Identifier URI ?

I appreciate the attempt to be fair, but I would like to point out that these implementations are not using the same data structures. The C++ version uses vectors and C strings, while the Haskell version uses lists and Strings (which are just lists of Chars!). A more fair comparison would be to use Vectors and ByteStrings. Done right, I would be surprised if the Haskell version is more than twice as slow and pretty unsurprised if it’s less then 50% slower.

December 18, 2011 2:17 am

Thanks for the great tips, Jake! My goal was not to write the fastest possible collocation counter in C++/Haskell, but rather to quickly write one without much regard for performance, throw it at the compiler and see what I get.

December 18, 2011 2:26 am

Sure, but my point is only that this is not a direct comparison since they do different things. It also shouldn’t really be any harder to use Vector and ByteString compared to list and String aside from some additional imports. You don’t really have to pay special attention to performance at all.

December 18, 2011 2:35 am

Jake,

Fair enough 🙂 I guess I was expecting too much from the compiler.

December 18, 2011 2:46 am

What Jake is getting at is that “algorithmically equivalent” can mean different things: either “share the general algorithmic outline” or “use almost the same data structures in RAM”. Your implementation falls in the former category, whereas benchmarks are generally only meaningful for the latter category.

December 18, 2011 7:22 am

Just changing String to ByteString should make a lot of difference.

And the slowdown/memory cost is nothing specific to Haskell. If you use totally wrong data structures (which are a bit simpler to use, just a tiny bit) and call it “idiomatic” and benchmark that you’ll get useless data.

December 18, 2011 10:00 am

Hey Marek, thanks for the comment. I was interested in how fast GHC could make naive Haskell code, without me having to sacrifice the niceness of Strings and lists or consciously thinking about performance. One of my favorite things about Haskell is that I can use highly abstract data structures and still get decent performance, which I did in my article. I do know that you can optimize Haskell code in a million different ways but then you give up a fair deal of that abstraction. To see what I mean, take a look at the programs in the Computer Language Benchmarks Game: the fastest Haskell programs are rather ugly and look suspiciously like C. Their source is even longer than Java in many cases! That’s why I wasn’t interested in the “best case” performance but in the “default case” performance.

Of course what constitutes “default case” is different for everybody. That said, a lot of people take my article as a criticism of GHC. It is the complete opposite! GHC managed to make my simple and naive Haskell implementation to run within 6x of a much less clear C++ version.

December 18, 2011 11:22 am

I’m not talking about your test as if it were criticism. I have nothing against criticising something that is slow. 🙂

1. String vs ByteString is much simpler than you make it appear to be. s/String/B.ByteString and 90% of code uses faster version.
2. You call this code “default case” or high-level and so on but in reality it’s just sloppy code that appears to be simple. It allocates a lot and is at best really naive. I’m almost sure that it’s possible to rewrite it to work faster and be comparably simple (which I’m trying to do, I’ll get back when I have results).

December 18, 2011 12:32 pm

Plotting ratios of time & memory use would be more useful than the raw numbers. Or maybe plot both!

December 18, 2011 12:54 pm

Good idea, Brad! I’m working a followup and will definitely do this.

December 18, 2011 1:54 pm

Your Haskell algorithm is not the same as your C algorithm.

First, Haskell’s nub is O(n²) while your C ‘s nub is only O(log² n * n), thanks to sort. You can also “nub” one time instead of 3 times.

Second, a lot stuff is working on String, which are really inefficient in front of C mutable string. You can remove most of String operations by using an new ADT for storing pairs.

Third, HashTable is not the pure functional way to go.

With these changes, I’m able to build Haskell code which runs as fast as C++ when using Map and 10% slower with an HashTable. I don’t even use ByteString :

[rapha@desktop Collocations-Benchmark]$ time cat data/input.txt | ./Colocations2-hash > out-hash.txt

real 0m10.627s
user 0m10.513s
sys 0m0.133s
[rapha@desktop Collocations-Benchmark]$ time cat data/input.txt | ./Colocations2-map > out-map.txt

real 0m8.684s
user 0m8.536s
sys 0m0.170s
[rapha@desktop Collocations-Benchmark]$ time cat data/input.txt | ./colocations-cpp > out-cpp.txt

real 0m9.069s
user 0m5.553s
sys 0m3.210s

Haskell is also able to use less memory, because of laziness. Map uses 55Mio and HashTable uses 65Mio. You will be able to get even better C++ performances by avoiding string and vector operations as well, but GHC is still an amazing piece of software.

Source code here: https://github.com/RaphaelJ/Collocations-Benchmark/blob/master/Colocations2.hs

December 18, 2011 3:31 pm

Raphael, I ran your version and it ‘s not returning the correct counts (you can verify it by running any of the old versions and comparing the outputs). Good suggestions though 🙂

December 18, 2011 3:43 pm

Sure,
I’ve change a little thing.

You don’t have to look at the line before, because you will list the same pair two times.

Line 54 in my sources.

December 18, 2011 3:56 pm

> Lies, damned lies, and benchmarks.

“After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, ‘Lies–damned lies–and statistics,’ still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of.”

Leonard Henry Courtney, 1895

December 18, 2011 8:23 pm

Isaac, I haven’t heard that one before, thanks! It was not my intention to imply that statistics/benchmarks are useless, but rather that there’s always a fair deal of controversy when one brings them up 🙂

December 18, 2011 8:30 pm

It took me quite awhile before I realized that using Haskell’s String type is pretty much an anti-pattern. It’s unfortunate that this is the case, but there’s so much historical momentum that the situation is highly unlikely to change any time soon if ever. But the bottom line is that the sooner you abandon String as your default choice and switch to either a ByteString variant or Text, the better off you’ll be for writing real-world production code…benchmarks included. 🙂

December 19, 2011 4:45 pm

So far the comments have been about ways of improving the performance of GHC by using other types of data. This is not the purpose of the blog post, they seem to have overlooked the statement: “[The benchmarks game] does not reflect the performance of a typical, idiomatic program in each language. This “typical performance” is what I attempted to measure in this article for the particular use case of counting collocations.”

Having said that, take a look at the N-Body fortran program in the language shootout. It is both the fastest and one of the nicest looking (that is, idiomatic) programs of that particular benchmark.

December 19, 2011 8:40 pm


Today, OCaml seems to be the popular ML to learn, but there is at least one convincing argument in SML’s favor: MLton . MLton really delivers on the thesis that functional languages offer the best opportunities at optimization. As a whole-program optimizing compiler, I’ve yet to see another compiler match its performance. I once created OpenGL bindings for MLton to toy around with 3D graphics, and the resulting program ran faster than the C -based model I had used as a reference, with just 10% of the code.

November 28, 2013 4:44 am

BS low – ralainitoty high! Really good answer!

March 30, 2017 1:45 am

Comment now!