$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [algorithm] Review Request: edit_distance
From: Erik Erlandson (eje_at_[hidden])
Date: 2014-01-31 19:53:59
----- Original Message -----
>
> - It would be great to see a benchmark against GNU diff!
So I cooked up an implementation of unix diff:
https://github.com/erikerlandson/algorithm/blob/edit_distance/sequence/example/edit_distance_diff_example.cpp
It produces standard default unix diff output:
http://en.wikipedia.org/wiki/Diff#Usage
That is to say, edit_distance_diff_example will reproduce the output of diff, at least if you just give both programs two files for arguments:
$ edit_distance_diff_example foo.txt bar.txt > ed.out
$ diff foo.txt bar.txt > diff.out
$ diff ed.out diff.out
$
I generated a few benchmarking data-sets, by editing /usr/share/dict/words. On my machine (F18), the 'words' file has 479,828 lines (one word per line) and is 4,953,680 bytes. So, a decent size of file in both line length and bytes. I did a couple different experiments, where I compared against a variation that differed by around 150 lines, and then another variation that differed by around 1200 lines. I also varied input size, by just duplicating the contents, so the result was twice the length and size.
Across these variations of input size and file difference that I tested, edit_distance_diff_example ran consistently 40% to 60% faster than unix diff. So, roughly speaking it's performing about twice as fast.
And less than 150 lines of code!