An O(ND) Difference Algorithm for C# This article is about comparing text files and the proven, best and most famous algorythm to identify the differences between them. The source code that you can find in the download implements a small class with a simple to use API that just does this job.
An Algorithm for Differential File Comparison J. W.Hunt Department of Electrical Engineering, Stanford University,Stanford, California M. McIlroy Bell Laboratories, Murray Hill, NewJersey07974 ABSTRACT The programdiffreports differences between twofiles, expressed as a minimal list of line changes to bring either file into agreement with the other. An empirical study of delta algorithms. An algorithm for differential file comparison..BIB BibTeX JabRef Mendeley Cookies We use cookies to improve. Algorithms for in-place differential file compression were presented, where a target file of size n is compressed with respect to a source file of size m using no.
You should have it in the bag of your algorythms. Beside the class that implements the algorythm there is also a sample web application that compares 2 files and generates html output with a combined and colored document. The algorythm was first published 20 Years ago under 'An O(ND) Difference Algorithm and its Variations' by Eugene Myers Algorithmica Vol. 2, 1986, p 251.
You can find a copy if it. In this article you can find a abstract recursive definition of the algorythm using some pseudo-code that needs to be transferred to a existing programming language. There are many C, Java, Lisp implementations public available of this algorythm out there on the internet.
Before I wrote the C# version I discovered that almost all these implementations seem to come from the same source (GNU diffutils) that is only available under the (unfree) GNU public license and therefore cannot be reused as a source code for a commercial or redistributable application without being trapped by the GNU license. There are very old C implementations that use other (worse) heuristic algorithms. Microsoft also published source code of a diff-tool (windiff) that uses some tree structures. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions and I wanted a managed solution. I tried a lot sources but at least found no usable solution written for the.NET platform. These are the reasons why I implemented the original published algorithm from the scratch and made it available without the GNU license limitations under a. The history of this implementation is back to 2002 when I published a Visual Studio add-in that also can compare files, see.
I found no more bugs in the last 3 years so I think that the code is stable. I did not need a high performance diff tool. I will do some performance tweaking when needed, so please let me know. I also dropped some hints in the source code on that topic.
How it works (briefely) You can find a online working version. • Comparing the characters of 2 huge text files is not easy to implement and tends to be slow.
Comparing numbers is much easier so the first step is to compute unique numbers for all textlines. If textlines are identical then identical numbers are computed. • There are some options before comuting these numbers that normally are usefull for some kind of text: stripping off space characters and comparing case insensitive. • The core algorithm itself will compare 2 arrays of numbers and the preparation is done in the private DiffCodes method and by using a Hashtable.
Oracle Jinitiator 1.1.8.19 Download Page. • The methods DiffText and DiffInts. • The core of the algorythm is built using 2 methods: LCS: This is the divide-and-conquer implementation of the longes common-subsequence algorithm. SMS: This method finds the Shortest Middle Snake. • To get some usable performance I did some changes to the original algorithm. The original algorithm was described using a recursive approach and comparing zero indexed sequences and passes parts of these sequences as parameters. Extracting sub-arrays and rejoining them is very CPU and memory intensive. To avoid copying these sequences as arrays around the used arrays together with the lower and upper bounds are passed while the sequences are not copied around all the time.