author

Peter Lane <p.c.lane@herts.ac.uk>

date

2019-22

Introduction

Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( collusion ) within a given set of files.

Ferret is useful for:

  • teachers, looking for

    • collusion or plagiarism in student work

    • originality (in the form of unique content)

  • software engineers

    • locating duplicate code

    • studying the evolution of code over time

  • document analysis, comparing documents within a group or over time

Features:

  • compares text documents containing natural language or computer language

  • many major programming languages are recognised and tokenised appropriately

  • outputs for analysis include:

    • pairwise comparisons ordered by similarity, including trigram counts

    • counts of unique trigrams within each file / group

    • reverse index from trigrams to list of documents they are found in

    • XML detailed comparison of a pair of documents

Ferret computes a similarity measure based on the trigrams found within each of the two documents under comparison; this measure is a number from 0 (no copying) to 1 (everything has been copied). This measure should not be taken as an absolute measure of the amount of copying. Instead, the measure is intended to indicate the relative amount of copying that the current pair has compared with the rest of the group. Pairs which appear on top of the table of all similarity comparisons should be examined for possible copying, but the measure itself does not imply any reliable conclusion. For example, it is wrong to say that all measures above an arbitrary value, such as 0.1, are indicative of copying.

Publications on Ferret include:

  • plagiarism detection: [1,2,3,4,5]

  • extension to work on Chinese: [6]

  • analysis of computer code: [7]

  • using trigrams and web searches: [8,9]

  • analysis of Ferret XML reports: [10,11,12]

Ferret was created at the University of Hertfordshire by the one-time Plagiarism Detection Group and developed by Ruth Barrett, Bob Dickerson, Caroline Lyon and James Malcolm. The program was extended to work with the Chinese language by Jun-Peng Bao, and Pam Green has shown how to extract more precise information on duplicate material from Ferret’s output. This version of Ferret has been implemented by Peter Lane.

Download

Precompiled versions (for AMD64) and source code are available for download:

Alternatively, a Ruby wrapper around a C++ implementation of Ferret, suitable for scripting, is available as a gem.

Command-line Interface

Ferret can be controlled from the command line, by providing it with input options. The full range of options are shown by calling ferret --help or ferret -h.

> ferret --help
Usage: ferret [-ghlsvx] filename [filenames...]
 -g, --group       Use subdirectory names to group files
 -h, --help        Show help information
 -l, --list-trigrams
                   Output list of trigrams used
 -u, --unique-counts
                   Output counts of unique trigrams
 -v, --version     Version number
 -x, --xml-report  filename1 filename2 [outfile] : Create XML report

Notice that all switches have both a long and a short form. You can either write ferret --help or ferret -h.

Input documents

Ferret provides two ways to define the input documents. The first is to provide a single directory name: Ferret will recursively locate all files within the directory which it can recognise. The second way is to provide a list of filenames.

For example, to run Ferret on all the text documents in the current directory, you can either use the directory name: ferret . or list the filenames: ferret *.txt

Ferret uses the file extension to decide on the type of document / computer language, and only recognised file types are processed. The full list of recognised file types and extensions is:

  • Text documents (.adoc, .md, .txt)

  • Computer languages

    • C/C++ (.h, .c, .cpp)

    • C# (.cs)

    • Clojure (.clj)

    • Go (.go)

    • Groovy (.groovy)

    • Haskell (.hs, .lhs)

    • Java (.java)

    • Lisp (.lisp, .lsp)

    • Prolog (.pl)

    • Python (.py)

    • Racket (.rkt)

    • Ruby (.rb)

    • Rust (.rs)

    • Scheme (.scm, .sld, .sls, .sps, .ss)

    • Visual Basic (.vb)

    • XML/HTML (.xml, .html)

Grouped documents

Grouping documents is used when processing files, such as from a class of students, where the files of each individual student do not need to be compared with each other.

e.g. a programming assignment might lead to the following layout of students and files:

- anne
  - file1.java
  - file2.java
  - file3.java
- ben
  - file1.java
  - file2.java
- charles
  - file1.java
  - file2.java
  - file3.java

If Ferret is called with the --grouped option, then "anne/file1.java" will not be compared with "anne/file2.java" etc. This reduces the total number of pairwise comparisons displayed from 28 to 21.

The group option only works when providing a single directory name as the location of the files.

Output formats

There are three forms of output provided by Ferret:

  • a similarity table, sorting documents in order of the most to least similar. This is the default output form.

  • a list of all the trigrams used and information on which documents they are contained in. This requires the --list-trigrams option.

  • an XML file containing the detailed comparison of just two documents. This requires the --xml-report option, naming the two files to compare and an optional output filename.

Similarity table

The default option is to provide a list of every pair of comparisons and their similarities. The output is to standard output, but can be redirected to form a CSV file readable by spreadsheets or other analysis programs. The output is in order of decreasing similarity. The six columns are:

  1. Name of file 1

  2. Name of file 2

  3. The number of trigrams the two files have in common

  4. The number of trigrams in file 1

  5. The number of trigrams in file 2

  6. The similarity measure

For example:

> ferret *.txt
text-3.txt,text-4.txt,65,128,273,0.193452
text-2.txt,text-3.txt,2,152,128,0.00719424
text-2.txt,text-4.txt,1,152,273,0.00235849
text-1.txt,text-3.txt,1,194,128,0.00311526
text-1.txt,text-2.txt,0,194,152,0
text-1.txt,text-4.txt,0,194,273,0

For instance, in the first row, text-3.txt has 128 trigrams, text-4.txt has 273 trigrams. 65 of these trigrams are in common, giving a similarity score of 0.193452.

The number of comparisons, for N documents, is N(N-1)/2, so 30 documents create 435 pairs, and 100 documents 4950 pairs.

Unique trigram counts

The switch --unique-counts is used to output a list of each file along with a count of its unique trigrams: a unique trigram is one which appears only in the named file. If the group option is also used, the count is made per group and group names are output instead of individual files. The list is ordered in descending order of count.

The output is suitable for redirection to form a CSV file. The two columns are:

  1. the name of the file (or group)

  2. the count of unique trigrams in that file (or group)

For example, the following are some unique counts for two sets of text documents:

> ferret -u aca/*.txt dem/*.txt
dem/KBW.txt,60107
dem/KB7.txt,55400
dem/KD0.txt,46259
dem/KE2.txt,41310
dem/KD8.txt,40841
aca/J18.txt,35573
aca/HWV.txt,34137
aca/FT1.txt,32940
dem/KBD.txt,31384
aca/HXH.txt,30604
...

Using the group option outputs a total count of the unique trigrams for all files within each folder:

> ferret -g -u aca/*.txt dem/*.txt
aca,710111
dem,511163

Trigram list

The switch --list-trigrams is used to output a list of all the trigrams within the provided documents, along with the documents in which they occur. The list is ordered in descending frequency of occurrence.

The output is suitable for redirection to form a CSV file. The three columns are:

  1. the trigram

  2. the number of files the trigram appears in

  3. a list of the file numbers the trigram appears in

> ferret -l *.txt
a programming language,2,0 1
a machine particularly,2,0 2
a computer such,2,0 1
a programmer to,2,1 2
a structured mechanism,1,0
a computation as,1,0
a collection of,1,0
...

The file numbers are based on the order in which the files are encountered when walking through the provided directory or files.

Individual reports

The switch --xml-report instructs Ferret to provide an XML report of the comparison of two documents. This XML report can be used for further analysis, or viewed in a web browser using the provided style sheet (keep the output file and style sheet in the same directory, and view the output file).

For example, to get a comparison of text-1.txt and text-2.txt in XML format, calling the XML document comparison.xml, you would call Ferret with:

> ferret -x text-1.txt text-2.txt comparison.xml

Note: The XML output is mostly intended to view two documents with their similar contents highlighted, through a provided XSL style sheet. However, this may fall foul of your browser’s "unique origin policy".

  • To view the output in Firefox: go to "about:config", find "privacy.file_unique_origin" and set it to "false".

  • You can then view the XML output with the style sheet. Enjoy!

  • You should reenable this property before returning to normal web activity.

References

[1] P. C. R. Lane, C. M. Lyon, and J. A. Malcolm, "Demonstration of the Ferret Plagiarism Detector", in Proceedings of the Second International Plagiarism Conference, 2006.

[2] C. M. Lyon, J. A. Malcolm, and R. G. Dickerson, "Detecting short passages of similar text in large document collections", in Proceedings of Conference on Empirical Methods in Natural Language Processing, 2001.

[3] C. M. Lyon, R. Barrett, and J. A. Malcolm, "Experiments in plagiarism detection", School of Computer Science, University of Hertfordshire, 388, 2003.

[4] C. M. Lyon, R. Barrett, and J. A. Malcolm, "A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector", in JISC(UK) Conference on Plagiarism: Prevention, Practice and Policies Conference, 2004.

[5] C. M. Lyon, R. Barrett, and J. A. Malcolm, "Plagiarism is easy, but also easy to detect", Plagiary: Cross-disciplinary studies in plagiarism, fabrication and falsification, vol. 1, pp. 1-10, 2006.

[6] J. P. Bao, C. M. Lyon, and P. C. R. Lane, "Copy detection in Chinese documents using Ferret", Language Resources and Evaluation, vol. 40, pp. 357-65, 2006.

[7] A. W. Rainer, P. C. R. Lane, J. A. Malcolm, and S. Scholz, "Using n-grams to rapidly characterise the evolution of software code", in The Fourth International ERCIM Workshop on Software Evolution and Evolvability, 2008, pp. 42-52.

[8] J. A. Malcolm and P. C. R. Lane, "Efficient search for plagiarism on the web", in Proceedings of the International Conference on Technology, Communication and Education, 2008, pp. 206-11.

[9] J. A. Malcolm and P. C. R. Lane, "An approach to detecting article spinning", in Proceedings of the Third International Conference on Plagiarism, 2008.

[10] P. D. Green, P. C. R. Lane, A. W. Rainer, and S. Scholz, "Analysing Ferret XML reports to estimate the density of copied code". Technical Report 501, Science and Technology Research Institute, University of Hertfordshire, 2010.

[11] P. D. Green, P. C. R. Lane, A. W. Rainer, and S. Scholz, "Unscrambling code clones for one-to-one matching of duplicated code". Technical Report 502, Science and Technology Research Institute, University of Hertfordshire, 2010.

[12] P. D. Green, P. C. R. Lane, A. W. Rainer, S. Scholz, and S. J. Bennett, "Same difference: Detecting collusion by finding unusual shared elements", in Proceedings of the Fifth International Plagiarism Conference, 2012.