2024-02-11: Text Classification with Fantom and Weka

As I did previously with jRuby and Kawa Scheme, here is a Fantom script to do some simple text classification using Weka. To keep this note self-contained, I have duplicated the text, changing only the code.

For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.

So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.

To install Weka, download the latest version (I used 3.8.6), and place the file weka.jar in your "FANTOM/lib/java/ext/" folder.

Overall process

The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)

First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".

Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.

Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).

Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attribute selection technique reduces the number of attributes to a manageable size.

Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.

Script

The script follows. It is actually one file, but broken up with some explanations.

At the start of the script, import the required classes.

// Aim to build a simple text classifier on BBC dataset

using pclStatsBox                                                 // <1>

using [java]java.io::File as JFile                                // <2>
using [java]weka.attributeSelection::CorrelationAttributeEval
using [java]weka.attributeSelection::Ranker
using [java]weka.classifiers::Classifier
using [java]weka.classifiers.bayes::NaiveBayes
using [java]weka.classifiers.functions::SMO
using [java]weka.classifiers.trees::J48
using [java]weka.core::Instances
using [java]weka.core.converters::TextDirectoryLoader
using [java]weka.core.stopwords::Rainbow
using [java]weka.core.tokenizers::WordTokenizer
using [java]weka.core.stemmers::LovinsStemmer
using [java]weka.filters::Filter
using [java]weka.filters.supervised.attribute::AttributeSelection
using [java]weka.filters.unsupervised.attribute::StringToWordVector
  1. See pclStatsBox - my own pod providing a Confusion Matrix.
  2. Give Java's File a new name, to avoid confusion with Fantom's File.

The following method was created as our preprocess steps apply two of Weka's filters. Weka requires us to input each instance in turn to the filter, tell the filter the current batch has finished, and then retrieve the instances. Notice that the output result may have a different structure (number and type of attributes) to the input instances.

  static Instances applyFilter(Instances instances, Filter filter)
  {
    filter.setInputFormat(instances)
    instances.size.times |Int i|                      // <1>
    {
      filter.input(instances[i])                      // <2>
    }
    filter.batchFinished

    result := filter.getOutputFormat
    instance := filter.output
    while (instance != null)
    {
      result.add(instance)
      instance = filter.output
    }

    return result
  }
  1. Use a simple for-loop on index to iterate over the Java Instances.
  2. The Instance#get method is mapped to [] by Fantom.

The preprocess method covers the first four(!) steps described above.

Weka provides TextDirectoryLoader to load the text documents from the two folders. This process leaves each instance with two attributes: one is the text of the document, and the second is its class label (the name of the child folder).

Step 1 is done using a regular expression, to replace all non-alphabetic characters with spaces.

Steps 2-3 are done using a StringToWordVector filter. In this filter, I set the stemmer and stopwords handlers, tell it to convert the text to lower case and tokenise the string as words (rather than character sequences). Setting output_word_counts to false means the values will be 1 or 0, not actual word counts.

Step 4 is achieved using a second filter, CorrelationAttributeEval, along with a ranking algorithm to pick the most predictive 300 attributes.

  static Instances preprocess(Str textDirectory)
  {
    loader := TextDirectoryLoader()
    loader.setSource(JFile(textDirectory))
    instances := loader.getDataSet

    // remove numbers/punctuation - step 1
    instances.size.times |Int i|
    {
      text := instances[i].stringValue(0)
      text = Regex("[^a-zA-Z]").matcher(text).replaceAll(" ")     // <1>
      instances[i].setValue(0, text)
    }

    // turn into vector of words, applying filters - steps 2 & 3
    filter := StringToWordVector()                                // <2>
    filter.setLowerCaseTokens(true)
    filter.setOutputWordCounts(false)
    filter.setStemmer(LovinsStemmer())
    filter.setStopwordsHandler(Rainbow())
    filter.setTokenizer(WordTokenizer())
    // -- apply the filter
    instances = applyFilter(instances, filter)
    // identify the class label
    instances.setClassIndex(0)

    // reduce number of attributes to 300 - step 4
    selector := AttributeSelection()
    selector.setEvaluator(CorrelationAttributeEval())
    ranker := Ranker()
    ranker.setNumToSelect(300)
    selector.setSearch(ranker)
    // -- apply the filter
    instances = applyFilter(instances, selector)

    return instances
  }
  1. Use a Regex to replace non-letters with space.
  2. The StringToWordVector filter offers many options, some require providing a new instance (like the stemmer) and others a value (like whether to use lower case tokens).

Step 5 is the task of the evaluate_classifier method, used to test a given classification algorithm. Weka provides methods on instances to access train/test sets for k-fold cross-validation, so we use those to build and evaluate a classifier for each fold. Notice the use of Fantom's Type literals to provide a convenient reference to the classifier type. This method uses a ConfusionMatrix instance to collate results, and Weka's cross-validation methods to handle the train/test splits.

  static Void evaluateClassifier(Type classifier, Instances instances, Int k := 10)
  {
    predictionAsLabel := |Float v->Str|                       // <1>
    {
      if (v > 0.5f)
        return "positive"
      else
        return "negative"
    }

    cm := ConfusionMatrix()                                   // <2>
    Classifier model := classifier.make                       // <3>

    k.times |Int f|
    {
      train := instances.trainCV(k, f)                        // <4>
      test := instances.testCV(k, f)
      model.buildClassifier(train)
      test.size.times |Int i|                                 // <5>
      {
        cm.addCount(
          predictionAsLabel(test[i].classValue),              // predicted
          predictionAsLabel(model.classifyInstance(test[i]))  // observed
        )
      }
    }

    echo("Classifier: ${classifier.name}")
    echo(" -- Precision:      ${cm.precision}")               // <6>
    echo(" -- Recall:         ${cm.recall}")
    echo(" -- Geometric Mean: ${cm.geometricMean}")
    echo()
    echo(cm)
    echo()
  }
  1. Simple function to convert a Float prediction into a class label.
  2. Create a ConfusionMatrix to collate results.
  3. Construct an instance of the classifier type.
  4. Use Weka's trainCV and testCV to create train/test splits for each fold of the cross-validation process.
  5. Run through every test instance in turn, recording its predicted and observed class in the confusion matrix.
  6. Pull out aggregate results from the confusion matrix.

The main method simply loads in the dataset, through the preprocess method, and then evaluates each classifier in turn. Notice how the classifier type is used to identify each classifier.

  static Void main(Str[] args)
  {
    data := preprocess("bbc/")

    evaluateClassifier(J48#, data)
    evaluateClassifier(NaiveBayes#, data)
    evaluateClassifier(SMO#, data)
  }

On my system, the script runs through in about 20 seconds. The output is:

>fan TextClassifier.fan
Classifier: J48
 -- Precision:      0.952020202020202
 -- Recall:         0.940149625935162
 -- Geometric Mean: 0.9454484813442652

Observed          |
positive negative | Predicted
------------------+----------
     377       24 | positive
      19      367 | negative


Classifier: NaiveBayes
 -- Precision:      0.9704433497536946
 -- Recall:         0.9825436408977556
 -- Geometric Mean: 0.9757039729011721

Observed          |
positive negative | Predicted
------------------+----------
     394        7 | positive
      12      374 | negative


Classifier: SMO
 -- Precision:      0.992462311557789
 -- Recall:         0.9850374064837906
 -- Geometric Mean: 0.9886261555033407

Observed          |
positive negative | Predicted
------------------+----------
     395        6 | positive
       3      383 | negative

Page from Peter's Scrapbook, output from a VimWiki on 2024-02-11.