2020-11-13: Text Classification with jRuby and Weka

jRuby is a perfect language to use with Java libraries. Today I made a script to do some simple text classification using Weka.

For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.

So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.

Overall process

The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)

First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".

Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.

Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).

Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attibute selection technique reduces the number of attributes to a manageable size.

Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.

Script

jRuby works well with Java libraries. Not only is it easy to import a library, but names are transformed into Ruby formats. So a setter like filter.setOutputWordCounts(false) can be written as filter.output_word_counts = false, etc.

The script follows. It is actually one file, but broken up with some explanations.

At the start of the script, load in the Weka library (I am using version 3.8.4) and import the required classes, so each class can be referred to without an import path.

# Aim to build a simple text classifier on BBC dataset
#
require "java"
require_relative "weka.jar"

java_import [
  "weka.attributeSelection.CorrelationAttributeEval",
  "weka.attributeSelection.Ranker",
  "weka.classifiers.bayes.NaiveBayes",
  "weka.classifiers.functions.SMO",
  "weka.classifiers.trees.J48",
  "weka.core.converters.TextDirectoryLoader",
  "weka.core.stopwords.Rainbow",
  "weka.core.tokenizers.WordTokenizer",
  "weka.core.stemmers.LovinsStemmer",
  "weka.core.converters.ArffSaver",
  "weka.filters.supervised.attribute.AttributeSelection",
  "weka.filters.unsupervised.attribute.StringToWordVector",
]

The following method was created as our preprocess steps applies two of Weka's filters. Weka requires us to input each instance in turn to the filter, tell the filter the current batch has finished, and then retrieve the instances. Notice that the output result may have a different structure (number and type of attributes) to the input instances.

def apply_filter(instances, filter)
  filter.setInputFormat(instances)
  instances.each do |instance|
    filter.input(instance)
  end
  filter.batch_finished

  result = filter.output_format
  loop do
    instance = filter.output
    break if instance.nil?
    result.add(instance)
  end

  result
end

The preprocess method covers the first four(!) steps described above.

Weka provides TextDirectoryLoader to load the text documents from the two folders. This process leaves each instance with two attributes: one is the text of the document, and the second is its class label (the name of the child folder).

Step 1 is done using a Ruby regular expression, to replace all non-alphabetic characters with spaces.

Steps 2-3 are done using a StringToWordVector filter. In this filter, I set the stemmer and stopwords handlers, tell it to convert the text to lower case and tokenise the string as words (rather than character sequences). Setting output_word_counts to false means the values will be 1 or 0, not actual word counts.

Step 4 is achieved using a second filter, CorrelationAttributeEval, along with a ranking algorithm to pick the most predictive 300 attributes.

def preprocess(text_dir)
  loader = TextDirectoryLoader.new
  loader.source = java::io::File.new(text_dir)
  instances = loader.data_set

  # remove numbers/punctuation - step 1
  instances.each do |instance|
    text = instance.string_value(0) # the text is in the first attribute
    text.gsub!(/[^a-zA-Z]/, ' ')
    instance.set_value(0, text)
  end

  # turn into vector of words, applying filters - steps 2 & 3
  filter = StringToWordVector.new
  filter.lower_case_tokens = true
  filter.output_word_counts = false
  filter.stemmer = LovinsStemmer.new
  filter.stopwords_handler = Rainbow.new
  filter.tokenizer = WordTokenizer.new
  # -- apply the filter
  instances = apply_filter(instances, filter)
  # identify the class label
  instances.class_index = 0

  # reduce number of attributes to 300 - step 4
  selector = AttributeSelection.new
  selector.evaluator = CorrelationAttributeEval.new
  ranker = Ranker.new
  ranker.num_to_select = 300
  selector.search = ranker
  # -- apply the filter
  instances = apply_filter(instances, selector)

  instances
end

Step 5 is the task of the evaluate_classifier method, used to test a given classification algorithm. Weka provides methods on instances to access train/test sets for k-fold cross-validation, so we use those to build and evaluate a classifier for each fold. Notice the use of ruby's const_get to access the classifier's class from its string name. Most of this method collates and reports the required numbers for evaluation:

def evaluate_classifier(classifier, instances, k=10)
  true_pos = 0
  true_neg = 0
  false_pos = 0
  false_neg = 0

  k.times do |i|
    model = Object.const_get(classifier).new
    train = instances.train_cv(k, i)
    test = instances.test_cv(k, i)
    model.build_classifier(train)

    test.each do |instance|
      result = model.classify_instance(instance)
      if instance.class_value > 0.5 # positive group
        if (instance.class_value-result).abs < 0.5
          true_pos += 1
        else
          false_neg += 1
        end
      else # negative group
        if (instance.class_value-result).abs < 0.5
          true_neg += 1
        else
          false_pos += 1
        end
      end
    end
  end
  true_pos /= k.to_f
  true_neg /= k.to_f
  false_pos /= k.to_f
  false_neg /= k.to_f
  precision = true_pos/(true_pos+false_neg)
  recall = true_neg/(true_neg+false_pos)
  geometric_mean = Math.sqrt(precision*recall)

  puts "Classifier: #{classifier} "
  puts " -- Precision      #{precision}"
  puts " -- Recall         #{recall}"
  puts " -- Geometric mean #{geometric_mean}"
  puts
end

The final step is to create some actual data and build some classification models. Notice how the Weka class names are passed to evaluate_classifier, which are used to create the right classifier.

def run_expt
  data = preprocess('bbc/')

  evaluate_classifier('J48', data)
  evaluate_classifier('NaiveBayes', data)
  evaluate_classifier('SMO', data)
end

run_expt

On my system, the script runs through in about 20 seconds. The output is:

Classifier: J48 
 -- Precision      0.9201995012468828
 -- Recall         0.9792746113989638
 -- Geometric mean 0.9492776248248251

Classifier: NaiveBayes 
 -- Precision      0.9825436408977556
 -- Recall         0.9715025906735751
 -- Geometric mean 0.9770075192044412

Classifier: SMO 
 -- Precision      0.9750623441396509
 -- Recall         0.9896373056994819
 -- Geometric mean 0.9823227937614932

Page from Peter's Scrapbook, output from a VimWiki on 2024-01-29.