jRuby is a perfect language to use with Java libraries. Today I made a script to do some simple text classification using Weka.
For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.
So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.
The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)
First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".
Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.
Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).
Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attibute selection technique reduces the number of attributes to a manageable size.
Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.
jRuby works well with Java libraries. Not only is it easy to import a library,
but names are transformed into Ruby formats. So a setter like
filter.setOutputWordCounts(false)
can be written as
filter.output_word_counts = false
, etc.
The script follows. It is actually one file, but broken up with some explanations.
At the start of the script, load in the Weka library (I am using version 3.8.4) and import the required classes, so each class can be referred to without an import path.
# Aim to build a simple text classifier on BBC dataset # require "java" require_relative "weka.jar" java_import [ "weka.attributeSelection.CorrelationAttributeEval", "weka.attributeSelection.Ranker", "weka.classifiers.bayes.NaiveBayes", "weka.classifiers.functions.SMO", "weka.classifiers.trees.J48", "weka.core.converters.TextDirectoryLoader", "weka.core.stopwords.Rainbow", "weka.core.tokenizers.WordTokenizer", "weka.core.stemmers.LovinsStemmer", "weka.core.converters.ArffSaver", "weka.filters.supervised.attribute.AttributeSelection", "weka.filters.unsupervised.attribute.StringToWordVector", ]
The following method was created as our preprocess steps applies two of Weka's
filters. Weka requires us to input each instance in turn to the filter, tell
the filter the current batch has finished, and then retrieve the instances.
Notice that the output result
may have a different structure (number and type
of attributes) to the input instances
.
def apply_filter(instances, filter) filter.setInputFormat(instances) instances.each do |instance| filter.input(instance) end filter.batch_finished result = filter.output_format loop do instance = filter.output break if instance.nil? result.add(instance) end result end
The preprocess method covers the first four(!) steps described above.
Weka provides TextDirectoryLoader
to load the text documents from the two
folders. This process leaves each instance with two attributes: one is the text
of the document, and the second is its class label (the name of the child
folder).
Step 1 is done using a Ruby regular expression, to replace all non-alphabetic characters with spaces.
Steps 2-3 are done using a StringToWordVector
filter. In this filter, I set
the stemmer and stopwords handlers, tell it to convert the text to lower case
and tokenise the string as words (rather than character sequences). Setting
output_word_counts
to false means the values will be 1 or 0, not actual word
counts.
Step 4 is achieved using a second filter, CorrelationAttributeEval
, along
with a ranking algorithm to pick the most predictive 300 attributes.
def preprocess(text_dir) loader = TextDirectoryLoader.new loader.source = java::io::File.new(text_dir) instances = loader.data_set # remove numbers/punctuation - step 1 instances.each do |instance| text = instance.string_value(0) # the text is in the first attribute text.gsub!(/[^a-zA-Z]/, ' ') instance.set_value(0, text) end # turn into vector of words, applying filters - steps 2 & 3 filter = StringToWordVector.new filter.lower_case_tokens = true filter.output_word_counts = false filter.stemmer = LovinsStemmer.new filter.stopwords_handler = Rainbow.new filter.tokenizer = WordTokenizer.new # -- apply the filter instances = apply_filter(instances, filter) # identify the class label instances.class_index = 0 # reduce number of attributes to 300 - step 4 selector = AttributeSelection.new selector.evaluator = CorrelationAttributeEval.new ranker = Ranker.new ranker.num_to_select = 300 selector.search = ranker # -- apply the filter instances = apply_filter(instances, selector) instances end
Step 5 is the task of the evaluate_classifier
method, used to test a given
classification algorithm. Weka provides methods on instances to access
train/test sets for k-fold cross-validation, so we use those to build and
evaluate a classifier for each fold. Notice the use of ruby's const_get
to
access the classifier's class from its string name. Most of this method
collates and reports the required numbers for evaluation:
def evaluate_classifier(classifier, instances, k=10) true_pos = 0 true_neg = 0 false_pos = 0 false_neg = 0 k.times do |i| model = Object.const_get(classifier).new train = instances.train_cv(k, i) test = instances.test_cv(k, i) model.build_classifier(train) test.each do |instance| result = model.classify_instance(instance) if instance.class_value > 0.5 # positive group if (instance.class_value-result).abs < 0.5 true_pos += 1 else false_neg += 1 end else # negative group if (instance.class_value-result).abs < 0.5 true_neg += 1 else false_pos += 1 end end end end true_pos /= k.to_f true_neg /= k.to_f false_pos /= k.to_f false_neg /= k.to_f precision = true_pos/(true_pos+false_neg) recall = true_neg/(true_neg+false_pos) geometric_mean = Math.sqrt(precision*recall) puts "Classifier: #{classifier} " puts " -- Precision #{precision}" puts " -- Recall #{recall}" puts " -- Geometric mean #{geometric_mean}" puts end
The final step is to create some actual data and build some classification
models. Notice how the Weka class names are passed to evaluate_classifier
,
which are used to create the right classifier.
def run_expt data = preprocess('bbc/') evaluate_classifier('J48', data) evaluate_classifier('NaiveBayes', data) evaluate_classifier('SMO', data) end run_expt
On my system, the script runs through in about 20 seconds. The output is:
Classifier: J48 -- Precision 0.9201995012468828 -- Recall 0.9792746113989638 -- Geometric mean 0.9492776248248251 Classifier: NaiveBayes -- Precision 0.9825436408977556 -- Recall 0.9715025906735751 -- Geometric mean 0.9770075192044412 Classifier: SMO -- Precision 0.9750623441396509 -- Recall 0.9896373056994819 -- Geometric mean 0.9823227937614932