As I did previously with jRuby and Kawa Scheme, here is a Fantom script to do some simple text classification using Weka. To keep this note self-contained, I have duplicated the text, changing only the code.
For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.
So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.
To install Weka, download the latest version (I used 3.8.6), and place the file weka.jar in your "FANTOM/lib/java/ext/" folder.
The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)
First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".
Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.
Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).
Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attribute selection technique reduces the number of attributes to a manageable size.
Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.
The script follows. It is actually one file, but broken up with some explanations.
At the start of the script, import the required classes.
// Aim to build a simple text classifier on BBC dataset using pclStatsBox // <1> using [java]java.io::File as JFile // <2> using [java]weka.attributeSelection::CorrelationAttributeEval using [java]weka.attributeSelection::Ranker using [java]weka.classifiers::Classifier using [java]weka.classifiers.bayes::NaiveBayes using [java]weka.classifiers.functions::SMO using [java]weka.classifiers.trees::J48 using [java]weka.core::Instances using [java]weka.core.converters::TextDirectoryLoader using [java]weka.core.stopwords::Rainbow using [java]weka.core.tokenizers::WordTokenizer using [java]weka.core.stemmers::LovinsStemmer using [java]weka.filters::Filter using [java]weka.filters.supervised.attribute::AttributeSelection using [java]weka.filters.unsupervised.attribute::StringToWordVector
- See pclStatsBox - my own pod providing a Confusion Matrix.
- Give Java's File a new name, to avoid confusion with Fantom's File.
The following method was created as our preprocess steps apply two of Weka's
filters. Weka requires us to input each instance in turn to the filter, tell
the filter the current batch has finished, and then retrieve the instances.
Notice that the output result
may have a different structure (number and type
of attributes) to the input instances
.
static Instances applyFilter(Instances instances, Filter filter) { filter.setInputFormat(instances) instances.size.times |Int i| // <1> { filter.input(instances[i]) // <2> } filter.batchFinished result := filter.getOutputFormat instance := filter.output while (instance != null) { result.add(instance) instance = filter.output } return result }
- Use a simple for-loop on index to iterate over the Java Instances.
-
The
Instance#get
method is mapped to[]
by Fantom.
The preprocess method covers the first four(!) steps described above.
Weka provides TextDirectoryLoader
to load the text documents from the two
folders. This process leaves each instance with two attributes: one is the text
of the document, and the second is its class label (the name of the child
folder).
Step 1 is done using a regular expression, to replace all non-alphabetic characters with spaces.
Steps 2-3 are done using a StringToWordVector
filter. In this filter, I set
the stemmer and stopwords handlers, tell it to convert the text to lower case
and tokenise the string as words (rather than character sequences). Setting
output_word_counts
to false means the values will be 1 or 0, not actual word
counts.
Step 4 is achieved using a second filter, CorrelationAttributeEval
, along
with a ranking algorithm to pick the most predictive 300 attributes.
static Instances preprocess(Str textDirectory) { loader := TextDirectoryLoader() loader.setSource(JFile(textDirectory)) instances := loader.getDataSet // remove numbers/punctuation - step 1 instances.size.times |Int i| { text := instances[i].stringValue(0) text = Regex("[^a-zA-Z]").matcher(text).replaceAll(" ") // <1> instances[i].setValue(0, text) } // turn into vector of words, applying filters - steps 2 & 3 filter := StringToWordVector() // <2> filter.setLowerCaseTokens(true) filter.setOutputWordCounts(false) filter.setStemmer(LovinsStemmer()) filter.setStopwordsHandler(Rainbow()) filter.setTokenizer(WordTokenizer()) // -- apply the filter instances = applyFilter(instances, filter) // identify the class label instances.setClassIndex(0) // reduce number of attributes to 300 - step 4 selector := AttributeSelection() selector.setEvaluator(CorrelationAttributeEval()) ranker := Ranker() ranker.setNumToSelect(300) selector.setSearch(ranker) // -- apply the filter instances = applyFilter(instances, selector) return instances }
-
Use a
Regex
to replace non-letters with space. -
The
StringToWordVector
filter offers many options, some require providing a new instance (like the stemmer) and others a value (like whether to use lower case tokens).
Step 5 is the task of the evaluate_classifier
method, used to test a given
classification algorithm. Weka provides methods on instances to access
train/test sets for k-fold cross-validation, so we use those to build and
evaluate a classifier for each fold. Notice the use of Fantom's Type
literals
to provide a convenient reference to the classifier type. This method uses
a ConfusionMatrix
instance to collate results, and Weka's cross-validation
methods to handle the train/test splits.
static Void evaluateClassifier(Type classifier, Instances instances, Int k := 10) { predictionAsLabel := |Float v->Str| // <1> { if (v > 0.5f) return "positive" else return "negative" } cm := ConfusionMatrix() // <2> Classifier model := classifier.make // <3> k.times |Int f| { train := instances.trainCV(k, f) // <4> test := instances.testCV(k, f) model.buildClassifier(train) test.size.times |Int i| // <5> { cm.addCount( predictionAsLabel(test[i].classValue), // predicted predictionAsLabel(model.classifyInstance(test[i])) // observed ) } } echo("Classifier: ${classifier.name}") echo(" -- Precision: ${cm.precision}") // <6> echo(" -- Recall: ${cm.recall}") echo(" -- Geometric Mean: ${cm.geometricMean}") echo() echo(cm) echo() }
-
Simple function to convert a
Float
prediction into a class label. -
Create a
ConfusionMatrix
to collate results. - Construct an instance of the classifier type.
-
Use Weka's
trainCV
andtestCV
to create train/test splits for each fold of the cross-validation process. - Run through every test instance in turn, recording its predicted and observed class in the confusion matrix.
- Pull out aggregate results from the confusion matrix.
The main method simply loads in the dataset, through the preprocess
method,
and then evaluates each classifier in turn. Notice how the classifier type
is used to identify each classifier.
static Void main(Str[] args) { data := preprocess("bbc/") evaluateClassifier(J48#, data) evaluateClassifier(NaiveBayes#, data) evaluateClassifier(SMO#, data) }
On my system, the script runs through in about 20 seconds. The output is:
>fan TextClassifier.fan Classifier: J48 -- Precision: 0.952020202020202 -- Recall: 0.940149625935162 -- Geometric Mean: 0.9454484813442652 Observed | positive negative | Predicted ------------------+---------- 377 24 | positive 19 367 | negative Classifier: NaiveBayes -- Precision: 0.9704433497536946 -- Recall: 0.9825436408977556 -- Geometric Mean: 0.9757039729011721 Observed | positive negative | Predicted ------------------+---------- 394 7 | positive 12 374 | negative Classifier: SMO -- Precision: 0.992462311557789 -- Recall: 0.9850374064837906 -- Geometric Mean: 0.9886261555033407 Observed | positive negative | Predicted ------------------+---------- 395 6 | positive 3 383 | negative