The Sorno Script—Metalleus SpecGram Vol CLII, No 4 Contents The Taalbergen—Wanejem Tii
Text Tricks

Recision and Precall
Accuracy Measures for the 21st Century

Jonathan van der Meer
Center for Computational Bioinformatics and Linguistics
NYC, NY

Thanks to a decades-long case of physics envy and the advent of cheap computational power, linguistics has devolved from a cultured gentlemen’s pseudo-science into a debased money-grubbing quasi-science.

A pair of unfortunate side effects of this ensciencification and the ever-growing popularity of so-called “computational” linguistics are, first, the “need”1 to create formalizable and computationally tractable algorithms for doing linguistics,2 and, second, the “need” to create metrics to evaluate those mechanisms.

Two standard measures of accuracy in information retrieval and computational linguistics circles are precision and recall.

For the technically minded, precision is the percentage of correct results among the results returned for a given task. Precision is usually tractable to calculate in that it is typically possible to review the results returned and judge their adequacy. In some cases it may also be possible to have 100% precision by giving only one guaranteed-correct resultthough this is usually of trivial usefulness.

In contrast, recall is the percentage of correct results returned out of all possible correct results for a given task. Recall can be difficult to calculate in that it may not be possible to survey the entire universe of possible results to determine which would have been adequate. In some cases it may also be possible to have 100% recall by returning all possible results, thus guaranteeing that all correct results are included as wellthough this, too, is usually of trivial usefulness.

Colloquially, recall measures “how much of the good stuff you got.” Precision measures “how much of what you got is good stuff.”3

Balancing precision and recall is often difficultin tuning the performance of a computational linguistic system, in determining the overall accuracy of a system, or, worst of all, comparing across systemswhen the two values differ significantly. Nonetheless, several attempts have been made to unify the two, including the popular but ungainly and counter-intuitive F-Measure. Though it risks putting most readers into a math coma, we must say that the F-Measure is the (sometimes weighted) harmonic mean of recall and precision. We’d also like to toss in that it is mathematically silly, and over-simplified. Further, anyone who can’t take the time to come to terms with two potentially useful and fundamentally orthogonal performance measures and balance their differences against the needs of a particular task isn’t someone who is going to be able to do anything particularly wise with a single muddled number anyway.

So, rather than trying to dumb things down and arrive at such a single “accuracy” number, we propose instead to dumb things upconstructing measures that focus on the real needs of a measurable theory, including the meta-system/contextual-matrix in which it is embedded (including, explicitly and for the first time, the researchers and grad-students on the research team).

These two new measures are called recision and precall.

Recision is a measure of the amount of data that must be ignored (or surreptitiously dumped in the river with a new pair of cement shoes) in order to get publishable results. If 10% of your data must be “lost” in order to get good results that support your pre-computed conclusions, then your theory and your research team have a respectable recision score of 10%. If only 10% of your data is useable, then your recision score is a dismal4 90%.

Precall is a measure of your team’s ability to quickly and correctly predict how well your algorithm or system will perform on a new data set that you can briefly review. Correctly predicting “This will give good results.” or “This is gonna suck!” 90% of the time translates directly into a precall score of 90%. Good precall (especially during live demos) can save a project when results are poorer than they should be. The ability to look at some data and accurately predict and, more importantly, explain why such data will give poor results shows a deep understanding of the problem space.5 Even when performance is decent, though, prefacing each data run with “We have no idea how this will turn out!” makes your team look lucky, at best, or, at worst, foolish.

Spend time improving the quality and complexity of your algorithms to decrease your recision score. Spend money improving the quality and complexity of your team to increase your precall score. Once you have mastered these important conceptual metrics, success, fame, and/or tenure await you!


Notes:

1 Here we use need in the sense of “that which is required to acquire funding and/or tenure”.

2 An affront to those who remember when our gentle form of madness was known as “philology”.

3 Interestingly, this is one of the few times when linguists, with more experience resolving subtle scoping differences, usually have a leg up on their computer science colleagues in properly internalizing technical vocabulary. If you are still lost, please do not submit your résumé to the Center for Computational Bioinformatics and Linguistics. Thanks.

4 But still good enough for government work!

5 A tip for our neophyte readers: mention your “deep understanding of the problem space” (or its equivalent, if you actually understand what that means) in every conversation you have with anyone who has influence over your funding.

The Sorno Script—Metalleus
The Taalbergen—Wanejem Tii
SpecGram Vol CLII, No 4 Contents