Rakuten Automated

Precision, Recall, F for sequence labeling task?

What’s matter with “normal" evaluation?

I wrote an evaluation for “binary classification task" as “normal".

In sequence labeling tasks, definitions are different from binary classification tasks, in the way of counting up True Positive/True Negative/False Positive/False Negative.

Thus, you need to calculate performance score in different way from binary classification tasks.

How could I do it easily?

No worry at all. Here is super easy python package to calculate the performance. It’s seqeval.

You just run “pip install seqeval" and follow the sample code in the seqeval Readme. That’s all!

So … what’s the definition of TP / FP / FN / TN?

You could refer definitions in seqeval source code on Github.

It’s like this way.

I write down here the meaning of above codes.

Here, “Gold" means “label annotated by human" and “Prediction" means “label predicted by a sequence labeling model"

  • TP: First both of Gold and Prediction is not “O" label. And then, if Gold is the same as Prediction, then count +1
  • FP: If Gold is NOT the same as Prediction, then count +1
  • FN: If Gold is NOT “O" and Prediction is “O", then count +1
  • TN: if both of Gold and Prediction is “O", then count +1