These are automatically gathered Finnish paraphrase candidates and negative examples. The data is gathered from the OPUS dataset as well as the Turku Internet Parsebank web-scraped data. Initial selection is done using FinBERT to get pairs which are at least to some extent similar, and subsequently the data is classified using a model trained on the Turku Paraphrase Corpus. "Safe" areas of reliably positive and reliably negative predictions are preserved in these files. There are 542K positives and 5.6M negatives.

Negatives file columns:

sim: lexical similarity in terms of character n-grams
base:2: classification score as a negative example (the higher, the more likely it is not a paraphrase pair)
txt1,txt2: the two segments

Positive file columns:

sim: lexical similarity in terms of character n-grams
base:4: classification score as a positivee example (the higher, the more likely it is a paraphrase pair)
txt1,txt2: the two segments