Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Error in Extracting Named Entities Leads to Difficulty in Generating REFUTES and NEI

Error in Extracting Named Entities Leads to Difficulty in Generating REFUTES and NEI

Record linkage is a crucial task in data analysis, which involves matching and merging records from different sources. The Fellegi-Sunter model has been widely used for record linkage, but it has some limitations. In this article, we propose using machine learning techniques to improve the accuracy of record linkage.

Machine Learning Techniques

We use three machine learning algorithms – decision trees, k-nearest neighbors (k-NN), and support vector machines (SVM) – to enhance the Fellegi-Sunter model. These algorithms are used to create a set of rules that can be used to match records. We also introduce a new feature called "string comparator metrics" which provides a more accurate way of matching strings.

String Comparator Metrics

The string comparator metrics are based on the Jaro-Winkler similarity measure, which calculates the similarity between two strings taking into account the length of the strings and the number of matching characters. We emphasize words similar to those in the claim (pink background) to improve orientation in the retrieved texts.

Results

We evaluate our proposed approach on four languages – English, French, German, and Czech. Our results show that our machine learning-based approach outperforms the traditional Fellegi-Sunter model in terms of accuracy. We also observe that the string comparator metrics provide a more accurate way of matching strings compared to other similarity measures.

Conclusion

In this article, we proposed using machine learning techniques to improve the accuracy of record linkage in the Fellegi-Sunter model. Our results show that our approach outperforms the traditional method in terms of accuracy and provides a more accurate way of matching strings. We believe that our proposed approach can be useful in various applications where accurate record linkage is required.