Plagiarism Detection Using Levenshtein Distance and Smith-Waterman Algorithm

In this paper, the authors present two streaming algorithms for computing Levenshtein distance, a measure of similarity between two sequences, in the low distance regime. They use a novel technique called "online-offline" method that combines the advantages of both online and offline algorithms. The first algorithm, called "Spectral Embedding," uses the eigenvectors of the graph Laplacian to embed the sequences in a high-dimensional space and compute the distance. The second algorithm, called "Streaming Sketches," uses random projections to reduce the dimensionality of the data and speed up computation.
To simplify the complex concepts, think of Levenshtein distance as a kind of "word puzzle" where we need to find the shortest way to transform one sequence into another by making only small changes (deletions, insertions, or substitutions). The algorithms in this paper are like specialized tools that help us solve these word puzzles efficiently and accurately.
The authors demonstrate the effectiveness of their algorithms on several datasets and show that they have better performance than existing methods in the low distance regime. They also provide a theoretical analysis of the algorithms and prove that they have a logarithmic time complexity, which is important for large-scale applications.
Overall, this paper makes a significant contribution to the field of computational biology by providing efficient algorithms for computing Levenshtein distance in the low distance regime. These algorithms can be used to analyze large datasets of biological sequences and help researchers understand the similarities and differences between them.

ARXIV/2312.07931 authored by Xiang Wei, Alan J.X. Guo, Sihan Sun, Mengyi Wei, Wei Yu.

Plagiarism Detection Using Levenshtein Distance and Smith-Waterman Algorithm

LLama 2 7B Chat

Categories

Tags

Archives

Plagiarism Detection Using Levenshtein Distance and Smith-Waterman Algorithm

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives