Abstract: Plagiarism, the unauthorized use or imitation of another’s work without proper acknowledgment, poses a significant challenge in academia, research, and professional content creation, amplified by the widespread sharing of digital information. Reliable plagiarism detection systems are essential to ensure originality and maintain integrity. This paper investigates two widely used algorithms—Jaccard and Cosine similarity—for their effectiveness in detecting textual similarities. Jaccard similarity excels in identifying exact or near-exact overlaps but struggles with rephrased content, whereas Cosine similarity captures deeper semantic similarities, including paraphrasing, but is computationally more demanding. Preprocessing techniques, such as tokenization, stop word removal, and stemming, are employed to optimize the algorithms’ performance. The research evaluates their strengths, limitations, and computational efficiency through a detailed comparative analysis, offering insights into their suitability for specific applications. The findings emphasize the importance of balancing detection accuracy with computational demands, guiding the selection of appropriate methods for plagiarism detection in various contexts.
Keywords: Plagiarism Detection, Cosine Similarity, Jaccard Similarity, Text Similarity, Text Preprocessing