How to programmatically determine the uniqueness of the text in the search engines?

0 like 0 dislike
I wonder how services like copyscape, define the uniqueness of the text?
by | 34 views

2 Answers

0 like 0 dislike
Most likely it is looking for similar documents. And if a designated text according to some metric is very similar to any it is considered a copy. Perhaps the same thing is done at the paragraph level.
How to find similar documents quickly — LSH (locality sensitive hashing) and clustering.
0 like 0 dislike
Use shingles (shingle). That is, take randomized single from the text (usually use the shingles do not remember exactly, from 5 to 9 words) in quotation marks, requesting him to search. If the results of more than 1, someone someone skopipastil. And here starts the algorithm of the search engines themselves to determine the original, and not always correctly identifies the original source.
110,608 questions
257,187 answers
40,796 users