To be done for each material site, to send them for indexation in Yandex, if both are in the index, C, we can consider them different :)
But seriously, there are services and programs, allowing to estimate the similarity of texts (rasprostanenie from SEOs and their assistants-rewriters). Open source not met such, but you can try with the authors dogovoritsa or use the service/program an external service/module.
Itself the move would have solved the problem thus:
— a list of words in the material (number of words)
— throw "garbage" (prepositions, conjunctions, "thank you" and "pojaluista")
— get a list of "tags"
looking for material(s), a list which most closely matches the current list (for example, looping through the current list of receive N first materials to this tag and take the(e) is often encountered)
looks like a similar current was found in(e) (the criterion is set in the settings, for example, if more than 80% of the same, we feel like)
— if you don't like (matches less than 80%), the published
— if similar, then send you these messages with the question "You this is mean?", if the user says "no", then published, if Yes, then do nothing
After the initial run, monitor the quality of the filter (you can watch transparent for users, noting similar only in the DB/admin) need to change the similarity threshold value, the dictionary of insignificant words, maybe introduce the concept of synonyms and/or words cut to the basics (open products seems even described on Habre recently), consider the phrase, the position of the words in the material/proposal... well gradually surpass the algorithms for automatic definition of duplicate content in Google/Yandex, sell them and forget about users who are too lazy to look for themselves before publishing :)
Another approach is to make a neural network, train it on the existing basis, learn in the process, but then I find it difficult to appreciate even about the usage and development, and the actual analysis. Well, or semantic analyzer to develop :)