Screening similar records?

0 like 0 dislike
4 views
Hello!


In our project, users add the material is a text string, length up to 300 characters.

Very many duplicates. Would like when adding to check: if you add a line like 90% is already added, it does not give the add.


As database MySQL is used.


At the moment, the thought solution is:


— remove from the string all punctuation and spaces

— drop in lower case

— do the md5 hash received

— add the hash in a separate field in the database

— when adding a new — check if there are such in the database


The solution is not the best, maybe there's something better?


PS Records about 10 thousand per day and added 500 new. There is a possibility to use sphinx, but did not provide similar functionality.
by | 4 views

5 Answers

0 like 0 dislike
Mine existing approach pozvolet to weed out the not similar records, and the identical...
\r
I think that this task is extremely difficult, if at all feasible, and perhaps it is no longer to the database, and AI. Suppose there are two messages:
1. How do I otsivat similar records in the database?
2. What is a way to avoid duplication of records in the database?
Do they look like?
\r
In my opinion it's best to leave this task to users, such as offering him prior to publication to see the link type "And here looked" in which for example in the order of relevantist will go 5 — 10 references to messages, which, as the words of the published messages. It is also possible to adapt tags and search messages not only by words but also by tags (or even just the tag).
\r
Well it is, controversy. In practice this never did not have to face.
by
0 like 0 dislike
To be done for each material site, to send them for indexation in Yandex, if both are in the index, C, we can consider them different :)
\r
But seriously, there are services and programs, allowing to estimate the similarity of texts (rasprostanenie from SEOs and their assistants-rewriters). Open source not met such, but you can try with the authors dogovoritsa or use the service/program an external service/module.
\r
Itself the move would have solved the problem thus:
— a list of words in the material (number of words)
— throw "garbage" (prepositions, conjunctions, "thank you" and "pojaluista")
— get a list of "tags"
looking for material(s), a list which most closely matches the current list (for example, looping through the current list of receive N first materials to this tag and take the(e) is often encountered)
looks like a similar current was found in(e) (the criterion is set in the settings, for example, if more than 80% of the same, we feel like)
— if you don't like (matches less than 80%), the published
— if similar, then send you these messages with the question "You this is mean?", if the user says "no", then published, if Yes, then do nothing
\r
After the initial run, monitor the quality of the filter (you can watch transparent for users, noting similar only in the DB/admin) need to change the similarity threshold value, the dictionary of insignificant words, maybe introduce the concept of synonyms and/or words cut to the basics (open products seems even described on Habre recently), consider the phrase, the position of the words in the material/proposal... well gradually surpass the algorithms for automatic definition of duplicate content in Google/Yandex, sell them and forget about users who are too lazy to look for themselves before publishing :)
\r
Another approach is to make a neural network, train it on the existing basis, learn in the process, but then I find it difficult to appreciate even about the usage and development, and the actual analysis. Well, or semantic analyzer to develop :)
by
0 like 0 dislike
Most likely you will want the shingles — habrahabr.ru/blogs/algorithm/65944/
by
0 like 0 dislike
by
0 like 0 dislike
ru.wikipedia.org/wiki/Soundex — the algorithm for comparing two strings in their sound. He sets the same index for rows that have the similar sound.
by

Related questions

0 like 0 dislike
1 answer
asked Apr 3, 2019 by artemerschow
0 like 0 dislike
7 answers
0 like 0 dislike
2 answers
0 like 0 dislike
1 answer
0 like 0 dislike
1 answer
110,608 questions
257,186 answers
0 comments
1,120 users