The algorithm for comparison of text strings?

0 like 0 dislike
9 views
Advise algorithm of string comparison with the working principle like this:


'Ivan Ivanovich Ivanov' = 'Ivanov Ivan Ivanovich'

'Ivan Ivanovich' ~ 'Ivanov Ivanovitch'

'Ivan Ivanovich Ivanov in the morning goes without pants' != 'Ivanov Ivan Ivanovich dress pants at night'


That is, you need to find the coefficient of similarity of strings, taking into account the fact that the words in the string can be swapped.

UPD: it Seems to have come up with:

a — array of words of the first line

b — the array of words of the second line


n is the number of words of the first line

m is the number of words of the second line


S — the coefficient of similarity of words a[i] and b[j] (you can use soundex or Levenshtein distance)


K = (C11 + C12 +... + С1м + C21 + C22 +... + C2m +... + Cnm) / ((n + m) / 2)


Total for instance, suppose Cij is calculated as a[i] == b[j] ? 1 : 0


a = ['Ivan', 'Ivanitch', 'Ivanov']

b = ['Ivanov', 'Ivan', 'Ivanovich']


K = (0 + 1 + 0 + 0 + 0 + 1 + 1 + 0 + 0) / ((3 + 3) / 2) = 3 / 3 = 1 — strings are the same


a = ['Ivan', 'Ivanovich']

b = ['Ivanov', 'Ivanovitch']


K = (0 + 0 + 0 + 1) / ((2 + 2) / 2) = 1 / 2 = 0.5 — similar, but not equal


Seems logical.


Thank you hamMElionreminding me to break lines at word %)
by | 9 views

5 Answers

0 like 0 dislike
Additionally, after split a string into words, they can be compared using the levinshtein(). Then, given the word length to obtain the coefficient of similarity. Thus, you can quite accurately determine similarity, even if you made a typo in the word, or if it's written a little differently.
Oh and an added bonus is the transliteration of the string and cleaning it of debris.
by
0 like 0 dislike
1. Split both strings into arrays of words (split)
2. The cycle of search of elements of the same array in the other (scoring = k)
3. Finding the number of matches for the second array from the ratio k1/n1=k2/n2 (n is the number of elements in the array)
4. The difference |k1-k2| is the coefficient of similarity
by
0 like 0 dislike
Algorithms — at least the antelope chew.
On staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html there are descriptions and links to implementation. Choose a suitable.
by
0 like 0 dislike
according to your algorithm it turns out that the string "Jay Jay Johanson" and "J. Q. Johnson" equal. you want to exclude from arrays of strings already match
by
0 like 0 dislike
by

Related questions

0 like 0 dislike
1 answer
0 like 0 dislike
3 answers
0 like 0 dislike
7 answers
asked Mar 21, 2019 by Zubchick
110,608 questions
257,186 answers
0 comments
28,839 users