## How do I find the percentage of similarity between two multiline Strings?

Question

I have got two multi-line strings. I'm using the following code to determine the similarity between two of them. This makes use of Levenshtein distance algorithm.

``````  public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) {
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

}

public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();

int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
``````

But the above code is not working as expected.

For instance lets say that we have got the following two strings say `s1` and `s2`,

S1 -> `How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?`

S2-> `How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?`

Then I'm passing the above string to similarity method but it does not find the exact percentage of difference. How do I optimize the algorithm?

Following is my main method

update:

``````public static boolean authQuestion(String question) throws SQLException{

boolean isQuestionAvailable = false;
Connection dbCon = null;
try {
dbCon = MyResource.getConnection();
String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
double re=similarity( rs.getString("question"), question);
if(re  > 0.6){
isQuestionAvailable = true;
}else {
isQuestionAvailable = false;
}
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (dbCon != null)
dbCon.close();
} finally {
if (dbCon != null)
dbCon.close();
}

return isQuestionAvailable;
}
``````

Show source

## Answers to How do I find the percentage of similarity between two multiline Strings? ( 3 )

1. I can suggest you an approach...

You are using edit distance, which gives you the number of characters in S1 you need to change/add/remove in order to turn it to S2.

So, for example:

``````S1 = "abc"
S2 = "cde"
``````

the edit distance is 3 and they are 100% different (taking in consideration you see it in some kind of char by char comparison).

So you can have an approximate percentage if you do

``````S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())
``````

the min is a workaround to treat the cases where the strings are very different, for example:

``````S1 = "abc"
S2 = "defghijklmno"
``````

so the edit distance would be bigger than the length of S1 and the percentage should be more than 100%, so maybe dividing by the bigger of the sizes should be better.

hope that helps

2. Your `similarity` method returns a number between 0 and 1 (both ends inclusive) where one means that the strings are the same (edit distance is zero).

However in your `authQuestion` method you are acting as if it returns a number between zero and 100, evidenced by this line:

``````if(re > 60){
``````

You need to change that to

``````if(re > .6){
``````

Or to

``````if(re * 100 > 60){
``````
3. Since you are using your entire S1 in the where clause of your sql query, it will either find a perfect match or won't return any result at all.

As mentioned by @ErwinBolwidt, if it returns nothing then you `isQuestionAvailable` will always remain false. And if it returns a perfect match then you are bound to get 100% similarity.

What you can do is: Use a substring of your S1 to search for questions that match that part.

You can make following changes:

`authQuestion method`

``````checkStmt.setString(1, question.substring(0,20)); //say
``````

Out of the results fetched, you can compare each result with your question for similarity.