Longest Common Substring

The longest common substring problem is a problem that finds the longest substring of two strings.

There is one difference between the Longest common subsequence and the longest common substring. In the case of substring, all the elements in the substring must be in contiguous in a original string and the order of the elements in the substring should be same as in the string. In the case of subsequence, we can miss out some elements which means that it is not mandatory that the elements in the substring should be contiguous.

Let's understand through an example.

Consider two strings given below:

S1: a b c d a f

S2: b c d f

On comparing the above two strings, we will find that:

The longest common substring is bcd.

The longest common subsequence is bcdf.

For example: The two strings are given below:

S1: ABABCD

S2: BABCDA

On comparing the above two strings, we will find that BABCD is the longest common substring.

If we have long strings then it won't be possible to find out the longest common substring. So, we use the dynamic programming approach to solve this problem.

Algorithm

Consider two strings given below:

S1: a b c d a f

S2: z b c d f

abcdaf
z0000000
b0
c0
d0
f0

As we can observe in the above table that the first row represents the first string, i.e., S1, and the first column represents the second string, i.e., S2.

When i=0, j =0 where S1[i]= z, S2[j] = a

Since there is no common string between S1[i] and S2[j] so the length of the longest common substring would be 0.

abcdaf
0000000
z00
b0
c0
d0
f0

When i=0, j=1 where S1[i] = z, S2[j] = ab

abcdaf
0000000
z000
b0
c0
d0
f0

When i=0, j=2 where S1[i] = z, S2[j] = abc

abcdaf
0000000
z0000
b0
c0
d0
f0

When i=0, j = 3 where S1[i] = z, S2[j] = abcd

abcdaf
0000000
z00000
b0
c0
d0
f0

Similarly, we will fill other two columns and table would be:

abcdaf
0000000
z0000000
b0
c0
d0
f0

When i=1, j=0 where S1[1] = b, S2[0] = a

abcdaf
0000000
z0000000
b00
c0
d0
f0

When i=1, j=1 where S1[1] = b, S2[1] = b

Since there is one common substring between S1[1] and S2[1], i.e., b so the length of the longest common substring would be 1 shown as below:

abcdaf
0000000
z0000000
b001
c0
d0
f0

When i=1, j=2 where S1[1] = b, S2[2] = c

abcdaf
0000000
z0000000
b0010
c0
d0
f0

Since 'b' and 'c' are not same so we put 0 at S[1][2].

When i=1, j=3 where S1[1] = b, S2[3] = d

abcdaf
0000000
z0000000
b00100
c0
d0
f0

Since 'b' and 'd' are not same so we put 0 at S[1][3].

When i=1, j= 4 where S1[1] = b, S2[4] = a

abcdaf
0000000
z0000000
b001000
c0
d0
f0

When i=1, j=5 where S1[1] = b, S2[5] = f

abcdaf
0000000
z0000000
b0010000
c0
d0
f0

Since 'b' and 'f' are not same so we put 0 at S[1][5].

When i=2, j= 0 where S1[2] = c and S2[5] = a

abcdaf
0000000
z0000000
b0010000
c00
d0
f0

Since 'c' and 'a' are not same so we put 0 at S[2][0].

When i=2, j = 1 where S1[2] = 'c' and S2[1] = 'b'

abcdaf
0000000
z0000000
b0010000
c000
d0
f0

Since 'c' and 'b' are not same so we put 0 at S[2][1].

When i=2, j=2 where S1[2] = 'c' and S2[2] = 'c'

abcdaf
0000000
z0000000
b0010000
c0002
d0
f0

Since both the characters 'c' are same; therefore, "bc" is the common substring among the strings "zbc" and "abc". The length of the longest common substring is 2.

When i=2, j=3 where S1[2] = 'c' and S2[3] = 'd'

abcdaf
0000000
z0000000
b0010000
c00020
d0
f0

Since 'c' and 'd' are not same so we put 0 at S[2][3].

When i=2, j=4 where S1[2] = 'c' and S2[4] = 'a'

abcdaf
0000000
z0000000
b0010000
c000200
d0
f0

Since 'c' and 'a' are not same so we put 0 at S[2][4].

When i=2, j=5 where S1[2] = 'c' and S2[5] = 'f'

abcdaf
0000000
z0000000
b0010000
c0002000
d0
f0

Since 'c' and 'f' are different so we put 0 at S[2][5].

When i=3, j=0 where S1[3] = 'd' and S2[0] = 'a'

abcdaf
0000000
z0000000
b0010000
c0002000
d00
f0

Since 'd' and 'a' are different so we put 0 at S[3][0].

When i=3, j=1 where S1[3] = 'd' and S2[1] = 'b'

abcdaf
0000000
z0000000
b0010000
c0002000
d000
f0

Since 'd' and 'b' are not same so we put 0 at S[3][1].

When i=3, j=2 where S1[3] = 'd' and S2[2] = 'c'

abcdaf
0000000
z0000000
b0010000
c0002000
d0000
f0

Since 'd' and 'c' are not same so we put 0 at S[3][2].

When i=3, j=3 where S1[3] = 'd' and S2[3] = 'd'

abcdaf
0000000
z0000000
b0010000
c0002000
d00003
f0

Since both the characters, i.e., 'd' is same; therefore, 'bcd' is common substring among the strings 'abcd' and 'zbcd'. The length of longest common substring is 3.

Similarly, we will calculate the values of other two columns, i.e., S[3][4] and S[3][5] shown in the below table:

abcdaf
0000000
z0000000
b0010000
c0002000
d0000300
f0

The final table would be:

abcdaf
0000000
z0000000
b0010000
c0002000
d0000300
f0000001

As we can observe in the above table that the length of the longest common substring is 3. We can also find the longest common substring from the above table. First, we move to the column having highest value, i.e., 3 and the character corresponding to 3 is 'd', move diagonally across 3 and the number is 2. The character corresponding to 2 is 'c' and again we move diagonally across the 2 and the value is 1. The character corresponding to 1 value is 'b'. Therefore, the substring would be "bcd".






Latest Courses