Shortest Common Super-Sequence

Problem Statement

Given two strings str1 and str2, find the shortest common super-sequence.

The shortest common super-sequence is the shortest string that contains both str1 and str2 as subsequences.

Subsequence = You can derive it by deleting zero or more characters (without changing order).
Super-sequence = A string that has the original as a subsequence (may include extra characters).

Explanation

Let's say:

A = “abc”

Then aXbYcZ, abc, and aabc are super-sequences of abc because you can remove some characters and still get abc.

But, acb is not a super-sequence of abc because order is not maintained.

Examples

Example 1:

Input: str1 = "mno", str2 = "nop"
Output: "mnop"

Explanation: The shortest common supersequence is "mnop". It contains "mno" as the first three characters and "nop" as the last three characters, thus including both strings as subsequences.

Example 2:

Input: str1 = "dynamic", str2 = "program"
Output: "dynprogramic"

Explanation: The shortest common supersequence is "dynamprogramic". It includes all characters from both "dynamic" and "program", with minimal overlap. For example, "dynamic" appears as "dynam...ic" and "program" appears as "...program..." within "dynamprogramic".

Example 3:

Input: str1 = "abac", str2 = "cab"
Output: cabac

Explanation: _ab__

Example 4:

Input: str1 = "geek", str2 = "eke"
Output: geeke

Different Approaches

1️⃣ Tabulation

Intuition:

If we keep the “shortest” criteria aside, what can be a way to generate a super-sequence given two strings. One easy way is to con-cat the given strings (write one after the other), this will always give us a super-sequence for any pair of given strings.

This can be said as the worst case with time complexity of O(n+m), where n and are the lengths of strings str1 and str2 respectively.

How can we improve the naive approach:

If we observe, there are some common characters that we can avoid writing for both the strings separately. These common characters can’t be all the common characters. They are the characters that are common and come in the same order. In other words, they are the characters of the longest common subsequence.

In an optimum solution, the characters of the longest common subsequence are written only once and other characters are placed around them. For every character that belongs to the longest common subsequence, the non-lcs characters coming before them in the strings S1 and S2 are placed before the lcs-character in the answer string. The below figure explains this:

Length of Shortest Common Supersequence(lcs):

From the explanation above, we can see that characters of lcs need to be covered only once. Therefore, the length of the shortest Common supersequence = n + m -k, where (n and m are lengths of str1 and str2, and k is the length of the lcs string).

Finding the supersequence string:

Now, instead of the length, we are interested in finding the shortest supersequence string itself.
When we form the DP table to calculate the longest common subsequence, we have all the information of characters that are coming in the lcs string and characters that don’t. We use this same DP table to form the shortest common supersequence.

To frame the string, we need to understand how the dp table was formed and work in the reverse process.

Now, let us see what were the conditions that we used while forming the dp array:

If(str1[i-1] == str2[j-1]), then return 1 + dp[i-1][j-1]
If(str1[i-1] != str22[j-1]) , then return 0 + max(dp[i-1][j],dp[i][j-1])

Approach:

We will start from the right-most cell of the dp array, initially i=n and j=m. To form the string, we will work in a reverse manner.
1. If(str1[i-1] == str2[j-1]), this means the character is an lcs character and needs to be included only once from both the strings, so we add it to the ans string and reduce both i and j by 1. We reduce them simultaneously to make sure the character is counted only once.
2. If(str1[i-1] != str2[j-1]), this means that the character is a non-lcs character and then we move the pointer to the top cell or left cell depending on which is greater. This way non-lcs characters will be included separately in the right order.

The algorithm steps are stated below:

We start from cell dp[n][m]. Initially i=n and j=m.

At every cell, we will check if S1[i-1] == S2[j-1], if it is then it means this character is a part of the longest common subsequence. So we will push it to the ans string str. Then we will move to the diagonally top-left(↖) cell by assigning i to i-1 and j to j-1.
Else, this character is not a part of the longest common subsequence so we include it in ans string. Originally this cell got its value from its left cell (←) or from its top cell (↑). Whichever cell’s value will be more of the two, we will move to that cell.
We will continue till i>0 and j>0, failing it we will break from the loop.
After breaking, either i>0 or j>0 (only one condition will fail to break from the while loop), if(i>0) we push all the characters from S1 to ans string, else if(j>0), we push all the remaining characters from S2.
At last, we reverse the ‘ans’ string and we get our answer.

Dry Run:

In the beginning i=5 and j=5.

As S1[4] != S2[4], we move to the top cell(↑) as its value is greater than the left cell(←) but before moving as we are leaving row 5(i=5) and will not return to it so we add S1[4] to our ans string.

As S1[3] == S2[4], this is an lcs-character. We add the current character to the ans string(and move to i-1 and j-1 cell) i.e top-left(↖). Reducing i and j simultaneously helps in adding the lcs character only once in the ans string.

As S1[2] != S2[3], we move to the left cell(←) as its value is greater than or equal to the top cell(↑) but before moving as we are leaving column 4 (j=4) and will not return to it so we add S2[3] to our ans string.

As S1[2] != S2[2], we move to the left cell(←) as its value is greater than or equal to the top cell(↑) but before moving as we are leaving column 3 (j=3) and will not return to it so we add S2[2] to our ans string.

As S1[2] != S2[1], we move to the top cell(↑) as its value is greater than the left cell(←) but before moving as we are leaving row 3(i=3) and will not return to it so we add S1[2] to our ans string.

As S1[1] == S2[1], this is an lcs-character. We add the current character to the ans string(and move to i-1 and j-1 cell) i.e top-left(↖).

As S1[2] != S2[2], we move to the left cell(←) as its value is greater than or equal to the top cell(↑) but before moving as we are leaving column 1 (j=1) and will not return to it so we add S2[0] to our ans string.

As j=0, we have reached the exit condition of the while loop, so we will break from it but still there are some characters left in the other string. We will simply add them in the ans string.
At last, we will return the reverse of the ans string as the final answer.

Code:

#include <bits/stdc++.h>
using namespace std;

class Solution {
public:
    // Function to find the shortest common supersequence
    string shortestCommonSupersequence(string str1, string str2) {
        int len1 = str1.size(); // Length of the first string
        int len2 = str2.size(); // Length of the second string

        // Create a DP table of size (len1+1) x (len2+1) initialized to 0
        vector<vector<int>> dp(len1 + 1, vector<int>(len2 + 1, 0));

        // Fill the DP table to find the length of the LCS (Longest Common Subsequence)
        for (int i = 1; i <= len1; i++) {
            for (int j = 1; j <= len2; j++) {
                if (str1[i - 1] == str2[j - 1]) {
                    // If characters match, add 1 to the previous LCS length
                    dp[i][j] = 1 + dp[i - 1][j - 1];
                } else {
                    // Otherwise, take the maximum of excluding one character from either string
                    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]);
                }
            }
        }

        // Reconstruct the shortest common supersequence (SCS) from the DP table
        string scs = ""; // Result string
        int i = len1;    // Pointer for str1
        int j = len2;    // Pointer for str2

        while (i > 0 && j > 0) {
            if (str1[i - 1] == str2[j - 1]) {
                // If characters match, add to SCS and move diagonally in the table
                scs += str1[i - 1];
                i--;
                j--;
            } else if (dp[i - 1][j] > dp[i][j - 1]) {
                // If excluding the character from str1 gives a longer LCS, add that character to SCS
                scs += str1[i - 1];
                i--;
            } else {
                // Otherwise, exclude the character from str2
                scs += str2[j - 1];
                j--;
            }
        }

        // Add any remaining characters from str1
        while (i > 0) {
            scs += str1[i - 1];
            i--;
        }

        // Add any remaining characters from str2
        while (j > 0) {
            scs += str2[j - 1];
            j--;
        }

        // Since the SCS is built in reverse order, reverse it to get the correct result
        reverse(scs.begin(), scs.end());
        return scs;
    }
};

int main() {
    string str1 = "brute";
    string str2 = "groot";

    // Create an instance of the Solution class
    Solution solution;

    // Call the function and print the result
    cout << "The Shortest Common Supersequence is: " 
         << solution.shortestCommonSupersequence(str1, str2) << endl;

    return 0;
}

Complexity Analysis:

Time Complexity: O(N*M), Where N is the length of string str1 and M is the length of string str2. As there are two nested loops, the time complexity will be O(N*M).
Space Complexity: O(N*M), We are using an external array of size (N*M).

Problem Statement

Explanation

Examples

Different Approaches

1️⃣ Tabulation

Intuition:

How can we improve the naive approach:

Length of Shortest Common Supersequence(lcs):

Finding the supersequence string:

Approach:

Dry Run:

Code:

Complexity Analysis:

Leave a comment

Popular Posts

Variadic Function Working in C

How Characters are Stored in Memory

Tags

Quick links

Newsletter

🛠️ Settings

Problem Statement

Explanation

Examples

Different Approaches

1️⃣ Tabulation

Intuition:

How can we improve the naive approach:

Length of Shortest Common Supersequence(lcs):

Finding the supersequence string:

Approach:

Dry Run:

Code:

Complexity Analysis:

Leave a comment

Variadic Function Working in C

How Characters are Stored in Memory