Wednesday, 29 April 2020

Suffix Arrays #1 Longest Common Substring

A Story with Strings Problem Code: SSTORY

Problem Statement: here

Solution:
This is the first problem with suffix arrays. so I will give a little boring lecture on it. So, bare with me.

Suffix Arrays

This is another data structure that operates on strings. It is a much lighter version, also known as a compressed version of suffix trees, which is itself a special case of trie data structure. 
Let us come out of the ocean. Suffix Array is a data structure that stores the suffix strings in lexicographically shortest order. Note, we do not store the strings, instead, we just store the starting index of our suffix. Using the starting index, obviously, we can create the entire suffix string.

Use cases:
We will see many different applications of suffix arrays.
Today, as you saw in the problem statement, we are finding the Longest Common Substring of given 2 strings. 

We will construct the suffix array in O(nlogn) time. You can also use O(nlog^2n) to form the suffix array. After this, we can construct the LCP table of this suffix array in O(n) time. 
The LCP table tells us the longest common prefix of suffixes of our string. This construction is widely used in many questions, which involve operations with substrings.  
The way LCP is constructed is that we look at the suffix string present at index i in our suffix array and compare it with i+1 suffix string. Then, the longest common prefix among them is the value of LCP[i].

So, in our question, we can form a single string s = s2 + '#' + s1. Please note that the senital character must have precedence less than any other value present in s1 or s2. This will make sure, that it comes at the very start of our suffix array, and we can then continue to move on from the next index.

Once we have computed the LCP table,  we know that LCP[i] gives us the longest common prefix for suffixes starting at suffix[i] and suffix[i+1]. To make sure that these suffixes belong to 2 different strings, we can check whether the index at suffix[i] is less than the length of s2 or index of suffix[i+1] less than the length of s2 and the other greater than s2. Since, suffix[i] stores the index of the start of suffix string, this is possible to do!

Also, once you implement, you will notice some failures because of some test cases. 
Try these tests then and you will figure out the mistake.
Test 01:
abcadlfghijkl
abcdlfabcde
Test 02:
abcmdef
abcdefabcdl

My code:

#include<bits/stdc++.h> using namespace std; #include <ext/pb_ds/assoc_container.hpp> #include <ext/pb_ds/tree_policy.hpp> using namespace __gnu_pbds; #define M1 1000000007 #define M2 998244353 #define ll long long int #define pll pair<ll,ll> #define mll map<ll,ll> #define F first #define S second #define PB push_back #define mp make_pair #define lb lower_bound #define ub upper_bound #define V(a) vector<a> #define endl '\n' #define test(t) while(t--) #define PI acos(-1.0) #define rep(i,a,b) for(ll i=(ll)a;i<(ll)b;i++) #define repp(i,b,a) for(ll i=(ll)(b-1);i>=(ll)a;i--) #define clr(ar, val) memset(ar, val, sizeof(ar)) #define setbits(x) __builtin_popcountll(x) #define zrobits(x) __builtin_ctzll(x) #define ps(y) fixed << setprecision(y) #define all(x) begin(x),end(x) #define allr(x) rbegin(x),rend(x) const int inf= 0x3f3f3f3f; const ll INF= 0x3f3f3f3f3f3f3f3f; const int dx[4]= { 0, -1, 0, 1 }; const int dy[4]= { -1, 0, 1, 0 }; vector<int> suffix_array_cyclic(string s){ int n = s.size(), N = n + 256; vector<int> sa(n), ra(n); for(int i = 0; i < n; i++) sa[i] = i, ra[i] = s[i]; for(int k = 0; k < n; k ? k *= 2 : k++) { vector<int> nsa(sa), nra(n), cnt(N); for(int i = 0; i < n; i++) nsa[i] = (nsa[i] - k + n) % n; for(int i = 0; i < n; i++) cnt[ra[i]]++; for(int i = 1; i < N; i++) cnt[i] += cnt[i - 1]; for(int i = n - 1; i >= 0; i--) sa[--cnt[ra[nsa[i]]]] = nsa[i]; int r = 0; for(int i = 1; i < n; i++) { if(ra[sa[i]] != ra[sa[i - 1]]) r++; else if(ra[(sa[i] + k) % n] != ra[(sa[i - 1] + k) % n]) r++; nra[sa[i]] = r; } ra = nra; } return sa; } vector<int> kasai(string s, vector<int> sa) { int n = s.size(), k = 0; vector<int> ra(n), lcp(n); for(int i = 0; i < n; i++) ra[sa[i]] = i; for(int i = 0; i < n; i++) { if(k) k--; if(ra[i] == n - 1) {k = 0; continue;} int j = sa[ra[i] + 1]; while(k < n && s[(i + k) % n] == s[(j + k) % n]) k++; lcp[ra[i]] = k; if(ra[(sa[ra[i]] + 1) % n] > ra[(sa[ra[j]] + 1) % n]) k = 0; } return lcp; } void solve() { string s1,s2; cin >> s1 >> s2; int l2 = s2.size(); string s = s2 + '#' + s1 + '$'; vector<int> v = suffix_array_cyclic(s); vector<int> lcp = kasai(s,v); int mx = 0; map<string,bool> sm; for(int i=0;i<v.size()-1;i++){ if (min(v[i],v[i+1]) < l2 and max(v[i],v[i+1])>l2){ mx = max(mx,lcp[i]); } } if (mx == 0){ cout << 0 << endl; return; } for(int i=0;i<v.size()-1;i++){ if (min(v[i],v[i+1]) < l2 and max(v[i],v[i+1])>l2){ if (lcp[i] == mx){ sm[s.substr(max(v[i],v[i+1]),mx)] = true; } } } string fans = s2.substr(0,mx); if (sm[fans]){ cout << fans << endl; cout << mx << endl; return; } for(int i=mx;i<=s2.size();i++){ fans.erase(fans.begin()); fans += s2[i]; if (sm[fans]){ cout << fans << endl; cout << mx << endl; return; } } } int32_t main(){ ios_base::sync_with_stdio(0);cin.tie(0);cout.tie(0); int t=1; // cin>>t; test(t){ solve(); } return 0; }

Time Complexity: O(nlogn)

What next??

This problem is exactly the above one, but with my approach only you will be able to pass all tests, the time restrictions are tight. Although, this is a simpler version of the problem we discussed because it only asks for the maximum size of the substring. So, I won't discuss this problem.

Want a little challenging task?
We will be discussing this problem soon. 

No comments:

Post a Comment