-->

Sunday, December 25, 2016

187. Repeated DNA Sequences

https://leetcode.com/problems/repeated-dna-sequences/

Solution:
It is straightforward to use the "brute force" hash map method plus a sliding window with size of 10-letter. But something else can be applied to reduce the space complexity since it is too expensive when using string as the hash key.
We just need to distinguish four letters, so bitwise operation may be used and two bits are enough for four, i.e. 'A' can be 00, 'C' for 01, 'G' for 10, and 'T' for 11.
Two details need to consider:
1) : a mask may need to set all the bits beyond the first 20 from the left to be 0. (since we only need to consider the "10-letter" sub-sequences and each letter need 2 bits). 0xFFFFF represents the first 20 bits from the left are 1s, and all the rest are 0s. (F in hexadecimal is equivalent to 1111 in binary);
2): need to remove the possible duplicates. Can use set, but actually can also use hash map, see below.

Code:
class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> res;
        unordered_map<char, int> m1{{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
        unordered_map<int, int> m2;
        int i = 0, t = 0, mask = 0xfffff;  // the lower 20 bits are 1s;
        while(i<9) t = (t<<2) + m1[s[i++]];
        while(i<s.size()){
            t = ((t<<2) & mask) | m1[s[i++]];
            if(m2.find(t) == m2.end()) m2[t] = 1;
            else{ 
                if(m2[t] == 1){
                    res.push_back(s.substr(i-10, 10));
                    m2[t] = 0;
                }
            }
        }
        return res;
    }
};

No comments :

Post a Comment