All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Understand the problem:
The naive approach is to use a sliding window, with size of 10. Put the visited string into a map. If the substring is visited again, add into the result.
Code (Java):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | public class Solution { public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<String>(); if (s == null || s.length() < 10 ) { return result; } Map<String, Integer> map = new HashMap<String, Integer>(); int lo = 0 ; int hi = 9 ; while (hi < s.length()) { String substring = s.substring(lo, hi + 1 ); if (map.containsKey(substring) && map.get(substring) == 1 ) { result.add(substring); map.put(substring, map.get(substring) + 1 ); } else { map.put(substring, 1 ); } lo++; hi++; } return result; } } |
The code got memory limit exceeding error. Why? Because the space complexity is O(10 * n). We need to think of a better approach with less memory cost.
Use bit manipulation:
Since the string contains only 4 characters, we could encode the characters using 2 bits each, i.e,
'A' -- 00
'C' -- 01
'G' -- 10
'T' -- 11
And since the length of the substring is only 10 characters, the total number of bits needed are only 20. Therefore we could encode a string into a integer.
Code (Java):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | public class Solution { private Map<Character, Integer> codingMap = new HashMap<Character, Integer>(); private int encode(String s) { int value = 0 ; for ( int i = 0 ; i < s.length(); i++) { value = (value << 2 ) + codingMap.get(s.charAt(i)); } return value; } private void fill() { codingMap.put( 'A' , 0 ); codingMap.put( 'C' , 1 ); codingMap.put( 'G' , 2 ); codingMap.put( 'T' , 3 ); } public List<String> findRepeatedDnaSequences(String s) { List<String> result = new ArrayList<String>(); if (s == null || s.length() < 10 ) { return result; } fill(); Map<Integer, Integer> map = new HashMap<Integer, Integer>(); int lo = 0 ; int hi = 9 ; while (hi < s.length()) { String substring = s.substring(lo, hi + 1 ); int hashCode = encode(substring); if (map.containsKey(hashCode) && map.get(hashCode) == 1 ) { result.add(substring); map.put(hashCode, map.get(hashCode) + 1 ); } else { if (!map.containsKey(hashCode)) { map.put(hashCode, 1 ); } else { map.put(hashCode, map.get(hashCode) + 1 ); } } lo++; hi++; } return result; } } |
Analysis:
Now let's analyze the space cost. Since in Java, each character takes 2 bytes. For the previous solution using string, a 10-character substring takes 20 byte. For using the integer, which takes only 4 bytes. So the new solution saves the memory by 1/5.
No comments:
Post a Comment