This does not affect the final metrics but reduces memory usage. Also we can use a nonlinear quantization to pack a 32 bit long count frequencies into a 16 bit values. To ensure that unknown words hash wouldn’t match the existing one we will use a bloom filter with known words. Without collisions it’s possible to store only values (count frequencies) and not the original n-grams. Perfect hash is a hash which guarantees no collisions. Let’s use a perfect hash ( Compress, Hash and Displace) to store n-gram counts. To compress our n-gram model let’s use an approach described in Efficient Minimal Perfect Hash Language Models paper. For example, for following text of 5 words: “ a b c a b” we store following frequencies:Ī => 2 b => 2 c => 1 a b => 2 b c => 1 c a => 1 a b c => 1 b c a => 1 c a b => 1Īnother reason - high memory overhead of the hash table data structure (hash table is used inside python dict or c++ unordered_map). One reason of such a high memory usage is that we don’t store a plain text, instead we store frequencies. Half of that size is used by language model, and another half by the symspell index. Training n-gram model on 600 mb file leads to significant memory consumption (25 Gb). To get the best possible accuracy we need a big dataset (at least few gigabytes). However, model become really huge, and everything works so slow. Sentence probability can be calculated like this: def predict(self, sentence): result = 0 for i in range(0, len(sentence) - 2): p2 = self.getGram3Prob(sentence, sentence, sentence) p3 = self.getGram2Prob(sentence, sentence) p4 = self.getGram1Prob(sentence) result += math.log(p2) + math.log(p3) + math.log(p4) return resultĪnd n-gram probabilities like this: def getGram1Prob(self, wordID): wordCounts = (wordID, 0) + SimpleLangModel.K vocabSize = len(am1) return float(wordCounts) / (self.totalWords + vocabSize) def getGram2Prob(self, wordID1, wordID2): countsWord1 = (wordID1, 0) + self.totalWords countsBigram = ((wordID1, wordID2), 0) + SimpleLangModel.K return float(countsBigram) / countsWord1 def getGram3Prob(self, wordID1, wordID2, wordID3): countsGram2 = ((wordID1, wordID2), 0) + self.totalWords countsGram3 = ((wordID1, wordID2, wordID3), 0) + SimpleLangModel.K return float(countsGram3) / countsGram2 Now we can use our extended language model to estimate candidates with context. Let’s estimate probability of some fragment as a product of all n-grams of n-size: Let’s pre-calculate not only single words, but word and a small context (3 nearest words). Adding some contextįirst improvement - adding n-gram language model (3-grams). The candidate word with highest frequency is taken as an answer. For each vocabulary word frequencies are pre-calculated, based on some big text collections. We repeat this procedure for every word for a second time to get candidates with bigger edit distance (for cases with two errors).Įach candidate is estimated with unigram language model. for word abc possible candidates will be: ab ac bc bac cba acb a_bc ab_c aabc abbc acbc adbc aebc etc.Įvery word is added to a candidate list. Let’s take a word and brute force all possible edits, such as delete, insert, transpose, replace and split. Peter Norvig (director of research at Google) described the following approach to spelling correction. Let start with a Norvig’s spelling corrector and iteratively increase its capabilities. We need an automatic spelling corrector which can fix words with typos and, at the same time not break correct spellings. As painful as it may be, data should be cleaned before fitting. As the result, we are unable to reach the best score. In real-world NLP problems we often meet texts with a lot of typos.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |