Package smile.nlp
Class SimpleCorpus
java.lang.Object
smile.nlp.SimpleCorpus
- All Implemented Interfaces:
Corpus
An in-memory text corpus. Useful for text feature engineering.
-
Constructor Summary
ConstructorsConstructorDescriptionConstructor.SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionAdds a document to the corpus.intReturns the average size of documents in the corpus.bigrams()Returns the iterator over the bigrams in the corpus.intReturns the total frequency of the term in the corpus.intReturns the total frequency of the bigram in the corpus.longnbigram()Returns the number of bigrams in the corpus.intndoc()Returns the number of documents in the corpus.intnterm()Returns the number of unique terms in the corpus.Returns the iterator over the set of documents containing the given term.search(RelevanceRanker ranker, String term) Returns the iterator over the set of documents containing the given term in descending order of relevance.search(RelevanceRanker ranker, String[] terms) Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.longsize()Returns the number of words in the corpus.terms()Returns the iterator over the terms in the corpus.
-
Constructor Details
-
SimpleCorpus
public SimpleCorpus()Constructor. -
SimpleCorpus
public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor.- Parameters:
splitter- the sentence splitter.tokenizer- the word tokenizer.stopWords- the set of stop words to exclude.punctuations- the set of punctuation marks to exclude. Set to null to keep all punctuation marks.
-
-
Method Details
-
add
Adds a document to the corpus.- Parameters:
text- the document text.- Returns:
- the document.
-
size
public long size()Description copied from interface:CorpusReturns the number of words in the corpus. -
ndoc
public int ndoc()Description copied from interface:CorpusReturns the number of documents in the corpus. -
nterm
public int nterm()Description copied from interface:CorpusReturns the number of unique terms in the corpus. -
nbigram
public long nbigram()Description copied from interface:CorpusReturns the number of bigrams in the corpus. -
avgDocSize
public int avgDocSize()Description copied from interface:CorpusReturns the average size of documents in the corpus.- Specified by:
avgDocSizein interfaceCorpus- Returns:
- the average size of documents in the corpus.
-
count
Description copied from interface:CorpusReturns the total frequency of the term in the corpus. -
count
Description copied from interface:CorpusReturns the total frequency of the bigram in the corpus. -
terms
Description copied from interface:CorpusReturns the iterator over the terms in the corpus. -
bigrams
Description copied from interface:CorpusReturns the iterator over the bigrams in the corpus. -
search
Description copied from interface:CorpusReturns the iterator over the set of documents containing the given term. -
search
Description copied from interface:CorpusReturns the iterator over the set of documents containing the given term in descending order of relevance. -
search
Description copied from interface:CorpusReturns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
-