How do I use custom stopwords and stemmer file in WEKA (Java)? -
so far have:
ngramtokenizer tokenizer = new ngramtokenizer(); tokenizer.setngramminsize(2); tokenizer.setngrammaxsize(2); tokenizer.setdelimiters("[\\w+\\d+]"); stringtowordvector filter = new stringtowordvector(); // customize filter here instances data = filter.usefilter(input, filter);
the api has these 2 methods stringtowordvector:
setstemmer(stemmer value); setstopwordshandler(stopwordshandler value);
i have text file containing stopwords , class stems words. how use custom stemmer , stopwords filter? note i'm taking phrases of size 2, can't preprocess , remove stopwords beforehand.
update: worked me (using weka developer version 3.7.12)
to use custom stopwords handler:
public class mystopwordshandler implements stopwordshandler { private hashset<string> mystopwords; public mystopwordshandler() { //load in own stopwords, etc. } //must implement method stopwordshandler interface public boolean isstopword(string word) { return mystopwords.contains(word); } }
to use custom stemmer, create class implements stemmer interface , write implementations these methods:
public string stem(string word) { ... } public string getrevision() { ... }
then use custom stopwords handler , stemmer:
stringtowordvector filter = new stringtowordvector(); filter.setstemmer(new mystemmer()); filter.setstopwordshandler(new mystopwordshandler());
note: answer below thusitha works stable 3.6 verion, , simpler 1 described above. not work 3.7.12 version.
in latest weka library can use
stringtowordvector filter = new stringtowordvector(); filter.setstopwords(new file("filename"));
i'm using following dependency
<dependency> <groupid>nz.ac.waikato.cms.weka</groupid> <artifactid>weka-stable</artifactid> <version>3.6.12</version> </dependency>
in api docs api doc
public void setstopwords(java.io.file value) sets file containing stopwords, null or directory unset stopwords. if file exists, automatically turns on flag use stoplist. parameters: value - file containing stopwords
Comments
Post a Comment