Indonesian Stemmer Using Lucene - java

Here is class from Lucene library that I want to take advantage (make use) of..
But I don't know how to use/implement that library in Java..
Example:
I have string array >> menjadikan, menjawab, penerbangan
Can you help me in Java with creating such an array??

Here is an example code snippet (based on the Lucene test code) that creates a Lucene analyser using the Indonesian stemmer.
import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
...
Analyzer a = new Analyzer() {
#Override
public TokenStreamComponents createComponents(
String fieldName, Reader reader) {
Tokenizer tokenizer = new KeywordTokenizer(reader);
return new TokenStreamComponents(tokenizer,
new IndonesianStemFilter(tokenizer));
}
};
You could also instantiate IndonesianStemmer directly, and call the stem method on individual words. For example;
IndonesianStemmer stemmer = new IndonesianStemmer();
...
char[] chars = "menjadikan".toCharArray();
int len = stemmer.stem(chars, chars.length, false);
String stem = new String(chars, 0, len);
WARNING: the above code is not tested.

Related

java.nio alternative for Java 6 [duplicate]

I have the following piece of code that uses the java 7 features like java.nio.file.Files and java.nio.file.Paths
import java.io.File;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.SerializationFeature;
import com.fasterxml.jackson.databind.node.ObjectNode;
public class JacksonObjectMapper {
public static void main(String[] args) throws IOException {
byte[] jsonData = Files.readAllBytes(Paths.get("employee.txt"));
ObjectMapper objectMapper = new ObjectMapper();
Employee emp = objectMapper.readValue(jsonData, Employee.class);
System.out.println("Employee Object\n"+emp);
Employee emp1 = createEmployee();
objectMapper.configure(SerializationFeature.INDENT_OUTPUT, true);
StringWriter stringEmp = new StringWriter();
objectMapper.writeValue(stringEmp, emp1);
System.out.println("Employee JSON is\n"+stringEmp);
}
}
Now I have to run the the same code on Java 6 , what are the best possible alternatives other than using FileReader ?
In Files class source you can see that in readAllBytes method bytes are read from InputStream.
public static byte[] readAllBytes(Path path) throws IOException {
long size = size(path);
if (size > (long)Integer.MAX_VALUE)
throw new OutOfMemoryError("Required array size too large");
try (InputStream in = newInputStream(path)) {
return read(in, (int)size);
}
}
return read(in, (int)size) - here it uses buffer to read data from InputStream.
So you can do it in the same way or just use Guava or Apache Commons IO http://commons.apache.org/io/.
Alternative are classes from java.io or Apache Commons IO, also Guava IO can help.
Guava is most modern, so I think it is the best solution for you.
Read more: Guava's I/O package utilities, explained.
If you really don't want to use FileReader(Though I didn't understand why) you can go for FileInputStream.
Syntax:
InputStream inputStream = new FileInputStream(Path of your file);
Reader reader = new InputStreamReader(inputStream);
You are right to avoid FileReader as that always uses the default character encoding for the platform it is running on, which may not be the same as the encoding of the JSON file.
ObjectMapper has an overload of readValue that can read directly from a File, there's no need to buffer the content in a temporary byte[]:
Employee emp = objectMapper.readValue(new File("employee.txt"), Employee.class);
You can read all bytes of a file into byte array even in Java 6 as described in an answer to a related question:
import java.io.RandomAccessFile;
import java.io.IOException;
RandomAccessFile f = new RandomAccessFile(fileName, "r");
if (f.length() > Integer.MAX_VALUE)
throw new IOException("File is too large");
byte[] b = new byte[(int)f.length()];
f.readFully(b);
if (f.getFilePointer() != f.length())
throw new IOException("File length changed while reading");
I added the checks leading to exceptions and the change from read to readFully, which was proposed in comments under the original answer.

Solr custom Tokenizer Factory works randomly

I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer.
Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing.
Here is the code:
package test.solr.analysis;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
//import test.solr.analysis.TestLemmatizer;
public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
//private TestLemmatizer lemmatizer = new TestLemmatizer();
private final int maxTokenLength;
public TestLemmatizerTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
public String readFully(Reader reader){
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
try {
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("### READFULLY ### => " + buf.toString());
/*
The original return with lemmatized text would be this:
return lemmatizer.getLemma(buf.toString());
To test it I only change the text adding "lemmatized" word
*/
return buf.toString() + " lemmatized";
}
#Override
public StandardTokenizer create(AttributeFactory factory, Reader input) {
// I print this to see when enters to the tokenizer
System.out.println("### Standar tokenizer ###");
StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input)));
tokenizer.setMaxTokenLength(maxTokenLength);
return tokenizer;
}
}
With this, it only indexes the first text adding the word "lemmatized" to the text.
Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document.
On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes.
What happens?
Thank you all.
I highly doubt that the create method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer that wraps a StandardTokenizer, then just override the setReader method and do my work there

A simple stemming algorithm with String for input

I've been looking at word stemming algorithms such as the porter algorithm, but everything I've found so far has dealt with files as input.
Are there any existing algorithms which would let me simply pass the stemmer a string, and have it return the stemmed string?
Something like:
String toBeStemmed = "The man worked tirelessly";
Stemmer s = new Stemmer();
String stemmed = s.stem(toBeStemmed);
The algorithms themselves don't take files. The code probably takes the file and reads it in as a series of Strings, which are fed to the algorithm. You just need to look at the part of the code that reads the Strings in from the file, and pass the Strings in yourself in a similar way.
In your example, toBeStemmed is a sentence, that you want to tokenize first. Then you would stem the individual tokens/words, like 'worked' or 'tirelessly'.
Here's fine morphological analyzer I use as a stemmer in some of my projects.
stemmer JAR: https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Flib%2Fnet%2Fsf%2Fjhunlang%2Fjmorph%2F1.0
stemmer source: https://code.google.com/p/j-morph/source/checkout
language resource files: https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Fsrc%2Fmain%2Fresources%2Fresources-lang%2Fjmorph
How I use it with Lucene: https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/
properties file: https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/resources/META-INF/spring/stemmer.properties
Example usage:
import net.sf.jhunlang.jmorph.lemma.Lemma;
import net.sf.jhunlang.jmorph.lemma.Lemmatizer;
import net.sf.jhunlang.jmorph.analysis.Analyser;
import net.sf.jhunlang.jmorph.analysis.AnalyserContext;
import net.sf.jhunlang.jmorph.analysis.AnalyserControl;
import net.sf.jhunlang.jmorph.factory.Definition;
import net.sf.jhunlang.jmorph.factory.JMorphFactory;
import net.sf.jhunlang.jmorph.parser.ParseException;
import net.sf.jhunlang.jmorph.sample.AnalyserConfig;
import net.sf.jhunlang.jmorph.sword.parser.EnglishAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.EnglishReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordReader;
AnalyserConfig acEn = new AnalyserConfig();
//TODO: set path to the English affix file
String enAff = "src/main/resources/resources-lang/jmorph/en.aff";
Definition affixDef = acEn.createDefinition(enAff, "utf-8", EnglishAffixReader.class);
//TODO set path to the English dict file
String enDic = "src/main/resources/resources-lang/jmorph/en.dic";
Definition dicDef = acEn.createDefinition(enDic, "utf-8", EnglishReader.class);
int enRecursionDepth = 3;
acEn.setRecursionDepth(affixDef, enRecursionDepth);
JMorphFactory jf = new JMorphFactory();
Analyser enAnalyser = jf.build(new Definition[] { affixDef, dicDef });
AnalyserControl acEn = new AnalyserControl(AnalyserControl.ALL_COMPOUNDS);
AnalyserContext analyserContextEn = new AnalyserContext(acEn);
boolean enStripDerivates = true;
Lemmatizer enLemmatizer = new net.sf.jhunlang.jmorph.lemma.LemmatizerImpl(enAnalyser, enStripDerivates, analyserContextEn);
//After somewhat complex initializing, here we go
List<Lemma> lemmas = enLemmatizer.lemmatize("worked");
for (Lemma lemma : lemmas) {
System.out.println(lemma.getWord());
}

Integrating ANTLR4 into Java

I have generated and compiled a grammar with ANTLR4. VIA the command line I am able to see if there is an error, but I am having issues integrating this parser into a java program successfully. I am able to use ANTLR4 methods as I've added the JAR's to my library in Eclipse, however I am completely unable to retrieve token text or find out if an error is being generated in any sort of meaningful manner. Any help would be appreciated. If I'm being ambiguous by any means, please let me know and I'll delve into more detail.
Looking at previous versions, an equivalent method to something like compilationUnit() might be what I want.
Something like this should work (assuming you generated GeneratedLexer and GeneratedParser from your grammar):
import java.io.FileInputStream;
import java.io.InputStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import test.GeneratedLexer;
import test.GeneratedParser;
public class Main {
public static void main(String[] args) throws Exception {
String inputFile = null;
if (args.length > 0) {
inputFile = args[0];
}
InputStream is = System.in;
if (inputFile != null) {
is = new FileInputStream(inputFile);
}
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lexer = new GeneratedLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
GeneratedParser parser = new GeneratedParser(tokens);
ParseTree tree = parser.startRule();
// Do something useful with the tree (e.g. use a visitor if you generated one)
System.out.println(tree.toStringTree(parser));
}
}
You could also use a parser and lexer interpreter if you don't want to pregenerate them from your grammar (or you have a dynamic grammar).

Parsing CSV files using Regex in Java

I'm trying to create a program, which reads CSV files from a directory, using a regex it parses each line of the file and displays the lines after matching the regex pattern.
For instance if this is the first line of my csv file
1997,Ford,E350,"ac, abs, moon",3000.00
my output should be
1997 Ford E350 ac, abs, moon 3000.00
I don't want to use any existing CSV libraries. I'm not good at regex, I've used a regex I found on net but its not working in my program
This is my source code, I'll be grateful if any one tells me where and what I"ve to modify in order to make my code work. Pls explain me.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.CharBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexParser {
private static Charset charset = Charset.forName("UTF-8");
private static CharsetDecoder decoder = charset.newDecoder();
String pattern = "\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)";
void regexparser( CharBuffer cb)
{
Pattern linePattern = Pattern.compile(".*\r?\n");
Pattern csvpat = Pattern.compile(pattern);
Matcher lm = linePattern.matcher(cb);
Matcher pm = null;
while(lm.find())
{
CharSequence cs = lm.group();
if (pm==null)
pm = csvpat.matcher(cs);
else
pm.reset(cs);
if(pm.find())
{
System.out.println( cs);
}
if (lm.end() == cb.limit())
break;
}
}
public static void main(String[] args) throws IOException {
RegexParser rp = new RegexParser();
String folder = "Desktop/sample";
File dir = new File(folder);
File[] files = dir.listFiles();
for( File entry: files)
{
FileInputStream fin = new FileInputStream(entry);
FileChannel channel = fin.getChannel();
int cs = (int) channel.size();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_ONLY, 0, cs);
CharBuffer cb = decoder.decode(mbb);
rp.regexparser(cb);
fin.close();
}
}
}
This is my input file
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
I'm getting the same as output where is the problem in my code? why doesn't my regex have any impact on the code?
Using regexp seems "fancy", but with CSV files (at least in my opinion) is not worth it. For my parsing I use http://commons.apache.org/csv/. It has never let me down. :)
Anyway I've found the fix myself, thanks guys for your suggestion and help.
This was my initial code
if(pm.find()
System.out.println( cs);
Now changed this to
while(pm.find()
{
CharSequence css = pm.group();
//print css
}
Also I used a different Regex. I'm getting the desired output now.
You can try this : [ \t]*+"[^"\r\n]*+"[ \t]*+|[^,\r\n]*+ with this code :
try {
Pattern regex = Pattern.compile("[ \t]*+\"[^\"\r\n]*+\"[ \t]*+|[^,\r\n]*+", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher matcher = regex.matcher(subjectString);
while (matcher.find()) {
// Do actions
}
} catch (PatternSyntaxException ex) {
// Take care of errors
}
But yeah, if it's not a very critical demand do try to use something that already working : )
Take the advice offered and do not use regular expressions to parse a CSV file. The format is deceptively complicated in the way it can be used.
The following answer contains links to wikipedia and the RFC describing the CSV file format:
field size limitation of csv file

Categories

Resources