I started learning to use Lucene and first tried to compile the example for a Indexer class from a book i found. The class looks like this:
public class Indexer
{
private IndexWriter writer;
public static void main(String[] args) throws Exception
{
String indexDir = "src/indexDirectory";
String dataDir = "src/filesDirectory";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexer = indexer.index(dataDir);
indexer.close();
long end = System.currentTimeMillis();
System.out.println("Indexarea a " + numIndexer + " fisiere a durat "
+ (end - start) + " milisecunde");
}
public Indexer(String indexDir) throws IOException
{
Directory dir = new FSDirectory(new File(indexDir), null) {};
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_35), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException
{
writer.close();
}
public int index(String dataDir) throws Exception
{
File[] files = new File(dataDir).listFiles();
for (int i=0;i<files.length; i++)
{
File f = files[i];
if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && acceptFile(f))
{
indexFile(f);
}
}
return writer.numDocs();
}
protected boolean acceptFile(File f)
{
return f.getName().endsWith(".txt");
}
protected Document getDocument(File f) throws Exception
{
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
}
private void indexFile(File f) throws Exception
{
System.out.println("Indexez " + f.getCanonicalPath());
Document doc = getDocument(f);
if (doc != null)
{
writer.addDocument(doc);
}
}
}
When i run it, i get
Exception in thread "main" java.lang.StackOverflowError
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:345)
at org.apache.lucene.store.Directory.openInput(Directory.java:143)
and goes like this for tens of times.
The constructor of my class
public Indexer(String indexDir) throws IOException
{
Directory dir = new FSDirectory(new File(indexDir), null) {};
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_35), true, IndexWriter.MaxFieldLength.UNLIMITED);
}
has a IndexWriter call that is deprecated (because the book is written for lucene 3.0.0) and uses this method IndexWriter.MaxFieldLength.UNLIMITED(also deprecated). Could this be causing the overflow? If so, what should i use instead?
No, don't make your own anonymous subclass of FSDirectory! Use FSDirectory.open instead, or instantiate a concrete subclass of FSDirectory provided by Lucene, like NIOFSDirectory (but in this case you must carefully read the Javadoc on your chosen implementation, every one has OS-specific pitfalls). Lucene never, even in version 3.0, expected you to make your own subclasses of FSDirectory.
Related
My program uses Picocli to parse XML data and store it in an ArrayList. For some reason, the information gets removed when I try to access it from another class.
I run the code below, and it shows the elements just fine:
public class SourceSentences {
static String source;
static ArrayList<String> sourceArray = new ArrayList<>();
public static void translate() throws ParserConfigurationException, IOException, SAXException {
String xmlFileLocation = "C:\\Users\\user\\Desktop\\exercise\\source.txml";
System.out.println("---------------");
System.out.println("Get Text From Source File: ");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
builderFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
//parse '.txml' file
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document document = builder.parse(new File(xmlFileLocation));
//...
document.getDocumentElement().normalize();
//specify tag in the '.txml' file and iterate
NodeList nodeList = document.getElementsByTagName("segment");
for (int i = 0; i < nodeList.getLength(); i++) {
//this is tag index of where line of el are
Node node = nodeList.item(i);
//check if actually a node
if (node.getNodeType() == Node.ELEMENT_NODE) {
//create a node object that will retrieve the element in the XML file
Element element = (Element) node;
//get the element from the specified node in nodeList
source = element.getElementsByTagName("source").item(0).getTextContent();
//check what it looks like
System.out.println(source);
//add to arraylist
sourceArray.add(source);
}
/*String[] arr = source.split("\\s");
System.out.println(Arrays.toString(arr));
System.out.println(Arrays.toString(arr));*/
}
//get its data type to make sure
System.out.println("data type: " + source.getClass().getSimpleName());
System.out.println(sourceArray);
}
}
So I try to access sourceArray from another class:
class getArrayElements extends SourceSentences{
public static void main(String[] args) {
System.out.println(SourceSentences.sourceArray);
}
}
and results in variables being [], thus not able to transfer data to another class.
Picocli setup snippet:
public class TranslateTXML implements Callable<String> {
#Option(names = "-f", description = " path to source txml file")
private String file;
#Option(names = "-o", description = "output path")
private String output;
public static void main(String... args) throws Exception {
int exitCode = new picocli.CommandLine(new TranslateTXML()).execute(args);
System.exit(exitCode);
}
public String call() throws Exception {
if (file != null) {
if (file.equals("C:\\Users\\gnier\\Desktop\\exercise\\source.txml")) {
sourceSent("C:\\Users\\gnier\\Desktop\\exercise\\source.txml");
System.out.println("source.txml data retrieved\n");
} else {
System.out.println("File \"source.txml\" not found. Check FileName and Directory.");
System.exit(2);
}
}
WriteSourceTranslatedToTXML.makeTranslated(System.out);
System.out.println("translated made");
System.out.println("------");
System.out.println("File \"translated.txml\" has been outputted to designated path");
}
}
The static context of the SourceSentences.main() is lost once you run the getArrayElements.main() method. The parsing of your XML data never happened as far as getArrayElements.main() was concerned.
You need to call the translate method from inside the getArrayElements' main function.
class getArrayElements extends SourceSentences {
public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException {
SourceSentences.translate();
System.out.println(SourceSentences.sourceArray);
}
}
I have started working with Apache Pig for one of our projects. I have to create a custom input format to load our data files. For this, I followed this example Hadoop:Custom Input format. I also created my custom RecordReader implementation to read the data (we get our data in binary format from some other application) and parse that to proper JSON format.
The problem occurs when I use my custom loader in Pig script. As soon as my loader's getNext() method is invoked, it calls my custom RecordReader's nextKeyValue() method, which works fine. It reads the data properly, passes it back to my loader which parses the data and returns a Tuple. So far so good.
The problem arises when my loader's getNext() method is called again and again. It gets called, works fine, and returns the proper output (I debugged it till return statement). But then, instead of letting the execution go further, my loader gets called again. I tried to see the number of times my loader is called, and I could see the number go till 20K!
Can somebody please help me understand the problem in my code?
Loader
public class SimpleTextLoaderCustomFormat extends LoadFunc {
protected RecordReader in = null;
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
#Override
public Tuple getNext() throws IOException {
Tuple t = null;
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
String value = (String) in.getCurrentValue();
byte[] buf = value.getBytes();
int len = value.length();
int start = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
readField(buf, start, i);
start = i + 1;
}
}
// pick up the last field
readField(buf, start, len);
t = mTupleFactory.newTupleNoCopy(mProtoTuple);
mProtoTuple = null;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
e.printStackTrace();
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
return t;
}
private void readField(byte[] buf, int start, int end) {
if (mProtoTuple == null) {
mProtoTuple = new ArrayList<Object>();
}
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
}
#Override
public InputFormat getInputFormat() throws IOException {
//return new TextInputFormat();
return new CustomStringInputFormat();
}
#Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
#Override
public void prepareToRead(RecordReader reader, PigSplit split)
throws IOException {
in = reader;
}
Custom InputFormat
public class CustomStringInputFormat extends FileInputFormat<String, String> {
#Override
public RecordReader<String, String> createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
return new CustomStringInputRecordReader();
}
}
Custom RecordReader
public class CustomStringInputRecordReader extends RecordReader<String, String> {
private String fileName = null;
private String data = null;
private Path file = null;
private Configuration jc = null;
private static int count = 0;
#Override
public void close() throws IOException {
// jc = null;
// file = null;
}
#Override
public String getCurrentKey() throws IOException, InterruptedException {
return fileName;
}
#Override
public String getCurrentValue() throws IOException, InterruptedException {
return data;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
file = split.getPath();
jc = context.getConfiguration();
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
InputStream is = FileSystem.get(jc).open(file);
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
data = writer.toString();
fileName = file.getName();
writer.close();
is.close();
System.out.println("Count : " + ++count);
return true;
}
}
Try this in Loader
//....
boolean notDone = ((CustomStringInputFormat)in).nextKeyValue();
//...
Text value = new Text(((CustomStringInputFormat))in.getCurrentValue().toString())
I don't know why my code only writes in the file the first time and then nothing, I think the problem is salida.close() but i'm not sure. A bit of help will be appreciated. The method itself saves a binary tree in a file. If you need more information, ask me. Here is my code:
public boolean guardarAgenda() throws IOException
{
NodoAgenda raiz;
raiz = this.root;
guardar(raiz);
if(this.numNodes == 0)
return true;
else return false;
}
public void guardar(NodoAgenda nodo) throws IOException
{
FileWriter fich_s = new FileWriter("archivo.txt");
BufferedWriter be = new BufferedWriter(fich_s);
PrintWriter output = new PrintWriter(be);
Parser p = new Parser();
if (nodo != null)
{
guardar(nodo.left);
p.ponerPersona(nodo.info);
String linea = p.obtainLine();
salida.println(linea);
guardar(nodo.right);
this.numNodes--;
}
output.close();
}
You are opening the same file on multiple calls without even the option of appending content. So no wonder you are not finding what you expect in it once processing ended.
When working with recursion and using the same resource on each call it's better to have a second method that takes that result as argument, and does only the writing/adding to it, opening/instantiation should happen in a method which initiates the first call, something similar to this:
public void guardar(NodoAgenda nodo) throws IOException
{
FileWriter fich_s = new FileWriter("archivo.txt");
BufferedWriter be = new BufferedWriter(fich_s);
PrintWriter salida = new PrintWriter(be);
if (nodo != null)
{
guardar(nodo.left, salida);
}
output.close();
}
public void guardar(NodoAgenda nodo, PrintWriter salida) throws IOException
{
if (nodo != null)
{
Parser p = new Parser();
guardar(nodo.left);
p.ponerPersona(nodo.info);
String linea = p.obtainLine();
salida.println(linea);
guardar(nodo.right, salida);
this.numNodes--;
}
}
I'm trying to index an Arabic text file, using the ArabicAnalyzer provided by Apache Lucene. The following code shows what I am trying to do:
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new IndexWriterConfig
(Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter)
throws Exception {
System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
File[] files = new File(dataDir).listFiles();
System.out.println(" Files number :::::"+files.length);
for (File f: files) {
System.out.println(" File is :::::"+f);
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase()
.endsWith(".txt");
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
InputStreamReader reader=new InputStreamReader
(new FileInputStream(f),"UTF8");
System.out.println(" Encoding is ::::"+reader.getEncoding());
doc.add(new TextField("contents",reader ));
doc.add(new TextField("filename", f.getName(),
Field.Store.YES));
doc.add(new TextField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer);
writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
}
}
my text file contains :
{سم الله الرحمن الرحيم
اهلا و سهلا بكم ، ماذا بعد
كتب يكتب كاتب مكتوب سيكتب }
When run, I get the following results in file _0.cfs:
I get words, but also get undefined characters
What is the problem here? Why doesn't it show Arabic correctly?
You shouldn't be looking at .cfs files directly. The cfs is a compound index file, and is not, in any way, a plain text document. You are intended to use the Lucene API to search and retrieve data from an index, not just look at the file in an editor. If you want to know more about Lucene file formats, feel free to look at the codec documentation
I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :(
IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer
Here is the code, I would very much appreciate help!!!! I am new with this..
public class Indexer {
private Indexer(){};
private String[] stopWords = {....};
private String indexName;
private IndexWriter iWriter;
private static String FILES_TO_INDEX = "/Users/ssi/forindexing";
public static void main(String[] args) throws Exception {
Indexer m = new Indexer();
m.index("./newindex");
}
public void index(String indexName) throws Exception {
this.indexName = indexName;
final File docDir = new File(FILES_TO_INDEX);
if(!docDir.exists() || !docDir.canRead()){
System.err.println("Something wrong... " + docDir.getPath());
System.exit(1);
}
Date start = new Date();
PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer());
analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords));
Directory directory = FSDirectory.open(new File(this.indexName));
IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED;
iWriter = new IndexWriter(directory, analyzers, true, maxLength);
System.out.println("Indexing to dir..........." + indexName);
if(docDir.isDirectory()){
File[] files = docDir.listFiles();
if(files != null){
for (int i = 0; i < files.length; i++) {
try {
indexDocument(files[i]);
}catch (FileNotFoundException fnfe){
fnfe.printStackTrace();
}
}
}
}
System.out.println("Optimizing...... ");
iWriter.optimize();
iWriter.close();
Date end = new Date();
System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds");
}
private void indexDocument(File someDoc) throws IOException {
Document doc = new Document();
Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED);
Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(name);
doc.add(text);
iWriter.addDocument(doc);
}
}
This says that one Lucene class is inconsistent with another Lucene class -- one is accessing a member of the other that it can't. This strongly suggests you have two different and incompatible versions of Lucene in your classpath somehow.