Lucene tika indexing failure

Lucene tika indexing failure - java

I wrote (mostly copied from lucene-in-action ebook) an indexing example using Tika. But it doesn't index the documents at all. There is no error on compile or run. I tried indexing a .pdf, .ppt, .doc, even .txt document, no use, at search returns 0 hits, and i payed attention at the words in my documents. Please take a look at the code:
public class TikaIndexer extends Indexer {
private boolean DEBUG = false;
static Set textualMetadataFields = new HashSet();
static {
textualMetadataFields.add(Metadata.TITLE);
textualMetadataFields.add(Metadata.AUTHOR);
textualMetadataFields.add(Metadata.COMMENTS);
textualMetadataFields.add(Metadata.KEYWORDS);
textualMetadataFields.add(Metadata.DESCRIPTION);
textualMetadataFields.add(Metadata.SUBJECT);
}
public TikaIndexer(String indexDir) throws IOException {
super(indexDir);
}
#Override
protected boolean acceptFile(File f) {
return true;
}
#Override
protected Document getDocument(File f) throws Exception {
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY,
f.getCanonicalPath());
InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(10*1024*1024);
try {
parser.parse(is, handler, metadata, new ParseContext());
} finally {
is.close();
}
Document doc = new Document();
doc.add(new Field("contents", handler.toString(), Field.Store.NO, Field.Index.ANALYZED));
if (DEBUG) {
System.out.println(" intregul textt: " + handler.toString());
}
for (String name : metadata.names()) {
String value = metadata.get(name);
if (textualMetadataFields.contains(name)) {
doc.add(new Field("contents", value,
Field.Store.NO, Field.Index.ANALYZED));
}
doc.add(new Field(name, value, Field.Store.YES, Field.Index.NO));
if (DEBUG) {
System.out.println(" " + name + ": " + value);
}
}
if (DEBUG) {
System.out.println();
}
return doc;
}
}
And main class:
public static void main(String args[])
{
String indexDir = "src/indexDirectory";
String dataDir = "src/filesDirectory";
try
{
TikaConfig config = TikaConfig.getDefaultConfig();
List<MediaType> parsers = new ArrayList(config.getParser().getSupportedTypes(new ParseContext())); //3
Collections.sort(parsers);
Iterator<MediaType> it = parsers.iterator();
System.out.println(parsers.size());
System.out.println("Tipuri de parsere:");
while (it.hasNext()) {
System.out.println(" " + it.next());
}
System.out.println();
long start = new Date().getTime();
TikaIndexer indexer = new TikaIndexer(indexDir);
int numIndexed = indexer.index(dataDir);
long end = new Date().getTime();
System.out.println("Indexarea a " + numIndexed + " fisiere a durat "
+ (end - start) + " milisecunde.");
System.out.println();
System.out.println("--------------------------------------------------------------");
System.out.println();
}
catch (Exception ex)
{
System.out.println("Nu s-a putut realiza indexarea: ");
ex.printStackTrace();
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}

Related

Flatten signatures in pdf with PDFBOX java

I want to flatten a pdf with signatures from a form but I am using this code and when I generate the final pdf I can still delete the signature. What I want is that when I generate the final pdf, I cannot delete anything at all from the pdf
private static void flattenPDF(String src, String dst) throws IOException {
PDDocument doc = null;
try {
doc = PDDocument.load(new File(src));
} catch (IOException e) {
System.out.println("Exception: " + e.getMessage());
}
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
if(acroForm == null) acroForm = new PDAcroForm(PDDocument.load(new File(src)));
PDResources resources = new PDResources();
List<PDField> fields = new ArrayList<>(acroForm.getFields());
processFields(fields, resources);
acroForm.setDefaultResources(resources);
try {
acroForm.flatten();
doc.save(dst);
doc.close();
} catch (IOException e) {
System.out.println("Exception: " + e.getMessage());
}
}
private static void processFields(List<PDField> fields, PDResources resources) {
fields.stream().forEach(f -> {
f.setReadOnly(true);
COSDictionary cosObject = f.getCOSObject();
String value = cosObject.getString(COSName.DV) == null ?
cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
try {
f.setValue(value);
} catch (IOException e) {
if (e.getMessage().matches("Could not find font: /.*")) {
String fontName = e.getMessage().replaceAll("^[^/]*/", "");
System.out.println("Adding fallback font for: " + fontName);
resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
try {
f.setValue(value);
} catch (IOException e1) {
e1.printStackTrace();
}
} else {
e.printStackTrace();
}
}
if (f instanceof PDNonTerminalField) {
processFields(((PDNonTerminalField) f).getChildren(), resources);
}
});
}

Printing plain text files to PDF printer using javax.print results in an empty file

I need to create a pdf file from plain text files. I supposed that the simplest method would be read these files and print them to a PDF printer.
My problem is that if I print to a pdf printer, the result will be an empty pdf file. If I print to Microsoft XPS Document Writer, the file is created in plain text format, not in oxps format.
I would be satisfied with a two or three step solution. (Eg. converting to xps first then to pdf using ghostscript, or something similar).
I have tried a couple of pdf printers such as: CutePDF, Microsoft PDF writer, Bullzip PDF. The result is the same for each one.
The environment is Java 1.7/1.8 Win10
private void print() {
try {
DocFlavor flavor = DocFlavor.SERVICE_FORMATTED.PRINTABLE;
PrintRequestAttributeSet patts = new HashPrintRequestAttributeSet();
PrintService[] ps = PrintServiceLookup.lookupPrintServices(flavor, patts);
if (ps.length == 0) {
throw new IllegalStateException("No Printer found");
}
System.out.println("Available printers: " + Arrays.asList(ps));
PrintService myService = null;
for (PrintService printService : ps) {
if (printService.getName().equals("Microsoft XPS Document Writer")) { //
myService = printService;
break;
}
}
if (myService == null) {
throw new IllegalStateException("Printer not found");
}
myService.getSupportedDocFlavors();
DocPrintJob job = myService.createPrintJob();
FileInputStream fis1 = new FileInputStream("o:\\k\\t1.txt");
Doc pdfDoc = new SimpleDoc(fis1, DocFlavor.INPUT_STREAM.AUTOSENSE, null);
HashPrintRequestAttributeSet pr = new HashPrintRequestAttributeSet();
pr.add(OrientationRequested.PORTRAIT);
pr.add(new Copies(1));
pr.add(MediaSizeName.ISO_A4);
PrintJobWatcher pjw = new PrintJobWatcher(job);
job.print(pdfDoc, pr);
pjw.waitForDone();
fis1.close();
} catch (PrintException ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
} catch (Exception ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
}
}
class PrintJobWatcher {
boolean done = false;
PrintJobWatcher(DocPrintJob job) {
job.addPrintJobListener(new PrintJobAdapter() {
public void printJobCanceled(PrintJobEvent pje) {
allDone();
}
public void printJobCompleted(PrintJobEvent pje) {
allDone();
}
public void printJobFailed(PrintJobEvent pje) {
allDone();
}
public void printJobNoMoreEvents(PrintJobEvent pje) {
allDone();
}
void allDone() {
synchronized (PrintJobWatcher.this) {
done = true;
System.out.println("Printing done ...");
PrintJobWatcher.this.notify();
}
}
});
}
public synchronized void waitForDone() {
try {
while (!done) {
wait();
}
} catch (InterruptedException e) {
}
}
}

If you can install LibreOffice, it is possible to use the Java UNO API to do this.
There is a similar example here which will load and save a file: Java Convert Word to PDF with UNO. This could be used to convert your text file to PDF.
Alternatively, you could take the text file and send it directly to the printer using the same API.
The following JARs give access to the UNO API. Ensure these are in your class path:
[Libre Office Dir]/URE/java/juh.jar
[Libre Office Dir]/URE/java/jurt.jar
[Libre Office Dir]/URE/java/ridl.jar
[Libre Office Dir]/program/classes/unoil.jar
[Libre Office Dir]/program
The following code will then take your sourceFile and print to the printer named "Local Printer 1".
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import com.sun.star.beans.PropertyValue;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.view.XPrintable;
public class DirectPrintTest
{
public static void main(String args[])
{
// set to the correct name of your printers
String printer = "Local Printer 1";// "Microsoft Print to PDF";
File sourceFile = new File("c:/projects/WelcomeTemplate.doc");
if (!sourceFile.canRead()) {
throw new RuntimeException("Can't read:" + sourceFile.getPath());
}
com.sun.star.uno.XComponentContext xContext = null;
try {
// get the remote office component context
xContext = com.sun.star.comp.helper.Bootstrap.bootstrap();
System.out.println("Connected to a running office ...");
// get the remote office service manager
com.sun.star.lang.XMultiComponentFactory xMCF = xContext
.getServiceManager();
Object oDesktop = xMCF.createInstanceWithContext(
"com.sun.star.frame.Desktop", xContext);
com.sun.star.frame.XComponentLoader xCompLoader = (XComponentLoader) UnoRuntime
.queryInterface(com.sun.star.frame.XComponentLoader.class,
oDesktop);
StringBuffer sUrl = new StringBuffer("file:///");
sUrl.append(sourceFile.getCanonicalPath().replace('\\', '/'));
List<PropertyValue> loadPropsList = new ArrayList<PropertyValue>();
PropertyValue pv = new PropertyValue();
pv.Name = "Hidden";
pv.Value = Boolean.TRUE;
loadPropsList.add(pv);
PropertyValue[] loadProps = new PropertyValue[loadPropsList.size()];
loadPropsList.toArray(loadProps);
// Load a Writer document, which will be automatically displayed
com.sun.star.lang.XComponent xComp = xCompLoader
.loadComponentFromURL(sUrl.toString(), "_blank", 0,
loadProps);
// Querying for the interface XPrintable on the loaded document
com.sun.star.view.XPrintable xPrintable = (XPrintable) UnoRuntime
.queryInterface(com.sun.star.view.XPrintable.class, xComp);
// Setting the property "Name" for the favoured printer (name of
// IP address)
com.sun.star.beans.PropertyValue propertyValue[] = new com.sun.star.beans.PropertyValue[2];
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Name";
propertyValue[0].Value = printer;
// Setting the name of the printer
xPrintable.setPrinter(propertyValue);
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Wait";
propertyValue[0].Value = Boolean.TRUE;
// Printing the loaded document
System.out.println("sending print");
xPrintable.print(propertyValue);
System.out.println("closing doc");
((com.sun.star.util.XCloseable) UnoRuntime.queryInterface(
com.sun.star.util.XCloseable.class, xPrintable))
.close(true);
System.out.println("closed");
System.exit(0);
} catch (Exception e) {
e.printStackTrace(System.err);
System.exit(1);
}
}
}

Thank you for all. After two days struggling with various type of printers (I gave a chance to CUPS PDF printer too but I could not make it to print in landscape mode) I ended up using the Apache PDFbox.
It's only a POC solution but works and fits to my needs. I hope it will be useful for somebody.
( cleanTextContent() method removes some ESC control characters from the line to be printed. )
public void txt2pdf() {
float POINTS_PER_INCH = 72;
float POINTS_PER_MM = 1 / (10 * 2.54f) * POINTS_PER_INCH;
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd HH:m.ss");
PDDocument doc = null;
try {
doc = new PDDocument();
PDPage page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(page);
PDPageContentStream content = new PDPageContentStream(doc, page);
//PDFont pdfFont = PDType1Font.HELVETICA;
PDFont pdfFont = PDTrueTypeFont.loadTTF(doc, new File("c:\\Windows\\Fonts\\lucon.ttf"));
float fontSize = 10;
float leading = 1.1f * fontSize;
PDRectangle mediabox = page.getMediaBox();
float margin = 20;
float startX = mediabox.getLowerLeftX() + margin;
float startY = mediabox.getUpperRightY() - margin;
content.setFont(pdfFont, fontSize);
content.beginText();
content.setLeading(leading);
content.newLineAtOffset(startX, startY);
BufferedReader fis1 = new BufferedReader(new InputStreamReader(new FileInputStream("o:\\k\\t1.txt"), "cp852"));
String inString;
//content.setRenderingMode(RenderingMode.FILL_STROKE);
float currentY = startY + 60;
float hitOsszesenOffset = 0;
int pageNumber = 1;
while ((inString = fis1.readLine()) != null) {
currentY -= leading;
if (currentY <= margin) {
content.newLineAtOffset(0, (mediabox.getLowerLeftX()-35));
content.showText("Date Generated: " + dateFormat.format(new Date()));
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftX()));
content.showText(String.valueOf(pageNumber++)+" lap");
content.endText();
float yCordinate = currentY+30;
float sX = mediabox.getLowerLeftY()+ 35;
float endX = mediabox.getUpperRightX() - 35;
content.moveTo(sX, yCordinate);
content.lineTo(endX, yCordinate);
content.stroke();
content.close();
PDPage new_Page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(new_Page);
content = new PDPageContentStream(doc, new_Page);
content.beginText();
content.setFont(pdfFont, fontSize);
content.newLineAtOffset(startX, startY);
currentY = startY;
}
String ss = new String(inString.getBytes(), "UTF8");
ss = cleanTextContent(ss);
if (!ss.isEmpty()) {
if (ss.contains("JAN") || ss.contains("SUMMARY")) {
content.setRenderingMode(RenderingMode.FILL_STROKE);
}
content.newLineAtOffset(0, -leading);
content.showText(ss);
}
content.setRenderingMode(RenderingMode.FILL);
}
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftY()));
content.showText(String.valueOf(pageNumber++));
content.endText();
fis1.close();
content.close();
doc.save("o:\\k\\t1.pdf");
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
}

Arabic analyzer Lucene

I'm trying to index an Arabic text file, using the ArabicAnalyzer provided by Apache Lucene. The following code shows what I am trying to do:
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new IndexWriterConfig
(Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter)
throws Exception {
System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
File[] files = new File(dataDir).listFiles();
System.out.println(" Files number :::::"+files.length);
for (File f: files) {
System.out.println(" File is :::::"+f);
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase()
.endsWith(".txt");
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
InputStreamReader reader=new InputStreamReader
(new FileInputStream(f),"UTF8");
System.out.println(" Encoding is ::::"+reader.getEncoding());
doc.add(new TextField("contents",reader ));
doc.add(new TextField("filename", f.getName(),
Field.Store.YES));
doc.add(new TextField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer);
writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
}
}
my text file contains :
{سم الله الرحمن الرحيم
اهلا و سهلا بكم ، ماذا بعد
كتب يكتب كاتب مكتوب سيكتب }
When run, I get the following results in file _0.cfs:
I get words, but also get undefined characters
What is the problem here? Why doesn't it show Arabic correctly?

You shouldn't be looking at .cfs files directly. The cfs is a compound index file, and is not, in any way, a plain text document. You are intended to use the Lucene API to search and retrieve data from an index, not just look at the file in an editor. If you want to know more about Lucene file formats, feel free to look at the codec documentation

How to insert and retrieve key from registry editor

I am new to cryptography. I have to develop project based on cryptography..In part of my project I have to insert a key to the registry and afterwards I have to retrieve the same key for decryption.. I done until getting the path of the registry ..
Here I am showing my code:
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
public final class Project {
public static final String readRegistry(String location, String key) {
try {
// Run reg query, then read output with StreamReader (internal class)
Process process = Runtime.getRuntime().exec("reg query " +
'"' + location + "\" /v " + key);
StreamReader reader = new StreamReader(process.getInputStream());
reader.start();
process.waitFor();
reader.join();
String output = reader.getResult();
// Output has the following format:
// \n<Version information>\n\n<key>\t<registry type>\t<value>
if (!output.contains("\t")) {
return null;
}
// Parse out the value
String[] parsed = output.split("\t");
return parsed[parsed.length - 1];
} catch (Exception e) {
return null;
}
}
static class StreamReader extends Thread {
private InputStream is;
private StringWriter sw = new StringWriter();
public StreamReader(InputStream is) {
this.is = is;
}
public void run() {
try {
int c;
while ((c = is.read()) != -1) {
System.out.println("Reading" + c);
sw.write(c);
}
} catch (IOException e) {
System.out.println("Exception in run() " + e);
}
}
public String getResult() {
System.out.println("Content " + sw.toString());
return sw.toString();
}
}
public static boolean addValue(String key, String valName, String val) {
try {
// Run reg query, then read output with StreamReader (internal class)
Process process = Runtime.getRuntime().exec("reg add \"" + key + "\" /v \"" + valName + "\" /d \"\\\"" + val + "\\\"\" /f");
StreamReader reader = new StreamReader(process.getInputStream());
reader.start();
process.waitFor();
reader.join();
String output = reader.getResult();
System.out.println("Processing........ggggggggggggggggggggg." + output);
// Output has the following format:
// \n<Version information>\n\n<key>\t<registry type>\t<value>
return output.contains("The operation completed successfully");
} catch (Exception e) {
System.out.println("Exception in addValue() " + e);
}
return false;
}
public static void main(String[] args) {
// Sample usage
JAXRDeleteConcept hc = new JAXRDeleteConcept();
System.out.println("Before Insertion");
if (JAXRDeleteConcept.addValue("HKEY_CURRENT_USER\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\ComDlg32\\OpenSaveMRU", "REG_SZ", "Muthus")) {
System.out.println("Inserted Successfully");
}
String value = JAXRDeleteConcept.readRegistry("HKEY_CURRENT_USER\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\ComDlg32\\OpenSaveMRU" , "Project_Key");
System.out.println(value);
}
}
But i dont know how to insert a key in a registry and read the particular key which i inserted..Please help me..
Thanks in advance..

It would be a lot easier to use the JRegistry library to edit the registry, rather than execute commands.

Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :(
IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer
Here is the code, I would very much appreciate help!!!! I am new with this..
public class Indexer {
private Indexer(){};
private String[] stopWords = {....};
private String indexName;
private IndexWriter iWriter;
private static String FILES_TO_INDEX = "/Users/ssi/forindexing";
public static void main(String[] args) throws Exception {
Indexer m = new Indexer();
m.index("./newindex");
}
public void index(String indexName) throws Exception {
this.indexName = indexName;
final File docDir = new File(FILES_TO_INDEX);
if(!docDir.exists() || !docDir.canRead()){
System.err.println("Something wrong... " + docDir.getPath());
System.exit(1);
}
Date start = new Date();
PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer());
analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords));
Directory directory = FSDirectory.open(new File(this.indexName));
IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED;
iWriter = new IndexWriter(directory, analyzers, true, maxLength);
System.out.println("Indexing to dir..........." + indexName);
if(docDir.isDirectory()){
File[] files = docDir.listFiles();
if(files != null){
for (int i = 0; i < files.length; i++) {
try {
indexDocument(files[i]);
}catch (FileNotFoundException fnfe){
fnfe.printStackTrace();
}
}
}
}
System.out.println("Optimizing...... ");
iWriter.optimize();
iWriter.close();
Date end = new Date();
System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds");
}
private void indexDocument(File someDoc) throws IOException {
Document doc = new Document();
Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED);
Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(name);
doc.add(text);
iWriter.addDocument(doc);
}
}

This says that one Lucene class is inconsistent with another Lucene class -- one is accessing a member of the other that it can't. This strongly suggests you have two different and incompatible versions of Lucene in your classpath somehow.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene tika indexing failure - java

Related

Flatten signatures in pdf with PDFBOX java

Printing plain text files to PDF printer using javax.print results in an empty file

Arabic analyzer Lucene

How to insert and retrieve key from registry editor

Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

Categories

Resources