Lucene 7 Index TREC - java

I'm having difficulties indexing TREC in Lucene 7. Until now I only needed to Index Text Files which was easily archivable by using a InputStreamReader like desribed by the Demo.
/** Indexes a single document */
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
try (InputStream stream = Files.newInputStream(file)) {
// make a new, empty document
Document doc = new Document();
Field pathField = new StringField("path", file.toString(), ld.Store.YES);
doc.add(pathField);
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
}
But TREC has different tags that store information not relevant for the search results. Like Header Title DocNo and many more. How would I adjust this Code to save specific Tags in their own textfield with their appropiate content?

Answering my own Question since I found a Solution. This might not be the most optimal and by no means the best looking one.
My Solution is to take the complet InputStream and read it step by step doing the appropiate actions if a certain tag is found here a small example:
BufferedReader in = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
while ((read = in.readLine()) != null) {
String[] splited = read.split("\\s+");
boolean text = false;
for (String part : splited) {
if (part.equals(new String("<TEXT>"))) {
text = true;
}
}
This solution works to solve my problem but im fairly certain that there is a better looking solution out there.

Related

merge multiple pdfs in order

hey guys sorry for long post and bad language and if there is unnecessary details
i created multiple 1page pdfs from one pdf template using excel document
i have now
something like this
tempfile0.pdf
tempfile1.pdf
tempfile2.pdf
...
im trying to merge all files in one single pdf using itext5
but it semmes that the pages in the resulted pdf are not in the order i wanted
per exemple
tempfile0.pdf in the first page
tempfile1. int the 2000 page
here is the code im using.
the procedure im using is:
1 filling a from from a hashmap
2 saving the filled form as one pdf
3 merging all the files in one single pdf
public void fillPdfitext(int debut,int fin) throws IOException, DocumentException {
for (int i =debut; i < fin; i++) {
HashMap<String, String> currentData = dataextracted[i];
// textArea.appendText("\n"+pdfoutputname +" en cours de preparation\n ");
PdfReader reader = new PdfReader(this.sourcePdfTemplateFile.toURI().getPath());
String outputfolder = this.destinationOutputFolder.toURI().getPath();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(outputfolder+"\\"+"tempcontrat"+debut+"-" +i+ "_.pdf"));
// get the document catalog
AcroFields acroForm = stamper.getAcroFields();
// as there might not be an AcroForm entry a null check is necessary
if (acroForm != null) {
for (String key : currentData.keySet()) {
try {
String fieldvalue=currentData.get(key);
if (key=="ADRESSE1"){
fieldvalue = currentData.get("ADRESSE1")+" "+currentData.get("ADRESSE2") ;
acroForm.setField("ADRESSE", fieldvalue);
}
if (key == "IMEI"){
acroForm.setField("NUM_SERIE_PACK", fieldvalue);
}
acroForm.setField(key, fieldvalue);
// textArea.appendText(key + ": "+fieldvalue+"\t\t");
} catch (Exception e) {
// e.printStackTrace();
}
}
stamper.setFormFlattening(true);
}
stamper.close();
}
}
this is the code for merging
public void Merge() throws IOException, DocumentException
{
File[] documentPaths = Main.objetapp.destinationOutputFolder.listFiles((dir, name) -> name.matches( "tempcontrat.*\\.pdf" ));
Arrays.sort(documentPaths, NameFileComparator.NAME_INSENSITIVE_COMPARATOR);
byte[] mergedDocument;
try (ByteArrayOutputStream memoryStream = new ByteArrayOutputStream())
{
Document document = new Document();
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
document.open();
for (File docPath : documentPaths)
{
PdfReader reader = new PdfReader(docPath.toURI().getPath());
try
{
reader.consolidateNamedDestinations();
PdfImportedPage pdfImportedPage = pdfSmartCopy.getImportedPage(reader, 1);
pdfSmartCopy.addPage(pdfImportedPage);
}
finally
{
pdfSmartCopy.freeReader(reader);
reader.close();
}
}
document.close();
mergedDocument = memoryStream.toByteArray();
}
FileOutputStream stream = new FileOutputStream(this.destinationOutputFolder.toURI().getPath()+"\\"+
this.sourceDataFile.getName().replaceFirst("[.][^.]+$", "")+".pdf");
try {
stream.write(mergedDocument);
} finally {
stream.close();
}
documentPaths=null;
Runtime r = Runtime.getRuntime();
r.gc();
}
my question is how to keep the order of the files the same in the resulting pdf
It is because of naming of files. Your code
new FileOutputStream(outputfolder + "\\" + "tempcontrat" + debut + "-" + i + "_.pdf")
will produce:
tempcontrat0-0_.pdf
tempcontrat0-1_.pdf
...
tempcontrat0-10_.pdf
tempcontrat0-11_.pdf
...
tempcontrat0-1000_.pdf
Where tempcontrat0-1000_.pdf will be placed before tempcontrat0-11_.pdf, because you are sorting it alphabetically before merge.
It will be better to left pad file number with 0 character using leftPad() method of org.apache.commons.lang.StringUtils or java.text.DecimalFormat and have it like this tempcontrat0-000000.pdf, tempcontrat0-000001.pdf, ... tempcontrat0-9999999.pdf.
And you can also do it much simpler and skip writing into file and then reading from file steps and merge documents right after the form fill and it will be faster. But it depends how many and how big documents you are merging and how much memory do you have.
So you can save the filled document into ByteArrayOutputStream and after stamper.close() create new PdfReader for bytes from that stream and call pdfSmartCopy.getImportedPage() for that reader. In short cut it can look like:
// initialize
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
for (int i = debut; i < fin; i++) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
// fill in the form here
stamper.close();
PdfReader reader = new PdfReader(out.toByteArray());
reader.consolidateNamedDestinations();
PdfImportedPage pdfImportedPage = pdfSmartCopy.getImportedPage(reader, 1);
pdfSmartCopy.addPage(pdfImportedPage);
// other actions ...
}

Jave, Lucene : Search with numbers as String not working

I am working on integrating Lucene in our Spring-MVC based project and currently it's working good, other than search with numbers.
Whenever I try search like 123Ab or 123 or anything which has numbers inside it, I don't get back any search results.
As soon as I remove the numbers though, it works fine.
Any suggestions? Thank you.
Code :
public List<Integer> searchLucene(String text, long groupId, boolean type) {
List<Integer> objectIds = new ArrayList<>();
if (text != null) {
//String specialChars = "+ - && || ! ( ) { } [ ] ^ \" ~ * ? : \\ /";
text = text.replace("+", "\\+");
text = text.replace("-", "\\-");
text = text.replace("&&", "\\&&");
text = text.replace("||", "\\||");
text = text.replace("!", "\\!");
text = text.replace("(", "\\(");
text = text.replace(")", "\\)");
text = text.replace("{", "\\}");
text = text.replace("{", "\\}");
text = text.replace("[", "\\[");
text = text.replace("^", "\\^");
// text = text.replace("\"","\\\"");
text = text.replace("~", "\\~");
text = text.replace("*", "\\*");
text = text.replace("?", "\\?");
text = text.replace(":", "\\:");
//text = text.replace("\\","\\\\");
text = text.replace("/", "\\/");
try {
Path path;
//Set system path code
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query;
query = queryParser.parse(text+"*");
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
indexSearcher.getIndexReader().close();
directory.close();
return objectIds;
} catch (Exception ignored) {
}
}
return null;
}
Indexing code :
#Override
public void saveIndexes(String text, String tagFileName, String filePath, long groupId, boolean type, int objectId) {
try {
//indexing directory
File testDir;
Path path1;
Directory index_dir;
if (type) {
// System path code
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
// doc.add(new TextField("id",String.valueOf(objectId),Field.Store.YES));
if (text == null) {
if (filePath != null) {
FileInputStream is = new FileInputStream(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder stringBuffer = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuffer.append(line).append("\n");
}
stringBuffer.append("\n").append(tagFileName);
reader.close();
doc.add(new TextField("contents", stringBuffer.toString(), Field.Store.YES));
}
} else {
text = text + "\n" + tagFileName;
doc.add(new TextField("contents", text, Field.Store.YES));
}
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.flush();
indexWriter.close();
directory.close();
} catch (Exception ignored) {
}
}
I have tried with and without wildcard i.e *. Thank you.
Issue is in your indexing code.
Your field contents is a TextField and you are using a SimpleAnalyzer so if you see SimpleAnalyzer documentation, it says ,
An Analyzer that filters LetterTokenizer with LowerCaseFilter
So that means for your field, if it is set to tokenized numbers will be removed.
Now look at , TextField code, here a TextField is always tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED.
So if you wish to index letters and numbers, you need to use a StringField instead of a TextField.
StringField documentation,
A field that is indexed but not tokenized: the entire String value is
indexed as a single token. For example this might be used for a
'country' field or an 'id' field, or any field that you intend to use
for sorting or access through the field cache.
A StringField is never tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED
So after indexing, numbers are removed from contents field and is indexed without numbers so you don't find those patterns while searching.
Instead of QueryParser and doing complicated searches, first use a query like below to first verify your indexed Terms,
Query wildcardQuery = new WildcardQuery(new Term("contents", searchString));
TopDocs hits = searcher.search(wildcardQuery, 20);
Also, to know if debugging to be focused on indexer side or searcher side , use Luke Tool to see if terms are created as per your need. If terms are there, you can focus on searcher code.

How to extract urls from an html file stored on my computer using java?

I need to find all the urls present in an html file which is stored in my computer itself and extract the links and store it to a variable. I'm using the code below to scan the file and get the lines. But I'm having a hard time extracting just the links. I would appreciate if someone could help me out.
Scanner htmlScanner = new Scanner(new File(args[0]));
PrintWriter output = new PrintWriter(new FileWriter(args[1]));
while(htmlScanner.hasNext()){
output.print(htmlScanner.next());
}
System.out.println("\nDone");
htmlScanner.close();
output.close();
You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway.
static Collection<String> getLinks(Path file)
throws IOException,
MimeTypeParseException,
BadLocationException {
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc;
try {
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
try (Reader reader =
Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
htmlKit.read(reader, htmlDoc, 0);
}
} catch (ChangedCharSetException e) {
MimeType mimeType = new MimeType(e.getCharSetSpec());
String charset = mimeType.getParameter("charset");
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
htmlDoc.putProperty("IgnoreCharsetDirective", true);
try (Reader reader =
Files.newBufferedReader(file, Charset.forName(charset))) {
htmlKit.read(reader, htmlDoc, 0);
}
}
Collection<String> links = new ArrayList<>();
for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
while (it.isValid()) {
String link = (String)
it.getAttributes().getAttribute(HTML.Attribute.HREF);
if (link != null) {
links.add(link);
}
it.next();
}
}
return links;
}

Lucene Search Returns no results when the file contents are saved

I am trying to develop a log querying system using apache lucene. I have developed a demo code to index two files and then search for the query string.
The first file contains the data
maclean
the second file contains the data
pinto
Bellow is the code that I have used for indexing
fis = new FileInputStream(file);
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
Document doc = new Document();
Document doc = new Document();
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));
doc.add(new StoredField("filename", file.getCanonicalPath()));
if (indexWriter.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
indexWriter.addDocument(doc);
} else {
System.out.println("updating " + file);
indexWriter.updateDocument(new Term("path", file.getPath()), doc);
}
If i use this code then i get the proffer result. But in display i can show only the file name since i have stored the only the file name.
So i modified the code and stored the file contents as well using this code
FileInputStream fis = null;
if (file.isHidden() || file.isDirectory() || !file.canRead() || !file.exists()) {
return;
}
if (suffix!=null && !file.getName().endsWith(suffix)) {
return;
}
System.out.println("Indexing file " + file.getCanonicalPath());
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
System.out.println("File Not Found"+fnfe);
}
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
String Data="";
while ((strLine = br.readLine()) != null)
{
Data=Data+strLine;
}
Document doc = new Document();
doc.add(new TextField("contents", Data, Field.Store.YES));
doc.add(new StoredField("filename", file.getCanonicalPath()));
if (indexWriter.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
indexWriter.addDocument(doc);
} else {
System.out.println("updating " + file);
indexWriter.updateDocument(new Term("path", file.getPath()), doc);
}
According to my understanding i should get the number of results as 1. and It should show the file name and content of the file containing maclean
But instead i get the result as
-----------------------Results--------------------------
0 total matching documents
Found 0
Is there anything wrong that i am doing in the code or there is a logical explanation to this? Why does the first code works and second doesn't work?
Search query Code
try
{
Directory directory = FSDirectory.open(indexDir);
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);
QueryParser parser = new QueryParser(Version.LUCENE_41, "contents", analyzer);
Query query = parser.parse(queryStr);
System.out.println("Searching for: " + query.toString("contents"));
TopDocs results = searcher.search(query, maxHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
System.out.println("\n\n\n-----------------------Results--------------------------\n\n\n");
System.out.println(numTotalHits + " total matching documents");
for (int i = 0; i < numTotalHits; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(i+":File name is: "+d.get("filename"));
System.out.println(i+":File content is: "+d.get("contents"));
}
System.out.println("Found " + numTotalHits);
}
catch(Exception e)
{
System.out.println("Exception Was caused in SimpleSearcher");
e.printStackTrace();
}
Use StoredField instead of TextField
doc.add(new StoredField("Data",Line));
When you use Text Field the string gets tokenized and as a result you will not be able to search for the same. Stored Field stores the entire string without tokenizing it.
I think your exact problem, is that by the time you get to creating a BufferedReader for the indexed field, you have already read the whole file, and the stream is at the end of the file, with nothing further to read. You should be able to fix that with a call to fis.reset();
However, you should not do that. Don't store the same data in two separate fields, one for indexing and one for storage. Instead, set the same field to both store and index the data. TextField has a ctor that allows you to store the data as well as index, something like:
doc.add(new TextField("contents", Data, Field.Store.YES));
I think there could be two problems with your code.
First, I notice that you did not user near-real-time search and did not commit the writer before reading as well. Lucene's IndexReader takes a snapshot of the index, either the committed version when NRT is not used or both committed and uncommitted version when NRT is used. That could be the reason that your IndexReader fails to see the change. As it seems you require concurrent reading and writing, I recommend you use NRT search (IndexReader reader = DirectoryReader.open(indexWriter);)
The second problem could be that, as #femtoRgon said, the data you stored may not what you expect. I notice that, when you append the content of your file for storage, you seem to lose EOL characters. I suggest you use Luke to check your index http://www.getopt.org/luke/
This works in Lucene 4.5: doc.add(new TextField("Data", Data, Field.Store.YES));

Apache POI HWPF - problem in convert doc file to pdf

I am currently working Java project with use of apache poi.
Now in my project I want to convert doc file to pdf file. The conversion done successfully but I only get text in pdf not any text style or text colour.
My pdf file looks like a black & white. While my doc file is coloured and have different style of text.
This is my code,
POIFSFileSystem fs = null;
Document document = new Document();
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(new FileInputStream("/document/test2.doc"));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(new File("/document/test.pdf"));
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": " + paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
please help me.
Thnx in advance.
If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.
The Tika class is
https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.
If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
int j = 0;
while (true) {
CharacterRun run = poiPara.getCharacterRun(j++);
System.out.println("Color "+run.getColor());
System.out.println("Font size "+run.getFontSize());
System.out.println("Font Name "+run.getFontName());
System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
System.out.println("Text is "+run.text());
if (run.getEndOffset() == poiPara.getEndOffset()) {
break;
}
}
}

Categories

Resources