Lucene index HTML Headings

Lucene index HTML Headings - java

I want to index HTML files and be able to jump to the corresponding heading after receiving my search results.
I currently use a HTMLStripCharFilter for parsing my files.
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
StandardTokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
return new TokenStreamComponents(source, result);
}
}
The method indexMyFile gets the path to one HTML file and creates the index, but it currently only stores the file name.
private static void indexMyFile(IndexWriter writer, Path file,
long lastModified) throws IOException {
try (InputStream stream = Files.newInputStream(file)) {
Document doc = new Document();
Field pathField = new StringField("path", file.toString(),
Field.Store.YES);
doc.add(pathField);
doc.add(new TextField("contents", new BufferedReader(
new InputStreamReader(stream, StandardCharsets.UTF_8))));
writer.addDocument(doc);
}
My solution was to add a new TextField to this Lucene Document, but I currently don't know the headings in this point of the code.
Is there a way of using Lucene, so I can link the content to the current heading and file name? Or do I have to use JSoup or JTidy and pass my indexMyFile Method the text after headings and create a Lucene Document for each heading, similar to this post?

I used JSoup to parse the HTML tags. Then instead of indexing the whole file, I created a Document for each heading containing several fields:
private void indexString(Path path, String title, String heading,
String content) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, TextField.TYPE_STORED));
doc.add(new Field("heading", heading, TextField.TYPE_STORED));
doc.add(new StringField("path", path.toString(), Field.Store.YES));
doc.add(new StringField("urlHeading", urlHeading, Field.Store.YES));
doc.add(new TextField("contents", content, Store.NO));
writer.addDocument(doc);
}

Related

read pdf file and write it to database [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 months ago.
Improve this question
I have a pdf file that needs to be parsed and written to a database.
It looks something like this:
Some report name Data for 10.10.2022
_____________________________________________________________________________________________
Name Currency1 Currency2 Percent1 Percent2
Some name1(IO/C)|1.2% 1'01.12 USD 1'021.2 USD 0.11% 1.12%
Some name2(IO/C)|1.2% 1'01.12 USD 1'021.2 USD 0.11% 1.12%
I used apache.pdfbox library
public class PdfParser {
private PDFParser parser;
private PDFTextStripper textStripper;
private PDDocument document;
private COSDocument cosDocument;
private String text;
private String filePath;
private File file;
public PdfParser() {}
public String parsePdf() throws IOException {
this.textStripper = null;
this.document = null;
this.cosDocument = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file, "rw"));
parser.parse();
cosDocument = parser.getDocument();
textStripper = new PDFTextStripper();
document = new PDDocument(cosDocument);
document.getNumberOfPages();
textStripper.setStartPage(0);
textStripper.setEndPage(2);
text = textStripper.getText(document);
return text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
public PDDocument getDocument() {
return document;
}
public void showDocument() {
List<String> list = new ArrayList<>();
PdfParser pdfParser = new PdfParser();
pdfParser.setFilePath("C:\\......pdf");
try {
String text = pdfParser.parsePdf();
list.addAll(List.of(strings));
for (Object obj : list) {
System.out.println(obj);
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
I have not parsed pdf documents before, and the result of my code does not suit me. As I understand it, it reads line by line and incorrectly displays the text format.
It comes out like this:
Some report name
Data for 10.10.2022
Name
Currency1
Currency2
Percent1
Percent2
1'01.12Some name1(IO/C)|1.2% 1'021.2 USD 0.11% 1.12%USD
1'01.12Some name2(IO/C)|1.2% 1'021.2 USD 0.11% 1.12%USD
Please tell me who worked with parsing pdf files. How can I at least correctly display my document in the console? And then write the received data to the object and save it to the database

IText7 only creates form/widgets on new documents

When running this code with the PdfDocument not having a read source, it works properly. When I try reading from a premade pdf it stops creating the form/widgets, but still adds the paragraph as expected. There is no error given. Does anyone understand why this is happening?
Here is the code I'm running:
public class HelloWorld {
public static final String DEST = "sampleOutput.pdf";
public static final String SRC = "sample.pdf";
public static void main(String args[]) throws IOException {
File file = new File(DEST);
new HelloWorld().createPdf(SRC, DEST);
}
public void createPdf(String src, String dest) throws IOException {
//Initialize PDF reader and writer
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer); //if i do (reader, writer) the widget isn't added to the first page anymore.
// Initialize document
Document document = new Document(pdf);
HelloWorld.addAcroForm(pdf, document);
//Close document
document.close();
}
public static PdfAcroForm addAcroForm(PdfDocument pdf, Document doc) throws IOException {
Paragraph title = new Paragraph("Test Form")
.setTextAlignment(TextAlignment.CENTER)
.setFontSize(16);
doc.add(title);
doc.add(new Paragraph("Full name:").setFontSize(12));
//Add acroform
PdfAcroForm form = PdfAcroForm.getAcroForm(doc.getPdfDocument(), true);
//Create text field
PdfTextFormField nameField = PdfFormField.createText(doc.getPdfDocument(),
new Rectangle(99, 753, 425, 15), "name", "");
form.addField(nameField);
return form;
}
}

I adapted your code like this:
public static PdfAcroForm addAcroForm(PdfDocument pdf, Document doc) throws IOException {
Paragraph title = new Paragraph("Test Form")
.setTextAlignment(TextAlignment.CENTER)
.setFontSize(16);
doc.add(title);
doc.add(new Paragraph("Full name:").setFontSize(12));
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfTextFormField nameField = PdfFormField.createText(pdf,
new Rectangle(99, 525, 425, 15), "name", "");
form.addField(nameField, pdf.getPage(1));
return form;
}
You'll notice two changes:
I change the Y offset of the field (525 instead of 753). Now the field is added inside the visible area of the page. In your code, the field was added, but it wasn't visible.
I defined to which page the fields needs to be added by adding pdf.getPage(1) as second parameter for the addField() method.

Create PDF Table from HTML String with UTF-8 encofing

I want to create PDF table from HTML string. I can create that table, but instead of Text, I'm getting question marks. Here is my code:
public class ExportReportsToPdf implements StreamSource {
private static final long serialVersionUID = 1L;
private ByteArrayOutputStream byteArrayOutputStream;
public static final String FILE_LOC = "C:/Users/KiKo/CasesWorkspace/case/Export.pdf";
private static final String CSS = ""
+ "table {text-align:center; margin-top:20px; border-collapse:collapse; border-spacing:0; border-width:1px;}"
+ "th {font-size:14px; font-weight:normal; padding:10px; border-style:solid; overflow:hidden; word-break:normal;}"
+ "td {padding:10px; border-style:solid; overflow:hidden; word-break:normal;}"
+ "table-header {font-weight:bold; background-color:#EAEAEA; color:#000000;}";
public void createReportPdf(String tableHtml, Integer type) throws IOException, DocumentException {
// step 1
Document document = new Document(PageSize.A4, 20, 20, 50, 20);
// step 2
PdfWriter.getInstance(document, new FileOutputStream(FILE_LOC));
// step 3
byteArrayOutputStream = new ByteArrayOutputStream();
PdfWriter writer = PdfWriter.getInstance(document, byteArrayOutputStream);
if (type != null) {
writer.setPageEvent(new Watermark());
}
// step 4
document.open();
// step 5
document.add(getTable(tableHtml));
// step 6
document.close();
}
private PdfPTable getTable(String tableHtml) throws IOException {
// CSS
CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(new ByteArrayInputStream(CSS.getBytes()));
cssResolver.addCss(cssFile);
// HTML
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline pdf = new ElementHandlerPipeline(elements, null);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
// XML Worker
XMLWorker worker = new XMLWorker(css, true);
XMLParser parser = new XMLParser(worker);
InputStream inputStream = new byteArrayInputStream(tableHtml.getBytes());
parser.parse(inputStream);
return (PdfPTable) elements.get(0);
}
private static class Watermark extends PdfPageEventHelper {
#Override
public void onEndPage(PdfWriter writer, Document document) {
try {
URL url = Thread.currentThread().getContextClassLoader().getResource("/images/memotemp.jpg");
Image background = Image.getInstance(url);
float width = document.getPageSize().getWidth();
float height = document.getPageSize().getHeight();
writer.getDirectContentUnder().addImage(background, width, 0, 0, height, 0, 0);
} catch (DocumentException | IOException e) {
e.printStackTrace();
}
}
}
#Override
public InputStream getStream() {
return new ByteArrayInputStream(byteArrayOutputStream.toByteArray());
}
}
This code is working, and I'm getting this:
I've try to add UTF-8,
InputStream inputStream = new byteArrayInputStream(tableHtml.getBytes("UTF-8"));
but than I'm getting this:
I want to get something like this:
I think the problem is with the encoding, but I don't know how to solve this bug. Any suggestions...?

To get bytes from a (Unicode) String in some encoding, specify it,
otherwise the default system encoding is used.
tableHtml.getBytes(StandardCharsets.UTF_8)
In your case however "Windows-1251" seems a better match as the PDF does not seem to use UTF-8.
Maybe the original tableHTML String was read with the wrong encoding. Might check that, if it came from file or database.

You need to tell iText what encoding to use by creating an instance of the BaseFont class. Then in your document.add(getTable(tableHtml)); you can add a call to the font. Example at http://itextpdf.com/examples/iia.php?id=199.
I can't tell how you create a table but the class PdfPTable has a method addCell(PdfCell) and one constructor for PdfCell takes a Phrase. The Phrase can be constructed with a String and a Font. The font class takes a BaseFont as a constructor argument.
If you look around the Javadoc for iText you will see various classes take a Font as a constructor argument.

Arabic analyzer Lucene

I'm trying to index an Arabic text file, using the ArabicAnalyzer provided by Apache Lucene. The following code shows what I am trying to do:
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "E:/workspace/IRThesisCorpusByApacheLucene/indexDir";
String dataDir = "E:/workspace/IRThesisCorpusByApacheLucene/dataDir";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,new IndexWriterConfig
(Version.LUCENE_45, new ArabicAnalyzer(Version.LUCENE_45))
);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter)
throws Exception {
System.out.println(" Dir Path :::::"+ new File(dataDir).getAbsolutePath());
File[] files = new File(dataDir).listFiles();
System.out.println(" Files number :::::"+files.length);
for (File f: files) {
System.out.println(" File is :::::"+f);
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase()
.endsWith(".txt");
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
InputStreamReader reader=new InputStreamReader
(new FileInputStream(f),"UTF8");
System.out.println(" Encoding is ::::"+reader.getEncoding());
doc.add(new TextField("contents",reader ));
doc.add(new TextField("filename", f.getName(),
Field.Store.YES));
doc.add(new TextField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
System.out.println(" In indexFile :::::::: doc is ::"+doc+" writer:::"+writer);
writer.addDocument(doc,new ArabicAnalyzer(Version.LUCENE_45));
}
}
my text file contains :
{سم الله الرحمن الرحيم
اهلا و سهلا بكم ، ماذا بعد
كتب يكتب كاتب مكتوب سيكتب }
When run, I get the following results in file _0.cfs:
I get words, but also get undefined characters
What is the problem here? Why doesn't it show Arabic correctly?

You shouldn't be looking at .cfs files directly. The cfs is a compound index file, and is not, in any way, a plain text document. You are intended to use the Lucene API to search and retrieve data from an index, not just look at the file in an editor. If you want to know more about Lucene file formats, feel free to look at the codec documentation

Term frequency in Lucene 4.0

Trying to calculate term frequency using Lucene 4.0. I got document frequency working just fine, but can't figure out how to do term frequency using the API. Here's the code I have:
private static void addDoc(IndexWriter writer, String content) throws IOException {
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
fieldType.setStored(true);
Document doc = new Document();
doc.add(new Field("content", content, fieldType));
writer.addDocument(doc);
}
public static void main(String[] args) throws IOException, ParseException {
Directory directory = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter writer = new IndexWriter(directory, config);
addDoc(writer, "Lucene is stupid");
addDoc(writer, "Java is great");
writer.close();
IndexReader reader = DirectoryReader.open(directory);
System.out.println(reader.docFreq(new Term("content", "Lucene")));
reader.close();
}
I've tried doing something like reader.getTermVector(0, "content")... but can't find a method to just get the frequency of a particular term in that document.
Thanks!

K, figured it out. You can get a DocsEnum object from MultiFields, and then iterate over that.
private static void addDoc(IndexWriter writer, String content) throws IOException {
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
fieldType.setStored(true);
Document doc = new Document();
doc.add(new Field("content", content, fieldType));
writer.addDocument(doc);
}
public static void main(String[] args) throws IOException, ParseException {
Directory directory = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter writer = new IndexWriter(directory, config);
addDoc(writer, "bla bla bla bleu bleu");
addDoc(writer, "bla bla bla bla");
writer.close();
DirectoryReader reader = DirectoryReader.open(directory);
DocsEnum de = MultiFields.getTermDocsEnum(reader, MultiFields.getLiveDocs(reader), "content", new BytesRef("bla"));
int doc;
while((doc = de.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
System.out.println(de.freq());
}
reader.close();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene index HTML Headings - java

Related

read pdf file and write it to database [closed]

IText7 only creates form/widgets on new documents

Create PDF Table from HTML String with UTF-8 encofing

Arabic analyzer Lucene

Term frequency in Lucene 4.0

Categories

Resources