I am using Apache Tika to convert RTF documents to HTML.
In Tika's RTFParser class I made changes to generate HTML file using HTMLEditorKit and now I'm able to generate the HTML file.
I want to add the metadata tags into the head tag of the generated HTML file.
Can anybody give me an idea to how to proceed?
Check this out:
Add Metadata
I'm not sure that this will help, but I think it is worth to check out.
Related
I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.
I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.
I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!
PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.
See this blog for more details.
https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition
You can download Windows/macOS/Linux PDFGenie command line tool here.
https://www.pdftron.com/downloads/linux
One more option, we can extract from Aspose PDF also, if you want look into the below link
https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/
I am working on project which converts html to a .doc file. I have implemented the html using divs not td/table. When I generate/download the doc file, the css which I have used in html is not applied.
I did some research and I found that .doc does not support some css attributes e.g position, float etc.
https://superuser.com/questions/146453/css-absolute-position-dont-work-in-ms-word
Is there any alternative to get css applied in .doc format
Can someone please help
Here Is list of supported attribute.
Ms word supported html tag
Here you can see and use it carefully in your project. It will help to make you minor change.
Maximum tag has full support except div tag.
Link for check COREEXTENDED
div has COREEXTENDED support. You can see it at link.
Which APIs in java help in extracting table metadata from a pdf, and presenting that table in a web page?
The result should be that when the source of page is viewed it will show the html code of that table.
Itext is usefull in this context
http://itextpdf.com/
I assume that, you need a PDF library for Java.
PDFBox is one of the popular libraries created to PDF manipulation and I think it is worth to look at it.
try The Metadata Extract Tool which extracts metadata from specific file types including PDF. Then you can parse the xml output with any Java XML parser. Once you're able to parse it, elements can be easily laid down in your view page.
i am working in JSF Application and in this application using some code i have create a HTML file that is store in my server and now i want to convert that HTML file in to .DOC file ..... so please help me
There are tools that convert HTML to PDF: Convert an HTML file to a PDF with their pictures and styles using Java
Then use: http://pdf-to-doc.software.informer.com/
You could strip it of script and comments etc, then load it into Word and save it as .doc.
A solution is i have Download a
officetools.OfficeFile;
this jar file and after write some code that can easy get from the net...
You can convert HTML to XML (using XSLT) and load it into Word afterwards. I did it while converting HTML page with tables to XML with XSLT file attached and it worked just fine.
I have a JSP file to flush all data from database into a MS-Word document by setting the content-type keyword.
I need to add header and footer to the same document. I couldn't find a direct way from JSP without using APIs like POI. So I created a macro which works locally.
How do I add this to a dynamically generated Word file?
I had a similar problem with POI and Excel.
The solution is to manually create a template .doc file, with the macro present. Then in your code, load that document, amend it with your data, and save it. The macro will be preserved from the template document.
I'd use POI or docx4j to create a docx file on the server, and add the header/footer as part of that process.