Adding pdf document to Solr in Java

Adding pdf document to Solr in Java - java

I can add field like this in Java, but I want to add pdf document to Solr with SolrJ in Java, how can I add a pdf file?
CommonsHttpSolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("cat", "lalal");
doc.addField("id", "1");
server.add(doc);
server.commit();

Solr uses Apache Tika to process binary files.
See http://wiki.apache.org/solr/ExtractingRequestHandler and http://wiki.apache.org/solr/ContentStreamUpdateRequestExample for a SolrJ example.

Related

How can I copy the content of one google doc file to another google doc file using google doc api?

I have a google doc file with some initial content and i want to append to it the content(all of the body part) of another google doc file. I am using the google docs api using java.
I can read contents from the google doc file which i want to copy as follows:
Document doc = service.documents().get(DOCUMENT_ID).execute();
doc.getBody().getContent()
But i do not know how to insert this content to my another google doc file. Specifically, in my case, I have a table in one of my google doc file. and I want to copy that table to my other google doc file. so, first, I am getting the table as follows:
List<StructuralElement> structuralElements = doc.getBody().getContent()
for (StructuralElement element : elements) {
if (element.getTable() != null) {
// How can i copy this "element.getTable()" table to my another google doc file
}
}
}
currently I am using the following code to add contents using batchUpdate operation and requests:
List<Request> requests = new ArrayList<>();
BatchUpdateDocumentRequest body = new BatchUpdateDocumentRequest();
service.documents().batchUpdate(DOCUMENT_ID, body.setRequests(requests)).execute();
I have looked at the google docs api documentation but i could not find what i am looking for.
Is there any approach or any recommendations to perform this operation?
N.B. I am aware that, in Google App Script, there is method to append table to a google doc file which is:
body.appendTable(table)
So, is there something like this in Java, which we can use to append a table from one doc file to another doc file.

How to print ODF files using java print API

How do i print an ODF file using the java print api framework? i can't seem to find any info on printing odf file using the java print api.
java accept postscript and convert it into doc for printing.
FileInputStream fis = new FileInputStream("example.ps");
Doc doc = new SimpleDoc(fis, psFlavor, null);
pj.print(doc, aset);
how do i do the same for ODF file?

Read Text from RTF file

I tried to read rtf file using Apache POI but I found issues with it. It reports Invalid Header exception. It seems like POI doesn't support rtf files. Is there any way to read .rtf using any open source java API. (I heard about Aspose API but it's not free)
Any solutions??

You can try the RTFEditorKit. It supports images and text as well.
Or look at this answer: Java API to convert RTF file to Word document (97-2003 format)
There is no free library that supports this. But it may not be that hard to create a basic compare function yourself. You can read in an rtf file and then extract the text like this:
// read rtf from file
JEditorPane p = new JEditorPane();
p.setContentType("text/rtf");
EditorKit rtfKit = p.getEditorKitForContentType("text/rtf");
rtfKit.read(new FileReader(fileName), p.getDocument(), 0);
rtfKit = null;
// convert to text
EditorKit txtKit = p.getEditorKitForContentType("text/plain");
Writer writer = new StringWriter();
txtKit.write(writer, p.getDocument(), 0, p.getDocument().getLength());
String documentText = writer.toString();

Creating PDF from Word (DOC) using Apache POI and iText in JAVA

I am trying to generate a PDF document from a *.doc document.
Till now and thanks to stackoverflow I have success generating it but with some problems.
My sample code below generates the pdf without formatations and images, just the text.
The document includes blank spaces and images which are not included in the PDF.
Here is the code:
in = new FileInputStream(sourceFile.getAbsolutePath());
out = new FileOutputStream(outputFile);
WordExtractor wd = new WordExtractor(in);
String text = wd.getText();
Document pdf= new Document(PageSize.A4);
PdfWriter.getInstance(pdf, out);
pdf.open();
pdf.add(new Paragraph(text));

docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.
There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.
If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.
Use it like so:
wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
// Set up font mapper
Mapper fontMapper = new IdentityPlusMapper();
wordMLPackage.setFontMapper(fontMapper);
// Example of mapping missing font Algerian to installed font Comic Sans MS
PhysicalFont font
= PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
fontMapper.getFontMappings().put("Algerian", font);
org.docx4j.convert.out.pdf.PdfConversion c
= new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
// = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);
OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");
c.output(os);
Update July 2016
As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com
If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.
Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.
The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.
What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.
One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.
Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.

I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).
It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.
Good luck with your project!
Wim

Use OpenOffice/LbreOffice and JODConnector
This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.
private static void transformDocXToPDFUsingJOD(File in, File out)
{
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
converter.convert(in, out, pdf);
}
private static OfficeManager officeManager;
#BeforeClass
public static void setupStatic() throws IOException {
/*officeManager = new DefaultOfficeManagerConfiguration()
.setOfficeHome("C:/Program Files/LibreOffice 3.6")
.buildOfficeManager();
*/
officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();
officeManager.start();
}
#AfterClass
public static void shutdownStatic() throws IOException {
officeManager.stop();
}
You need to be running LibreOffice as a serverto make this work.
From the command line you can do this using;
"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore

Another option I came across recently is using the OpenOffice (or LibreOffice) API (see here). I have not been able to get into this but it should be able to open documents in various formats and output them in a pdf format. If you look into this, let me know how it worked!

Java: parsing ms-word document using POI/HWPF

I have a ms-word document (MS-Office 2003; non-xml). Within this
document there is a string associated with a bookmark. Furthermore,
the word document contains word-macros. My goal is to read the
document with java, replace the string associated with the bookmark,
and save the document back to word format.
My first approach was using Apache POI HWPF:
HWPFDocument doc = new HWPFDocument(new FileInputStream("Test.doc"));
doc.write(new FileOutputStream("Test_generated.doc"));
The problem with this solution is that the generated file does not
contain the macro anymore (File size of the original document: 32k;
file size of the generated document 19k).
Does anybody now if it's possible to retain all the original info
using POI/HWPF?

never found a solution. The customer had to pay an Aspose-license (expensive) or refrain from using macros.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Adding pdf document to Solr in Java - java

Solr uses Apache Tika to process binary files. See http://wiki.apache.org/solr/ExtractingRequestHandler and http://wiki.apache.org/solr/ContentStreamUpdateRequestExample for a SolrJ example.

Related

How can I copy the content of one google doc file to another google doc file using google doc api?

How to print ODF files using java print API

Read Text from RTF file

Creating PDF from Word (DOC) using Apache POI and iText in JAVA

Java: parsing ms-word document using POI/HWPF

Categories

Resources