Unable to read unicode character in pdf using java

Unable to read unicode character in pdf using java - java

I am trying to convert Pdf document that contains Tamil unicode characters into a word document retaining all the formatting. I am not able to read the unicode character in the Pdf they are appearing as junk character in word. I am using the below code can someone please help?
public static void main(String[] args) throws IOException {
System.out.println("Document converted started");
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\sample1.pdf";
PdfReader reader = new PdfReader(pdf);
// InputStreamReader isr = new InputStreamReader(reader,"UTF8");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,
new SimpleTextExtractionStrategy());
System.out.println(strategy.getResultantText());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
// run.setFontFamily(new Font("Arial"));
run.setFontSize(14);
run.setText(text);
// run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("D:\\tamildoc.docx");
doc.write(out);
out.close();
reader.close();
System.out.println("Document converted successfully");
}

You can use the library Apache PDFBox https://pdfbox.apache.org/download.cgi . With the component PDFTextStripper, invoking method getText(PDDocument doc) you will obtain a simple String that represents the content of .pdf file
Here an example :
UploadedFile file = new UploadedFile(fileName);
InputStream is = file.getInputStream();
PDDocument doc = PDDocument.load(is);
String content = new PDFTextStripper().getText(doc);
doc.close();
And after that you can write on your file

Related

Delete paragraph from PDF - Java

Greeting,
What is the easiest way to delete text/paragraph from a PDF document. The program takes a PDF document and creates a separate PDF document for each page. In each document I have text from the original that I would like to delete.
I tried a couple of examples but it doesn't work or they use old libraries
I am using the iText 7 library
private void processPDF(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream) object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data, "UTF8")
.replace("Hand made software", "");
stream.setData(dd.getBytes("UTF8"));
if (dd.contains("Hand made software")) {
System.out.println("Contains");
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
private void processPDF2(String src, String dest) throws InvalidPasswordException, IOException {
Map<String, String> map = new HashMap<>();
map.put("Hand made software", "");
File template = new File(src);
PDDocument document = PDDocument.load(template);
List<PDField> fields = document.getDocumentCatalog().getAcroForm().getFields();
for (PDField field : fields) {
for (Map.Entry<String, String> entry : map.entrySet()) {
if (entry.getKey().equals(field.getFullyQualifiedName())) {
field.setValue(entry.getValue());
field.setReadOnly(true);
}
}
}
File out = new File(dest);
document.save(out);
document.close();
}
I want to delete line "Hand made software"

You can make it easy iterating through the PDF elements. First create a PDF reader and writter that will read the template located on src string path.
File template = new File(src);
PdfReader reader = new PdfReader(template);
File out = new File(dest);
PdfWritter writter = new PdfWritter(out);
Then create a Document object by firstly creating a PdfDocument:
PdfDocument pdf = new PdfDocument(reader, writter);
Document document = new Document(pdf);
Last iterate through the elements of the pdf document until "Hand made software" line is found:
for (int i = 0; i < document.getRoots().size(); i++) {
if (document.getRoots().get(i) instanceof Paragraph) { //iterate only through paragraphs
Paragraph paragraph = (Paragraph) document.getRoots().get(i);
if(paragraph.getText().equals("Hand made software")){ //if the paragraph equals to the string to be removed, remove from the document
document.getRoots().remove(i);
i--;
}
}
}
Finally close de document to save the changes
document.close();

How to copy paragraphs include character styles to new document in apache poi?

I'm trying to copy a certain number of paragraphs from an ms word file into a new one with Apache Poi. Although I copy paragraph styles without problem but I can't transfer inline character styles to new file, how to get and apply character styles to new new doc?
FileInputStream in = new FileInputStream("oldDoc.docx");
XWPFDocument doc = new XWPFDocument(in);
XWPFDocument newDoc = new XWPFDocument();
// Copy styles from old to new doc
XWPFStyles newStyles = newDoc.createStyles();
newStyles.setStyles(doc.getStyle());
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (int p = 0; p < paragraphs.size(); p++) {
XWPFParagraph oldPar = paragraphs.get(p);
XWPFParagraph newPar = newDoc.createParagraph();
// Apply paragraph style
newPar.setStyle(oldPar.getStyle());
XWPFRun run = newPar.createRun();
run.setText(oldPar.getText());
}
FileOutputStream outNewDoc = new FileOutputStream("newDoc.docx");
newDoc.write(outNewDoc);
in.close();
outNewDoc.close();

try {
FileInputStream in = new FileInputStream("in.docx");
XWPFDocument oldDoc = new XWPFDocument(in);
XWPFDocument newDoc = new XWPFDocument();
// Copy styles from template to new doc
XWPFStyles newXStyles = newDoc.createStyles();
newXStyles.setStyles(oldDoc.getStyle());
List<XWPFParagraph> oldDocParagraphs = oldDoc.getParagraphs();
for (XWPFParagraph oldPar : oldDocParagraphs) {
// Create new paragraph and set it style of old paragraph
XWPFParagraph newPar = newDoc.createParagraph();
newPar.setStyle(oldPar.getStyle());
// Loop in runs of old paragraphs.
for (XWPFRun oldRun : oldPar.getRuns()) { // Paragrafın sitillere göre parçalanmış stringleri
// Create a run for the new paragraph
XWPFRun newParRun = newPar.createRun();
// Set old run's text of old paragraph to the run of new paragraph
String runText = oldRun.text();
newParRun.setText(runText);
// Set old run's style of old paragraph to the run of new paragraph
CTRPr oldCTRPr = oldRun.getCTR().getRPr();
if (oldCTRPr != null) {
if (oldCTRPr.sizeOfRStyleArray() != 0){
String carStyle = oldRun.getStyle();
newParRun.setStyle(carStyle);
}
}
// Add the new run to the new paragraph
newPar.addRun(newParRun);
}
// Write to file and close.
FileOutputStream out = new FileOutputStream("out.docx");
newDoc.write(out);
out.close();
}
} catch (IOException | XmlException e) {
e.printStackTrace();
}

Converting PDF document containing graphs and tables to Word Document

I am trying to convert a PDF document to a Word file using Java. On Internet, I found a code snippet which converts PDF document to Word. but the alignments in the resulting Word document is clumsy. Images tables and graphs are not in sync. Everything is displaying as string paragraph/words.
The code, I have written is given below.
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\xyz.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = (TextExtractionStrategy)
parser.processContent(i,new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
run.addBreak(BreakType.PAGE);
Please anyone help.....

Java get plain Text from RTF

I have on my database a column that holds text in RTF format.
How can I get only the plain text of it, using Java?

RTFEditorKit rtfParser = new RTFEditorKit();
Document document = rtfParser.createDefaultDocument();
rtfParser.read(new ByteArrayInputStream(rtfBytes), document, 0);
String text = document.getText(0, document.getLength());
this should work

If you can try "AdvancedRTFEditorKit", it might be cool. Try here http://java-sl.com/advanced_rtf_editor_kit.html
I have used it to create a complete RTF editor, with all the supports MS Word has.

Apache POI will also read Microsoft Word formats, not just RTF.
POI
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public String getRtfText(String fileName) {
File rtfFile = null;
WordExtractor rtfExtractor = null ;
try {
rtfFile = new File(fileName);
//A FileInputStream obtains input bytes from a file.
FileInputStream inStream = new FileInputStream(rtfFile.getAbsolutePath());
//A HWPFDocument used to read document file from FileInputStream
HWPFDocument doc=new HWPFDocument(inStream);
rtfExtractor = new WordExtractor(doc);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
//This Array stores each line from the document file.
String [] rtfArray = rtfExtractor.getParagraphText();
String rtfString = "";
for(int i=0; i < rtfArray.length; i++) rtfString += rtfArray[i];
System.out.println(rtfString);
return rtfString;
}

This works if the RTF text is in a JEditorPane
String s = getPlainText(aJEditorPane.getDocument());
String getPlainText(Document doc) {
try {
return doc.getText(0, doc.getLength());
}
catch (BadLocationException ex) {
System.err.println(ex);
return null;
}
}

How to convert a pdf file into CSV file?

I want to convert a PDF file into a CSV file.
I am using iText library for this.
The program is working fine but the output is not in desired format.
All the data is coming in first line of the csv file. The output should be exactly same as pdf file(means with line breaks).
Please help.
Thanks in advance.
Document document = new Document();
document.open();
PdfReader reader = new PdfReader("C:\\Indiaops-projects\\PREMIUM_PAID_ACKNOWLEDGEMENT.pdf");
PdfDictionary dictionary = reader.getPageN(1);
AcroFields fileds = reader.getAcroFields();
PRIndirectReference reference = (PRIndirectReference)
dictionary.get(PdfName.CONTENTS);
PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
byte[] bytes = PdfReader.getStreamBytes(stream);
PRTokeniser tokenizer = new PRTokeniser(bytes);
FileOutputStream fos=new FileOutputStream("C:\\Indiaops-projects\\pdf.csv");
StringBuffer buffer = new StringBuffer();
StringBuffer data = new StringBuffer();
int i=0;
while (tokenizer.nextToken()) {
if (tokenizer.getTokenType() == PRTokeniser.TK_STRING) {
String value = tokenizer.getStringValue();
if("x-none".equals(value)){
String datastr =data.toString();
if(!"".equals(datastr)){
buffer.append("\""+datastr+"\",");
data = new StringBuffer();
}
}else{
data.append(value);
}
}
}
String test=buffer.toString();
StringReader stReader = new StringReader(test);
int t;
while((t=stReader.read())>0)
fos.write(t);
document.add(new Paragraph(".."));
document.close();

You need to introduce a line break '\n' in the buffer after each table row.
buffer.append("\n");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to read unicode character in pdf using java - java

Related

Delete paragraph from PDF - Java

How to copy paragraphs include character styles to new document in apache poi?

Converting PDF document containing graphs and tables to Word Document

Java get plain Text from RTF

How to convert a pdf file into CSV file?

Categories

Resources