Importing PDF to String in java

Importing PDF to String in java - java

i need to extract text from a pdf file using java. I found iText but it doesn't work the way i wanted it to. Here's my code
package com.itextpdf.mavenproject1;
import com.itextpdf.forms.PdfAcroForm;
import com.itextpdf.forms.fields.PdfButtonFormField;
import com.itextpdf.forms.fields.PdfFormField;
import com.itextpdf.io.font.FontConstants;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.action.PdfAction;
import com.itextpdf.kernel.pdf.annot.PdfAnnotation;
import com.itextpdf.kernel.pdf.annot.PdfTextAnnotation;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
import com.itextpdf.test.annotations.WrapToTest;
import java.io.File;
import java.io.IOException;
public class zczytywanie {
public static void main(String args[]) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader("D:/pdf/pdf"));
String page= PdfTextExtractor.getTextFromPage(pdfDoc, 1);
System.out.println(page);
}
}
And it tells me that there is an error in the line where i try to use PDdfTextExtractor (PdfDocument can not be converted to pdfPage, although i found that pdfDoc has to be PdfReader)
It doesn't work with
PdfReader pdfDoc = new PdfReader("D:/pdf/pdf");
either.

You can try PDFBox or Tikka. But here I am giving an example for PDFBox
Add the PDFBox jar dependency to your pom.xml.
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.23</version>
</dependency>
</dependencies>
Java class
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class TestPDF {
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("/path_to_your_pdf_file"))) {
document.getClass();
if(!document.isEncrypted()){
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
System.out.println("Text:" + pdfFileInText);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Related

How to create pptx file for Link-Video in Slide using Apache-POI

I want to create pptx file having linked-Video in slides using Apache-poi.
I got one example in Apache-Examples code
poi-4.1.2\src\scratchpad\testcases\org\apache\poi\hslf\model\TestMovieShape.
Using this example I can able to create .ppt file but it's not creating .pptx file.
Also using this example media-controls are not visible.

Only a few lines needed to be changed opposed to the embedded video case.
The video URI is not a real URI, but simply a relative .mp4 filename in the same directory. Although I haven't tested it, absolute file URIs should also work.
I haven't implemented the frame extraction, as it's mentioned in the embedded example - so either look for an archived version of xuggler or find a different library to extract the preview image.
Tested with Powerpoint 2016 / POI 5.0.0-Snapshot.
import org.apache.poi.openxml4j.opc.PackagePart;
import org.apache.poi.openxml4j.opc.PackageRelationship;
import org.apache.poi.openxml4j.opc.TargetMode;
import org.apache.poi.sl.usermodel.PictureData;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
import org.apache.poi.xslf.usermodel.XSLFPictureData;
import org.apache.poi.xslf.usermodel.XSLFPictureShape;
import org.apache.poi.xslf.usermodel.XSLFSlide;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.drawingml.x2006.main.CTHyperlink;
import org.openxmlformats.schemas.presentationml.x2006.main.*;
import javax.xml.namespace.QName;
import java.awt.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import static org.apache.poi.openxml4j.opc.PackageRelationshipTypes.CORE_PROPERTIES_ECMA376_NS;
public class LinkVideoToPptx {
public static void main(String[] args) throws IOException, URISyntaxException {
XMLSlideShow pptx = new XMLSlideShow();
String videoFileName = "file_example_MP4_640_3MG.mp4";
XSLFSlide slide1 = pptx.createSlide();
PackagePart pp = slide1.getPackagePart();
URI mp4uri = new URI("./"+videoFileName);
PackageRelationship prsEmbed1 = pp.addRelationship(mp4uri, TargetMode.EXTERNAL, "http://schemas.microsoft.com/office/2007/relationships/media");
PackageRelationship prsExec1 = pp.addRelationship(mp4uri, TargetMode.EXTERNAL, "http://schemas.openxmlformats.org/officeDocument/2006/relationships/video");
File previewJpg = new File("preview.jpg");
XSLFPictureData snap = pptx.addPicture(previewJpg, PictureData.PictureType.JPEG);
XSLFPictureShape pic1 = slide1.createPicture(snap);
pic1.setAnchor(new Rectangle(100, 100, 500, 400));
CTPicture xpic1 = (CTPicture)pic1.getXmlObject();
CTHyperlink link1 = xpic1.getNvPicPr().getCNvPr().addNewHlinkClick();
link1.setId("");
link1.setAction("ppaction://media");
CTApplicationNonVisualDrawingProps nvPr = xpic1.getNvPicPr().getNvPr();
nvPr.addNewVideoFile().setLink(prsExec1.getId());
CTExtension ext = nvPr.addNewExtLst().addNewExt();
ext.setUri("{DAA4B4D4-6D71-4841-9C94-3DE7FCFB9230}");
String p14Ns = "http://schemas.microsoft.com/office/powerpoint/2010/main";
XmlCursor cur = ext.newCursor();
cur.toEndToken();
cur.beginElement(new QName(p14Ns, "media", "p14"));
cur.insertNamespace("p14", p14Ns);
cur.insertAttributeWithValue(new QName(CORE_PROPERTIES_ECMA376_NS, "link"), prsEmbed1.getId());
cur.dispose();
CTSlide xslide = slide1.getXmlObject();
CTTimeNodeList ctnl;
if (!xslide.isSetTiming()) {
CTTLCommonTimeNodeData ctn = xslide.addNewTiming().addNewTnLst().addNewPar().addNewCTn();
ctn.setDur(STTLTimeIndefinite.INDEFINITE);
ctn.setRestart(STTLTimeNodeRestartType.NEVER);
ctn.setNodeType(STTLTimeNodeType.TM_ROOT);
ctnl = ctn.addNewChildTnLst();
} else {
ctnl = xslide.getTiming().getTnLst().getParArray(0).getCTn().getChildTnLst();
}
CTTLCommonMediaNodeData cmedia = ctnl.addNewVideo().addNewCMediaNode();
cmedia.setVol(80000);
CTTLCommonTimeNodeData ctn = cmedia.addNewCTn();
ctn.setFill(STTLTimeNodeFillType.HOLD);
ctn.setDisplay(false);
ctn.addNewStCondLst().addNewCond().setDelay(STTLTimeIndefinite.INDEFINITE);
cmedia.addNewTgtEl().addNewSpTgt().setSpid(""+pic1.getShapeId());
try (FileOutputStream fos = new FileOutputStream("mp4test-poi.pptx")) {
pptx.write(fos);
}
}
}

Java-Can't pass Directory variable as an argument to IndexReader.open() in Apache Lucene 6.4.2

I'm trying to use the open function defined in the Lucene documentation here- https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/index/IndexReader.html (Do a Ctrl + F for 'open'). However Netbeans 8.1 with Apache Lucene 6.4.2 gives an in-line error on the code at statement 'reader = IndexReader.open(indexDirectory);'. Here is the error and code.
Cannot find symbol
symbol: method open(Directory)
location: class IndexReader
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexing_Searching
{
public static final String FIELD_CONTENTS = "contents";
public int searchIndex(String instring, String Index_Dir_Path)
{
int numDocs =0;
try
{
Path path = Paths.get(Index_Dir_Path);
Directory indexDirectory = FSDirectory.open(path);
IndexReader reader;
reader = IndexReader.open(indexDirectory);
Term term = new Term("content", instring);
numDocs = reader.docFreq(term);
//System.out.println("Number of documents for given key" + instring +" # docs" + numDocs);
}
catch (CorruptIndexException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
return(numDocs);
}// End of one-words searching function
}

According to current IndexReader JavaDoc for Lucene 6.4.2 you should use DirectoryReader.open.

Convert docx file to pdf in java..issue

I am developing a project which needs a docx file to be converted to pdf. I found same question already posted and used the code which was provided by "Kishan C S". It uses docx4J2.8.1
The code is working fine , pdf is generated but only problem I am facing is that the docx file contains logo.jpg (images header part) which are not converted. Only textual format is converted to pdf.
I am posting the code which I have used. Please let me know what how can I solve the problem
P.S: link I referred Convert docx file into PDF with Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Collections;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class DocxConverter {
public static void main(String[] args) throws FileNotFoundException, Docx4JException, Exception {
InputStream is = new FileInputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i).getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.pdf"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
}
}

zxing qrcode, error on read. com.google.zxing.NotFoundException

Am trying to read the qrcode from a image file uploaded from a jsp file. To read QRcode i have used zxing jars.
code from
import java.awt.Color;
import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Hashtable;
import java.util.Map;
import javax.imageio.ImageIO;
import com.google.zxing.BarcodeFormat;
import com.google.zxing.BinaryBitmap;
import com.google.zxing.EncodeHintType;
import com.google.zxing.MultiFormatReader;
import com.google.zxing.NotFoundException;
import com.google.zxing.Result;
import com.google.zxing.WriterException;
import com.google.zxing.client.j2se.BufferedImageLuminanceSource;
import com.google.zxing.common.BitMatrix;
import com.google.zxing.common.HybridBinarizer;
import com.google.zxing.qrcode.QRCodeWriter;
import com.google.zxing.qrcode.decoder.ErrorCorrectionLevel;
public class GenerateQRCode {
public String readQRCode(String filePath, String charset)
throws FileNotFoundException, IOException, NotFoundException {
Hashtable hintMap = new Hashtable();
hintMap.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.L);
BinaryBitmap binaryBitmap = new BinaryBitmap(new HybridBinarizer(
new BufferedImageLuminanceSource( ImageIO.read(new FileInputStream(filePath)))));
**Result qrCodeResult = new MultiFormatReader().decode(binaryBitmap, hintMap);**
return qrCodeResult.getText();
}
}
This is the method where am trying to get the qrcode value in the string "result".
String result = rr.readQRCode(tmpFile.getCanonicalPath(), "UTF-8");
the following error is thrown in the above called method at the bold line.
com.google.zxing.NotFoundException
already i find the repetition of the same question in stackoverflow.
http://stackoverflow.com/questions/27770665/error-when-decoding-qr-code
but no proper response for it. Will this code work. or should i look for alternative. I have completed the code for generating a qrcode. reading the code from a file is the issue with zxing.

I had similar problem and I found this https://github.com/zxing/zxing/issues/216
You should put PURE_BARCODE hints. So, your code should be
// ...
Map<DecodeHintType, Object> hints = new EnumMap<>(DecodeHintType.class);
hints.put(DecodeHintType.PURE_BARCODE, true);
Result qrCodeResult = new MultiFormatReader().decode(binaryBitmap, hints);
return qrCodeResult.getText();
// ...

How to read tibetan content from pdf file?

I have a tibetan pdf file, and I want to extract its content. But I tried following three codes to read the file, I got code that isn't what I wanted.
code1:
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class iTextReadDemo {
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader("");
String page = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println("Page Content:\n\n" + page + "\n\n");
} catch (IOException e) {
e.printStackTrace();
}
}
}// - See more at:
// http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/#sthash.iAhF00Kj.dpuf
code2 :
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.PageSize;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;
public class MainClass {
public static void main(String[] args) throws Exception {
PdfReader reader = new PdfReader("");
byte[] bs = new byte[100];
byte[] streamBytes = reader.getPageContent(1);
for(byte b: streamBytes){
System.out.print((char)b);
}
}
}
code3:
package pdfBox;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFTest {
public static void main(String[] args) throws Exception {
PDDocument pd;
File input = new File("C:\\Users\\Administrator\\Desktop\\tibetan Dictionary pdf/藏英英藏词典 - 副本.pdf");
pd = PDDocument.load(input);
PDFTextStripper reader = new PDFTextStripper("utf-8");
String pageText = reader.getText(pd);
System.out.println(pageText);
}
}
and this is the part of the maven pom dependency
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.3</version>
</dependency>
<dependency>
<groupId>com.lowagie</groupId>
<artifactId>itext</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.swinglabs</groupId>
<artifactId>pdf-renderer</artifactId>
<version>1.0.5</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.7</version>
</dependency>
what is wrong ?
is he said right?
https://answers.acrobatusers.com/Can-I-convert-PDF-Word-Doc-Tibetan-script-addition-English-language-q219757.aspx
The quality of exported content from a PDF is directly related to the quality of the PDF's "build" (what is under the hood, not what you "see"). Poor quality export indicates a poorly built PDF. Nothing you can do other that ask the originator of the PDF to do a better job.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Importing PDF to String in java - java

Related

How to create pptx file for Link-Video in Slide using Apache-POI

Java-Can't pass Directory variable as an argument to IndexReader.open() in Apache Lucene 6.4.2

Convert docx file to pdf in java..issue

zxing qrcode, error on read. com.google.zxing.NotFoundException

How to read tibetan content from pdf file?

Categories

Resources