i need to extract text from a pdf file using java. I found iText but it doesn't work the way i wanted it to. Here's my code
package com.itextpdf.mavenproject1;
import com.itextpdf.forms.PdfAcroForm;
import com.itextpdf.forms.fields.PdfButtonFormField;
import com.itextpdf.forms.fields.PdfFormField;
import com.itextpdf.io.font.FontConstants;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.action.PdfAction;
import com.itextpdf.kernel.pdf.annot.PdfAnnotation;
import com.itextpdf.kernel.pdf.annot.PdfTextAnnotation;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
import com.itextpdf.test.annotations.WrapToTest;
import java.io.File;
import java.io.IOException;
public class zczytywanie {
public static void main(String args[]) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader("D:/pdf/pdf"));
String page= PdfTextExtractor.getTextFromPage(pdfDoc, 1);
System.out.println(page);
}
}
And it tells me that there is an error in the line where i try to use PDdfTextExtractor (PdfDocument can not be converted to pdfPage, although i found that pdfDoc has to be PdfReader)
It doesn't work with
PdfReader pdfDoc = new PdfReader("D:/pdf/pdf");
either.
You can try PDFBox or Tikka. But here I am giving an example for PDFBox
Add the PDFBox jar dependency to your pom.xml.
<dependencies>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.23</version>
</dependency>
</dependencies>
Java class
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class TestPDF {
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("/path_to_your_pdf_file"))) {
document.getClass();
if(!document.isEncrypted()){
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
System.out.println("Text:" + pdfFileInText);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Related
I want to create pptx file having linked-Video in slides using Apache-poi.
I got one example in Apache-Examples code
poi-4.1.2\src\scratchpad\testcases\org\apache\poi\hslf\model\TestMovieShape.
Using this example I can able to create .ppt file but it's not creating .pptx file.
Also using this example media-controls are not visible.
Only a few lines needed to be changed opposed to the embedded video case.
The video URI is not a real URI, but simply a relative .mp4 filename in the same directory. Although I haven't tested it, absolute file URIs should also work.
I haven't implemented the frame extraction, as it's mentioned in the embedded example - so either look for an archived version of xuggler or find a different library to extract the preview image.
Tested with Powerpoint 2016 / POI 5.0.0-Snapshot.
import org.apache.poi.openxml4j.opc.PackagePart;
import org.apache.poi.openxml4j.opc.PackageRelationship;
import org.apache.poi.openxml4j.opc.TargetMode;
import org.apache.poi.sl.usermodel.PictureData;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
import org.apache.poi.xslf.usermodel.XSLFPictureData;
import org.apache.poi.xslf.usermodel.XSLFPictureShape;
import org.apache.poi.xslf.usermodel.XSLFSlide;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.drawingml.x2006.main.CTHyperlink;
import org.openxmlformats.schemas.presentationml.x2006.main.*;
import javax.xml.namespace.QName;
import java.awt.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import static org.apache.poi.openxml4j.opc.PackageRelationshipTypes.CORE_PROPERTIES_ECMA376_NS;
public class LinkVideoToPptx {
public static void main(String[] args) throws IOException, URISyntaxException {
XMLSlideShow pptx = new XMLSlideShow();
String videoFileName = "file_example_MP4_640_3MG.mp4";
XSLFSlide slide1 = pptx.createSlide();
PackagePart pp = slide1.getPackagePart();
URI mp4uri = new URI("./"+videoFileName);
PackageRelationship prsEmbed1 = pp.addRelationship(mp4uri, TargetMode.EXTERNAL, "http://schemas.microsoft.com/office/2007/relationships/media");
PackageRelationship prsExec1 = pp.addRelationship(mp4uri, TargetMode.EXTERNAL, "http://schemas.openxmlformats.org/officeDocument/2006/relationships/video");
File previewJpg = new File("preview.jpg");
XSLFPictureData snap = pptx.addPicture(previewJpg, PictureData.PictureType.JPEG);
XSLFPictureShape pic1 = slide1.createPicture(snap);
pic1.setAnchor(new Rectangle(100, 100, 500, 400));
CTPicture xpic1 = (CTPicture)pic1.getXmlObject();
CTHyperlink link1 = xpic1.getNvPicPr().getCNvPr().addNewHlinkClick();
link1.setId("");
link1.setAction("ppaction://media");
CTApplicationNonVisualDrawingProps nvPr = xpic1.getNvPicPr().getNvPr();
nvPr.addNewVideoFile().setLink(prsExec1.getId());
CTExtension ext = nvPr.addNewExtLst().addNewExt();
ext.setUri("{DAA4B4D4-6D71-4841-9C94-3DE7FCFB9230}");
String p14Ns = "http://schemas.microsoft.com/office/powerpoint/2010/main";
XmlCursor cur = ext.newCursor();
cur.toEndToken();
cur.beginElement(new QName(p14Ns, "media", "p14"));
cur.insertNamespace("p14", p14Ns);
cur.insertAttributeWithValue(new QName(CORE_PROPERTIES_ECMA376_NS, "link"), prsEmbed1.getId());
cur.dispose();
CTSlide xslide = slide1.getXmlObject();
CTTimeNodeList ctnl;
if (!xslide.isSetTiming()) {
CTTLCommonTimeNodeData ctn = xslide.addNewTiming().addNewTnLst().addNewPar().addNewCTn();
ctn.setDur(STTLTimeIndefinite.INDEFINITE);
ctn.setRestart(STTLTimeNodeRestartType.NEVER);
ctn.setNodeType(STTLTimeNodeType.TM_ROOT);
ctnl = ctn.addNewChildTnLst();
} else {
ctnl = xslide.getTiming().getTnLst().getParArray(0).getCTn().getChildTnLst();
}
CTTLCommonMediaNodeData cmedia = ctnl.addNewVideo().addNewCMediaNode();
cmedia.setVol(80000);
CTTLCommonTimeNodeData ctn = cmedia.addNewCTn();
ctn.setFill(STTLTimeNodeFillType.HOLD);
ctn.setDisplay(false);
ctn.addNewStCondLst().addNewCond().setDelay(STTLTimeIndefinite.INDEFINITE);
cmedia.addNewTgtEl().addNewSpTgt().setSpid(""+pic1.getShapeId());
try (FileOutputStream fos = new FileOutputStream("mp4test-poi.pptx")) {
pptx.write(fos);
}
}
}
I'm trying to use the open function defined in the Lucene documentation here- https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/index/IndexReader.html (Do a Ctrl + F for 'open'). However Netbeans 8.1 with Apache Lucene 6.4.2 gives an in-line error on the code at statement 'reader = IndexReader.open(indexDirectory);'. Here is the error and code.
Cannot find symbol
symbol: method open(Directory)
location: class IndexReader
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexing_Searching
{
public static final String FIELD_CONTENTS = "contents";
public int searchIndex(String instring, String Index_Dir_Path)
{
int numDocs =0;
try
{
Path path = Paths.get(Index_Dir_Path);
Directory indexDirectory = FSDirectory.open(path);
IndexReader reader;
reader = IndexReader.open(indexDirectory);
Term term = new Term("content", instring);
numDocs = reader.docFreq(term);
//System.out.println("Number of documents for given key" + instring +" # docs" + numDocs);
}
catch (CorruptIndexException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
return(numDocs);
}// End of one-words searching function
}
According to current IndexReader JavaDoc for Lucene 6.4.2 you should use DirectoryReader.open.
I am developing a project which needs a docx file to be converted to pdf. I found same question already posted and used the code which was provided by "Kishan C S". It uses docx4J2.8.1
The code is working fine , pdf is generated but only problem I am facing is that the docx file contains logo.jpg (images header part) which are not converted. Only textual format is converted to pdf.
I am posting the code which I have used. Please let me know what how can I solve the problem
P.S: link I referred Convert docx file into PDF with Java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Collections;
import java.util.List;
import org.apache.log4j.Level;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;
import org.docx4j.convert.out.pdf.viaXSLFO.PdfSettings;
import org.docx4j.fonts.IdentityPlusMapper;
import org.docx4j.fonts.Mapper;
import org.docx4j.fonts.PhysicalFont;
import org.docx4j.fonts.PhysicalFonts;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
public class DocxConverter {
public static void main(String[] args) throws FileNotFoundException, Docx4JException, Exception {
InputStream is = new FileInputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.docx"));
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(is);
List sections = wordMLPackage.getDocumentModel().getSections();
for (int i = 0; i < sections.size(); i++) {
wordMLPackage.getDocumentModel().getSections().get(i).getPageDimensions();
}
Mapper fontMapper = new IdentityPlusMapper();
PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");//set your desired font
fontMapper.getFontMappings().put("Algerian", font);
wordMLPackage.setFontMapper(fontMapper);
PdfSettings pdfSettings = new PdfSettings();
org.docx4j.convert.out.pdf.PdfConversion conversion = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
//To turn off logger
List<Logger> loggers = Collections.<Logger> list(LogManager.getCurrentLoggers());
loggers.add(LogManager.getRootLogger());
for (Logger logger : loggers) {
logger.setLevel(Level.OFF);
}
OutputStream out = new FileOutputStream(new File("D:\\Test\\C_IN0004_AppointmentLetter.pdf"));
conversion.output(out, pdfSettings);
System.out.println("DONE!!");
}
}
Am trying to read the qrcode from a image file uploaded from a jsp file. To read QRcode i have used zxing jars.
code from
import java.awt.Color;
import java.awt.Graphics2D;
import java.awt.image.BufferedImage;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Hashtable;
import java.util.Map;
import javax.imageio.ImageIO;
import com.google.zxing.BarcodeFormat;
import com.google.zxing.BinaryBitmap;
import com.google.zxing.EncodeHintType;
import com.google.zxing.MultiFormatReader;
import com.google.zxing.NotFoundException;
import com.google.zxing.Result;
import com.google.zxing.WriterException;
import com.google.zxing.client.j2se.BufferedImageLuminanceSource;
import com.google.zxing.common.BitMatrix;
import com.google.zxing.common.HybridBinarizer;
import com.google.zxing.qrcode.QRCodeWriter;
import com.google.zxing.qrcode.decoder.ErrorCorrectionLevel;
public class GenerateQRCode {
public String readQRCode(String filePath, String charset)
throws FileNotFoundException, IOException, NotFoundException {
Hashtable hintMap = new Hashtable();
hintMap.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.L);
BinaryBitmap binaryBitmap = new BinaryBitmap(new HybridBinarizer(
new BufferedImageLuminanceSource( ImageIO.read(new FileInputStream(filePath)))));
**Result qrCodeResult = new MultiFormatReader().decode(binaryBitmap, hintMap);**
return qrCodeResult.getText();
}
}
This is the method where am trying to get the qrcode value in the string "result".
String result = rr.readQRCode(tmpFile.getCanonicalPath(), "UTF-8");
the following error is thrown in the above called method at the bold line.
com.google.zxing.NotFoundException
already i find the repetition of the same question in stackoverflow.
http://stackoverflow.com/questions/27770665/error-when-decoding-qr-code
but no proper response for it. Will this code work. or should i look for alternative. I have completed the code for generating a qrcode. reading the code from a file is the issue with zxing.
I had similar problem and I found this https://github.com/zxing/zxing/issues/216
You should put PURE_BARCODE hints. So, your code should be
// ...
Map<DecodeHintType, Object> hints = new EnumMap<>(DecodeHintType.class);
hints.put(DecodeHintType.PURE_BARCODE, true);
Result qrCodeResult = new MultiFormatReader().decode(binaryBitmap, hints);
return qrCodeResult.getText();
// ...
I have a tibetan pdf file, and I want to extract its content. But I tried following three codes to read the file, I got code that isn't what I wanted.
code1:
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class iTextReadDemo {
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader("");
String page = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println("Page Content:\n\n" + page + "\n\n");
} catch (IOException e) {
e.printStackTrace();
}
}
}// - See more at:
// http://www.quicklyjava.com/read-pdf-file-in-java-using-itext/#sthash.iAhF00Kj.dpuf
code2 :
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.PageSize;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;
public class MainClass {
public static void main(String[] args) throws Exception {
PdfReader reader = new PdfReader("");
byte[] bs = new byte[100];
byte[] streamBytes = reader.getPageContent(1);
for(byte b: streamBytes){
System.out.print((char)b);
}
}
}
code3:
package pdfBox;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFTest {
public static void main(String[] args) throws Exception {
PDDocument pd;
File input = new File("C:\\Users\\Administrator\\Desktop\\tibetan Dictionary pdf/藏英英藏词典 - 副本.pdf");
pd = PDDocument.load(input);
PDFTextStripper reader = new PDFTextStripper("utf-8");
String pageText = reader.getText(pd);
System.out.println(pageText);
}
}
and this is the part of the maven pom dependency
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.3</version>
</dependency>
<dependency>
<groupId>com.lowagie</groupId>
<artifactId>itext</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.swinglabs</groupId>
<artifactId>pdf-renderer</artifactId>
<version>1.0.5</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.7</version>
</dependency>
what is wrong ?
is he said right?
https://answers.acrobatusers.com/Can-I-convert-PDF-Word-Doc-Tibetan-script-addition-English-language-q219757.aspx
The quality of exported content from a PDF is directly related to the quality of the PDF's "build" (what is under the hood, not what you "see"). Poor quality export indicates a poorly built PDF. Nothing you can do other that ask the originator of the PDF to do a better job.