Java using iText and Extracting from PDF

Java using iText and Extracting from PDF - java

I followed some previous advice on StackOverflow here (Extract specific parts of PDF documents) to extract data from PDFs. I have limited programming experience and virtually none with Java (used C++ when I was younger).
After overcoming some newbie difficulty having the right jars available via the buildpath, I have run into what seems to be a define problem. I hope this isn't a stupid question, I've tried for quite some time to get this to work.
Here's the error I receive:
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The constructor RegionTextRenderFilter(Rectangle) is undefined
at PDFExtract.PDFExtract.parsePdf(PDFExtract.java:41)
at PDFExtract.PDFExtract.main(PDFExtract.java:59)
Here is my code:
package PDFExtract;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.RegionTextRenderFilter;
import com.itextpdf.text.pdf.parser.RenderFilter;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class PDFExtract {
/** The original PDF that will be parsed. */
public static final String PREFACE = "resources/pdfs/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "results/output.txt";
/**
* Parses a specific area of a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 420, 500);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new PDFExtract().parsePdf(PREFACE, RESULT);
}
}
The error appears on this line "RenderFilter filter = new RegionTextRenderFilter(rect);"
Hopefully someone can help me! Again I apologize if this is a stupid question. I tried jumping from a Hello World to this and it's taxing my aged understanding of C++

Related

Using PdfCleanUpTool or PdfAutoSweep causes some text to change to bold, line weights increase and double points change to hearts

I have weird problem when I try to use iText 7. Some parts of the PDF are modified (text to change to bold, line weights increase and double points change to hearts). In iText version 5.4.4 this didn't happened, but every version since that I have tried cause this same problem (5 or 7).
Does anyone have a clued why this is happening and is there anything I could to do to bypass this problem? Any help would be appreciated!
If more information is needed, I will try to provide it.
Below is simple code that I used to test iText 7.
Example PDF Files
package javaapplication1;
import com.itextpdf.kernel.colors.ColorConstants;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.pdfcleanup.PdfCleanUpLocation;
import com.itextpdf.pdfcleanup.PdfCleanUpTool;
import com.itextpdf.pdfcleanup.autosweep.ICleanupStrategy;
import com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep;
import com.itextpdf.pdfcleanup.autosweep.RegexBasedCleanupStrategy;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
public class JavaApplication1 {
public static final String DEST = "D:/TEMP/TEMP/PDF/result/orientation_result.pdf";
public static final String DEST2 = "D:/TEMP/TEMP/PDF/result/orientation_result2.pdf";
public static final String DEST3 = "D:/TEMP/TEMP/PDF/result/orientation_result3.pdf";
public static final String SRC = "D:/TEMP/TEMP/PDF/TEST_PDF.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new JavaApplication1().manipulatePdf(DEST);
new JavaApplication1().manipulatePdf2(DEST2);
try (PdfDocument pdf = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST3))) {
final ICleanupStrategy cleanupStrategy = new RegexBasedCleanupStrategy(Pattern.compile("2019", Pattern.CASE_INSENSITIVE)).setRedactionColor(ColorConstants.PINK);
final PdfAutoSweep autoSweep = new PdfAutoSweep(cleanupStrategy);
autoSweep.cleanUp(pdf);
} catch (Exception e) {
System.out.println(e.toString());
}
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
// The arguments of the PdfCleanUpLocation constructor: the number of page to be cleaned up,
// a Rectangle defining the area on the page we want to clean up,
// a color which will be used while filling the cleaned area.
PdfCleanUpLocation location = new PdfCleanUpLocation(1, new Rectangle(97, 405, 383, 40),
ColorConstants.GRAY);
cleanUpLocations.add(location);
PdfCleanUpTool cleaner = new PdfCleanUpTool(pdfDoc, cleanUpLocations);
cleaner.cleanUp();
pdfDoc.close();
}
protected void manipulatePdf2(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
// If the second argument is true, then regions to be erased are extracted from the redact annotations
// contained inside the given document. If the second argument is false (that's default behavior),
// then use PdfCleanUpTool.addCleanupLocation(PdfCleanUpLocation)
// method to set regions to be erased from the document.
PdfCleanUpTool cleaner = new PdfCleanUpTool(pdfDoc, true);
cleaner.cleanUp();
pdfDoc.close();
}
}

Sorry for late response. I managed now verify that updating to latest versions corrected this problem. Thanks to #mkl pointing this out.
Problem solved

Convert Extent Report HTML file to PDF

I am trying to convert Extent report HTML file to PDF, however i did not succeed.
Below is the code i tried.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
public class Demo
{
public static void main( String[] args ) throws DocumentException, IOException
{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("pdf.pdf"));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,new FileInputStream("html.html"));
document.close();
System.out.println( "PDF Created!" );
}
}
Exception in thread "main" com.itextpdf.tool.xml.exceptions.RuntimeWorkerException: Invalid nested tag head found, expected closing tag link.
at com.itextpdf.tool.xml.XMLWorker.endElement(XMLWorker.java:134)
at com.itextpdf.tool.xml.parser.XMLParser.endElement(XMLParser.java:396)
at com.itextpdf.tool.xml.parser.state.ClosingTagState.process(ClosingTagState.java:70)
at com.itextpdf.tool.xml.parser.XMLParser.parseWithReader(XMLParser.java:236)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:214)
at com.itextpdf.tool.xml.parser.XMLParser.parse(XMLParser.java:175)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:238)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:210)
at com.itextpdf.tool.xml.XMLWorkerHelper.parseXHtml(XMLWorkerHelper.java:183)
at com.tib.controlStatements.Demo.main(Demo.java:22)
HTML File Link:
https://drive.google.com/open?id=1UrHafoit0rJuhTC0QRqCe9bC5PMpqIWS

Try this example. You did not create the file and trying to fill it.
package sandbox.xmlworker;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.XMLWorkerHelper;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import sandbox.WrapToTest;
#WrapToTest
public class D02_ParseHtml {
public static final String HTML = "resources/xml/walden.html";
public static final String DEST = "results/xmlworker/walden1.pdf";
/**
* Html to pdf conversion example.
* #param file
* #throws IOException
* #throws DocumentException
*/
public void createPdf(String file) throws IOException, DocumentException {
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
// step 3
document.open();
// step 4
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
// step 5
document.close();
}
/**
* Main method
*/
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new D02_ParseHtml().createPdf(DEST);
}
}

Finally found a work around for it. So what I have done is Added A css file with some adjustments in to the html and was able to covert to PDF and print with attached screenshots.

Image alignment issue in PDF using iText pdfHTML

I am trying to export HTML page into a PDF using iText7.1.0 and pdfHTML2.0.0. For some reason, the pages have formatting issue for the Pie chart images (aligned horizontally in HTML whereas vertically aligned in PDF) and the table (titled "Features") on left is pushed down vertically. The jsFiddle link to my HTML code that is being used by PDF renderer.
Below is the Java code used for rendering the PDF (Page1.html is the same HTML code in the fiddle):
/*
* Copyright 2016-2017, iText Group NV.
* This example was created by Bruno Lowagie.
* It was written in the context of the following book:
* https://leanpub.com/itext7_pdfHTML
* Go to http://developers.itextpdf.com for more info.
*/
package com.itextpdf.htmlsamples.chapter01;
import java.io.File;
import java.io.IOException;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.licensekey.LicenseKey;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Converts a simple HTML file to PDF using File objects
* as arguments for the convertToPdf() method.
*/
public class C01E03_HelloWorld {
/** The Base URI of the HTML page. */
public static final String BASEURI = "src/main/resources/html/";
/** The path to the source HTML file. */
public static final String SRC = String.format("%sPage1.html", BASEURI);
/** The target folder for the result. */
public static final String TARGET = "target/results/ch01/";
/** The path to the resulting PDF file. */
public static final String DEST = String.format("%stest-03.pdf", TARGET);
/**
* The main method of this example.
*
* #param args no arguments are needed to run this example.
* #throws IOException Signals that an I/O exception has occurred.
*/
public static void main(String[] args) throws IOException {
LicenseKey.loadLicenseFile("C://Users//Sparks//Desktop//itextkey-0.xml");
File file = new File(TARGET);
file.mkdirs();
new C01E03_HelloWorld().createPdf(BASEURI, SRC, DEST);
}
/**
* Creates the PDF file.
*
* #param baseUri the base URI
* #param src the path to the source HTML file
* #param dest the path to the resulting PDF
* #throws IOException Signals that an I/O exception has occurred.
*/
public void createPdf(String baseUri, String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
}
The output PDF file is here. It should have formatting similar to the one in HTML page.
Any suggestions would be helpful.

How do I return data in an array of type Product[] using this data?

public class InputFileData {
/**
* #param inputFile a file giving the data for an electronic
* equipment supplier’s product range
* #return an array of product details
* #throws IOException
*/
public static Product [] readProductDataFile(File inputFile)
throws IOException{
// YOUR CODE HERE
}
This code is meant to read a text file and store the data in an array of type Product[]. I know how to read in a text file and have it sort it into an array, but I've never seen code laid out in this fashion before (specifically "public static Product[]", and I'm unsure how to work with "(File inputfile)". I've looked all over the place but can't find any examples of anything like this. Could someone explain this to me?
EDIT
package electronicsequipmentdemo;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import static java.lang.System.out;
import java.util.Arrays;
import java.util.Scanner;
/**
*
* #author George
*/
public class InputFileData {
/**
* #param inputFile a file giving the data for an electronic
* equipment supplier’s product range
* #return an array of product details
* #throws IOException
*/
public static Product [] readProductDataFile(File inputFile)
throws IOException{
Product[] productName;
try {
FileReader fr = new FileReader("productDataFile.txt");
BufferedReader br = new BufferedReader(fr);
String str;
while ((str = br.readLine()) != null) {
Product[] list = str.split("/");
Arrays.toString(list);
productName = list[1];
return productName;
}
br.close();
} catch (IOException e) {
out.println("File not found");
}
return null;
}
This gives the following errors:
Product[] list = str.split("/"); incompatible types: String[] cannot be converted to product[]
productName = list[1]; incompatible types: Product cannot be converted to Product[]
I've tried lots of things, but without knowing how this sort of class is meant to work (I've never seen a method written out like this), combined with the fact I've been trying to make it work for a solid two days, I've probably got everything wrong. I'm desperate to learn how to do this, help would really be appreciated.

This is a static method. See this turorial (or a lot of others): http://www.programmingsimplified.com/java/source-code/java-static-method-program
Java static method
Java static method program: static methods in Java can be called
without creating an object of class. Have you noticed why we write
static keyword when defining main it's because program execution
begins from main and no object has been created yet. Consider the
example below to improve your understanding of static methods.

Please try the below code
package electronicsequipmentdemo;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import static java.lang.System.out;
import java.util.Arrays;
import java.util.Scanner;
/**
*
* #author George
*/
public class InputFileData {
/**
* #param inputFile a file giving the data for an electronic
* equipment supplier’s product range
* #return an array of product details
* #throws IOException
*/
public static Product [] readProductDataFile(File inputFile)
throws IOException{
Product[] productList=new Product[100];
int i=0;
try {
FileReader fr = new FileReader("productDataFile.txt");
BufferedReader br = new BufferedReader(fr);
String str;
String[] list= new String[100];
while ((str = br.readLine()) != null) {
list = str.split("/");
for(String name:list){
Product product=new Product();
product.setProductName(name);
if(i<100){
productList[i]=product;
i++;
}
}
}
br.close();
return productList;
}
} catch (IOException e) {
out.println("File not found");
}
return null;
}
The idea is that the read line method will return you a string, when you split this string using your delimiter you get an array of strings now you have to create a Product object using the setter method of the Product class(which you know i guess) the you need to add the product objet to the ProductList array here i have assumed that there are not more than 100 product objects and 100 names in a line of the file you are trying to read.

jAudio Feature Extractor : Null Exception

My project is to create an Android app that can perform feature extraction and classification of audio files. So first, I'm creating a Java application as a test run.
I'm attempting to use jAudio's feature extractor package to extract audio features from an audio file.
As a starter, I want to input a .wav file and run the feature extraction operation upon that file, and then store the results as an .ARFF file.
However, I'm getting the below NullPointer Exception error from a package within the project:
Exception in thread "main" java.lang.NullPointerException
at java.io.DataOutputStream.writeBytes(Unknown Source)
at jAudioFeatureExtractor.jAudioTools.FeatureProcessor.writeValuesARFFHeader(FeatureProcessor.java:853)
at jAudioFeatureExtractor.jAudioTools.FeatureProcessor.<init>(FeatureProcessor.java:258)
at jAudioFeatureExtractor.DataModel.extract(DataModel.java:308)
at Mfccarffwriter.main(Mfccarffwriter.java:70)
Initially I thought it was a file permission issue(i.e, the program was not being allowed to write a file because of lack of permissions), but even after granting every kind of permission to Eclipse 4.2.2(I'm running Windows 7, 64 bit version), I'm still getting the NullException bug.
The package code where the offending exception originates from is given below:
/**
* Write headers for an ARFF file. If saving for overall features, this must
* be postponed until the overall features have been calculated. If this a
* perWindow arff file, then all the feature headers can be extracted now
* and no hacks are needed.
* <p>
* <b>NOTE</b>: This procedure breaks if a feature to be saved has a
* variable number of dimensions
*
* #throws Exception
*/
private void writeValuesARFFHeader() throws Exception {
String sep = System.getProperty("line.separator");
String feature_value_header = "#relation jAudio" + sep;
values_writer.writeBytes(feature_value_header); // exception here
if (save_features_for_each_window && !save_overall_recording_features) {
for (int i = 0; i < feature_extractors.length; ++i) {
if (features_to_save[i]) {
String name = feature_extractors[i].getFeatureDefinition().name;
int dimension = feature_extractors[i]
.getFeatureDefinition().dimensions;
for (int j = 0; j < dimension; ++j) {
values_writer.writeBytes("#ATTRIBUTE \"" + name + j
+ "\" NUMERIC" + sep);
}
}
}
values_writer.writeBytes(sep);
values_writer.writeBytes("#DATA" + sep);
}
}
Here's the main application code:
import java.io.*;
import java.util.Arrays;
import com.sun.xml.internal.bind.v2.runtime.RuntimeUtil.ToStringAdapter;
import jAudioFeatureExtractor.Cancel;
import jAudioFeatureExtractor.DataModel;
import jAudioFeatureExtractor.Updater;
import jAudioFeatureExtractor.Aggregators.AggregatorContainer;
import jAudioFeatureExtractor.AudioFeatures.FeatureExtractor;
import jAudioFeatureExtractor.AudioFeatures.MFCC;
import jAudioFeatureExtractor.DataTypes.RecordingInfo;
import jAudioFeatureExtractor.jAudioTools.*;
public static void main(String[] args) throws Exception {
// Display information about the wav file
File extractedFiletoTest = new File("./microwave1.wav");
String randomID = Integer.toString((int) Math.random());
String file_path = "E:/Weka-3-6/tmp/microwave1.wav";
AudioSamples sampledExampleFile = new AudioSamples(extractedFiletoTest,randomID,false);
RecordingInfo[] samplefileInfo = new RecordingInfo[5];
samplefileInfo[1] = new RecordingInfo(randomID, file_path, sampledExampleFile, true);
double samplingrate= sampledExampleFile.getSamplingRateAsDouble();
int windowsize= 4096;
boolean normalize = false;
OutputStream valsavepath = new FileOutputStream(".\\values");
OutputStream defsavepath = new FileOutputStream(".\\definitions");
boolean[] featurestosaveamongall = new boolean[10];
Arrays.fill(featurestosaveamongall, Boolean.TRUE);
double windowoverlap = 0.0;
DataModel mfccDM = new DataModel("features.xml",null);
mfccDM.extract(windowsize, 0.5, samplingrate, true, true, false, samplefileInfo, 1); /// invokes the writeValuesARFFHeader function.
}
}
You can download the whole project (done so far)here.

This may be a bit late but I had the same issue and I tracked it down to the featureKey and featureValue never being set in the DataModel. There is not a set method for these but they are public field. Here is my code:
package Sound;
import jAudioFeatureExtractor.ACE.DataTypes.Batch;
import jAudioFeatureExtractor.DataModel;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class Analysis {
private static String musicFile = "/home/chris/IdeaProjects/AnotherProj/res/SheMovesInHerOwnWay15s.wav";
private static String featureFile = "/home/chris/IdeaProjects/AnotherProj/res/features.xml";
private static String settingsFile = "/home/chris/IdeaProjects/AnotherProj/res/settings.xml";
private static String FKOuputFile = "/home/chris/IdeaProjects/AnotherProj/res/fk.xml";
private static String FVOuputFile = "/home/chris/IdeaProjects/AnotherProj/res/fv.xml";
public static void main(String[] args){
Batch batch = new Batch(featureFile, null);
try{
batch.setRecordings(new File[]{new File(musicFile)});
batch.getAggregator();
batch.setSettings(settingsFile);
DataModel dm = batch.getDataModel();
OutputStream valsavepath = new FileOutputStream(FVOuputFile);
OutputStream defsavepath = new FileOutputStream(FKOuputFile);
dm.featureKey = defsavepath;
dm.featureValue = valsavepath;
batch.setDataModel(dm);
batch.execute();
}
catch (Exception e){
e.printStackTrace();
}
}
}
I created the settings.xml file using the GUI and just copied the features.xml file from the directory where you saved the jar.
Hope this helps

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java using iText and Extracting from PDF - java

Related

Using PdfCleanUpTool or PdfAutoSweep causes some text to change to bold, line weights increase and double points change to hearts

Convert Extent Report HTML file to PDF

Image alignment issue in PDF using iText pdfHTML

How do I return data in an array of type Product[] using this data?

jAudio Feature Extractor : Null Exception

Categories

Resources