Hi I am relatively new to Java but I am hoping to write a class that will find all the ALT (image) attributes in a HTML file using JSOUP. I am hoping to get an error message printed if there is no alt text on an image and if there is to remind users to check it.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class grabImages {
File input = new File("...HTML");
Document doc = Jsoup.parse(input, "UTF-8", "file:///C:...HTML");
Elements img = doc.getElementsByTag("img");
Elements alttext = doc.getElementsByAttribute("alt");
for (Element el : img){
if(el.attr("img").contains("alt")){
System.out.println("is the alt text relevant to the image? ");
}
else { System.out.println("no alt text found on image");
}
}
}
I think your logic was a little off.
For example:
Here you are trying to load the 'img' attribute of the 'img' tag...
el.attr("img")
Here's my implementation of the program. You should be able to alter it for your own needs.
public class Controller {
public static void main(String[] args) throws IOException {
// Connect to website. This can be replaced with your file loading implementation
Document doc = Jsoup.connect("http://www.google.co.uk").get();
// Get all img tags
Elements img = doc.getElementsByTag("img");
int counter = 0;
// Loop through img tags
for (Element el : img) {
// If alt is empty or null, add one to counter
if(el.attr("alt") == null || el.attr("alt").equals("")) {
counter++;
}
System.out.println("image tag: " + el.attr("src") + " Alt: " + el.attr("alt"));
}
System.out.println("Number of unset alt: " + counter);
}
}
public class grabImages {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("...HTML").get();
Elements img = doc.getElementsByTag("img");
for (Element el : img){
if(el.hasAttr("alt")){
System.out.println("is the alt text relevant to the image? ");
}
else {
System.out.println("no alt text found on image");
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
el.hasAttr("alt") will give 'alt' attr is there or not.
for more informatiom
http://jsoup.org/cookbook/extracting-data/example-list-links
You can simplify this by using CSS selectors to select the img that do not have alt, rather than iterating over every img in the doc.
Document doc = Jsoup.connect(url).get();
for (Element img : doc.select("img:not([alt])"))
System.out.println("img does not have alt: " + img);
Related
I have below code to fetch the pages inside the given URL but I am not sure how to display them in tree like structure.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href^=\"" +URL+ "\"]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("https://www.wikipedia.com/");
}
}
Okay, I think I managed to do what you asked, when all links on site are checked or site has no links then the recursion will finish, but in internet it's actually not doable, it's funny where can you go from one site just by clicking first not checked link:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL, int level) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
for(int i = 0; i < level; i++) {
System.out.print("-");
}
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"), level + 1);
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://mysmallwebpage.com/", 0);
}
}
#JasonPlutext,
Hi Jason! I tried the above code but it just replaces an totally the image deleting the whole template.
I would like to just replace/add a particular relationship of the image ,say
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image10.png"/>
in place of rId8 i would like to replace rId7 image.
My Source Code:
public static void main(String[] args) throws Exception {
String inputfilepath = "C:\\Users\\saranyac\\QUERIES\\Estimation\\PPT-PSR\\PSR_Dev0ps\\PSRAutomationTemplate.pptx";
PresentationMLPackage presentationMLPackage = (PresentationMLPackage)OpcPackage.load(new java.io.File(inputfilepath));
MainPresentationPart pp = presentationMLPackage.getMainPresentationPart();
SlidePart slidePart = presentationMLPackage.getMainPresentationPart().getSlide(0);
SlideLayoutPart layoutPart = slidePart.getSlideLayoutPart();
System.out.println("SlidePart Name:::::"+slidePart.getPartName().getName());
String layoutName = layoutPart.getJaxbElement().getCSld().getName();
System.out.println("layout: " + layoutPart.getPartName().getName() + " with cSld/#name='" + layoutName + "'");
System.out.println("Master: " + layoutPart.getSlideMasterPart().getPartName().getName());
System.out.println("layoutPart.getContents()::::::::s: " + layoutPart.getContents());
//layoutPart.setContents( (SldLayout)XmlUtils.unmarshalString(SAMPLE_PICTURE, Context.jcPML));
// Add image part
File file = new File("C:\\Users\\saranyac\\PPT-PSR\\PSR_Dev0ps\\ppt\\media\\image10.png" );
BinaryPartAbstractImage imagePart
= BinaryPartAbstractImage.createImagePart(presentationMLPackage, slidePart, file);
Relationship rel = pp.getRelationshipsPart().getRelationshipByID("rId8");
System.out.println("Relationship:::::::s: " +imagePart.getSourceRelationship().getId());
// pp.removeSlide(rel);
java.util.HashMap<String, String>mappings = new java.util.HashMap<String, String>();
mappings.put("rId8", imagePart.getSourceRelationship().getId());
String outputfilepath = "C:\\Work\\24Jan2018_CheckOut\\PPT-TRAILS\\Success.pptx";
//presentationMLPackage.save(new java.io.File(outputfilepath));
SaveToZipFile saver = new SaveToZipFile(presentationMLPackage);
saver.save(outputfilepath);
System.out.println("\n\n done .. saved " + outputfilepath);
}
Please help me how to replace an image in the generated PPT.
With Regards,
Saranya
See https://github.com/plutext/docx4j/blob/master/src/samples/pptx4j/org/pptx4j/samples/TemplateReplaceSimple.java (just added):
package org.pptx4j.samples;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import javax.xml.bind.JAXBException;
import org.apache.commons.io.FileUtils;
import org.docx4j.TraversalUtil;
import org.docx4j.TraversalUtil.CallbackImpl;
import org.docx4j.dml.CTBlip;
import org.docx4j.dml.CTBlipFillProperties;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.packages.PresentationMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PresentationML.SlidePart;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;
import org.pptx4j.Pptx4jException;
/**
* Example of how to replace text and images in a Pptx.
*
* Text is replaced using the familiar VariableReplace approach.
*
* Images are replaced by replacing their byte content.
*
* #author jharrop
*
*/
public class TemplateReplaceSimple {
public static void main(String[] args) throws Docx4JException, Pptx4jException, JAXBException, IOException {
// Input file
String inputfilepath = System.getProperty("user.dir") + "/sample-docs/pptx/image.pptx";
// String replacements
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("colour", "green");
// Image replacements
List<ImageReplacementDetails> imageReplacements = new ArrayList<ImageReplacementDetails>();
ImageReplacementDetails example1 = new ImageReplacementDetails();
example1.slideIndex = 0;
example1.imageRelId = "rId2";
example1.replacementImageBytes = FileUtils.readFileToByteArray(new File("test.png"));
imageReplacements.add(example1);
PresentationMLPackage presentationMLPackage =
(PresentationMLPackage)OpcPackage.load(new java.io.File(inputfilepath));
// First, the text replacements
List<SlidePart> slideParts=
presentationMLPackage.getMainPresentationPart().getSlideParts();
for (SlidePart slidePart : slideParts) {
slidePart.variableReplace(mappings);
}
// Second, the image replacements.
// We have a design choice here.
// Either we can replace text placeholders with images,
// or we can replace existing images with new images, but keep the XML specifying size etc
// Here I opt for the latter, so what we need is the relId and image bytes.
for( ImageReplacementDetails ird : imageReplacements) {
// its a bit inefficient to potentially traverse a single slide
// multiple times, but I've done it this way to keep this example simple
SlidePart slidePart=
presentationMLPackage.getMainPresentationPart().getSlide(ird.slideIndex);
SlidePicFinder traverser = new SlidePicFinder();
new TraversalUtil(slidePart.getJaxbElement().getCSld().getSpTree().getSpOrGrpSpOrGraphicFrame(), traverser);
for(org.pptx4j.pml.Pic pic : traverser.pics) {
CTBlipFillProperties blipFill = pic.getBlipFill();
if (blipFill!=null) {
CTBlip blip = blipFill.getBlip();
if (blip.getEmbed()!=null) {
String relId = blip.getEmbed();
// is this the one we want?
if (relId.equals(ird.imageRelId)) {
Part part = slidePart.getRelationshipsPart().getPart(relId);
try {
BinaryPartAbstractImage imagePart = (BinaryPartAbstractImage)part;
// you'll need to ensure that you replace like with like,
// ie png for png, not eg jpeg for png!
imagePart.setBinaryData(ird.replacementImageBytes);
} catch (ClassCastException cce) {
System.out.println(part.getClass().getName());
}
} else {
System.out.println(relId + " isn't a match for this replacement. ");
}
} else {
System.out.println("No a:blip/#r:embed");
}
}
}
}
System.out.println("\n\n saving .. \n\n");
String outputfilepath = System.getProperty("user.dir") + "/OUT_VariableReplace.pptx";
presentationMLPackage.save(new java.io.File(outputfilepath));
System.out.println("\n\n done .. \n\n");
}
static class ImageReplacementDetails {
int slideIndex;
String imageRelId;
byte[] replacementImageBytes;
}
static class SlidePicFinder extends CallbackImpl {
List<org.pptx4j.pml.Pic> pics = new ArrayList<org.pptx4j.pml.Pic>();
public List<Object> apply(Object o) {
if (o instanceof org.pptx4j.pml.Pic) {
pics.add((org.pptx4j.pml.Pic) o);
System.out.println("added pic");
}
return null;
}
}
}
I m trying to extract images from a pdf using pdfbox. The example pdf here
But i m getting blank images only.
The code i m trying:-
public static void main(String[] args) {
PDFImageExtract obj = new PDFImageExtract();
try {
obj.read_pdf();
} catch (IOException ex) {
System.out.println("" + ex);
}
}
void read_pdf() throws IOException {
PDDocument document = null;
try {
document = PDDocument.load("C:\\Users\\Pradyut\\Documents\\MCS-034.pdf");
} catch (IOException ex) {
System.out.println("" + ex);
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int i =1;
String name = null;
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
i ++;
}
}
}
}
Thanks
Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.
public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException {
List<RenderedImage> images = new ArrayList<>();
for (PDPage page : document.getPages()) {
images.addAll(getImagesFromResources(page.getResources()));
}
return images;
}
private List<RenderedImage> getImagesFromResources(PDResources resources) throws IOException {
List<RenderedImage> images = new ArrayList<>();
for (COSName xObjectName : resources.getXObjectNames()) {
PDXObject xObject = resources.getXObject(xObjectName);
if (xObject instanceof PDFormXObject) {
images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));
} else if (xObject instanceof PDImageXObject) {
images.add(((PDImageXObject) xObject).getImage());
}
}
return images;
}
The below GetImagesFromPDF java class get all images in 04-Request-Headers.pdf file and save those files into destination folder PDFCopy.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
#SuppressWarnings({ "unchecked", "rawtypes", "deprecation" })
public class GetImagesFromPDF {
public static void main(String[] args) {
try {
String sourceDir = "C:/PDFCopy/04-Request-Headers.pdf";// Paste pdf files in PDFCopy folder to read
String destinationDir = "C:/PDFCopy/";
File oldFile = new File(sourceDir);
if (oldFile.exists()) {
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
} else {
System.err.println("File not exists");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
For PDFBox 2.0.1, pudaykiran's answer must be slightly modified since some APIs have been changed.
public static void testPDFBoxExtractImages() throws Exception {
PDDocument document = PDDocument.load(new File("D:/Temp/Test.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File("D:/Temp/" + System.nanoTime() + ".png");
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
}
}
}
}
Just add the .jpeg to the end of your path:
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i + ".jpeg");
That works for me.
You can use PDPage.convertToImage() function which can convert the PDF page into a BufferedImage. Next you can use the BufferedImage to create an Image.
Use the following reference for further detail:
All PDF realated classes in PDFBox you can get in
Apache PDFBox 1.8.3 API
Here you can see PDPage related documentation.
And do not forget to look for PDPage.convertToImage() function in PDPage class.
This is a kotlin version of #Matt's answer.
fun <R> PDResources.onImageResources(block: (RenderedImage) -> (R)): List<R> =
this.xObjectNames.flatMap {
when (val xObject = this.getXObject(it)) {
is PDFormXObject -> xObject.resources.onImageResources(block)
is PDImageXObject -> listOf(block(xObject.image))
else -> emptyList()
}
}
You can use it on PDPage Resources like this:
page.resources.onImageResources { image ->
Files.createTempFile("image", "xxx").also { path->
if(!ImageIO.write(it, "xxx", file.toFile()))
IllegalStateException("Couldn't write image to file")
}
}
Where "xxx" is the format you need (like "jpeg")
For someone who want just copy and paste this ready to use code
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.UUID;
public class ExtractImagesUseCase extends PDFStreamEngine{
private final String filePath;
private final String outputDir;
// Constructor
public ExtractImagesUseCase(String filePath,
String outputDir){
this.filePath = filePath;
this.outputDir = outputDir;
}
// Execute
public void execute(){
try{
File file = new File(filePath);
PDDocument document = PDDocument.load(file);
for(PDPage page : document.getPages()){
processPage(page);
}
}catch(IOException e){
e.printStackTrace();
}
}
#Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException{
String operation = operator.getName();
if("Do".equals(operation)){
COSName objectName = (COSName) operands.get(0);
PDXObject pdxObject = getResources().getXObject(objectName);
if(pdxObject instanceof PDImageXObject){
// Image
PDImageXObject image = (PDImageXObject) pdxObject;
BufferedImage bImage = image.getImage();
// File
String randomName = UUID.randomUUID().toString();
File outputFile = new File(outputDir,randomName + ".png");
// Write image to file
ImageIO.write(bImage, "PNG", outputFile);
}else if(pdxObject instanceof PDFormXObject){
PDFormXObject form = (PDFormXObject) pdxObject;
showForm(form);
}
}
else super.processOperator(operator, operands);
}
}
Demo
public class ExtractImageDemo{
public static void main(String[] args){
String filePath = "C:\\Users\\John\\Downloads\\Documents\\sample-file.pdf";
String outputDir = "C:\\Users\\John\\Downloads\\Documents\\Output";
ExtractImagesUseCase useCase = new ExtractImagesUseCase(
filePath,
outputDir
);
useCase.execute();
}
}
Instead of calling
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
You can use the ImageIO.write() static method to write the RGB image out in whatever format you need. Here I've used PNG:
File outputFile = new File( "C:\\Users\\Pradyut\\Documents\\image" + i + ".png");
ImageIO.write( image.getRGBImage(), "png", outputFile);
I need to write a code which will get all the links in a website recursively. Since I'm new to this is what I've got so far;
List<WebElement> no = driver.findElements(By.tagName("a"));
nooflinks = no.size();
for (WebElement pagelink : no)
{
String linktext = pagelink.getText();
link = pagelink.getAttribute("href");
}
Now what I need to do is if the list finds a link of the same domain, then it should get all the links from that URL and then return back to the previous loop and resume from the next link. This should go on till the last URL in the Whole Website is found. That is for example, Home Page is base URL and it has 5 URLs of other pages, then after getting the first of the 5 URLs the loop should get all the links of that first URL return back to Home Page and resume from second URL. Now if second URL has Sub-sub URL, then the loop should find links for those first then resume to second URL and then go back to Home Page and resume from third URL.
Can anybody help me out here???
I saw this post recently. I don't know if you are still looking for ANY solution for this problem. If not, I thought it might be useful:
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Iterator;
public class URLReading {
public static void main(String[] args) {
try {
String url="";
HashMap<String, String> h = new HashMap<>();
Url = "https://abidsukumaran.wordpress.com/";
Document doc = Jsoup.connect(url).get();
// Page Title
String title = doc.title();
//System.out.println("title: " + title);
// Links in page
Elements links = doc.select("a[href]");
List url_array = new ArrayList();
int i=0;
url_array.add(url);
String root = url;
h.put(url, title);
Iterator<String> keySetIterator = h.keySet().iterator();
while((i<=h.size())){
try{
url = url_array.get(i).toString();
doc = Jsoup.connect(url).get();
title = doc.title();
links = doc.select("a[href]");
for (Element link : links) {
String res= h.putIfAbsent(link.attr("href"), link.text());
if (res==null){
url_array.add(link.attr("href"));
System.out.println("\nURL: " + link.attr("href"));
System.out.println("CONTENT: " + link.text());
}
}
}catch(Exception e){
System.out.println("\n"+e);
}
i++;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
You can use Set and HashSet. You may try like this:
Set<String> getLinksFromSite(int Level, Set<String> Links) {
if (Level < 5) {
Set<String> locallinks = new HashSet<String>();
for (String link : Links) {
Set<String> new_links = ;
locallinks.addAll(getLinksFromSite(Level+1, new_links));
}
return locallinks;
} else {
return Links;
}
}
I would think the following idiom would be useful in this context:
Set<String> visited = new HashSet<>();
Deque<String> unvisited = new LinkedList<>();
unvisited.add(startingURL);
while (!unvisited.isEmpty()) {
String current = unvisited.poll();
visited.add(current);
for /* each link in current */ {
if (!visited.contains(link.url())
unvisited.add(link.url());
}
}
I have one PDF and some keywords. What I need is to search for those keywords in the PDF, highlight them in PDF, and save it. After this, I have to view this PDF in Google Docs and the words should be highlighted in it. I have to do this in Java.
My code is
package com.hiringsteps.ats.util.pdfclownUtil;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.File;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
import com.hiringsteps.ats.applicant.domain.ApplicantKeyWord;
import com.hiringsteps.ats.job.domain.CustomerJobKeyword;
public class TextHighlightUtil
{
private int count;
public Collection<ApplicantKeyWord> highlight(String inputPath, String outputPath, Collection<CustomerJobKeyword> customerJobKeywordList )
{
Collection<ApplicantKeyWord> applicantKeywordList = new ArrayList<ApplicantKeyWord>();
ApplicantKeyWord applicantKeyword = null;
// 1. Open the PDF file!
File file;
try
{
file = new File(inputPath);
}
catch(Exception e)
{
throw new RuntimeException(inputPath + " file access error.",e);
}
for(CustomerJobKeyword key : customerJobKeywordList) {
applicantKeyword = new ApplicantKeyWord();
count = 0;
// Define the text pattern to look for!
//String textRegEx = promptChoice("Please enter the pattern to look for: ");
applicantKeyword.setKey(key);
Pattern pattern = Pattern.compile(key.getName(), Pattern.CASE_INSENSITIVE);
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
// 2.1. Extract the page text!
Map<Rectangle2D,List<ITextString>> textStrings = textExtractor.extract(page);
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
// 2.3. Highlight the text pattern matches!
textExtractor.filter(textStrings,
new TextExtractor.IIntervalFilter()
{
public boolean hasNext()
{
//if(key.getMatchCriteria() == 1){
if (matcher.find()) {
count++;
return true;
}
/*} else if(key.getMatchCriteria() == 2) {
if (matcher.hitEnd()) {
count++;
return true;
}
}*/
return false;
}
public Interval<Integer> next()
{
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match)
{
// Defining the highlight box of the text pattern match...
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for(TextChar textChar : match.getTextChars())
{
Rectangle2D textCharBox = textChar.getBox();
if(textBox == null)
{textBox = (Rectangle2D)textCharBox.clone();}
else
{
if(textCharBox.getY() > textBox.getMaxY())
{
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D)textCharBox.clone();
}
else
{textBox.add(textCharBox);}
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight()+5);
highlightQuads.add(Quad.get(textBox));
}
//TextMarkup.setPrintable(true);
// Highlight the text pattern match!
new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
//TextMarkup temp = new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
//temp.setMarkupBoxes(highlightQuads);
//temp.setPrintable(true);
//
temp.setVisible(true);
//temp.setMarkupType(MarkupTypeEnum.Highlight);
}
public void remove()
{throw new UnsupportedOperationException();}
}
);
}
applicantKeyword.setCount(count);
applicantKeywordList.add(applicantKeyword);
}
SerializationModeEnum serializationMode = SerializationModeEnum.Incremental;
try
{
file.save(new java.io.File(outputPath), serializationMode);
file.close();
}
catch(Exception e)
{
System.out.println("File writing failed: " + e.getMessage());
e.printStackTrace();
}
return applicantKeywordList;
}
}
With this, I am able to highlight. But when I render the PDF in Google Docs, the words are no longer highlighted. If the PDF is opened with Adobe, they are highlighted. Also, if I just open and save the PDF in Adobe Acrobat Professional, then open it with Google Docs, the Google Docs version will have the words highlighted.
See this also
The author of PDF Clown reported that the problem was caused by the lack of explicit appearance stream associated to the markup annotation. As successively stated, this issue has been solved by a revision committed to the project's SVN repository on Sourceforge.net