I have one PDF and some keywords. What I need is to search for those keywords in the PDF, highlight them in PDF, and save it. After this, I have to view this PDF in Google Docs and the words should be highlighted in it. I have to do this in Java.
My code is
package com.hiringsteps.ats.util.pdfclownUtil;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.File;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
import com.hiringsteps.ats.applicant.domain.ApplicantKeyWord;
import com.hiringsteps.ats.job.domain.CustomerJobKeyword;
public class TextHighlightUtil
{
private int count;
public Collection<ApplicantKeyWord> highlight(String inputPath, String outputPath, Collection<CustomerJobKeyword> customerJobKeywordList )
{
Collection<ApplicantKeyWord> applicantKeywordList = new ArrayList<ApplicantKeyWord>();
ApplicantKeyWord applicantKeyword = null;
// 1. Open the PDF file!
File file;
try
{
file = new File(inputPath);
}
catch(Exception e)
{
throw new RuntimeException(inputPath + " file access error.",e);
}
for(CustomerJobKeyword key : customerJobKeywordList) {
applicantKeyword = new ApplicantKeyWord();
count = 0;
// Define the text pattern to look for!
//String textRegEx = promptChoice("Please enter the pattern to look for: ");
applicantKeyword.setKey(key);
Pattern pattern = Pattern.compile(key.getName(), Pattern.CASE_INSENSITIVE);
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
// 2.1. Extract the page text!
Map<Rectangle2D,List<ITextString>> textStrings = textExtractor.extract(page);
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
// 2.3. Highlight the text pattern matches!
textExtractor.filter(textStrings,
new TextExtractor.IIntervalFilter()
{
public boolean hasNext()
{
//if(key.getMatchCriteria() == 1){
if (matcher.find()) {
count++;
return true;
}
/*} else if(key.getMatchCriteria() == 2) {
if (matcher.hitEnd()) {
count++;
return true;
}
}*/
return false;
}
public Interval<Integer> next()
{
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match)
{
// Defining the highlight box of the text pattern match...
List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for(TextChar textChar : match.getTextChars())
{
Rectangle2D textCharBox = textChar.getBox();
if(textBox == null)
{textBox = (Rectangle2D)textCharBox.clone();}
else
{
if(textCharBox.getY() > textBox.getMaxY())
{
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D)textCharBox.clone();
}
else
{textBox.add(textCharBox);}
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight()+5);
highlightQuads.add(Quad.get(textBox));
}
//TextMarkup.setPrintable(true);
// Highlight the text pattern match!
new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
//TextMarkup temp = new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
//temp.setMarkupBoxes(highlightQuads);
//temp.setPrintable(true);
//
temp.setVisible(true);
//temp.setMarkupType(MarkupTypeEnum.Highlight);
}
public void remove()
{throw new UnsupportedOperationException();}
}
);
}
applicantKeyword.setCount(count);
applicantKeywordList.add(applicantKeyword);
}
SerializationModeEnum serializationMode = SerializationModeEnum.Incremental;
try
{
file.save(new java.io.File(outputPath), serializationMode);
file.close();
}
catch(Exception e)
{
System.out.println("File writing failed: " + e.getMessage());
e.printStackTrace();
}
return applicantKeywordList;
}
}
With this, I am able to highlight. But when I render the PDF in Google Docs, the words are no longer highlighted. If the PDF is opened with Adobe, they are highlighted. Also, if I just open and save the PDF in Adobe Acrobat Professional, then open it with Google Docs, the Google Docs version will have the words highlighted.
See this also
The author of PDF Clown reported that the problem was caused by the lack of explicit appearance stream associated to the markup annotation. As successively stated, this issue has been solved by a revision committed to the project's SVN repository on Sourceforge.net
Related
Every time the user logins.I'm reading till the SECOND LAST LINE OF THE FILE .I want to know what changes i need to make to the code so that i can read only till the second last line of the file.
public static boolean User(String usid) {
try {
String acc = usid;
File file = new File("C:\\Temp\\logs\\bank.log");
Scanner myReader = new Scanner(file);
while (myReader.hasNextLine()) {
String data = myReader.nextLine();
String[] substrings = data.split("[:]");
if (substrings[5].contains(acc) && substrings[4].contains("Login Successful for user")) {
a = true;
} else {
a = false;
}
}
} catch (Exception e) {
e.printStackTrace();
}
Could anyone please guide me what changes i need to make to the above code to read till the second last line of the file.[NOTE:-The contents of this file keeps adding once the user logins or logout.]
Try this, you can modify as per your requirement
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
public class Test{
public static boolean User(String usid) {
boolean a=false;
String fileName = "c://lines.txt";
List<String> list = new ArrayList<>();
String acc = usid;
try (BufferedReader br = Files.newBufferedReader(Paths.get(fileName))) {
//br returns as stream and convert it into a List
list = br.lines().collect(Collectors.toList());
for(int i=0; i<list.size()-1; i++){
String data = list.get(i);
String[] substrings = data.split("[:]");
if (substrings[5].contains(acc) && substrings[4].contains("Login Successful for user")) {
a = true;
} else {
a = false;
}
}
} catch (IOException e) {
e.printStackTrace();
}
return a;
}
}
This is already answered at Link
#JasonPlutext,
Hi Jason! I tried the above code but it just replaces an totally the image deleting the whole template.
I would like to just replace/add a particular relationship of the image ,say
<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="../media/image10.png"/>
in place of rId8 i would like to replace rId7 image.
My Source Code:
public static void main(String[] args) throws Exception {
String inputfilepath = "C:\\Users\\saranyac\\QUERIES\\Estimation\\PPT-PSR\\PSR_Dev0ps\\PSRAutomationTemplate.pptx";
PresentationMLPackage presentationMLPackage = (PresentationMLPackage)OpcPackage.load(new java.io.File(inputfilepath));
MainPresentationPart pp = presentationMLPackage.getMainPresentationPart();
SlidePart slidePart = presentationMLPackage.getMainPresentationPart().getSlide(0);
SlideLayoutPart layoutPart = slidePart.getSlideLayoutPart();
System.out.println("SlidePart Name:::::"+slidePart.getPartName().getName());
String layoutName = layoutPart.getJaxbElement().getCSld().getName();
System.out.println("layout: " + layoutPart.getPartName().getName() + " with cSld/#name='" + layoutName + "'");
System.out.println("Master: " + layoutPart.getSlideMasterPart().getPartName().getName());
System.out.println("layoutPart.getContents()::::::::s: " + layoutPart.getContents());
//layoutPart.setContents( (SldLayout)XmlUtils.unmarshalString(SAMPLE_PICTURE, Context.jcPML));
// Add image part
File file = new File("C:\\Users\\saranyac\\PPT-PSR\\PSR_Dev0ps\\ppt\\media\\image10.png" );
BinaryPartAbstractImage imagePart
= BinaryPartAbstractImage.createImagePart(presentationMLPackage, slidePart, file);
Relationship rel = pp.getRelationshipsPart().getRelationshipByID("rId8");
System.out.println("Relationship:::::::s: " +imagePart.getSourceRelationship().getId());
// pp.removeSlide(rel);
java.util.HashMap<String, String>mappings = new java.util.HashMap<String, String>();
mappings.put("rId8", imagePart.getSourceRelationship().getId());
String outputfilepath = "C:\\Work\\24Jan2018_CheckOut\\PPT-TRAILS\\Success.pptx";
//presentationMLPackage.save(new java.io.File(outputfilepath));
SaveToZipFile saver = new SaveToZipFile(presentationMLPackage);
saver.save(outputfilepath);
System.out.println("\n\n done .. saved " + outputfilepath);
}
Please help me how to replace an image in the generated PPT.
With Regards,
Saranya
See https://github.com/plutext/docx4j/blob/master/src/samples/pptx4j/org/pptx4j/samples/TemplateReplaceSimple.java (just added):
package org.pptx4j.samples;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import javax.xml.bind.JAXBException;
import org.apache.commons.io.FileUtils;
import org.docx4j.TraversalUtil;
import org.docx4j.TraversalUtil.CallbackImpl;
import org.docx4j.dml.CTBlip;
import org.docx4j.dml.CTBlipFillProperties;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.OpcPackage;
import org.docx4j.openpackaging.packages.PresentationMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PresentationML.SlidePart;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;
import org.pptx4j.Pptx4jException;
/**
* Example of how to replace text and images in a Pptx.
*
* Text is replaced using the familiar VariableReplace approach.
*
* Images are replaced by replacing their byte content.
*
* #author jharrop
*
*/
public class TemplateReplaceSimple {
public static void main(String[] args) throws Docx4JException, Pptx4jException, JAXBException, IOException {
// Input file
String inputfilepath = System.getProperty("user.dir") + "/sample-docs/pptx/image.pptx";
// String replacements
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("colour", "green");
// Image replacements
List<ImageReplacementDetails> imageReplacements = new ArrayList<ImageReplacementDetails>();
ImageReplacementDetails example1 = new ImageReplacementDetails();
example1.slideIndex = 0;
example1.imageRelId = "rId2";
example1.replacementImageBytes = FileUtils.readFileToByteArray(new File("test.png"));
imageReplacements.add(example1);
PresentationMLPackage presentationMLPackage =
(PresentationMLPackage)OpcPackage.load(new java.io.File(inputfilepath));
// First, the text replacements
List<SlidePart> slideParts=
presentationMLPackage.getMainPresentationPart().getSlideParts();
for (SlidePart slidePart : slideParts) {
slidePart.variableReplace(mappings);
}
// Second, the image replacements.
// We have a design choice here.
// Either we can replace text placeholders with images,
// or we can replace existing images with new images, but keep the XML specifying size etc
// Here I opt for the latter, so what we need is the relId and image bytes.
for( ImageReplacementDetails ird : imageReplacements) {
// its a bit inefficient to potentially traverse a single slide
// multiple times, but I've done it this way to keep this example simple
SlidePart slidePart=
presentationMLPackage.getMainPresentationPart().getSlide(ird.slideIndex);
SlidePicFinder traverser = new SlidePicFinder();
new TraversalUtil(slidePart.getJaxbElement().getCSld().getSpTree().getSpOrGrpSpOrGraphicFrame(), traverser);
for(org.pptx4j.pml.Pic pic : traverser.pics) {
CTBlipFillProperties blipFill = pic.getBlipFill();
if (blipFill!=null) {
CTBlip blip = blipFill.getBlip();
if (blip.getEmbed()!=null) {
String relId = blip.getEmbed();
// is this the one we want?
if (relId.equals(ird.imageRelId)) {
Part part = slidePart.getRelationshipsPart().getPart(relId);
try {
BinaryPartAbstractImage imagePart = (BinaryPartAbstractImage)part;
// you'll need to ensure that you replace like with like,
// ie png for png, not eg jpeg for png!
imagePart.setBinaryData(ird.replacementImageBytes);
} catch (ClassCastException cce) {
System.out.println(part.getClass().getName());
}
} else {
System.out.println(relId + " isn't a match for this replacement. ");
}
} else {
System.out.println("No a:blip/#r:embed");
}
}
}
}
System.out.println("\n\n saving .. \n\n");
String outputfilepath = System.getProperty("user.dir") + "/OUT_VariableReplace.pptx";
presentationMLPackage.save(new java.io.File(outputfilepath));
System.out.println("\n\n done .. \n\n");
}
static class ImageReplacementDetails {
int slideIndex;
String imageRelId;
byte[] replacementImageBytes;
}
static class SlidePicFinder extends CallbackImpl {
List<org.pptx4j.pml.Pic> pics = new ArrayList<org.pptx4j.pml.Pic>();
public List<Object> apply(Object o) {
if (o instanceof org.pptx4j.pml.Pic) {
pics.add((org.pptx4j.pml.Pic) o);
System.out.println("added pic");
}
return null;
}
}
}
Iam facing issue with some of the search keywords are not highlighting in chinese documents .Due to confidiential concerns iam not providing actual pdf . search keywords are 1)亿元或2) 收入亿来源 Please find the pdf document path which i tested ,pdfpath link. and ActualResult link .I have already posted related to this issue in following Link but some of the keywords are not highlighting properly in few chinese documents.Kindly provide your inputs to highlight the search keywords which i mentioned.
import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.BufferedInputStream;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
public class pdfclown2 {
private static int count;
public static void main(String[] args) throws IOException {
highlight("ebook.pdf","C:\\Users\\Downloads\\6.pdf");
System.out.println("OK");
}
private static void highlight(String inputPath, String outputPath) throws IOException {
URL url = new URL(inputPath);
InputStream in = new BufferedInputStream(url.openStream());
org.pdfclown.files.File file = null;
try {
file = new org.pdfclown.files.File("C:\\Users\\Desktop\\pdf\\test123.pdf");
Map<String, String> m = new HashMap<String, String>();
m.put("亿元或","hi");
m.put("收入亿来","hi");
System.out.println("map size"+m.size());
long startTime = System.currentTimeMillis();
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
for (Map.Entry<String, String> entry : m.entrySet()) {
Pattern pattern;
String serachKey = entry.getKey();
final String translationKeyword = entry.getValue();
/*
if ((serachKey.contains(")") && serachKey.contains("("))
|| (serachKey.contains("(") && !serachKey.contains(")"))
|| (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
|| serachKey.contains("*") || serachKey.contains("+")) {s
pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
}
else*/
pattern = Pattern.compile(serachKey, Pattern.CASE_INSENSITIVE);
// 2.1. Extract the page text!
//System.out.println(textStrings.toString().indexOf(entry.getKey()));
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
// 2.3. Highlight the text pattern matches!
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
// System.out.println(matcher.find());
// if(key.getMatchCriteria() == 1){
if (matcher.find()) {
return true;
}
/*
* } else if(key.getMatchCriteria() == 2) { if
* (matcher.hitEnd()) { count++; return true; } }
*/
return false;
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match) {
// Defining the highlight box of the text pattern
// match...
System.out.println(match);
/* List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}*/
List<Quad> highlightQuads = new ArrayList<Quad>();
List<TextChar> textChars = match.getTextChars();
Rectangle2D firstRect = textChars.get(0).getBox();
Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
Rectangle2D rect = firstRect.createUnion(lastRect);
highlightQuads.add(Quad.get(rect).get(rect));
// subtype can be Highlight, Underline, StrikeOut, Squiggly
new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
}
SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
file.save(new java.io.File(outputPath), serializationMode);
System.out.println("file created");
long endTime = System.currentTimeMillis();
System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);
} catch (Exception e) {
e.printStackTrace();
}
finally{
in.close();
}
}
}
Indeed, when searching for "亿元或" the result highlight is somewhat wrong:
The cause is a PDF Clown bug. When it parses a composite font (aka Type 0 font), it expects the DW (default width) entry in the Type 0 font base dictionary while it is specified to be in the CIDFont subdictionary!
In case of the document at hand the widths of most characters, in particular of the Chinese characters, are not given explicitly and, therefore, default to that DW value. As this value cannot be determined properly due to the bug mentioned above, an average over the explicitly given widths is used, and this average happens to be merely ¾ of the correct value. Thus, the highlighted area is too short.
You can fix this bug in the CompositeFont class (package org.pdfclown.documents.contents.fonts) at the end of the method onLoad. Simply replace
PdfInteger defaultWidthObject = (PdfInteger)getBaseDataObject().get(PdfName.DW);
by
PdfInteger defaultWidthObject = (PdfInteger)getCIDFontDictionary().get(PdfName.DW);
The highlighting now results in
I've been looking for easy way to add ID to HTML tags and spent few hours here jumping form one tool to another before I came up with this little test solving my issues. Hence my sprint backlog is almost empty I have some time to share. Feel free to make it clear and enjoy those whom are asked by QA to add the ID. Just change the tag, path and run :)
Had some issue here to make proper lambda due to lack of coffee today...
how to replace first occurence only, in single lambda? in files I had many lines having same tags.
private void replace(String path, String replace, String replaceWith) {
try (Stream<String> lines = Files.lines(Paths.get(path))) {
List<String> replaced = lines
.map(line -> line.replace(replace, replaceWith))
.collect(Collectors.toList());
Files.write(Paths.get(path), replaced);
} catch (IOException e) {
e.printStackTrace();
}
}
Above was replacing all lines as it found text to replace in next lines. Proper matcher with repleace that has autoincrement would be better to use within this method body isntead of preparing the replaceWith value before the call. If I'll ever need this again I'll add you another final version .
Final version to not waste more time (phase green):
import org.junit.Test;
import org.junit.runner.RunWith;
import org.mockito.runners.MockitoJUnitRunner;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
#RunWith(MockitoJUnitRunner.class)
public class RepalceInFilesWithAutoIncrement {
private int incremented = 100;
/**
* The tag you would like to add Id to
* */
private static final String tag = "label";
/**
* Regex to find the tag
* */
private static final Pattern TAG_REGEX = Pattern.compile("<" + tag + " (.+?)/>", Pattern.DOTALL);
private static final Pattern ID_REGEX = Pattern.compile("id=", Pattern.DOTALL);
#Test
public void replaceInFiles() throws IOException {
String nextId = " id=\"" + tag + "_%s\" ";
String path = "C:\\YourPath";
try (Stream<Path> paths = Files.walk(Paths.get(path))) {
paths.forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
try {
List<String> foundInFiles = getTagValues(readFile(filePath.toAbsolutePath().toString()));
if (!foundInFiles.isEmpty()) {
for (String tagEl : foundInFiles) {
incremented++;
String id = String.format(nextId, incremented);
String replace = tagEl.split("\\r?\\n")[0];
replace = replace.replace("<" + tag, "<" + tag + id);
replace(filePath.toAbsolutePath().toString(), tagEl.split("\\r?\\n")[0], replace, false);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
});
}
System.out.println(String.format("Finished with (%s) changes", incremented - 100));
}
private String readFile(String path)
throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, StandardCharsets.UTF_8);
}
private List<String> getTagValues(final String str) {
final List<String> tagValues = new ArrayList<>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
if (!ID_REGEX.matcher(matcher.group()).find())
tagValues.add(matcher.group());
}
return tagValues;
}
private void replace(String path, String replace, String replaceWith, boolean log) {
if (log) {
System.out.println("path = [" + path + "], replace = [" + replace + "], replaceWith = [" + replaceWith + "], log = [" + log + "]");
}
try (Stream<String> lines = Files.lines(Paths.get(path))) {
List<String> replaced = new ArrayList<>();
boolean alreadyReplaced = false;
for (String line : lines.collect(Collectors.toList())) {
if (line.contains(replace) && !alreadyReplaced) {
line = line.replace(replace, replaceWith);
alreadyReplaced = true;
}
replaced.add(line);
}
Files.write(Paths.get(path), replaced);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Try it with Jsoup.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<html><head><title>Try it with Jsoup</title></head>"
+ "<body><p>P first</p><p>P second</p><p>P third</p></body></html>";
Document doc = Jsoup.parse(html);
Elements ps = doc.select("p"); // The tag you would like to add Id to
int i = 12;
for(Element p : ps){
p.attr("id",String.valueOf(i));
i++;
}
System.out.println(doc.toString());
}
}
I am writing a web crawler program using Jsoup library. (Sorry i can not post my code becase it too long to post it here).I need to crawl only URLs that can leed me to new links without crawling URLs with that starts with http or https and ending with image files, pdf, rar or zip files. I need just to crawl URLs that ending with .html, .htm, .jsp , .php and .asp etc.
I have two question regarding this issue:
1- How can i prevent the program to not read other unneeded URLs (like: images, PDFs or RARs) ?
2- How can i improve this class to not waisting time to load whole URL content to memory then parse URLs from it ?
This is my code below :
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import java.math.BigInteger;
import java.util.Formatter;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.security.*;
import java.nio.file.Path;
import java.nio.file.Paths;
public class HTMLParser {
private static final int READ_TIMEOUT_IN_MILLISSECS = (int) TimeUnit.MILLISECONDS.convert(30, TimeUnit.SECONDS);
private static HashMap <String, Integer> filecounter = new HashMap<> ();
public static List<LinkNodeLight> parse(LinkNode inputLink){
List<LinkNodeLight> outputLinks = new LinkedList<>();
try {
inputLink.setIpAdress(IpFromUrl.getIp(inputLink.getUrl()));
String url = inputLink.getUrl();
if (inputLink.getIpAdress() != null) {
url.replace(URLWeight.getHostName(url), inputLink.getIpAdress());
}
Document parsedResults = Jsoup
.connect(url)
.timeout(READ_TIMEOUT_IN_MILLISSECS)
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.get();
inputLink.setSize(parsedResults.html().length());
/* IP address moved here in order to speed up the process */
inputLink.setStatus(LinkNodeStatus.OK);
inputLink.setDomain(URLWeight.getDomainName(inputLink.getUrl()));
if (true) {
/* save the file to the html */
String filename = parsedResults.title();//digestBig.toString(16) + ".html";
if (filename.length() > 24) {
filename = filename.substring(0, 24);
}
filename = filename.replaceAll("[^\\w\\d\\s]", "").trim();
filename = filename.replaceAll("\\s+", " ");
if (!filecounter.containsKey(filename)) {
filecounter.put(filename, 1);
} else {
Integer tmp = filecounter.remove(filename);
filecounter.put(filename, tmp + 1);
}
filename = filename + "-" + (filecounter.get(filename)).toString() + ".html";
filename = Paths.get("downloads", filename).toString();
inputLink.setFileName(filename);
/* use md5 of url as file name */
try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filename)))) {
out.println("<!--" + inputLink.getUrl() + "-->");
out.print(parsedResults.html());
out.flush();
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
String tag;
Elements tagElements;
List<LinkNode> result;
tag = "a[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
tag = "area[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
} catch (IOException e) {
inputLink.setParseException(e);
inputLink.setStatus(LinkNodeStatus.ERROR);
}
return outputLinks;
}
static List<LinkNode> toLinkNodeObject(LinkNode parentLink, Elements tagElements, String tag) {
List<LinkNode> links = new LinkedList<>();
for (Element element : tagElements) {
if(isFragmentRef(element)){
continue;
}
String absoluteRef = String.format("abs:%s", tag.contains("[") ? tag.substring(tag.indexOf("[") + 1, tag.length()) : "href");
String url = element.attr(absoluteRef);
if(url!=null && url.trim().length()>0) {
LinkNode link = new LinkNode(url);
link.setTag(element.tagName());
link.setParentLink(parentLink);
links.add(link);
}
}
return links;
}
static boolean isFragmentRef(Element element){
String href = element.attr("href");
return href!=null && (href.trim().startsWith("#") || href.startsWith("mailto:"));
}
}
To add another solution to Pshemo for your first question. You may want to make a regex to compare to so that you don't even take the element and put it in the list
in method "static List toLinkNodeObject" maybe something like
"[http].+[^(pdf|rar|zip)]" and match your url to the regex. This will speed up the program too because you won't even be adding those links to parse for.
String url = element.attr(absoluteRef);
if(url!=null && url.trim().length()>0
&& url.matches("[http].+[^(pdf|rar|zip)]")) {
LinkNode link = new LinkNode(url);
link.setTag(element.tagName());
link.setParentLink(parentLink);
links.add(link);
}
As to speed up the class as a whole, it would help to multithread the downloading and parsing and allow the multiple threads to get and validate the information.