To search a particular file in PDF document using Java - java

Hi I have a PDF file and I need to search a particular string in that. I tried various methods, and I am able to read all the contents in PDF file but unable to find a particular string.
Here in this file, I need to search string such as Telephone, Garbage, Rent etc individually.
Could you please help me?
I have the below code for reading the file.
public class PDFBoxReader {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFBoxReader() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File("D:\\report.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
It would be great if someone could help me with a code that searches for a particular string. Thanks in advance.

Try String.indexOf("substring") with String being what is returned from your ToText() method, and substring the string you wish to search for. (Side note, the custom in Java is camel-case methods, which would be toText() in this case.)
This method should find the first index of the entered substring in your long String of text. So you could do String.indexOf("Telephone") to find the first occurrence of the word Telephone in your String.
If you want the stuff directly after that substring, the index would simply be String.indexOf("substring")+"substring".length()
You can even find the next occurrence (or the next after that) with another variation of this method String.indexOf("substring", indexOfLastOccurrence+"substring".length)
Example:
String myPDF = ToText();
int rentIndex = myPDF.indexOf("Rent")+"Rent".length();
String rent = myPDF.substring(rentIndex); //Find 1st occurrence of "Rent" and get info after it
rent = rent.substring(int beginIndex, int endIndex); //Get endIndex-beginIndex characters after rent. (I assume you only want like a few numbers afterwards or something.)
//process rent e.g. Integer.parseInt(rent) or something
rentIndex = myPDF.indexOf("Rent",rentIndex)+"Rent".length();
rent = myPDF.substring(rentIndex); //Next occurrence of "Rent"
//Repeat to find the next occurrence, and the one after that. (Until rentIndex gets set to a negative, indicating that no more occurrences exist.)
Both methods can be found in the Java API:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String)

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Search pattern within String in JAVA

I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}
I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.
The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.

PDFBox - Splitting one single pdf into multiple pdf files

My requirement is, i have to split a large pdf file into multiple small pdf files. I have a 10000 pages pdf file and i want to split the file into 1000 files with 10 pages each. I tried to split the file using pdfbox api. I am able to split the file as per my requirement and also it works fine with the file having small no of pages. But when i tried with 10000 pages, it is taking huge time, ie) in hours. In actual scenario i may even get pdf file with more than 20000 pages and more than 5000 splits.
The time to split is reducing based on the no of split. If i try to split the same file into 100*100 pages, it is taking less time. Can anyone please validate my code and check if i am doing it in a right way or i can add code to make the performance better.
Note: I cannot use 'iText' since this is for client specific project. Is there any api available to split the pdf file other than iText and pdfbox
Please refer my below code
public class Test {
private static String sourceFolderPath = "/local_path/PDFSplitter_perf/10000_pages/";
private static String outputPath = sourceFolderPath+"output/";
private static String pdfFileName = sourceFolderPath+"test_1.pdf";
private static int pageCount = 10;
public static void main(String[] args) throws IOException {
splitUsingPDFBox(pdfFileName);
}
public static void splitUsingPDFBox(String pdfFilePath) throws IOException, InterruptedException, ExecutionException{
try (final PDDocument document = PDDocument.load(new File(pdfFilePath));) {
int i = 1;
while(i<10000){
int startPage = i;
int endPage = i + (pageCount-1);
String chidlPdfFile = outputPath+"/"+startPage+"_"+endPage+".pdf";
Splitter splitter = new Splitter();
splitter.setStartPage(startPage);
splitter.setEndPage(endPage);
splitter.setSplitAtPage(endPage);
List<PDDocument> pages = splitter.split(document);
PDDocument pd = null;
try{
pd = pages.get(0);
pd.save(chidlPdfFile);
}finally{
if( pd != null ){
pd.close();
}
}
}
}
}
}

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:
You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}
What about:
nameArrays.add(properties+"\n");
You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]
This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

Reading words in Ms Word and replacing it with new words (with JAVA)

I am writing a program to read some text or textfields from a Microsoft Office Word document and replace it with new words using Jacob.
I got the help from this link http://tech-junki.blogspot.de/2009/06/java-jacob-edit-ms-word.html but it didn't work. Could you please help me by telling me how can I read some text and replace it with new text!?
If you have a better idea, please tell me.
Note:
1- This method didn't give me any error but couldn't find the speciffic words!
2- How can I write an If() to know if our requested Search text (in this method arrayKeyString) exists or is written in ms word?
Thanks.
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.Dispatch;
//class
ActiveXComponent oWord = null;
Dispatch documents = null;
Dispatch document = null;
Dispatch selection = null;
//method
oWord = new ActiveXComponent("Word.Application");
documents = oWord.getProperty("Documents").toDispatch();
document = Dispatch.call(documents, "Open", finalName).toDispatch();
Dispatch selections = oWord.getProperty("Selection").toDispatch();
Dispatch findT = Dispatch.call(selections, "Find").toDispatch();
//hm is a Hashmap
for (int i=0; i<hm.size();i++){
hm.get(array[i].toString());
String arrayValString = (arrayVal[i].toString());
String arrayKeyString = array[i].toString();
// Here we should write an if() to check for our key word:
Dispatch.put(findT, "Text", arrayKeyString);
Dispatch.call(findT, "Execute");
Dispatch.put(selections, "Text", arrayValString);
}
ok I have also modified your for loop which had logical errors, As far as i can understand your question you don't need an if statement if you are trying to replace all the words from your hashmap in the document:
//hm is a Hashmap
for (int i=0; i<hm.size();i++){
//you were getting the value to be replaced but not storing that in arrayValString object
String arrayValString = hm.get(array[i].toString());
String arrayKeyString = array[i].toString();
// Here we should write an if() to check for our key word:
//if you want to replace all the text in your hash map in the word document then you don't need a if condition ...so if the text is not present in the document nothing will be replaced.
Dispatch.put(findT, "Text", arrayKeyString);
Dispatch.call(findT, "Execute");
Dispatch.put(selections, "Text", arrayValString);
}
I know it is probably a little bit too late for my answer, but i'll leave this here, for all the others who will find this page.
This is how i've done this:
private static final Variant MATCH_CASE = new Variant(true);
private static final Variant MATCH_WILDCARDS = new Variant(false);
private static final Variant FORWARD = new Variant(true);
private static final Variant MATCH_WHOLE_WORD = new Variant(false);
private static final Variant MATCH_SOUNDS_LIKE = new Variant(false);
private static final Variant MATCH_ALL_WORD_FORMS = new Variant(false);
private static final Variant FORMAT = new Variant(false);
private static final Variant WRAP = new Variant(1);
private static final Variant REPLACE = new Variant(2);
//...........
Dispatch selection = Dispatch.get(oleComponent,"Selection").toDispatch();
Dispatch oFind = Dispatch.call(selection, "Find").toDispatch();
for (Entry<String, String> entry : replacements.entrySet()) {
while (replaced) {
Variant variant = Dispatch.invoke(oFind,"Execute",Dispatch.Method, new Object[] {entry.getKey(),MATCH_CASE, MATCH_WHOLE_WORD,MATCH_WILDCARDS, MATCH_SOUNDS_LIKE, MATCH_ALL_WORD_FORMS,FORWARD, WRAP, FORMAT, entry.getValue(), new Variant(true), REPLACE }, new int[1]);
replaced = variant.getBoolean();
}
}
This code goes throug the whole map and replaces for each element ALL of the occurrences in the word Document.

Categories

Resources