I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}
I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.
The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.
Related
I'm starting to work with Apache Lucene 8.0. I would want to know how to convert my String text variable into lowercase using Lucene. I'm not really sure about how to do it because I couldn't find any examples. What I want would be something like this:
public class DocumentLowercase {
private Analyzer analyzer;
public Analyzer DocAnalysis(Document d) {
analyzer = new StandardAnalyzer();
String text = d.text();
**Here convert String Text into lowercase**
** maybe using Lower Case Tokenizer? but how? **
return analyzer;
}
}
StandardAnalyzer already converts everything to lower case!
Check the docs here: http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
They say:
Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a
configurable list of stop words.
You can also see in the source code which components a StandardAnalyzer includes:
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new LowerCaseFilter(src);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(r -> {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
src.setReader(r);
}, tok);
}
If you want to customize your analyzer anyway you should look into CustomAnalyzer.
Hi I have a PDF file and I need to search a particular string in that. I tried various methods, and I am able to read all the contents in PDF file but unable to find a particular string.
Here in this file, I need to search string such as Telephone, Garbage, Rent etc individually.
Could you please help me?
I have the below code for reading the file.
public class PDFBoxReader {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFBoxReader() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File("D:\\report.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
It would be great if someone could help me with a code that searches for a particular string. Thanks in advance.
Try String.indexOf("substring") with String being what is returned from your ToText() method, and substring the string you wish to search for. (Side note, the custom in Java is camel-case methods, which would be toText() in this case.)
This method should find the first index of the entered substring in your long String of text. So you could do String.indexOf("Telephone") to find the first occurrence of the word Telephone in your String.
If you want the stuff directly after that substring, the index would simply be String.indexOf("substring")+"substring".length()
You can even find the next occurrence (or the next after that) with another variation of this method String.indexOf("substring", indexOfLastOccurrence+"substring".length)
Example:
String myPDF = ToText();
int rentIndex = myPDF.indexOf("Rent")+"Rent".length();
String rent = myPDF.substring(rentIndex); //Find 1st occurrence of "Rent" and get info after it
rent = rent.substring(int beginIndex, int endIndex); //Get endIndex-beginIndex characters after rent. (I assume you only want like a few numbers afterwards or something.)
//process rent e.g. Integer.parseInt(rent) or something
rentIndex = myPDF.indexOf("Rent",rentIndex)+"Rent".length();
rent = myPDF.substring(rentIndex); //Next occurrence of "Rent"
//Repeat to find the next occurrence, and the one after that. (Until rentIndex gets set to a negative, indicating that no more occurrences exist.)
Both methods can be found in the Java API:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String)
i have complete file path and i just need to extract the filename and just extension. So my output would be fileName.csv.
For ex: complete path is:
/Dir1/Dir2/Dir3/Dir4/Dir5/Dir6/fileName_20150108_002_20150109013841.csv
My output of Regex should be fileName.csv.
Extension and level of directories are not fixed.
As part of my requirement i need single regex that can extract fileName.csv not fileName_20150108_002_20150109013841.csv.how can i do it in single regular expression ?
Without using regex this can be solved as -public static String getFileName(String args){
args = args.substring(args.lastIndexOf('/')+1);
return args.substring(0,args.indexOf('_')) + args.substring(args.indexOf('.'));
}
Below would work for you might be
[^\\/:*?"<>|\r\n]+$
This regex has been tested on these two examples:
\var\www\www.example.com\index.jsp
\index.jsp
or rather you should use File.getName() for better approach.
String filename = new File("Payload/brownie.app/Info.plist").getName();
System.out.println(filename)
another way is
int index = path.lastIndexOf(File.separatorChar);
String filename = path.substring(index+1);
finally after getting the full filename use below code snippet
String str = filename;// in your case filename will be fileName_20150108_002_20150109013841.csv
str = str.substring(0,str.indexOf('_'))+str.substring(str.lastIndexOf('.'));
System.out.println("filename is ::"+str); // output will be fileName.csv
In the below code, group one will be fileName_timestamp.extension. I've replaced numerics and underscores with empty string. This may look ugly, but still will server your purpose. If the file name contains numerics, we need go for a different approach.
public static void main(String[] args) {
String toBeSplitted = "/Dir1/Dir2/Dir3/Dir4/Dir5/Dir6/fileName_20150108_002_20150109013841.csv";
Pattern r = Pattern.compile("(/[a-zA-Z0-9_.-]+)+/?");
Matcher m = r.matcher(toBeSplitted);
if(m.matches()){
String s = m.group(1).replaceAll("(/|[0-9]|_)", "");
System.out.println(s);
}
}
I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?
If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.
I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!
Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding
Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.
Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.
I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);