Parse all PDF pages at once with iText

Parse all PDF pages at once with iText - java

I am trying to parse a pdf file with "iText". What I am trying to achieve is to parse all pages at once.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
String content = "";
for (int i = 0; i <= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
content = content + " " + PdfTextExtractor.getTextFromPage(reader, i);
}
System.out.println(content);
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:77)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:74)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:89)
at com.pdf.PDF.main(PDF.java:18)
Other problem I am facing is that the - hyphen is being parsed as ? question mark. How can I fix that?
I appreciate any help.
Edit
It works for me like this but I cant still solve the hyphen bug.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
for(int i = 1; i<= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = PdfTextExtractor.getTextFromPage(reader,i);
System.out.println(line);
}
}

public static String extractPdfText() throws IOException {
PdfReader pdfReader = new PdfReader("/path/to/file/myfile.pdf");
int pages = pdfReader.getNumberOfPages();
String pdfText = "";
for (int ctr = 1; ctr < pages + 1; ctr++) {
pdfText += PdfTextExtractor.getTextFromPage(pdfReader, ctr); // Page number cannot be 0 or will throw NPE
}
pdfReader.close();
return pdfText;
}

Related

Saving scraped data to file

Im scraping data from multiple web pages using Jsoup, how can I get the scraped data to save to file without it overwriting the previous webpage that got scraped
I've tried searching on stack overflow and Jsoup docs for a solution.
int j = 0;
int i = 0;
String URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
Document doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
Elements temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
j++;
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}

If you need to save the data from code, just check this, maybe it can help you:
int i = 0;
int pagesNumber = 10;
String URL = "";
Document doc = null;
Elements temp = null;
try {
// Create file
FileWriter fstream = new FileWriter(System.currentTimeMillis() + "out.txt");
BufferedWriter out = new BufferedWriter(fstream);
for (i=0; i<pagesNumber; i++) {
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+i);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighter : temp) {
out.write(i + " " + fighter.getElementsByClass("c-listing-athlete__name").first().text());
}
}
//Close the output stream
out.close();
} catch (Exception e) { // Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Hope it helps :)

pdfbox getcharacterbyarticle() rendering the vector for last page

I am trying to get text details like co-ordinates, width and height using the following code (took up this solution from here), but the output was only the text from the last page.
Code
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf"
PDFParser parser = new PDFParser(new FileInputStream(fileName));
parser.parse();
StringWriter outString = new StringWriter();
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
stripper.writeText(parser.getPDDocument(), outString);
Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List<TextPosition> tplist = vectorlistoftps.get(i);
for (int j = 0; j < tplist.size(); j++) {
TextPosition text = tplist.get(j);
System.out.println(" String "
+ "[x: " + text.getXDirAdj() + ", y: "
+ text.getY() + ", height:" + text.getHeightDir()
+ ", space: " + text.getWidthOfSpace() + ", width: "
+ text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
+ text.getCharacter() +" Font "+ text.getFont().getBaseFont() + " PageNUm "+ (i+1));
}
}
}
CustomPDFTextStripper class:
class CustomPDFTextStripper extends PDFTextStripper
{
//Vector<Vector<List<TextPosition>>> data = new Vector<Vector<List<TextPosition>>>();
public CustomPDFTextStripper() throws IOException {
super();
}
public Vector<List<TextPosition>> getCharactersByArticle(){
// data.add(charactersByArticle);
return charactersByArticle;
}
}
I tried to add the vectors to a list, but when calling the stripper() it is iterating through all the pages and the last page details are stored in charactersByArticle vector and thus returning the same. How do I get info for all pages??

Temporary Fix:
Changed the main method to set the current page as end page and getting the text info. Not a good idea though.
for (int page = 0; page < pageCount; page++)
{
stripper.setStartPage(0);
stripper.setEndPage(page + 1);
stripper.writeText(parser.getPDDocument(), outString);
Vector vectorlistoftps = stripper.getCharactersByArticle();
PDPage thisPage = stripper.getCurrentPage();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List<TextPosition> tplist = vectorlistoftps.get(i);
}
}

Printwriter not outputing leading zeros to text file

So this code I'm trying to output to text file in a certain format so that when I load it it wont have any problems. I'm using String.format("%02d", 5) to output 05 instead of just 5. This works in console but when outputing to text file it seems to output as 5 not 05. Why is this happening?
PrintWriter output = new PrintWriter(scheduleFile);
for (int i = 0; i < getSchedule().length; i++) {
for (int j = 0; j < getSchedule()[i].length; j++) {
output.print(j + ":00-" + (j + 1) + ":00 =");
if (getSchedule()[i][j] != null) { // if not null
output.print(String.format("%02d", (j)) + ":00-" + String.format("%02d", (j+1)) + ":00 "); // print hours
}
output.println();
}
}
Text file looks like this:
0:00-1:00 =
1:00-2:00 =
2:00-3:00 =
3:00-4:00 =
4:00-5:00 =
5:00-6:00 =
6:00-7:00 =
7:00-8:00 =
8:00-9:00 =
9:00-10:00 =
.
.
.
should look like this:
00:00-01:00 =
01:00-02:00 =
02:00-03:00 =
03:00-04:00 =
04:00-05:00 =
05:00-06:00 =
06:00-07:00 =
07:00-08:00 =
08:00-09:00 =
09:00-10:00 =
.
.
.

Getting java.lang.IllegalArgumentException: The end (7905) must not be before the start (15721) Exception while reading word document

I am parsing Microsoft Word documents. I have imported Apache poi jar to read the Word document. I want to get the headings present in the Word document. I have given the size of the headings to get that filtered.
public void try1(POIFSFileSystem filestream) throws Exception
{
HWPFDocument doc = new HWPFDocument (filestream);
WordExtractor we = new WordExtractor(doc);
Range range = doc.getRange();
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++)
{
Paragraph pr = range.getParagraph(i);
int k = 0;
if(pr.text().trim().length()!=0)
{
while (true)
{
System.out.println(k);
CharacterRun run = pr.getCharacterRun(k++);
/*System.out.println("Word is "+pr.text());
System.out.println("Color: " + run.getColor());
System.out.println("Font: " + run.getFontName());
System.out.println("Font Size: " + run.getFontSize());*/
System.out.println(pr.text());
System.out.println(run.getEndOffset()+" "+pr.getEndOffset());
if(run.getFontSize()==26||run.getFontSize()==24)
{
System.out.println("Selected One is "+pr.text());
}
if (run.getEndOffset() == pr.getEndOffset())
break;
}
}
}
}
am getting this exception :
java.lang.IllegalArgumentException: The end (7905) must not be before the start (15721)
at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:247)
at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:181)
at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:791)
at com.honeywell.corept.srd.ReadDocFileFromJava.try1(ReadDocFileFromJava.java:122)
at com.honeywell.corept.srd.ReadDocFileFromJava.main(ReadDocFileFromJava.java:24)
CharacterRun run = pr.getCharacterRun(k++); this is 122 line in the java file

Replacing url in a CSS with CSS Parser and Regex (Java)

I have this requirement that I need to replace URL in CSS, so far I have this code that display the rules of a css file:
#Override
public void parse(String document) {
log.info("Parsing CSS: " + document);
this.document = document;
InputSource source = new InputSource(new StringReader(this.document));
try {
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
log.info("Number of rules: " + ruleList.getLength());
// lets examine the stylesheet contents
for (int i = 0; i < ruleList.getLength(); i++)
{
CSSRule rule = ruleList.item(i);
if (rule instanceof CSSStyleRule) {
CSSStyleRule styleRule=(CSSStyleRule)rule;
log.info("selector: " + styleRule.getSelectorText());
CSSStyleDeclaration styleDeclaration = styleRule.getStyle();
//assertEquals(1, styleDeclaration.getLength());
for (int j = 0; j < styleDeclaration.getLength(); j++) {
String property = styleDeclaration.item(j);
log.info("property: " + property);
log.info("value: " + styleDeclaration.getPropertyCSSValue(property).getCssText());
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
However, I am not sure whether how to actually replace the URL since there is not much a documentation about CSS Parser

Here is the modified for loop:
//Only images can be there in CSS.
Pattern URL_PATTERN = Pattern.compile("http://.*?jpg|jpeg|png|gif");
for (int j = 0; j < styleDeclaration.getLength(); j++) {
String property = styleDeclaration.item(j);
String value = styleDeclaration.getPropertyCSSValue(property).getCssText();
Matcher m = URL_PATTERN.matcher(value);
//CSS property can have multiple URL. Hence do it in while loop.
while(m.find()) {
String originalUrl = m.group(0);
//Now you've the original URL here. Change it however ou want.
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse all PDF pages at once with iText - java

Related

Saving scraped data to file

pdfbox getcharacterbyarticle() rendering the vector for last page

Printwriter not outputing leading zeros to text file

Getting java.lang.IllegalArgumentException: The end (7905) must not be before the start (15721) Exception while reading word document

Replacing url in a CSS with CSS Parser and Regex (Java)

Categories

Resources