Not able to read individual page using PDFTextStripper with multiple threads - java

I am able to create 10 threads. But the problem is when I try to access individual page using those threads in parallel style. I have tried putting the private static PDFTextStripper instance into synchronized block as well. Still I get below exception:
COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
trying to print first word from each page for first 10 pages, but its not working. This is my first experiment with Multithreading and PDF reading. Any help would be much appreciated.
public class ReadPDFFile extends Thread implements FileInstance {
private static String fileLocation;
private static String fileNameIV;
private static String userInput;
private static int userConfidence;
private static int totalPages;
private static ConcurrentHashMap<Integer, List<String>> map = null;
private Iterator<PDDocument> iteratorForThisDoc;
private PDFTextStripperByArea text;
private static PDFTextStripper pdfStrip = null;
private static PDFParser pdParser = null;
private Splitter splitter;
private static int counter=0;
private StringWriter writer;
private static ReentrantLock counterLock = new ReentrantLock(true);
private static PDDocument doc;
private static PDDocument doc2;
private static boolean flag = false;
List<PDDocument> listOfPages;
ReadPDFFile(String filePath, String fileName, String userSearch, int confidence) throws FileNotFoundException{
fileLocation= filePath;
fileNameIV = fileName;
userInput= userSearch;
userConfidence = confidence;
System.out.println("object created");
}
#Override
public void createFileInstance(String filePath, String fileName) {
List<String> list = new ArrayList<String>();
map = new ConcurrentHashMap<Integer, List<String>>();
try(PDDocument document = PDDocument.load(new File(filePath))){
doc = document;
pdfStrip = new PDFTextStripper();
this.splitter = new Splitter();
text = new PDFTextStripperByArea();
document.getClass();
if(!document.isEncrypted()) {
totalPages = document.getNumberOfPages();
System.out.println("Number of pages in this book "+totalPages);
listOfPages = splitter.split(document);
iteratorForThisDoc = listOfPages.iterator();
}
this.createThreads();
/*
* for(int i=0;i<1759;i++) { readThisPage(i, pdfStrip); } flag= true;
*/
}
catch(IOException ie) {
ie.printStackTrace();
}
}
public void createThreads() {
counter=1;
for(int i=0;i<=9;i++) {
ReadPDFFile pdf = new ReadPDFFile();
pdf.setName("Reader"+i);
pdf.start();
}
}
public void run() {
try {
while(counter < 10){
int pgNum= pageCounterReentrant();
readThisPage(pgNum, pdfStrip);
}
doc.close();
}catch(Exception e) {
}
flag= true;
}
public static int getCounter() {
counter= counter+1;
return counter;
}
public static int pageCounterReentrant() {
counterLock.lock();
try {
counter = getCounter();
} finally {
counterLock.unlock();
}
return counter;
}
public static void readThisPage(int pageNum, PDFTextStripper ts) {
counter++;
System.out.println(Thread.currentThread().getName()+" reading page: "+pageNum+", counter: "+counter);
synchronized(ts){
String currentpageContent= new String();
try {
ts.setStartPage(pageNum);
ts.setEndPage(pageNum);
System.out.println("-->"+ts.getPageEnd());
currentpageContent = ts.getText(doc);
currentpageContent = currentpageContent.substring(0, 10);
System.out.println("\n\n "+currentpageContent);
}
/*
* further operations on currentpageContent here
*/
catch(IOException io) {
io.printStackTrace();
}finally {
}
}
}
public static void printFinalResult(ConcurrentHashMap<Integer, List<String>> map) {
/*
* simply display content of ConcurrentHashMap
*/
}
public static void main(String[] args) throws FileNotFoundException {
Scanner sc = new Scanner(System.in);
System.out.println("Search Word");
userInput = sc.nextLine();
System.out.println("Confidence");
userConfidence = sc.nextInt();
ReadPDFFile pef = new ReadPDFFile("file path", "file name",userInput, userConfidence);
pef.createFileInstance("file path ","file name");
if(flag==true)
printFinalResult(map);
}
}
If I read each page in a for loop sequentially using one thread then it is able to print the content, but not with multithreads. You can see that code commented in void createFileInstance(), after this.createThreads(); I wish to get string content of each pdf page individually, using threads, and then perform operation on it. I have the logic to collect each word token into List but before moving ahead, I need to solve this problem.

Your code looks like this:
try(PDDocument document = PDDocument.load(new File(filePath))){
doc = document;
....
this.createThreads();
} // document gets closed here
...
//threads that do text extraction still running here (and using a closed object)
These threads use doc to extract the text (ts.getText(doc)). However at this time, the PDDocument object is already closed due to the usage of try-with-resources, and its streams also closed. Thus the error message "Perhaps its enclosing PDDocument has been closed?".
You should create the thread before closing the document, and waiting for all threads to finish before closing it.
I'd advise against using multithreading on one PDDocument, see PDFBOX-4559. You could create several PDDocuments and then extract on these, or not do it at all. Text extraction works pretty fast in PDFBox (compared to rendering).

Related

NullPointer Exception while writing to file

So, I've been trying to write a method where I can write fields of Item, Client and Order objects, but as I run this program it gives nullpointexception error and I have no idea what causes it. As an argument, I give path of a file, for example : "C:\Java\info.txt"
Any help regarding this issue would be grateful.
public class IO {
ItemClass item;
ClientClass client;
OrderClass order;
private static HashSet<ItemClass> Items;
private static HashSet<ClientClass> Orders;
private static HashSet<OrderClass> Clients;
public void writeToFile(String directory) throws IOException{
File file = new File(directory);
if(!file.exists()){
file.createNewFile();
}
BufferedWriter bufferedW = new BufferedWriter(new FileWriter(directory));
Orders=getOrders();
Iterator <OrderClass>iteratorOrders = Orders.iterator();
while(iteratorOrders.hasNext()){
order = iteratorOrders.next();
bufferedW.write(order.getIdNumber());
bufferedW.write(order.getPersonalID());
bufferedW.write(order.getAddress());
bufferedW.write(order.getCountry());
bufferedW.write(order.getStatus());
}
Clients=getClients();
Iterator <ClientClass>iteratorClients = Clients.iterator();
while(iteratorClients.hasNext()){
client = iteratorClients.next();
bufferedW.write(client.getPersonalID());
bufferedW.write(client.getName());
bufferedW.write(client.getSurname());
bufferedW.write(client.getBirthDate());
boolean active = client.getIsActive();
boolean activeOrder = client.getHasActiveOrder();
bufferedW.write(String.valueOf(active));
bufferedW.write(String.valueOf(activeOrder));
}
Items=getItems();
Iterator <ItemClass>iteratorItems = Items.iterator();
while(iteratorItems.hasNext()){
item = iteratorItems.next();
bufferedW.write(item.getItemID());
int amount = item.getAmount();
double price = item.getPrice();
bufferedW.write(String.valueOf(amount));
bufferedW.write(String.valueOf(price));
bufferedW.write(item.getName());
bufferedW.write(item.getType());
bufferedW.write(item.getMadeIn());
}
bufferedW.close();
}
}

Get ID lexemes in lexer class ANTLR3 that implemented to a jTable

I am building a java clone code detector in swing that implemented the ANTLR. This is the Screenshot :
https://www.dropbox.com/s/wnumgsjmpps33v5/SemogaYaAllah.png
if you see the screenshot, there are a main file that compared to another files. The way that I do is compared thats token main file to another file. The problem is, I am failed to get the ID Lexemes or tokens from my lexer class.
This is my ANTLR3JavaLexer
public class Antlr3JavaLexer extends Lexer {
public static final int PACKAGE=84;
public static final int EXPONENT=173;
public static final int STAR=49;
public static final int WHILE=103;
public static final int MOD=32;
public static final int MOD_ASSIGN=33;
public static final int CASE=58;
public static final int CHAR=60;
I ve created a JavaParser.class like this to use that lexer:
public final class JavaParser extends AParser { //Parser is my Abstract Class
JavaParser() {
}
#Override
protected boolean parseFile(JCCDFile f, final ASTManager treeContainer)throws ParseException, IOException {
BufferedReader in = new BufferedReader(new FileReader(f.getFile()));
String filePath = f.getNama(); // getName of file
final Antlr3JavaLexer lexer = new Antlr3JavaLexer();
lexer.preserveWhitespacesAndComments = false;
try {
lexer.setCharStream(new ANTLRReaderStream(in));
} catch (IOException e) {
e.printStackTrace();
return false;
}
//This is the problem
//When I am activated this code pieces, I get the output like this
https://www.dropbox.com/s/80uyva56mk1r5xy/Bismillah2.png
/*
StringBuilder sbu = new StringBuilder();
while (true) {
org.antlr.runtime.Token token = lexer.nextToken();
if (token.getType() == Antlr3JavaLexer.EOF) {
break;
}
sbu.append(token.getType());
System.out.println(token.getType() + ": :" + token.getText());
}*/
final CommonTokenStream tokens = new CommonTokenStream();
tokens.setTokenSource(lexer);
tokens.LT(10); // force load
// Create the parser
Antlr3JavaParser parser = new Antlr3JavaParser(tokens);
StringBuffer sb = new StringBuffer();
sb.append(tokens.toString());
DefaultTableModel model = (DefaultTableModel) Main_Menu.jTable2.getModel();
List<final_tugas_akhir.Report2> theListData = new ArrayList<Report2>();
final_tugas_akhir.Report2 theResult = new final_tugas_akhir.Report2();
theResult.setFile(filePath);
theResult.setId(sb.toString());
theResult.setNum(sbu.toString());
theListData.add(theResult);
for (Report2 report : theListData) {
System.out.println(report.getFile());
System.out.println(report.getId());
model.addRow(new Object[]{
report.getFile(),
report.getId(),
report.getNum(),
});
}
// in CompilationUnit
CommonTree tree;
try {
tree = (CommonTree) parser.compilationUnit().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
} catch (RecognitionException e) {
e.printStackTrace();
return false;
}
walkThroughChildren(tree, treeContainer, parser.getTokenStream()); //this is my method to check the similiar tokens
in.close();
this.posisiFix(treeContainer); //fix position
return true;
}
Once again, this is the error code my java program link: https://www.dropbox.com/s/80uyva56mk1r5xy/Bismillah2.png.
The tokens always give me a null value.

A Producer-Consumer implemented using java threads writes only half the data to file

Hello I have a problem wherein I have to read a huge csv file. remove first field from it, then store only unique values to a file. I have written a program using threads which implements producer-consumer pattern.
Class CSVLineStripper does what the name suggests. Takes a line out of csv, removes first field from every line and adds it to a queue. CSVLineProcessor then takes that field stores all one by one in an arraylist and checks if fields are unique so only uniques are stored. Arraylist is only used for reference. every unique field is written to a file.
Now what is happening is that all fields are stripped correctly. I run about 3000 lines it's all correct. When I start the program for all lines, which are around 7,00,000 + lines, i get incomplete records, about 1000 unique are not taken. Every field is enclosed in double-quotes. What is weird is that the last field in the file that is generated is an incomplete word and ending double quote is missing. Why is this happening?
import java.util.*;
import java.io.*;
class CSVData
{
Queue <String> refererHosts = new LinkedList <String> ();
Queue <String> uniqueReferers = new LinkedList <String> (); // final writable queue of unique referers
private int finished = 0;
private int safety = 100;
private String line = "";
public CSVData(){}
public synchronized String getCSVLine() throws InterruptedException{
int i = 0;
while(refererHosts.isEmpty()){
if(i < safety){
wait(10);
}else{
return null;
}
i++;
}
finished = 0;
line = refererHosts.poll();
return line;
}
public synchronized void putCSVLine(String CSVLine){
if(finished == 0){
refererHosts.add(CSVLine);
this.notifyAll();
}
}
}
class CSVLineStripper implements Runnable //Producer
{
private CSVData cd;
private BufferedReader csv;
public CSVLineStripper(CSVData cd, BufferedReader csv){ // CONSTRUCTOR
this.cd = cd;
this.csv = csv;
}
public void run() {
System.out.println("Producer running");
String line = "";
String referer = "";
String [] CSVLineFields;
int limit = 700000;
int lineCount = 1;
try {
while((line = csv.readLine()) != null){
CSVLineFields = line.split(",");
referer = CSVLineFields[0];
cd.putCSVLine(referer);
lineCount++;
if(lineCount >= limit){
break;
}
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("<<<<<< PRODUCER FINISHED >>>>>>>");
}
private String printString(String [] str){
String string = "";
for(String s: str){
string = string + " "+s;
}
return string;
}
}
class CSVLineProcessor implements Runnable
{
private CSVData cd;
private FileWriter fw = null;
private BufferedWriter bw = null;
public CSVLineProcessor(CSVData cd, BufferedReader bufferedReader){ // CONSTRUCTOR
this.cd = cd;
try {
this.fw = new FileWriter("unique_referer_dump.txt");
} catch (IOException e) {
e.printStackTrace();
}
this.bw = new BufferedWriter(fw);
}
public void run() {
System.out.println("Consumer Started");
String CSVLine = "";
int safety = 10000;
ArrayList <String> list = new ArrayList <String> ();
while(CSVLine != null || safety <= 10000){
try {
CSVLine = cd.getCSVLine();
if(!list.contains(CSVLine)){
list.add(CSVLine);
this.CSVDataWriter(CSVLine);
}
} catch (Exception e) {
e.printStackTrace();
}
if(CSVLine == null){
break;
}else{
safety++;
}
}
System.out.println("<<<<<< CONSUMER FINISHED >>>>>>>");
System.out.println("Unique referers found in 30000 records "+list.size());
}
private void CSVDataWriter(String referer){
try {
bw.write(referer+"\n");
} catch (Exception e) {
e.printStackTrace();
}
}
}
public class RefererCheck2
{
public static void main(String [] args) throws InterruptedException
{
String pathToCSV = "/home/shantanu/DEV_DOCS/Contextual_Work/excite_domain_kw_site_wise_click_rev2.csv";
CSVResourceHandler csvResHandler = new CSVResourceHandler(pathToCSV);
CSVData cd = new CSVData();
CSVLineProcessor consumer = new CSVLineProcessor(cd, csvResHandler.getCSVFileHandler());
CSVLineStripper producer = new CSVLineStripper(cd, csvResHandler.getCSVFileHandler());
Thread consumerThread = new Thread(consumer);
Thread producerThread = new Thread(producer);
producerThread.start();
consumerThread.start();
}
}
This is how a sample input is:
"xyz.abc.com","4432"."clothing and gifts","true"
"pqr.stu.com","9537"."science and culture","false"
"0.stu.com","542331"."education, studies","false"
"m.dash.com","677665"."technology, gadgets","false"
Producer stores in queue:
"xyz.abc.com"
"pqr.stu.com"
"0.stu.com"
"m.dash.com"
Consumer stores uniques in the file, but after opening file contents one would see
"xyz.abc.com"
"pqr.stu.com"
"0.st
Couple things, you are breaking after 700k, not 7m, also you are not flushing your buffered writer, so the last stuff you could be incomplete, add flush at end and close all your resources. Debugger is a good idea :)

collecting text within <p> from html pages

I have a blog dataset which has a huge number of blog pages, with blog posts, comments and all blog features.
I need to extract only blog post from this collection and store it in a .txt file.
I need to modify this program as this program should collect blogposts tag starts with <p> and ends with </p> and avoiding other tags.
Currently I use HTMLParser to do the job, here is what I have so far:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
HasAttributeFilter filter = new HasAttributeFilter("P");
try {
parser.setResource("d://Blogs/asample.txt");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
thanks in advance
Provided the HTML is well formed, the following method should do what you need:
private static String extractText(File file) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
FileReader reader = new FileReader(file);
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
private int append = 0;
public void handleText(final char[] data, final int pos) {
if(append > 0) {
list.add(new String(data));
}
}
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (Tag.P.equals(tag)) {
append++;
}
}
public void handleEndTag(Tag tag, final int pos) {
if (Tag.P.equals(tag)) {
append--;
}
}
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, false);
reader.close();
String text = "";
for(String s : list) {
text += " " + s;
}
return text;
}
EDIT: Change to handle nested P tags.

How can I display a PPT file in a Java Applet?

I want to open and display an existing Microsoft PowerPoint presentation in a Java Applet. How can I do that?
Tonic Systems was selling a Java PPT renderer until they were bought by Google. I know of no other solution.
You could implement this yourself, of course, but that's going to be a lot of work. There is rudimentary support for reading and writing PPT files in the Apache POI project, but you will have to do all the rendering yourself.
Visualization can be done by visualizing each slide as a jpg image.
You can use the Jacob java library for the conversion. This library expoits COM bridge and gets Microsoft Office Power Point to do the conversion (through save-as command). I've got PP2007:
package jacobSample;
import com.jacob.activeX.*;
import com.jacob.com.*;
public class Ppt {
ActiveXComponent pptapp = null; //PowerPoint.Application ActiveXControl Object
Object ppto = null; //PowerPoint.Application COM Automation Object
Object ppts = null; //Presentations Set
// Office.MsoTriState
public static final int msoTrue = -1;
public static final int msoFalse = 0;
// PpSaveAsFileType
public static final int ppSaveAsJPG = 17 ;
//other formats..
public static final int ppSaveAsHTML = 12;
public static final int ppSaveAsHTMLv3 =13;
public static final int ppSaveAsHTMLDual= 14;
public static final int ppSaveAsMetaFile =15;
public static final int ppSaveAsGIF =16;
public static final int ppSaveAsPNG =18;
public static final int ppSaveAsBMP =19;
public static final int ppSaveAsWebArchive =20;
public static final int ppSaveAsTIF= 21;
public static final int ppSaveAsPresForReview= 22;
public static final int ppSaveAsEMF= 23;
public Ppt(){
try{
pptapp = new ActiveXComponent("PowerPoint.Application");
ppto = pptapp.getObject();
ppts = Dispatch.get((Dispatch)ppto, "Presentations").toDispatch();
}catch(Exception e){
e.printStackTrace();
}
}
public Dispatch getPresentation(String fileName){
Dispatch pres = null; //Presentation Object
try{
pres = Dispatch.call((Dispatch)ppts, "Open", fileName,
new Variant(Ppt.msoTrue), new Variant(Ppt.msoTrue),
new Variant(Ppt.msoFalse)).toDispatch();
}catch(Exception e){
e.printStackTrace();
}
return pres;
}
public void saveAs(Dispatch presentation, String saveTo, int ppSaveAsFileType){
try{
Object slides = Dispatch.get(presentation, "Slides").toDispatch();
Dispatch.call(presentation, "SaveAs", saveTo, new Variant(ppSaveAsFileType));
}catch (Exception e) {
e.printStackTrace();
}
}
public void closePresentation(Dispatch presentation){
if(presentation != null){
Dispatch.call(presentation, "Close");
}
}
public void quit(){
if(pptapp != null){
ComThread.Release();
//pptapp.release();
try{
pptapp.invoke("Quit", new Variant[]{});
}catch(Exception e){
System.out.println("error");
}
}
}
public static void main(String[] args){
//System.loadLibrary("jacob-1.15-M4-x86.dll");
//System.loadLibrary("jacob-1.15-M4-x64.dll");
Ppt a = new Ppt();
System.out.println("start");
Dispatch pres = a.getPresentation("C:\\j.pptx");// pptx file path
a.saveAs(pres, "C:\\im", Ppt.ppSaveAsJPG); // jpg destination folder
a.closePresentation(pres);
a.quit();
System.out.println("end");
}
}
I am not sure whether my idea of simulating PPT rendering works or not:
Let the server side read the PPT file and generate JPG files for display.
The browser side will use ajax to request for any specific page from the PPT.

Categories

Resources