Saving scraped data to file

Saving scraped data to file - java

Im scraping data from multiple web pages using Jsoup, how can I get the scraped data to save to file without it overwriting the previous webpage that got scraped
I've tried searching on stack overflow and Jsoup docs for a solution.
int j = 0;
int i = 0;
String URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
Document doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
Elements temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}
j++;
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+j);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighterList:temp) {
i++;
System.out.println(i + " " + fighterList.getElementsByClass("c-listing-athlete__name").first().text());
}

If you need to save the data from code, just check this, maybe it can help you:
int i = 0;
int pagesNumber = 10;
String URL = "";
Document doc = null;
Elements temp = null;
try {
// Create file
FileWriter fstream = new FileWriter(System.currentTimeMillis() + "out.txt");
BufferedWriter out = new BufferedWriter(fstream);
for (i=0; i<pagesNumber; i++) {
URL = ("https://www.ufc.com/athletes/all?gender=All&search=&page="+i);
doc = Jsoup.connect(URL).userAgent("mozilla/70.0.1").get();
temp = doc.select("div.c-listing-athlete__text");
for (Element fighter : temp) {
out.write(i + " " + fighter.getElementsByClass("c-listing-athlete__name").first().text());
}
}
//Close the output stream
out.close();
} catch (Exception e) { // Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Hope it helps :)

Related

Warning: You did not close a PDF Document looping when renderImageWithDPI

i want to split pdf to image file by page, but i got Warning: You did not close a PDF Document looping when renderImageWithDPI
Still have warning
UPDATE CODE :
public void splitImage(PDDocument document, File checkFile, File theDirSplit, String fileExtension, File theDir, File watermarkDirectory, int numberOfPages)
throws InvalidPasswordException, IOException {
String fileName = checkFile.getName().replace(".pdf", "");
int dpi = 300;
if (theDirSplit.list().length < numberOfPages)
{
for (int i = 0; i < numberOfPages; ++i)
{
if (i == numberOfPages)
break;
if (theDirSplit.list().length != numberOfPages)
{
File outPutFile = new File(theDirSplit + Constan.simbol + fileName + "_" + (i + 1) + "." + fileExtension);
document = PDDocument.load(checkFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bImage = pdfRenderer.renderImageWithDPI(i, dpi, ImageType.RGB);
ImageIO.write(bImage, fileExtension, outPutFile);
}
// splitService.watermark(outPutFile, (i + 1), watermarkDirectory, "pdf");
}
document.close();
//System.out.println("Converted Images are saved at -> " + theDirSplit.getAbsolutePath());
}
System.out.println("Done Partial SPlit");
/*
* int i = 1; while (iterator.hasNext()) { PDDocument pd = iterator.next();
* pd.save(theDirSplit + Constan.simbol + i++ + ".pdf"); }
* System.out.println("Multiple PDF’s created");
*/
}
error looping
total warning same with number of pages...
i already try to close but not work, this process make my server java.lang.OutOfMemoryError: Java heap space
update :
else if ("pdf".equalsIgnoreCase(typeFile)) {
System.out.println(
"target file " + downloadPath + R_OBJECT_ID + Constan.simbol + R_OBJECT_ID + "." + typeFile);
//get jumlah halaman
try(PDDocument document = PDDocument.load(checkFile)){
File theDirSplit = new File(theDir.getAbsolutePath() + Constan.simbol + "splitImage");
createFolder(theDirSplit);
String fileExtension = "jpeg";
File watermarkDirectory = new File(theDir.getAbsolutePath() + Constan.simbol + "watermarkImage");
createFolder(watermarkDirectory);
// split 2 page image
if (theDirSplit.list().length <= document.getNumberOfPages()) {
try {
splitImage(document,checkFile, theDirSplit, fileExtension, theDir, watermarkDirectory, document.getNumberOfPages()/2);
} catch (IOException e) {
System.out.println("ERROR SPLIT PDF " + e.getMessage());
e.printStackTrace();
}
}
res.setTotalPages(document.getNumberOfPages());
document.close();
return new ResponseEntity<>(res, HttpStatus.OK);
}
} else {
res.setTotalPages(1);
return new ResponseEntity<>(res, HttpStatus.OK);
}
this is code to call split method....

This is somewhat lost from the question, but the cause was failing to close the documents generated by splitter.split().

pdfbox getcharacterbyarticle() rendering the vector for last page

I am trying to get text details like co-ordinates, width and height using the following code (took up this solution from here), but the output was only the text from the last page.
Code
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf"
PDFParser parser = new PDFParser(new FileInputStream(fileName));
parser.parse();
StringWriter outString = new StringWriter();
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
stripper.writeText(parser.getPDDocument(), outString);
Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List<TextPosition> tplist = vectorlistoftps.get(i);
for (int j = 0; j < tplist.size(); j++) {
TextPosition text = tplist.get(j);
System.out.println(" String "
+ "[x: " + text.getXDirAdj() + ", y: "
+ text.getY() + ", height:" + text.getHeightDir()
+ ", space: " + text.getWidthOfSpace() + ", width: "
+ text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
+ text.getCharacter() +" Font "+ text.getFont().getBaseFont() + " PageNUm "+ (i+1));
}
}
}
CustomPDFTextStripper class:
class CustomPDFTextStripper extends PDFTextStripper
{
//Vector<Vector<List<TextPosition>>> data = new Vector<Vector<List<TextPosition>>>();
public CustomPDFTextStripper() throws IOException {
super();
}
public Vector<List<TextPosition>> getCharactersByArticle(){
// data.add(charactersByArticle);
return charactersByArticle;
}
}
I tried to add the vectors to a list, but when calling the stripper() it is iterating through all the pages and the last page details are stored in charactersByArticle vector and thus returning the same. How do I get info for all pages??

Temporary Fix:
Changed the main method to set the current page as end page and getting the text info. Not a good idea though.
for (int page = 0; page < pageCount; page++)
{
stripper.setStartPage(0);
stripper.setEndPage(page + 1);
stripper.writeText(parser.getPDDocument(), outString);
Vector vectorlistoftps = stripper.getCharactersByArticle();
PDPage thisPage = stripper.getCurrentPage();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List<TextPosition> tplist = vectorlistoftps.get(i);
}
}

Can't delete a pdf file created using itext pdf library

I am using the IText pdf library (itextpdf.com/) to create pdf files for my project written in java.
The problem is: I create 2 pdfs at the result of my method and i want to delete the first, but seems like my first pdf file cannot be deleted for some reason. I have tried using File.delete(), putting File.delete() inside a "finally{}" block... nothing seems to work.
I am sure that i close my FileOutsputStream and do document.close() too! What can i do to remove this file?
public boolean gerarPDFDeStringVariosArquivosSemNumeroDePaginasComId(LinkedList < String > textosLidos, LinkedList < String > nomesDosArquivosLidos, File arquivoPdfOutput) {
try {
nomesDosArquivosLidosESeusIds = new HashMap < String, String > ();
FileOutputStream fos = new FileOutputStream(arquivoPdfOutput);
Document document = new Document();
PdfWriter.getInstance(document, fos);
document.open();
addMetaData(document);
addTitlePage(document);
for (int i = 0; i < textosLidos.size(); i++) {
String umTextoLido = textosLidos.get(i);
String umNomeArquivoLido = nomesDosArquivosLidos.get(i);
String idUmNomeArquivoLido = "#%&#" + "id_" + i + "#%&#";
this.nomesDosArquivosLidosESeusIds.put(umNomeArquivoLido, idUmNomeArquivoLido);
String umNomeArquivoLidoEIdDele = idUmNomeArquivoLido + " \n" + umNomeArquivoLido; //o id servirah para sabermos quantas paginas o arquivo possui no pdf
String textoLido2 = umTextoLido.replaceAll("\\t", " ");
addContent(document, textoLido2, umNomeArquivoLidoEIdDele);
}
document.close();
fos.close();
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}
}
public boolean gerarPDFDeStringVariosArquivosComNumeroDePaginas(LinkedList < String > textosLidos, LinkedList < String > nomesDosArquivosLidos, File arquivoPdfOutput, File arquivoPdfOutputComNumeroDePaginas) {
/*primeiro vou executar gerarPDFDeStringVariosArquivosSemNumeroDePaginas para gerar um pdf com os
* ids de cada arquivo, seus textos, mas sem o numero de paginas e vou alterar a variavel local this.nomesDosArquivosLidosESeusIds
*/
boolean conseguiGerarPrimeiroPdf = gerarPDFDeStringVariosArquivosSemNumeroDePaginasComId(textosLidos, nomesDosArquivosLidos, arquivoPdfOutput);
if (conseguiGerarPrimeiroPdf == true) {
//agora vou pegar quantas paginas os arquivos tem
VerificaNumeroDePaginasDeCadaArquivoNoPdfGerado verificaNumeroDePaginas = new VerificaNumeroDePaginasDeCadaArquivoNoPdfGerado();
HashMap < String, Integer > arquivosEQuantasPaginasElesTem = verificaNumeroDePaginas.pegarNumeroDePaginasNoPdfDeCadaArquivo(this.nomesDosArquivosLidosESeusIds, nomesDosArquivosLidos, Main.FILE);
//agora comeco a criar o segundo pdf que terah o numero de paginas de cada arquivo
try {
FileOutputStream fos = new FileOutputStream(arquivoPdfOutputComNumeroDePaginas);
Document document = new Document();
PdfWriter.getInstance(document, fos);
document.open();
addMetaData(document);
addTitlePage(document);
for (int i = 0; i < textosLidos.size(); i++) {
String umTextoLido = textosLidos.get(i);
String umNomeArquivoLido = nomesDosArquivosLidos.get(i);
int quantasPaginasTemOArquivoLido = arquivosEQuantasPaginasElesTem.get(umNomeArquivoLido);
String umNomeArquivoLidoEPaginas;
if (quantasPaginasTemOArquivoLido > 1) {
umNomeArquivoLidoEPaginas = umNomeArquivoLido + " (" + quantasPaginasTemOArquivoLido + " páginas)";
} else {
umNomeArquivoLidoEPaginas = umNomeArquivoLido + " (" + quantasPaginasTemOArquivoLido + " página)";
}
String textoLido2 = umTextoLido.replaceAll("\\t", " ");
addContent(document, textoLido2, umNomeArquivoLidoEPaginas);
}
document.close();
fos.close();
arquivoPdfOutput.delete();
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}
} else {
return false;
}
}
I do this to test:
File arquivoPdfGerar = new File(Main.FILE);
File arquivopdfGerarComNumeroDePaginas = new File(Main.FILE2);
/*PrintStream ps = new PrintStream(fileOutputStream);
System.setOut(ps);*/
LinkedList < String > nomesArquivosLidos = new LinkedList < String > ();
LinkedList < String > textosArquivosLidos = new LinkedList < String > ();
String url = "C:/Users/fábioandrews/Documents/git/PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/src/br/ufrn/pairg/pdfgenerator/FirstPDF.java";
String nomeProjeto = "PdfGeneratorForSoftwareRegistration";
String arquivoLido = LeitorArquivoTexto.lerArquivoQualquerDeTexto(url);
String nomeArquivoLido = LeitorArquivoTexto.pegarNomeArquivo(url, nomeProjeto);
nomesArquivosLidos.add(nomeArquivoLido);
textosArquivosLidos.add(arquivoLido);
url = "C:/Users/fábioandrews/Documents/git/PdfGeneratorForSoftwareRegistration/PdfGeneratorForSoftwareRegistration/src/br/ufrn/pairg/pdfgenerator/Main.java";
nomeProjeto = "PdfGeneratorForSoftwareRegistration";
arquivoLido = LeitorArquivoTexto.lerArquivoQualquerDeTexto(url);
nomeArquivoLido = LeitorArquivoTexto.pegarNomeArquivo(url, nomeProjeto);
nomesArquivosLidos.add(nomeArquivoLido);
textosArquivosLidos.add(arquivoLido);
GeraPDFDeStringVariosArquivos geradorPdf = new GeraPDFDeStringVariosArquivos();
geradorPdf.gerarPDFDeStringVariosArquivosComNumeroDePaginas(textosArquivosLidos, nomesArquivosLidos, arquivoPdfGerar, arquivopdfGerarComNumeroDePaginas);

What about the following line ?
PdfWriter.getInstance(document, fos);
IMHO, this method/line
looks useless in you code as I don't find any reference to the object (PdfWriter) returned by it.
if this line can be removed, just do it ;-)
if not, you have to
hold PdfWriter object returned.
and close it (in a finally block as it should be done for the FileOutputStream and `Document' instances too).
Note: this remarks are done according to the itext version 5.5.6 I am using.
If you still have issues, you may plug this little tool (created by the autor of Jenkins). He saved me on an ooooold program.

Thanks for all your answers. The solution was right where Bruno Lowagie said: When i was reading the pdf files to count how many pages where there, i was not closing the pdfreader and therefore the file was still in use.
Thank you all for the answers ^^

Parse all PDF pages at once with iText

I am trying to parse a pdf file with "iText". What I am trying to achieve is to parse all pages at once.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
String content = "";
for (int i = 0; i <= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
content = content + " " + PdfTextExtractor.getTextFromPage(reader, i);
}
System.out.println(content);
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:77)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:74)
at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:89)
at com.pdf.PDF.main(PDF.java:18)
Other problem I am facing is that the - hyphen is being parsed as ? question mark. How can I fix that?
I appreciate any help.
Edit
It works for me like this but I cant still solve the hyphen bug.
try {
PdfReader reader = new PdfReader("D:\\hl_sv\\L04MF.pdf");
int pages = reader.getNumberOfPages();
for(int i = 1; i<= pages; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = PdfTextExtractor.getTextFromPage(reader,i);
System.out.println(line);
}
}

public static String extractPdfText() throws IOException {
PdfReader pdfReader = new PdfReader("/path/to/file/myfile.pdf");
int pages = pdfReader.getNumberOfPages();
String pdfText = "";
for (int ctr = 1; ctr < pages + 1; ctr++) {
pdfText += PdfTextExtractor.getTextFromPage(pdfReader, ctr); // Page number cannot be 0 or will throw NPE
}
pdfReader.close();
return pdfText;
}

Getting a connect time out exception in jsoup

I am trying to read a lot of html pages using jsoup. I have an arraylist called "allPageLinks" that keeps html page links. Here is my code:
Document doc;
for (int i = 0; i < allPageLinks.size(); i++) {
try {
doc = Jsoup.connect(allPageLinks.get(i)).timeout(0).get();
Element page_clips = doc.getElementById("page_clips");
Element page_clip_content = page_clips
.getElementById("content");
Elements product_grid = page_clip_content
.select(".product-list.margin-left-5");
Elements products = product_grid.get(0).children();
for (int j = 0; j < products.size(); j++) {
try {
String productName = products.get(j)
.getElementsByClass("name").text();
String productPrice = products.get(j)
.getElementsByClass("price").text();
String productLink = products.get(j)
.getElementsByClass("image").select("a")
.first().attr("href");
Document newDoc = Jsoup.connect(productLink).get();
Elements elements = newDoc.getElementsByClass("left");
Elements productNameElement = elements.get(0)
.getElementsByClass("colorbox");
String productImage = productNameElement.attr("href");
elements = newDoc.getElementsByClass("right");
String productId = elements.get(0)
.getElementsByClass("field").get(1).text();
writer.append(productName);
writer.append(';');
writer.append(productPrice);
writer.append(';');
writer.append(productId);
writer.append(';');
writer.append(productImage);
writer.append(';');
writer.append(productLink);
writer.append('\n');
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i) + " ICTEKICATCH");
}
}
} catch (Exception ex) {
System.out.println(ex.getMessage() + " " + i + " "
+ allPageLinks.get(i));
}
}
Even though i set connection timeout to zero, i am getting a lot of connect time out exceptions for most of the links. Can anyone help me to get rid of that exception?
Thanks

You forgot to add specify the timeout for this connection within the loop of the code:
Document newDoc = Jsoup.connect(productLink).get();
Should be:
Document newDoc = Jsoup.connect(productLink).timeout(0).get();
That is where the timeout exception is most likely occurring.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Saving scraped data to file - java

Related

Warning: You did not close a PDF Document looping when renderImageWithDPI

pdfbox getcharacterbyarticle() rendering the vector for last page

Can't delete a pdf file created using itext pdf library

Parse all PDF pages at once with iText

Getting a connect time out exception in jsoup

Categories

Resources