Fetching correct news image - JAVA - java

I am trying to make a small news crawler.
I got every thing working after many tries.
Problem is that approx every HTML news page have more then 50 images.
Many of them are too small. So, i am filtering them simply by checking size.
Only images lager them 200x200 will be taken.
But there are many images on a single page which are large.
and some news articles not have any related image.
Lets take a example -
Link to News - http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB-patients-in-India-undetected-Report/articleshow/24600851.cms
My code got this image - Image no. 0 http://timesofindia.indiatimes.com/photo/10905539.cms
Image height - 300
Image width - 450
But this image is useless to image topic.
In simple words "How to get correct image dynamically"
I do not want to make code for each website.
A blank image is better then a wrong image.

I would recommend an approach where you identify the proximity of an image based on its position.. so, if an Image comes inside the article its probably an image about the article itself (except for ads which are very wide).
you can findout the source of the image and decide if it should interest you or not. for instance ad images usually come from a different server which doesn't belong to the site. (in your case indiatimes.com).

Consider the alt text. The alt text usually contains either the title completely or some words from the title.
Also, the article does not have any relevant image associated with the title.
I also suggest JSoup:
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

Related

Creating documents programmatically

I have an Android app that generates documents with each client's info based in a template, and that the client can sign. The signature is then jointed into the document, and the whole doc is converted to image and uploaded to a server.
Despite being converted to image, the objective is to be as similar as possible to the A4 format.
For this I use WebView, and then I convert it to Bitmap based on the width and height of an ScrollView.
For the signature I use Canvas.
But I'm not sure this approach is the best, as it is very difficult to simulate an A4 document. Depending on the device, the dimensions of the doc are not proportional and to be I would have to be adjusting based on each device display size. Because of that, this component of the app is not available in some devices, purposefully. But now we want to make it available to every device.
What approach do you recommend? Is there some way of develop one doc fits all with correct proportions and similar aspect?
Thanks in advance.
Print to a PDF file. You can use Flying Saucer for that, or similar libraries.
PDF is famous for sticking to definitions and is especially fitted to the needs of proper scaling and displaying.
You could even get away from the doc template and only use (X)HTML to design and fill the page.
Simply fill in some HTML, link an image (<img src='rel.jpg'>), send all that to FlyingSaucer, and you're done.

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?
PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.
There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

PDFBox - 2.0.3 - PDFTextStripper picking up old text from page prior to cropping/rotating

I'm attempting to perform some string validation against individual PDF pages in a file via the use of Apache PDFBox.
I'm going to be utilizing PDFTextStripper for the majority of this, so my first issue to tackle was the fact that all the PDFs i'm going to be validating against are generated as 2up; e.g Page 1 of 2 and page 2 of 2 were on the same page or if you imagine you literally scanned a book face down into a scanner - In addition to this, they were oriented incorrectly, and needed rotating 90 degrees so PDFTextStripper could read them properly.
Using elements of the below questions/solutions, i have built a method which first crops the page exactly in half, exports the cropped pages in order to a new file, rotates each page to the correct orientation and then saves the file;
Rotate PDF around its center using PDFBox in java
Split a PDF page in two parts [duplicate]
Visually, my method is seemingly working as expected until i run PDFTextStripper against it - It appears to be returning the text of not just the page i want, but also the page i cropped out of it.
To confirm the issue, I extracted a single page out of the entire document and saved it as a new file - when running PDFTextStripper, i still get the same results even though all i can see is literally one page. Adobe search doesn't bring up the hidden, legacy data either.
I can only assume that during my transform method, i need to redefine the cropped page with only the contents of the cropped page.
My question is, how can i do this?
p.s - i haven't posted my code as it's basically a amalgamation of the solutions provided in the aforementioned links above - however if it i needed, i can provide
The PDFTextStripper ignores the CropBox you set to crop the pages. It also ignores whether text is covered by some filled rectangle or image or whether the text is invisible, it extracts all text (except text in patterns or contains in Type 3 font characters).
You might want to try the PDFTextStripperByArea instead. This class (which is derived from PDFTextStripper) restricts itself to regions you can define.
(Unfortunately these regions have to be defined using a different coordinate system than the one used for the CropBox, so usually you will have to transform the coordinates first.)

How to Convert HTML pages into Bmp using java

I am having thousands of HTML files, I want to convert then to .bmp (Bitmap images) using Java.
Like I select all the HTML files specify the size and then code will convert all images into bmp.
Please suggest me the simple method, which class shall i use for the same.
Is there any API for Converting Html to bmp.
One simple and easy way to do it is to rely on this 2 tools :
WkHtml2png : it cas convert HTML pages to PNG and it has really great capabilities with advanced CSS and JavaScript. I'm currently using it to convert Html to PDF.
png2bmp : it converts PNG to BMP images.
As these tools are native programs, you will have to Wrap them using basic Java API :
java.lang.ProcessBuilder
java.lang.Process
This discussion might be very useful if using WkHtml2pdf.
If the HTML is very simple, and each file is short, you could render it in a JEditorPane, then use java.awt.Robot to take a screenshot with the .createScreenCapture() method.
See this question for saving it as BMP: BufferedImage to BMP in Java
However, JEditorPane is quite limited in the HTML it accepts.
This won't work if the rendering area is larger than the screen. You might be able to create a larger JEditorPane in a window larger than the screen and capture the Graphics buffer, instead.
Why does this have to be done from Java,?
Now that i understood you right you have another problem.
Although all webbrowser should display HTML equally, they don't do.
You should think about that.
If you only have simple HTML Files like Javadocs or equal, there should be no problem with the proposal of Khaled Labidi, if you have no problem using native Librarys.
Maybe you can have a look at http://lobobrowser.org/cobra.jsp.
Render your HTML and CSS and then try to convert it all to BMP.
I think there is no really easy way to do that.

Java HTML rendering using Cobra

I am currently using Cobra: Java HTML Renderer & Parser to render an HTML page that is dynamically generated based on user choices in a java app.
In my app the user has a choice of hundreds of items to select. The items are displayed in the form of special uniquely colored symbols and the user can select more then one item.
Once a number of items are selected their written description is dynamically generated and formatted to include css2 and html4 tags and loaded into the Cobra HTMLPanel for display.
I wish to display the image of the symbol with the written description of an item in the HTMLPanel.
One way to do this would be to save the BufferedImage to a file using ImageIO.write and then include the img html tag in my dynamically generated HTML document that is being loaded into HTMLPanel. Unfortunately this is unacceptable as there may be hundreds of symbols being selected by the user wich in turn would result in hundreds of ImageIO.write calls and an incredible decrease in performance of my app.
An alternate way would be to convert the BufferedImage to a Base64 encoding and then directly place the encoding into the HTML document as follows
<img alt="Embedded Image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADIA..." />
Unfortunately HTMLPanel appears to ignore the data URI scheme.
Does anyone know a solution?
Use an embedded servlet container like Jetty. Point the URLs to "http://localhost:somePort/imageId", and then serve those URLs up from memory.

Categories

Resources