How to Convert HTML pages into Bmp using java

How to Convert HTML pages into Bmp using java - java

I am having thousands of HTML files, I want to convert then to .bmp (Bitmap images) using Java.
Like I select all the HTML files specify the size and then code will convert all images into bmp.
Please suggest me the simple method, which class shall i use for the same.
Is there any API for Converting Html to bmp.

One simple and easy way to do it is to rely on this 2 tools :
WkHtml2png : it cas convert HTML pages to PNG and it has really great capabilities with advanced CSS and JavaScript. I'm currently using it to convert Html to PDF.
png2bmp : it converts PNG to BMP images.
As these tools are native programs, you will have to Wrap them using basic Java API :
java.lang.ProcessBuilder
java.lang.Process
This discussion might be very useful if using WkHtml2pdf.

If the HTML is very simple, and each file is short, you could render it in a JEditorPane, then use java.awt.Robot to take a screenshot with the .createScreenCapture() method.
See this question for saving it as BMP: BufferedImage to BMP in Java
However, JEditorPane is quite limited in the HTML it accepts.
This won't work if the rendering area is larger than the screen. You might be able to create a larger JEditorPane in a window larger than the screen and capture the Graphics buffer, instead.
Why does this have to be done from Java,?

Now that i understood you right you have another problem.
Although all webbrowser should display HTML equally, they don't do.
You should think about that.
If you only have simple HTML Files like Javadocs or equal, there should be no problem with the proposal of Khaled Labidi, if you have no problem using native Librarys.
Maybe you can have a look at http://lobobrowser.org/cobra.jsp.
Render your HTML and CSS and then try to convert it all to BMP.
I think there is no really easy way to do that.

Related

Creating documents programmatically

I have an Android app that generates documents with each client's info based in a template, and that the client can sign. The signature is then jointed into the document, and the whole doc is converted to image and uploaded to a server.
Despite being converted to image, the objective is to be as similar as possible to the A4 format.
For this I use WebView, and then I convert it to Bitmap based on the width and height of an ScrollView.
For the signature I use Canvas.
But I'm not sure this approach is the best, as it is very difficult to simulate an A4 document. Depending on the device, the dimensions of the doc are not proportional and to be I would have to be adjusting based on each device display size. Because of that, this component of the app is not available in some devices, purposefully. But now we want to make it available to every device.
What approach do you recommend? Is there some way of develop one doc fits all with correct proportions and similar aspect?
Thanks in advance.

Print to a PDF file. You can use Flying Saucer for that, or similar libraries.
PDF is famous for sticking to definitions and is especially fitted to the needs of proper scaling and displaying.
You could even get away from the doc template and only use (X)HTML to design and fill the page.
Simply fill in some HTML, link an image (<img src='rel.jpg'>), send all that to FlyingSaucer, and you're done.

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?

PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.

There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

Fetching correct news image - JAVA

I am trying to make a small news crawler.
I got every thing working after many tries.
Problem is that approx every HTML news page have more then 50 images.
Many of them are too small. So, i am filtering them simply by checking size.
Only images lager them 200x200 will be taken.
But there are many images on a single page which are large.
and some news articles not have any related image.
Lets take a example -
Link to News - http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB-patients-in-India-undetected-Report/articleshow/24600851.cms
My code got this image - Image no. 0 http://timesofindia.indiatimes.com/photo/10905539.cms
Image height - 300
Image width - 450
But this image is useless to image topic.
In simple words "How to get correct image dynamically"
I do not want to make code for each website.
A blank image is better then a wrong image.

I would recommend an approach where you identify the proximity of an image based on its position.. so, if an Image comes inside the article its probably an image about the article itself (except for ads which are very wide).
you can findout the source of the image and decide if it should interest you or not. for instance ad images usually come from a different server which doesn't belong to the site. (in your case indiatimes.com).

Consider the alt text. The alt text usually contains either the title completely or some words from the title.
Also, the article does not have any relevant image associated with the title.
I also suggest JSoup:
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

How to detect a large image in PDF and extract it as a JPEG or any format?

I'm creating a project that creates a magazine from PDF file, however each uploaded magazine should have a thump that has a cover photo, and i want to extract this image from the PDF as a JPEG in order to set it as a cover photo.
Is there any way to do it using Ghostscript or any other command line tool ?

Do you mean you want to render the first page of a PDF file to an image format ? If so then yes, Ghostscript can do that (also ImageMagick using Ghostscript, MuPDF and probably many other utilities too).
If you mean the first page contains an image, and you do actually want to extract it, then this is a harder job and you will need a PDF toolkit to do it. Ghostscript can do this,but its probably overkill, again you might find MuPDF more convenient. I have a vague memory that pdftk can extract images, but I may be mistaken. A quick search on Google should probably help, if this is what you want.

Poppler/XPDF comes with pdfimages:
Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.
Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).
The commandline to extract all images from page 1 of a PDF is this:
pdfimages -j -f 1 -l 1 some.pdf subdir/prefix
The images will be saved to subdir/ named prefix-0000.jpeg, prefix-0001.jpeg. The -j parameter will try to get JPEG images, if possible. Direct JPEG extraction may fail, in which case the images extracted will be saved as PPM or PNM (attention, these are big, since they're uncompressed). These can be converted by ImageMagick's to JPEGs, if needed:
convert subdir/prefix-0022.ppm subdir/prefix-0022.jpeg

ABCpdf will allow you to extract images from a PDF. It's a two stage operation. First you need to identify where images appear in the document. Then you need to export them.
You need something like this...
using (Doc theDoc = new Doc()) {
theDoc.Read(theSrc);
ImageOperation op = new ImageOperation(theDoc);
op.PageContents.AddPages();
ICollection<ImageProperties> images = op.GetImageProperties();
foreach (ImageProperties pl in images) {
foreach (ImageRendition plc in pl.Renditions) {
... if plc is a good match
plc.PixMap.GetBitmap().Save(#"c:\output.jpg");
}
}
}
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

Save Java frame as a Microsoft Word or PDF document?

I am working on a billing program - right now when you click the appropriate button it generates a frame that shows the various charges etc, basically an invoice. Is there a way to give the user an option of saving that frame as a document, either Microsoft Word, Microsoft Works or PDF?

One approach would be to save the frame as an image, you can do that by using the following syntax to convert it to an image.
BufferedImage myImage = new BufferedImage(size.width,size.height,
BufferedImage.TYPE_INT_RGB);
Graphics2D g2 = myImage.createGraphics();
myComponent.paint(g2);
you can then save this image and pass it into a jasper report. From the JasperPrint object you can then save in a few different formats, including pdf. A better but similar approach would be to pass the Graphics context into JasperReports(there is a renderer to do this in jasper, and the quality is much better).

Paint JFrame in a BufferedImage. paint() method of JFrame
Save the image as jpg or png or whatever image format
Take some pdf library and create a blank pdf (e.g. iText)
Insert the image into the PDF document
Save it - done

Instead of generating a word document, I'd rather use a Java library like iText to produce a PDF document (more portable) or, even better, the JasperReport report library that can output reports in a wide range of formats (PDF, XML, HTML, CSV, XLS, RTF, TXT) as suggested by bigbrother82 in a comment. This looks cleaner to me than using an image, especially for printing (not even mentioning that your invoice may be a multi-page document).

I'd likely look at this from a slightly different direction and instead of asking how to splat the GUI form as-is into a PDF or word document I'd ask how to get that content into a Word/PDF document.
The answer to that question is Apache FOP. Generate a XSL-FO file and ask FOP to convert it into a RTF document (with a .DOC extension) or a PDF.
Normally one does this by generating an XML file containing the data you need printed. Then use an XSLT to convert that XML to XSL-FO. I however found it easier to generate a XSL-FO file directly using a templating language (such as Freemarker).

You might want to look at the online demo for Docmosis as an example which gives the user the options for requesting the document up front. That demo does a download, but it could direct the document into a frame instead and leave it to the browser to display. This style of working (as metioned by others) is looking at the problem from a different angle and deciding up front what format, rather than after the fact and then trying to save the frame contents.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.