community!
My project is simple: I have a link to a website that has multiple information on different chemical substances and I want to extract some data and put in into pdf. Thing is that I want to keep the formatting of the original HTML (using it's css, of course).
Example of substance: http://www.molbase.com/en/msds_1659-31-0-moldata-2.html#tabs
I used jsoup to read the HTML of the table on the bottom of the page, the MSDS one, containing multiple sections with different information about the substance, but I really don't know how to save the exact HTML format into my pdf file. I have tried with iText too, but it gives me "missing ending tag" error, and if it worked, it would print the full page, not only that msds table.
Here is what I have tried to do, but ain't effective:
Document docu = Jsoup.connect(urlbun).get();
Element tableHeader = docu.select("div[class=\"msds\"]")
.first();
String[] finSyn = tableHeader.text().split(" ");
String moreText =" ";
I tried to split the text that the webpage has under that div ("class = "msds"") but I cannot find a way to split it the good way.
Please, could you please give me a hint on what to do? Even if the formating is not the same, I would like to be able to display the information in the same way, with indentation and such.
Thank you!
You can put the content that you want to convert to PDF inside a CSS ID (such as a DIV) and then use the PDFmyURL API to convert only that section to PDF.
Please refer to this on our website about how to select pieces from a page to convert to PDF
Disclosure: I work for the company that owns this site
Related
I'm almost done creating a Supreme Bot. Now I need a keyword-searcher. They should search for a keyword on the page and then click on it.
For example:
Illegal Business Hooded Sweatshirt Red
... the bot now searches for the keyword but also for the color. I uploaded a screenshot (from the Supreme page) and need your help.
Screenshot from the source code (Supreme):
My code I tried:
driver.findElement(By.xpath("//h1[text()='Illegal Business Hooded Sweatshirt']/p[text()='Red']")).click();
Since there're encoded symbols in between of text, I believe you can't find those items directly with xpath.
I suggest you to find all <acticle> tags, then for every article tag you search for <h1> inside, retrieve it's text (with filtering out those weird symbols), and compare the text you want with the text article tag actually has.
p tage is not inside h1 tag and "red" is inside anchor tag a
So you can use this xpath:- //h1/a/[text()='Illegal Business Hooded Sweatshirt']/ancestor::div/p/a[text()='Red']
My website is just like Stack Overflow and under development. I am using plain textarea to take text input as I do not have any WMD editor like Stack Overflow's.
When I take HTML code as input and store it in database table in a text or nvarchar(max) column, it is stored successfully. But when I call that data for display, it displays the corresponding HTML page instead of that HTML code on screen. I am not able to resolve it. For better understanding I'm putting here input page and output page images of my website.
This is image of input page:
This is the image of output page:
What is going wrong here ?
One easy way is to replace
< with < and > with >
in the HTML string which you retrived and then display it on page.
Have you tried that ?
You need to escape the HTML so it's not interpreted by the browser. How to do that depends on the view technology you're using.
With JSP and JSTL the escaping is automatically done with <c:out value="${myString}"/>. If you're not using JSTL yet, now's the time to start (there's a lot of other helpful things in there too).
you can save the html codes just like text. You can use varchar(max) type column to save the html code in table. Display the code is depending the browser. But if you use nvarchar type that will cause problems in display.
Another possible solution is to replace the html tags before storing in database. What I did is :-
text=text.replaceAll("<", "<");
text=text.replaceAll(">", ">");
and then stored text in database and its working. Thanks to Bibin Mathew.
I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words .
example :
побеђивање -should come out as- побеђи́ва̄ње (accents are missing)
The letters are Cyrillic Serbian .
I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data.
The pdf file pdf example file
The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.
Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry
is added as
(\340\361\362\340\353\367\355)Tj 0 Tc (\236)Tj
...
As you can see, the same letter \340 is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.
This happens throughout the whole page.
Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.
You say you
tried many of the ocr solutions but they all give bad results
Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.
I have couple of html pages in my assets folder, i am able to open them and get them in a string. My problem lies ahead of it, I just to extract text between certain tags. For example if i am having a line in my html page as <h3>Hello have a nice day</h3> inside h3 tag.
I just want to get "Hello have a nice day". Till now i tried it to string functions but no success. How can i achieve this?
UPDATE
I got the solution from link
Use Html.fromHtml(), pass the html source and it will return only the text..
check http://developer.android.com/reference/android/text/Html.html
If you are able to read html files, then everything should be easy. If it's simple html page you can use xpath to parse it and retrieve whatever you want, or you can use some libaries such as jsoup to parse the html.
I have to edit an existing pdf file using itext in java. My problem is in the existing pdf it contains lots of pages. When inputting the page number of that existing pdf i have to edit the footer of that page to a new text and have to output only that page with edited footer page along with the page contents in that page. No need to output the remaining pages. Also the existing pdf is in A6 format and I have to change the output pdf to A4 format. How it is possible?
You can split and merge PDF files using iText. That means, you need to split your original document into three parts and keep only the middle (required) part. You can also delete and add objects. That means you can find the footer object, delete it and and add a new object in its place. I do not think you would be able to change the format. Unless, you can create a brand new document in the target format and copy the objects from the source into the new document. Worth trying.