Screen scrping using java without downloading web source code - java

I am trying to extract info from a particular website and then store it in a separate text file. for example, i want to go to http://www.ncbi.nlm.nih.gov/nuccore/293762 and extract genome sequences. These sequences are formatted as groups of 10 characters only including the letters "a,t,c,g" separated by white spaces. They will look something like this: "acctgtacgg". Ive been searching for a solution for hours but all I find are java libraries that parse html code such as jsoup. The problem with this is that when I view the source of the website and search for the genome sequences they don't seem to be included in the source code, although I can find them in the DOM tree. Is there a way to programmatically read the actual data on a web page without downloading the source? Or is there a better way to go about this? Please point me in the right direction it will be greatly appreciated.

Related

Java creating PDF file

I am currently working at a project which generates contracts. The idea is that I put the data in a form and save it in a simple database.
So long, this was my favorite place to search for good ideas and simple solutions.
Now I am facing another problem and I don't know how I can solve that. I want to create a PDF and replace some placeholders with some data from my form.
One idea was, that I use an existing Word template with some bookmarks and replace them with the data from my form. Maybe there is a way to do that, and I am just too stupid to find it.
Another idea was, that I am using XML. Therefore, I thought I was clever and just converted the Word template to an PDF, so I am able to convert that PDF to an XML. Attached, you find the XML file. But now I need the XSL file - is there an easy way to create the XSL file?
Or maybe there is another simple solution to solve my problem.
In these attachments you find the PDF file, the Word template and the XML:
Thank you a lot :)
Using a template is a good idea - it makes some changes much quicker to make and then deploy. The comments above are focused on conversion, but don't forget you need to merge your data in (population) first.
If you can use Adobe tools, you can have a PDF template and use the Adobe tools to populate. This saves a "conversion" stage.
You mentioned using Word for templates. This means you to run through two stages of processing:
population - docx is a zipped set of XML files - so you can process them with your own code or using a library.
conversion - you need pdf, so you have to convert the docx to pdf. You also have to watch out for fonts at this stage (ie make sure they are available on your host).
The population stage you could do yourself since you are familiar with XML. But it is definitely complicated. The conversion needs to use a tool that is ideal for it. There are a few mentioned in the comments already.
There are some free/os and commercial tools that can do both parts:
docx4j
JOD Reports
Libre Office (using the Java Uno API) (I blogged this once - Java Convert Word to PDF with UNO)
Docmosis (please note I work for Docmosis)
I suggest starting with the simple example you have attached and prove you can both populate and convert that. Then switch to a more complicated example to see if you can do the other things that might be required (eg repeating or conditions or other logic) during the population stage.

extracting text AND Images from PDF file

I have been bumping my head against the wall with this one, have researched and pretty much tried every library suggested to me. I am currently trying to write a program in java that will extract text AND images from a pdf file and allow me to write the extracted content to a word file. I have managed to extract the content using the ICEpdf library, however the problem is that I need to be able to write the content in the exact same order as it was read. So, to clarify, I need a library that will help me keep track of where exactly in the page the text and images are situated so I can put them in the same place in my word file.
A PDF to Word converter is a horribly complex proposition.
Your best bet will probably to use Open Office to do it for you and not even try to handle the intermediate steps.
http://www.openoffice.org/api/
Look at this: Advanced PDF parser for Java
OFF:
-Also to my knowledge there is a python parser that sorta converts the pdf to html (that way you can keep track of the ordering of the objects within the pdf). I know its not java, but you might be able to use the output.
http://www.unixuser.org/~euske/python/pdfminer/index.html

Creating an editable document via java web application

I am looking for a convenient method to export some data from my database into a form that would be editable afterwards. The perfect scenario would be to export a word document, and perhaps a brutally simple solution would be to generate HTML and copy/paste it into Word.
I've looked at several open source libraries for generating word documents, but they seem a bit too simple or incomplete. I need support for tables and embedded images and control over formatting the fonts, table borders etc. (too much formatting seems to be lost when copying html and pasting into word).
Although Word is the end format, it'd be fine to generate it in any format that word would be able to open and subsequently save as DOCX.
I really haven't been able to find anything about generating ODT files (server side without client installation).
I would just dive into the ASPOSE libraries, but it'll take ages (and significant pain) to get a purchase order sorted out so I need to make sure its the only viable option before taking that route.
I could generate it as an excel file and copy that to word - this is looking like the best option currently.

lucene searching

Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it.
Please try to understand what is my actual requirement.
The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index.
the html files not only having text only but also they are having links , images etc.
thanks in advance hoping help from you
regards
In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:
Stored the HTML document as is on disk (you can store in the DB as well).
Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.
HTH.

Convert HTML page into MS word using java or any API

I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.

Categories

Resources