If there is a blog website,
I want to know whether save the large html content to databse directly is a better way or is there any other way to process the html content.
On a website such as a blog, the blog posting's title and text are stored in a database as plain texts without the html. When you view a page, the web framework fetches the text content of the page from a database and uses a template to format it to a html page.
You should find a book to get into all this. How all this works is a far too broad question for this site.
Related
I try to crawl Data from webpages just like amazon, but I'm only interested in the price of the products. When I try to crawl a lot of data it needs way too much time to download the full HTML document. So I desire to download only the part, where the Price stands (like the first 300kb), of the HTML document. It would be even better to download only a part in the middle of HTML document if this is possible, but it would be enough to have a solution on how to only download a specific number of bytes. I am using Jsoup to crawl data. It would be great if someone is able and willing to help me :)
I want a Java app that would capture all the images (and preferably data in other tags too) from a webpage and write their links to an excel file.
While I know my way around Excel files and Java, I was just wondering if there's any way to capture images from web pages.
A quick google search didnt help
Obviously there is.
Since images are in the source code, you can start from the simpliest solution - getting the page source, retrieve image links and download them.
KISS ;-)
Probably you need to parse the html of the webpage and get the links referring to images from respective html tags.
In my previous question I got answer that I can store small index (few sites) data in solr without using any data base (Is it possible to store data in solr?). I wonder, if it is possible to store full html page source code in solr without using any data base?
Nutch with Solr is a solution if you want to Crawl websites and have it indexed.
Nutch with Solr Tutorial will get you started.
However, Nutch would not maintain the Original Solr code with html tags.
You would need to develop an custom solution by downloading the html page and then can use Solr Extracting Request Handler to feed Solr with the HTML file and extract contents from the html file. e.g. at link
Solr uses Apache Tika to extract contents from the uploaded html file
You can also check HTMLStripCharFilterFactory if you are feeding data as html text.
Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it.
Please try to understand what is my actual requirement.
The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index.
the html files not only having text only but also they are having links , images etc.
thanks in advance hoping help from you
regards
In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:
Stored the HTML document as is on disk (you can store in the DB as well).
Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.
HTH.
I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.