Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it.
Please try to understand what is my actual requirement.
The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index.
the html files not only having text only but also they are having links , images etc.
thanks in advance hoping help from you
regards
In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:
Stored the HTML document as is on disk (you can store in the DB as well).
Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.
HTH.
Related
I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.
I try to crawl Data from webpages just like amazon, but I'm only interested in the price of the products. When I try to crawl a lot of data it needs way too much time to download the full HTML document. So I desire to download only the part, where the Price stands (like the first 300kb), of the HTML document. It would be even better to download only a part in the middle of HTML document if this is possible, but it would be enough to have a solution on how to only download a specific number of bytes. I am using Jsoup to crawl data. It would be great if someone is able and willing to help me :)
In my previous question I got answer that I can store small index (few sites) data in solr without using any data base (Is it possible to store data in solr?). I wonder, if it is possible to store full html page source code in solr without using any data base?
Nutch with Solr is a solution if you want to Crawl websites and have it indexed.
Nutch with Solr Tutorial will get you started.
However, Nutch would not maintain the Original Solr code with html tags.
You would need to develop an custom solution by downloading the html page and then can use Solr Extracting Request Handler to feed Solr with the HTML file and extract contents from the html file. e.g. at link
Solr uses Apache Tika to extract contents from the uploaded html file
You can also check HTMLStripCharFilterFactory if you are feeding data as html text.
I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.
After cleaning a folder full of HTML files with TIDY, how can the tables content be extracted for further processing?
I've used BeautifulSoup for such things in the past with great success.
Depends on what sort of processing you want to do. You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results.
If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel: Excel will open and import the table and turn it in to a spreadsheet.)
If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. Python's Elementtree and CSV modules would make this pretty easy.
After reviewing the suggestions, I wound up using HtmlUnit.
With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag,
query each column content and extract the data I needed to create a CSV file.
In .NET you could use HTMLAgilityPack.
See this previous question on StackOverflow for more information.
If you want to extract the content from the the HTML markup, you should use some type of HTML parser. To that end there are plenty out there and here are two that might suite your needs:
http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/
iterate through the text and Use regular expression :)
http://www.knowledgehouse.sg