Indexing the content from URL in Java (Search Engine) - java

I would like to have some help in creating an index for the content that I saved in my database after crawling the URLs and removing the tags. I have been looking for solutions but couldn't find anything helpful yet. The thing is our instructor told us to not use Apache Lucene. I have no idea how to implement this and would like to get help in understanding this better.

Related

Accessing a Property of an odt File?

I asked lately how i can access the meta data of a odt file and the most recomended way to do this was using the apache tika library.
I tried finding out how i can use this Libraries in Detail.
Reading through all this stuff I find that this tika service is pretty hard (at least for someone that is new to Programming like me)
So my question is if maybe someone has a link to a side where there is like a step by step tutorial on how to use it or maybe as a different method to do this. I realy dont want you to do my work but i am working so long on this now and did not get this far so i think its time to ask for help here. :(
I want to acces and set only one property of my file which is the "DocumentID".
Thank you for helping and sorry for my maybe bad english but it is not my foreign language.

what are the steps to make a word search for a website?

I want to write a word search,which connects to a specific website(huge one),takes the word from user,searches the site and returns the strings which contain the word;this should be written in java and as an applet.I have read some tutorials and questions on this,and understood what have to be done is:
1.connect to a website and get the content of a website and save it to a string.(this should be done with a webcrawler which will be made from my own code for connecting to website and save the content to a string + jsoup library to parse the html code).
2.save the datas to a database(in my case nosql database).
3.index the datas in database.
4.query the database to show the results.
5.make a UI for showing the search results(I use swing.japplet).
now my qustions are:
1.have I understood correctly the steps which I have to go?(please explain me in details if a step is unnecessary or necessary)
2.Is it necessary to have a database?
notice:I want to implement it myself,without using ready things such as lucene,nutch,solr,...
edit:3 people told me applet is not suitable for such a thing,so what should be the replacement?
many many thanks for your help.
You should look at using Lucene, as it does most of what you want here.
You should not use applets.
For small data set, database should be sufficient. Databases like mysql comes with full text search functions.
For bigger data set, you might want to consider Lucene or Solr.
That is one way way to implement this. Another (simpler) way would be to use an existing text search / indexing engine like Lucene / Solr. Going to the effort of reimplementing the "text search / indexing" wheel using database technology strikes me as a waste of effort, unless you have a sound technical reason for doing so.
You do need to has some kind of database, because indexing a website on the fly would simply not work. Lucene will handle that.
I think your choice of Java applets to build the UI is a bad idea. There are other technologies that give results that are as good or better ... without the security risk of a Java browser plugin.
Finally, another way to make your website searchable is to get Google to do it for you. Make your website content indexable, and then use Google's search APIs.

Where to find correct Java docs?

I'm using a new set of Java tools that I'm not entirely familiar with, for Birt report writing.
They're discussed here:
http://www.eclipse.org/birt/phoenix/deploy/designEngineAPI.php#concepts
Now normally the Oracle API site seems to have and explain everything I need, but I've been unable to locate anything Birt related there.
The most promising link I came up with through Google, and the above link was:
http://dev.eclipse.org/viewcvs/viewvc.cgi/source/org.eclipse.birt.report.model/?hideattic=1&root=BIRT_Project
However I'm finding that page difficult in navigating. I couldn't seem to find the grid, label, image classes and methods, etc that were mentioned in the first page's example. Have I missed them or are they in one of those folders? Or can they be found on the Oracle site?
Additionally, I will be looking for a class that allows a JDBC to SQLite. I haven't looked yet, but if anyone can tell me ahead of time a good one to use that would be helpful.
Try this link:
BIRT Programmer's Reference

Is it possible to use flashvars with JBoss?

I'm part of a team developing a product using JSF 2.0 and I was asked to investigate the possibility of including FusionCharts free in the app. I have tried different ways of inserting a simple chart in a JSF page but with no luck.
On of the methods involves using the elements OBJECT and EMBED but hhen I try to use them I get a "null source" error from JBoss. From what I could find online (through Google), I am under the impression that 'flashvars' isn't quite compatible with JBoss. Is anyone here able to confirm this? If this is the case, what workaround would you suggest me?
Other ways I also found online didn't show the chart not even an error message.
Thanks in advance.
It is hard to tell what the other methods quoted were, but the preferred way of embedding flash is to use swfobject, a javascript library that does not require any special tags (nor server-side support).
It boils down to preparing a div for your flash content, giving it an id, and then calling a single function that takes the swf file url, size of the clip, flasvars and so. The javascript could easily contain EL expressions.
You might want to read this:
http://www.adobe.com/devnet/flashplayer/articles/swfobject.html
but skip to the Under the hood: dynamic publishing section, you will not be using the static publishing nor GUI.
The probable solution might be to pass the value of the flashvars as querystring of the user loading the chart swf file.
e.g.,
Column3D.swf?debugMode=1&dataURL=mydata.xml&registerWithJS=1&chartWidth=200&chartHeight=300

How to get Blogger post comments for URL using API or anything else?

I have been looking for a way to get the comments from a Blogger blog if I have a regular URL. I know you can get the blogID by scraping the html, which is somewhat unpleasant but has a few standard ways to get it. The problem is that I have not been able to find a way to get the comments for a specific post if I have only the post URL and the blogID. The postID cannot be reliably scraped from the HTML as far as I can tell, and it seems like the postID is required to get the comments for a single post.
Also, the get most recent posts for a blogID API call is only helpful if the post is one of the most recent 10 or 15, so if it is a slightly older post, I probably cannot use that option. Does anyone know of a decent method to do this? I am mostly looking for a java solution, but if there is a solution in a different language I would gladly port it to java.
I just wanted to document my findings given that this question seems to be asked often and rarely answered.
Basically, to get the comments for a single blogger URL you would need the postID. If you have the postID, you can go through the Blogger API. If you only have the URL of the post, there seems to be only one somewhat reliable option, looking for the default post comments feed. To find this you need to look for an html tag of the form
In particular, the java regex that works for this is:
Pattern p = Pattern.compile("http://.*/feeds/[0-9]+/comments/default");
If this link tag does not exist, then the blog likely has a third-party commenting system installed like Disqus, Echo or IntenseDebate.

Categories

Resources