How to extract links from a webpage using jsp? - java

My requirement is to extract all links (using "a href") from a web page dynamically. I am using JSP. To be more specific, i am building a meta search engine in JSP. So when user enters a query item, i have to extract the links from the search results pages of yahoo, ask, google, momma etc.
For getting the pages in string format, the code i am using right now is.
> > try
{
> String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
>
> String nextLine;
> String webPage;
> StringBuffer wPage;
> String sSql;
> java.net.URL siteURL = new java.net.URL (sUrl_yahoo);
> java.net.URLConnection siteConn = siteURL.openConnection();
> java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );
> wPage = new StringBuffer(30*1024);
> while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }
> in.close();
> webPage = wPage.toString(); out.println(webPage); }
> catch(Exception e) {
> out.println("Error" + e); }
Now, my request is: Can you suggest some way to extract the links from the String webPage ?
Or is there some other way to extract those links ? I would prefer doing it without using any external packages.

One quick solution would be to use a regex Matcher object to pull the URLs out:
Pattern p = Pattern.compile("<a +href=\"([a-zA-z0-9\\:\\-\\/\\.]+)\">");
Matcher m = p.matcher(webPage);
ArrayList<String> foundUrls = new ArrayList<String>();
while(m.find()) {
foundUrls.add(m.group(1));
}
You might have to play around with the URL pattern a little to make it more airtight, but this is a quick and dirty solution without using external libraries.

Related

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}
You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

Search Box for Jpanel

I am in the middle of creating an app that allows users to apply for job positions and upload their CVs. I`m currently stuck on trying to make a search box for the admin to be able to search for Keywords. The app will than look through all the CVs and if it finds such keywords it will show up a list of Cvs that contain the keyword. I am fairly new to Gui design and app creation so not sure how to go about doing it. I wish to have it done via java and am using the Eclipse Window builder to help me design it. Any help will be greatly appreciated, hints, advice anything. Thank You.
Well, this not right design approach as real time search of words in all files of given folder will be slow and not sustainable in long run. Ideally you should have indexed all CV's for keywords. The search should run on index and then get the associated CV for that index ( think of indexes similar to tags). There are many options for indexing - simples DB indexing or using Apache Lucene or follow these steps to create a index using Maps and refer this index for search.
Create a map Map<String, List<File>> for keeping the association of
keywords to files
iterate through all files, and for each word in
each file, add that file to the list corresponding to that word in
your index map
here is the java code which will work for you but I would still suggest to change your design approach and use indexes.
File dir = new File("Folder for CV's");
if(dir.exists())
{
Pattern p = Pattern.compile("Java");
ArrayList<String> list = new ArrayList<String>(); // list of CV's
for(File f : dir.listFiles())
{
if(!f.isFile()) continue;
try
{
FileInputStream fis = new FileInputStream(f);
byte[] data = new byte[fis.available()];
fis.read(data);
String text = new String(data);
Matcher m = p.matcher(text);
if(m.find())
{
list.add(f.getName()); // add file to found-keyword list.
}
fis.close();
}
catch(Exception e)
{
System.out.print("\n\t Error processing file : "+f.getName());
}
}
System.out.print("\n\t List : "+list); // list of files containing keyword.
} // IF directory exists then only process.
else
{
System.out.print("\n Directory doesn't exist.");
}
Here you get the files list to show now for "Java". As I said use indexes :)
Thanks for taking your time to look into my problem.
I have actually come up with a solution of my own. It is probably very amateur like but it works for me.
JButton btnSearch = new JButton("Search");
btnSearch.addActionListener(new ActionListener()
{
public void actionPerformed(ActionEvent arg0)
{
list.clear();
String s = SearchBox.getText();
int i = 0,present = 0;
int id;
try
{
Class.forName(driver).newInstance();
Connection conn = DriverManager.getConnection(url+dbName,userName,password);
Statement st = conn.createStatement();
ResultSet res = st.executeQuery("SELECT * FROM javaapp.test");
while(res.next())
{
i = 0;
present = 0;
while(i < 9)
{
String out = res.getString(search[i]);
if(out.toLowerCase().contains(s.toLowerCase()))
{
present = 1;
break;
}
i++;
}
if(tglbtnNormalshortlist.isSelected())
{
if(present == 1 && res.getInt("Shortlist") == 1)
{
id = res.getInt("Candidate");
String print = res.getString("Name");
list.addElement(print+" "+id);
}
}
else
{
if(present == 1 && res.getInt("Shortlist") == 0)
{
id = res.getInt("Candidate");
String print = res.getString("Name");
list.addElement(print+" "+id);
}
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
});

Extraction of HTML content from Walmart html page

I have written the below code . I need to extract the price from the below URL .I am writing code in java.
http://www.walmart.com/ip/VIZIO-E70-C3-70-1080p-240Hz-Class-LED-Smart-HDTV/43310251
String regEx = "<span\\s+class=\"sup\">.+</span>[\n]*(\\d+(,)*\\d+)[\n*]<span\\s+class=\"visuallyhidden\">[.]*</span>[\n]*<span\\s+class=\"sup\">(\\d+)";
Pattern p1 = Pattern.compile(regEx);
System.out.println("Vikash");
while ((line = in .readLine()) != null) {
sb.append(line + "\n");
}
m = p1.matcher(sb);
while (!m.hitEnd()) {
if (m.find()) {
System.out.println("$" + m.group());
}
}
If you can't use API's, you should use a framework for this. Take a look at http://jsoup.org
It will generate a strucutred document and allows you to iterate over ids, classes, tags and so on.
E.g.
findElementsByClass("sup"). I can provide some examplecode when I'm back at my desktop.

Save file from a website with java

I'm trying to build a jsoup based java app to automatically download English subtitles for films (I'm lazy, I know. It was inspired from a similar python based app). It's supposed to ask you the name of the film and then download an English subtitle for it from subscene.
I can make it reach the download link but I get an Unhandled content type error when I try to 'go' to that link. Here's my code
public static void main(String[] args) {
try {
String videoName = JOptionPane.showInputDialog("Title: ");
subscene(videoName);
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
public static void subscene(String videoName){
try {
String siteName = "http://www.subscene.com";
String[] splits = videoName.split("\\s+");
String codeName = "";
String text = "";
if(splits.length>1){
for(int i=0;i<splits.length;i++){
codeName = codeName+splits[i]+"-";
}
videoName = codeName.substring(0, videoName.length());
}
System.out.println("videoName is "+videoName);
// String url = "http://www.subscene.com/subtitles/"+videoName+"/english";
String url = "http://www.subscene.com/subtitles/title?q="+videoName+"&l=";
System.out.println("url is "+url);
Document doc = Jsoup.connect(url).get();
Element exact = doc.select("h2.exact").first();
Element yuel = exact.nextElementSibling();
Elements lis = yuel.children();
System.out.println(lis.first().children().text());
String hRef = lis.select("div.title > a").attr("href");
hRef = siteName+hRef+"/english";
System.out.println("hRef is "+hRef);
doc = Jsoup.connect(hRef).get();
Element nonHI = doc.select("td.a40").first();
Element papa = nonHI.parent();
Element link = papa.select("a").first();
text = link.text();
System.out.println("Subtitle is "+text);
hRef = link.attr("href");
hRef = siteName+hRef;
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
Jsoup.connect(hRef).get(); //<-- Here's where the problem lies
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
Can someone please help me so I don't have to manually download subs?
I just found out that using
java.awt.Desktop.getDesktop().browse(java.net.URI.create(hRef));
instead of
Jsoup.connect(hRef).get();
downloads the file after prompting me to save it. But I don't want to be prompted because this way I won't be able to read the name of the downloaded zip file (I want to unzip it after saving using java).
Assuming that your files are small, you can do it like this. Note that you can tell Jsoup to ignore the content type.
// get the file content
Connection connection = Jsoup.connect(path);
connection.timeout(5000);
Connection.Response resultImageResponse = connection.ignoreContentType(true).execute();
// save to file
FileOutputStream out = new FileOutputStream(localFile);
out.write(resultImageResponse.bodyAsBytes());
out.close();
I would recommend to verify the content before saving.
Because some servers will just return a HTML page when the file cannot be found, i.e. a broken hyperlink.
...
String body = resultImageResponse.body();
if (body == null || body.toLowerCase().contains("<body>"))
{
throw new IllegalStateException("invalid file content");
}
...
Here:
Document subDownloadPage = Jsoup.connect(hRef).get();
hRef = siteName+subDownloadPage.select("a#downloadButton").attr("href");
//specifically here
Jsoup.connect(hRef).get();
Looks like jsoup expects that the result of Jsoup.connect(hRef) should be an HTML or some text that it's able to parse, that's why the message states:
Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml
I followed the execution of your code manually and the last URL you're trying to access returns a content type of application/x-zip-compressed, thus the cause of the exception.
In order to download this file, you should use a different approach. You could use the old but still useful URLConnection, URL or use a third party library like Apache HttpComponents to fire a GET request and retrieve the result as an InputStream, wrap it into a proper writer and write your file into your disk.
Here's an example about doing this using URL:
URL url = new URL(hRef);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\foo.zip"));
final int BUFFER_SIZE = 1024 * 4;
byte[] buffer = new byte[BUFFER_SIZE];
BufferedInputStream bis = new BufferedInputStream(in);
int length;
while ( (length = bis.read(buffer)) > 0 ) {
out.write(buffer, 0, length);
}
out.close();
in.close();

Get the link from html file

I use htmlcleaner to parse HTML files. here is example of an html file.
.......<div class="name">Name</div>;......
I get the word Name using this construction in my code
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
rootNode = cleaner.clean(htmlPage);
TagNode linkElements[] = rootNode.getElementsByName("div",true);
for (int i = 0; linkElements != null && i < linkElements.length; i++)
{
String classType = linkElements.getAttributeByName("name");
if (classType != null)
{
if(classType.equals(class)&& classType.equals(CSSClassname)) { linkList.add(linkElements); }
}
System.out.println("TagNode" + linkElements.getText());
linkList.add(linkElements);
}
and then add all of this name's to listview using
TagNode=linkelements.getText().toString()
;
But I don't understand how I can get the link in my example. I want to get the link http://exxample.com but I don't know what to do.
Please help me. I read the tutorial and used the function but can't.
P.S. Sorry for my bad English
I don't use HtmlCleaner, but according to the javadoc you do it this way:
List<String> links = new ArrayList<String> ();
for (TagNode aTag : linkElements[i].getElementListByName ("a", false))
{
String link = aTag.getAttributeByName ("href");
if (link != null && link.length () > 0) links.add (link);
}
P.S.: you posted clearly uncompilable code
P.P.S.: why don't you use some library that creates an ordinary DOM tree from html? This way you'll be able to work with parsed document using a common-known API.

Categories

Resources