Extraction of HTML content from Walmart html page - java

I have written the below code . I need to extract the price from the below URL .I am writing code in java.
http://www.walmart.com/ip/VIZIO-E70-C3-70-1080p-240Hz-Class-LED-Smart-HDTV/43310251
String regEx = "<span\\s+class=\"sup\">.+</span>[\n]*(\\d+(,)*\\d+)[\n*]<span\\s+class=\"visuallyhidden\">[.]*</span>[\n]*<span\\s+class=\"sup\">(\\d+)";
Pattern p1 = Pattern.compile(regEx);
System.out.println("Vikash");
while ((line = in .readLine()) != null) {
sb.append(line + "\n");
}
m = p1.matcher(sb);
while (!m.hitEnd()) {
if (m.find()) {
System.out.println("$" + m.group());
}
}

If you can't use API's, you should use a framework for this. Take a look at http://jsoup.org
It will generate a strucutred document and allows you to iterate over ids, classes, tags and so on.
E.g.
findElementsByClass("sup"). I can provide some examplecode when I'm back at my desktop.

Related

Search pattern within String in JAVA

I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}
I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.
The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.

Decode alfresco file name or replace unicode[_x0020_] characters in String/fileName

I am using alfresco download upload services using java.
When I upload the file to alfreco server it gives me the following path :
/app:Home/cm:Company_x0020_Home/cm:Abc/cm:TestFile/cm:V4/cm:BC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
When I use the same file path and download using alfresco services I took the file name at the end of the path
i.e ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
How can I remove or decode the [Unicode] characters in fileName
String decoded = URLDecoder.decode(queryString, "UTF-8");
The above does not work .
These are some Unicode characters which appeared in my file name.
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Please do not mark the question as duplicate as I have searched below links but non of those gave the solution.
Following are the links that I have searched for replacing unicode charectors in String with java.
Java removing unicode characters
Remove non-ASCII characters from String in Java
How can I replace a unicode character in java string
Java Replace Unicode Characters in a String
The solution given by Jeff Potts will be perfect .
But i had a situation where i was using file name in diffrent project where i wont use org.alfresco related jars
I had to take all those dependencies to use for a simple file decoding
So i used java native methods which uses regex to parse the file name and decode it,which gave me the perfect solution which was same from using
ISO9075.decode(test);
This is the code which can be used
public String decode_FileName(String fileName) {
System.out.println("fileName : " + fileName);
String decodedfileName = fileName;
String temp = "";
Matcher m = Pattern.compile("\\_x(.*?)\\_").matcher(decodedfileName); //rejex which matches _x0020_ kind of charectors
List<String> unicodeChars = new ArrayList<String>();
while (m.find()) {
unicodeChars.add(m.group(1));
}
for (int i = 0; i < unicodeChars.size(); i++) {
temp = unicodeChars.get(i);
if (isInteger(temp)) {
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf(temp), 16)));//converting
decodedfileName = decodedfileName.replace("_x" + temp + "_", replace_char);
}
}
System.out.println("Decoded FileName :" + decodedfileName);
return decodedfileName;
}
And use this small java util to know Is integer
public static boolean isInteger(String s) {
try {
Integer.parseInt(s);
} catch (NumberFormatException e) {
return false;
} catch (NullPointerException e) {
return false;
}
return true;
}
So the above code works as simple as this :
Example :
0028 Left parenthesis U+0028 You can see in the link
https://en.wikipedia.org/wiki/List_of_Unicode_characters
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf("0028"), 16)));
System.out.println(replace_char);
This code gives output : ( which is a Left parenthesis
This is what the logic i have used in my java program.
The above program will give results same as ISO9075.decode(test)
Output :
fileName : ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
Decoded FileName :ABC1X 0400 0109-(1-2)_v2.pdf
In the org.alfresco.util package you will find a class called ISO9075. You can use it to encode and decode strings according to that spec. For example:
String test = "ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf";
String out = ISO9075.decode(test);
System.out.println(out);
Returns:
ABC1X 0400 0109-(1-2)_v2.pdf
If you want to see what it does behind the scenes, look at the source.

Using Alchemy Entity Extraction to retrieve JSON output

I am running the EntityTest.java file from the Alchemy API Java SDK which can be found here. The programs works just fine, but it seems there is no way to change output format to JSON.
I have tried executing this code-
// Create an AlchemyAPI object.
AlchemyAPI alchemyObj = AlchemyAPI.GetInstanceFromFile("api_key.txt");
// Force the output type to be JSON
AlchemyAPI_NamedEntityParams params = new AlchemyAPI_NamedEntityParams();
params.setOutputMode("json");
// Extract a ranked list of named entities for a web URL.
Document doc = alchemyObj.URLGetRankedNamedEntities("http://www.techcrunch.com/", params);
System.out.println(getStringFromDocument(doc));
But the code throws a RunTimeException, and prints the following on console-
Exception in thread "main" java.lang.RuntimeException: Invalid setting json for parameter outputMode
at com.alchemyapi.api.AlchemyAPI_Params.setOutputMode(AlchemyAPI_Params.java:42)
at com.alchemyapi.test.EntityTest.main(EntityTest.java:29)
Also, here is the setOutputCode method from AlchemyAPI_Params.java file-
public void setOutputMode(String outputMode) {
if( !outputMode.equals(AlchemyAPI_Params.OUTPUT_XML) && !outputMode.equals(OUTPUT_RDF) )
{
throw new RuntimeException("Invalid setting " + outputMode + " for parameter outputMode");
}
this.outputMode = outputMode;
}
As is evident from the code, it seems that the only 2 acceptable output formats are XML and RDF. Is that so?? Is there no way the get the output in JSON?
Can anybody please help me out regarding that??
You will need to add new constant : OUTPUT_JSON in AlchemyAPI_Params and modify the setOutputMode method to accept it.
After that in AlchemyAPI :
You will need to modify the doRequest method with a the new OUTPUT_JSON case.
You can use :
http://www.oracle.com/technetwork/articles/java/json-1973242.html
to create the new content.
Hope it help
I solved the problem by resorting to a completely different approach. Instead of using the already available Java SDK, I made an HTTP connection to the endpoint of URLGetRankedNamedEntities API, and retrieved the response.
Here is a code sample that demonstrates how to do this-
URL urlObj = new URL("http://access.alchemyapi.com/calls/url/URLGetRankedNamedEntities?apikey=" + API_KEY_HERE + "&url=http://www.smashingmagazine.com/2015/04/08/web-scraping-with-nodejs/&outputMode=json");
System.out.println(urlObj.toString() + "\n");
URLConnection connection = urlObj.openConnection();
connection.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
StringBuilder builder = new StringBuilder();
while ((line = reader.readLine()) != null) {
builder.append(line + "\n");
}
System.out.println(builder);
Similar endpoints are avaliable for other APIs as well, which can found here.

Checking HTML (Website) tags within Java Code

I have system in PHP that the user enters a website url and we download the html and check values in tags. I have to rewrite it in java now. I been search for days and cant find any easy way to do the following tasks.
1) download HTML based on URL
2) After downloading HTML check values in tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
public String tagValue(String inHTML, String tag) throws DataNotFoundException
{
String value = null;
String searchFor = "/<" + tag + ">(.*?)<\/" + tag + "\>/";
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(inHTML);
return value;
}
check out http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
google "java html parser" for options. you could also use regular expressions if the requirements are fairly simple and straightforward.
An example follows. It took me a while, I haven't worked with these APIs for a long time.
jcomeau#intrepid:~/tmp$ cat test.java; javac test.java; java test
import java.util.regex.*;
import java.net.*;
import java.io.*;
public class test {
public static void main(String args[]) throws Exception {
URL target = new URL("http://www.example.com/");
URLConnection connection = target.openConnection();
connection.connect();
String html = "", line = null;
BufferedReader input = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
while ((line = input.readLine()) != null) html += line;
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(html);
System.out.println("href\ttext");
while (matcher.find()) {
System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}
}
}
href text
"/"
"/domains/" Domains
"/numbers/" Numbers
"/protocols/" Protocols
"/about/" About IANA
"/go/rfc2606" RFC 2606
"/about/" About
"/about/presentations/" Presentations
"/about/performance/" Performance
"/reports/" Reports
"/domains/" Domains
"/domains/root/" Root Zone
"/domains/int/" .INT
"/domains/arpa/" .ARPA
"/domains/idn-tables/" IDN Repository
"/protocols/" Protocols
"/numbers/" Number Resources
"/abuse/" Abuse Information
"http://www.icann.org/" Internet Corporation for Assigned Names and Numbers
"mailto:iana#iana.org?subject=General%20website%20feedback" iana#iana.org
1) download HTML based on URL
There are various options. There are some helper libraries, e.g. Apache HTTPComponents. You can also just use Java's built-in classes. See e.g. java code to download a file from server .
2) After downloading HTML check values in tags
You probably want to use an HTML parser. For very simple cases, you could use regular expressions (as it seems you are trying to in your example), but this quickly leads to problems. See this famous question: RegEx match open tags except XHTML self-contained tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
To put a "\" (backslash) into a literal Java string, you need to double it (because \ is used to introduce special sequences in a Java string literal). So to get a string with just a "\", write it as
String myBackslash = "\\";
See e.g. How can I print "\t" (as it looks) in Java?

How to extract links from a webpage using jsp?

My requirement is to extract all links (using "a href") from a web page dynamically. I am using JSP. To be more specific, i am building a meta search engine in JSP. So when user enters a query item, i have to extract the links from the search results pages of yahoo, ask, google, momma etc.
For getting the pages in string format, the code i am using right now is.
> > try
{
> String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
>
> String nextLine;
> String webPage;
> StringBuffer wPage;
> String sSql;
> java.net.URL siteURL = new java.net.URL (sUrl_yahoo);
> java.net.URLConnection siteConn = siteURL.openConnection();
> java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );
> wPage = new StringBuffer(30*1024);
> while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }
> in.close();
> webPage = wPage.toString(); out.println(webPage); }
> catch(Exception e) {
> out.println("Error" + e); }
Now, my request is: Can you suggest some way to extract the links from the String webPage ?
Or is there some other way to extract those links ? I would prefer doing it without using any external packages.
One quick solution would be to use a regex Matcher object to pull the URLs out:
Pattern p = Pattern.compile("<a +href=\"([a-zA-z0-9\\:\\-\\/\\.]+)\">");
Matcher m = p.matcher(webPage);
ArrayList<String> foundUrls = new ArrayList<String>();
while(m.find()) {
foundUrls.add(m.group(1));
}
You might have to play around with the URL pattern a little to make it more airtight, but this is a quick and dirty solution without using external libraries.

Categories

Resources