Parsing HTML page: difference in page content between Java code and browser

Parsing HTML page: difference in page content between Java code and browser - java

URL: https://www.bing.com/search?q=vevo+USIV30300367
If I View source of the above URL (in Internet Explorer 11 for that matter), the sub-string pertaining to the first search result is:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5075.1"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - Rush - [strong]Vevo[/strong][/a][/h2]"
Whereas via Java code, I get this:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5077.1"][span dir="ltr"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - …[/span][/a][/h2]"
The formatting is a bit different (check the [span] tags), but even worse, the video title has been truncated in the search result string (i.e. "Rush - Vevo" became "...").
Why is that? How to fix it?
(NOTE: I am using "[" and "]" in this post as replacements for the original HTML tagging delimiters to avoid my strings being formatted here on SO.)
Below is my Java code:
private String getWebPage(String pageURL, UserAgentBrowser uab)
{
URL url = null;
InputStream is = null;
BufferedReader br = null;
URLConnection conn = null;
StringBuilder pagedata = new StringBuilder();
String contenttype = null, charset = "utf-8";
String line = null;
try {
url = new URL(pageURL);
conn = url.openConnection();
conn.addRequestProperty("User-Agent", uab.toString());
contenttype = conn.getContentType();
int indexL = contenttype.indexOf("charset=") + 8;
if (indexL > 7) {
int indexR = contenttype.indexOf(";", indexL);
charset = (indexR == -1 ? contenttype.substring(indexL): contenttype.substring(indexL, indexR));
}
is = conn.getInputStream(); // Could throw an IOException
br = new BufferedReader(new InputStreamReader(is, charset));
while (true) {
line = br.readLine();
if (line == null) break;
pagedata.append(line);
}
} catch (MalformedURLException mue) {
// mue.printStackTrace();
} catch (IOException ioe) {
// ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// Nothing to see here
}
}
return (pagedata.length() == 0 ? null : pagedata.toString());
}
And
String pagedata = getWebPage("https://www.bing.com/search?q=vevo+USIV30300367", UserAgentBrowser.INTERNET_EXPLORER);
Where UserAgentBrowser.INTERNET_EXPLORER.toString() equals:
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"

Related

G suite account get report java sample question

I am trying to use this api to get report with java, and here is the link
https://developers.google.com/admin-sdk/reports/v1/appendix/activity/meet
and here is what i am using now
public static String getGraph() {
String PROTECTED_RESOURCE_URL = "https://www.googleapis.com/admin/reports/v1/activity/users/all/applications/meet?eventName=call_ended&maxResults=10&access_token=";
String graph = "";
try {
URL urUserInfo = new URL(PROTECTED_RESOURCE_URL + "access_token");
HttpURLConnection connObtainUserInfo = (HttpURLConnection) urUserInfo.openConnection();
if (connObtainUserInfo.getResponseCode() == HttpURLConnection.HTTP_OK) {
StringBuilder sbLines = new StringBuilder("");
BufferedReader reader = new BufferedReader(
new InputStreamReader(connObtainUserInfo.getInputStream(), "utf-8"));
String strLine = "";
while ((strLine = reader.readLine()) != null) {
sbLines.append(strLine);
}
graph = sbLines.toString();
}
} catch (IOException ex) {
x.printStackTrace();
}
return graph;
}
I am pretty sure it's not a smart way to do that and the string I get is quite complex, are there any jave sample that i can get the data directly instead of using java origin httpRequest
Or, are there and class I can import to switch the json string to the object!?
Anyone can help?!
I have trying this for many days already!
Thanks!!

How to read another remote url android if the first url is not possible

I have this code and it works well to read a remote file, but I wonder how it would be possible and it would be like to read a second url url if the first fails.
That is, I read the first file url, if available, ok continued.
if you can not read the first url, then accesses the second url.
As you can add a second url "backup"
Thanks.
// Code
try {
// Create a URL for the desired page
URL url = new URL("http://myurl.com/archive.txt");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
network1 = in.readLine();
network2 = in.readLine();
network3 = in.readLine();
network4 = in.readLine();
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
// Code
}

Use something like this
String[] readUrl(String urlStr) throws Exception {
URL url = new URL(urlStr);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String result = new String[4];
for(i=0; i< 4; i++) {
result[i] = in.readLine();
}
return result;
}
String[] tryMultipleUrls(String url1, String url2) {
String result[] = null;
try {
result = readUrl(url1);
}
catch(Exception ex) {
result = readUrl(url2);
}
return result;
}

Getting HTML page with java hangs

I'm building a java program that reads a file from Remax.com containing ids from around 300 properties. I parses the html file of (www.remax.pt/(id)) and then writes some image URLs (found in the HTML page) into another file. It works well, but hangs in the middle of the process. Sometimes it writes 15 properties, sometimes 30 and sometimes 4. It seems random. I can't figure out when and why it hangs. It's probably something with the connection maybe?
Here's my code, more or less:
try {
//initializing variables
reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputdir), "UTF-8"));
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(ouputDir)), "UTF-8"));
String line = "";
int nProperty = 1;
//reading Property id
while ((line = reader.readLine()) != null) {
id = line;
//opening a connection to the property page, so i can grab the html and the images.
URLConnection spoof = new URL("http://www.remax.pt/" + id).openConnection();
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
System.out.println(" Downloading photos from property " + nProperty + " - " + id);
//getting an input stream to read the page
InputStream in = spoof.getInputStream();
try {
InputStreamReader inR = new InputStreamReader(in);
BufferedReader buf = new BufferedReader(inR);
// searching the html page for the images i want
while ((lineaux = buf.readLine()) != null) {
if (lineaux.contains(".jpg")) {
Pattern p = Pattern.compile("www.remax.pt/.*?.jpg");
Matcher m = p.matcher(lineaux);
int i = 0;
int principal = 0;
String link = null;
while (m.find()) {
writer.write(m.group());
writer.newLine();
System.out.println("\t Downloading Photo " + i);
}
}
}
} finally {
in.close();
}
nProperty++;
}
} catch (FileNotFoundException e) {
System.out.println("File Not Found");
e.printStackTrace();
} finally {
try {
writer.close();
reader.close();
} catch (Exception exp) {
}
}
Then again, the code works. It's doing exactly what I want it to do, but hangs at random stages (I get no error -the program doesn't stop) and I have no idea what I can do to prevent it..
Thank You!

I kinda solved it. I had to do a spoof.setReadTimeout(10000); so it times out after 10 seconds and tries to connect again. It should be just a safety measure, but without it, the program doesn't complete.

Regular expressions in Java, can't search all HTML

I'm working with Java regular expressions on Android platform.
I'm trying to search this HTML for defined a regular expression.
Here's my code:
public void mainaaForWWW(String websiteSource){
try {
websiteSource = readDataFromWWW(websiteSource);
} catch (IOException e1) {
e1.printStackTrace();
}
ArrayList<String> cinemaArray = new ArrayList<String>();
Pattern sample = Pattern.compile("<div class=\"theatre\">");
Matcher secuence = sample.matcher(websiteSource);
try {
while (secuence.find()) {
cinemaArray.add(secuence.group());
}
} catch (Exception e) {
e.printStackTrace();
}
titleTableForWWW = new String[cinemaArray.size()];
for(int i = 0; i < titleTableForWWW.length; i++)
titleTableForWWW[i] = cinemaArray.get(i);
}
The problem is quite strange, because when I debug the code, String websiteSource is okay (all HTML files are completely loaded), but there's only 4 while loops. In the HTML document I found manually 11 matches. This regex is simplified only to find what's going on. Any ideas?
Ok, my bad. I found a solution:
So, here's my code responsible for writing HTML source code to String:
public String readDataFromWWW(String UrlAdress) throws IOException
{
String line = null;
URL url = new URL(UrlAdress);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "ISO-8859-2"));
while (rd.readLine() != null) {
line += rd.readLine();
}
System.out.println(line);
return line;
I think that reading to string that way, may something messed up, so I replaced this method by this one:
public String readDataFromWWW(String UrlAdress) throws IOException
{
String wyraz = "";
try {
String webPage = UrlAdress;
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is, "ISO-8859-2");
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
wyraz = sb.toString();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return wyraz;
}
And everything works FINE! Thanks a lot for clues and help. I think the problem was connected with newline durring writing String, but I'm not quite sure.

Extract some contents from the url using regular expressions in java

I want to extract contents from this url http://www.xyz.com/default.aspx and this is the below content that I want to extract using regular expression.
String expr = "
What Regular Expression should I use here
";
Pattern patt = Pattern.compile(expr, Pattern.DOTALL | Pattern.UNIX_LINES);
URL url4 = null;
try {
url4 = new URL("http://www.xyz.com/default.aspx");
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Text" +url4);
Matcher m = null;
try {
m = patt.matcher(getURLContent(url4));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Match" +m);
while (m.find()) {
String stateURL = m.group(1);
System.out.println("Some Data" +stateURL);
}
public static CharSequence getURLContent(URL url8) throws IOException {
URLConnection conn = url8.openConnection();
String encoding = conn.getContentEncoding();
if (encoding == null) {
encoding = "ISO-8859-1";
}
BufferedReader br = new BufferedReader(new
InputStreamReader(conn.getInputStream(), encoding));
StringBuilder sb = new StringBuilder(16384);
try {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
System.out.println(line);
sb.append('\n');
}
} finally {
br.close();
}
return sb;
}

As #bkent314 has mentioned, jsoup is a better and cleaner approach than using regular expression.
If you inspect the source code of that website, you basically want content from this snippet:-
<div class="smallHd_contentTd">
<div class="breadcrumb">...</div>
<h2>Services</h2>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
By using jsoup, your code may look something like this:-
Document doc = Jsoup.connect("http://www.ferotech.com/Services/default.aspx").get();
Element content = doc.select("div.smallHd_contentTd").first();
String header = content.select("h2").first().text();
System.out.println(header);
for (Element pTag : content.select("p")) {
System.out.println(pTag.text());
}
Hope this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing HTML page: difference in page content between Java code and browser - java

Related

G suite account get report java sample question

How to read another remote url android if the first url is not possible

Getting HTML page with java hangs

Regular expressions in Java, can't search all HTML

Extract some contents from the url using regular expressions in java

Categories

Resources