Getting HTML page with java hangs

Getting HTML page with java hangs - java

I'm building a java program that reads a file from Remax.com containing ids from around 300 properties. I parses the html file of (www.remax.pt/(id)) and then writes some image URLs (found in the HTML page) into another file. It works well, but hangs in the middle of the process. Sometimes it writes 15 properties, sometimes 30 and sometimes 4. It seems random. I can't figure out when and why it hangs. It's probably something with the connection maybe?
Here's my code, more or less:
try {
//initializing variables
reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputdir), "UTF-8"));
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(ouputDir)), "UTF-8"));
String line = "";
int nProperty = 1;
//reading Property id
while ((line = reader.readLine()) != null) {
id = line;
//opening a connection to the property page, so i can grab the html and the images.
URLConnection spoof = new URL("http://www.remax.pt/" + id).openConnection();
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
System.out.println(" Downloading photos from property " + nProperty + " - " + id);
//getting an input stream to read the page
InputStream in = spoof.getInputStream();
try {
InputStreamReader inR = new InputStreamReader(in);
BufferedReader buf = new BufferedReader(inR);
// searching the html page for the images i want
while ((lineaux = buf.readLine()) != null) {
if (lineaux.contains(".jpg")) {
Pattern p = Pattern.compile("www.remax.pt/.*?.jpg");
Matcher m = p.matcher(lineaux);
int i = 0;
int principal = 0;
String link = null;
while (m.find()) {
writer.write(m.group());
writer.newLine();
System.out.println("\t Downloading Photo " + i);
}
}
}
} finally {
in.close();
}
nProperty++;
}
} catch (FileNotFoundException e) {
System.out.println("File Not Found");
e.printStackTrace();
} finally {
try {
writer.close();
reader.close();
} catch (Exception exp) {
}
}
Then again, the code works. It's doing exactly what I want it to do, but hangs at random stages (I get no error -the program doesn't stop) and I have no idea what I can do to prevent it..
Thank You!

I kinda solved it. I had to do a spoof.setReadTimeout(10000); so it times out after 10 seconds and tries to connect again. It should be just a safety measure, but without it, the program doesn't complete.

Related

Parsing HTML page: difference in page content between Java code and browser

URL: https://www.bing.com/search?q=vevo+USIV30300367
If I View source of the above URL (in Internet Explorer 11 for that matter), the sub-string pertaining to the first search result is:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5075.1"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - Rush - [strong]Vevo[/strong][/a][/h2]"
Whereas via Java code, I get this:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5077.1"][span dir="ltr"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - …[/span][/a][/h2]"
The formatting is a bit different (check the [span] tags), but even worse, the video title has been truncated in the search result string (i.e. "Rush - Vevo" became "...").
Why is that? How to fix it?
(NOTE: I am using "[" and "]" in this post as replacements for the original HTML tagging delimiters to avoid my strings being formatted here on SO.)
Below is my Java code:
private String getWebPage(String pageURL, UserAgentBrowser uab)
{
URL url = null;
InputStream is = null;
BufferedReader br = null;
URLConnection conn = null;
StringBuilder pagedata = new StringBuilder();
String contenttype = null, charset = "utf-8";
String line = null;
try {
url = new URL(pageURL);
conn = url.openConnection();
conn.addRequestProperty("User-Agent", uab.toString());
contenttype = conn.getContentType();
int indexL = contenttype.indexOf("charset=") + 8;
if (indexL > 7) {
int indexR = contenttype.indexOf(";", indexL);
charset = (indexR == -1 ? contenttype.substring(indexL): contenttype.substring(indexL, indexR));
}
is = conn.getInputStream(); // Could throw an IOException
br = new BufferedReader(new InputStreamReader(is, charset));
while (true) {
line = br.readLine();
if (line == null) break;
pagedata.append(line);
}
} catch (MalformedURLException mue) {
// mue.printStackTrace();
} catch (IOException ioe) {
// ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// Nothing to see here
}
}
return (pagedata.length() == 0 ? null : pagedata.toString());
}
And
String pagedata = getWebPage("https://www.bing.com/search?q=vevo+USIV30300367", UserAgentBrowser.INTERNET_EXPLORER);
Where UserAgentBrowser.INTERNET_EXPLORER.toString() equals:
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"

I am using the epublib and I am trying to get the entire chapter of a book at a time

I am trying to get one chapter at a time of a book. I am using the Paul Seigmann library. However, I am not sure how to do it but I am able to get all the text from the book. Not sure where to go from there.
// find InputStream for book
InputStream epubInputStream = assetManager
.open("the_planet_mappers.epub");
// Load Book from inputStream
mThePlanetMappersBookEpubLib = (new EpubReader()).readEpub(epubInputStream);
Spine spine = new Spine(mThePlanetMappersBookEpubLib.getTableOfContents());
for (SpineReference bookSection : spine.getSpineReferences()) {
Resource res = bookSection.getResource();
try {
InputStream is = res.getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = r.readLine()) != null) {
line = Html.fromHtml(line).toString();
Log.i("Read it ", line);
mEntireBook.append(line);
}
} catch (IOException e) {
}

I don't know if you're still looking for an answer, but...
I'm working on it too right now. This is the code I have to retrieve the content of all the epub file:
public ArrayList<String> getBookContent(Book bi) {
// GET THE CONTENTS OF ALL PAGES
StringBuilder string = new StringBuilder();
ArrayList<String> listOfPages = new ArrayList<>();
Resource res;
InputStream is;
BufferedReader reader;
String line;
Spine spine = bi.getSpine();
for (int i = 0; spine.size() > i; i++) {
res = spine.getResource(i);
try {
is = res.getInputStream();
reader = new BufferedReader(new InputStreamReader(is));
while ((line = reader.readLine()) != null) {
// FIRST PAGE LINE -> <?xml version="1.0" encoding="utf-8" standalone="no"?>
if (line.contains("<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>")) {
string.delete(0, string.length());
}
// ADD THAT LINE TO THE FINAL STRING REMOVING ALL THE HTML
string.append(Html.fromHtml(formatLine(line)));
// LAST PAGE LINE -> </html>
if (line.contains("</html>")) {
listOfPages.add(string.toString());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
return listOfPages;
}
private String formatLine(String line) {
if (line.contains("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd")) {
line = line.substring(line.indexOf(">") + 1, line.length());
}
// REMOVE STYLES AND COMMENTS IN HTML
if ((line.contains("{") && line.contains("}"))
|| ((line.contains("/*")) && line.contains("*/"))
|| (line.contains("<!--") && line.contains("-->"))) {
line = line.substring(line.length());
}
return line;
}
As you may have notice I need to improve the filter, but I have every chapter of that book in my ArrayList. Now I just need to call that ArrayList like myList.get(0); and is done.
To show the text in a proper way, I'm using the bluejamesbond:textjustify library (https://github.com/bluejamesbond/TextJustify-Android).
It is easy to use and powerful.
I hope it helps you, and if anybody finds a better way to filter that html, notice me, please.

i want to change the text in a file, my code is searching the word but not replacing the word

I am trying to replace a string from a js file which have content like this
........
minimumSupportedVersion: '1.1.0',
........
now 'm trying to replace the 1.1.0 with 1.1.1. My code is searching the text but not replacing. Can anyone help me with this. Thanks in advance.
public class replacestring {
public static void main(String[] args)throws Exception
{
try{
FileReader fr = new FileReader("G:/backup/default0/default.js");
BufferedReader br = new BufferedReader(fr);
String line;
while((line=br.readLine()) != null) {
if(line.contains("1.1.0"))
{
System.out.println("searched");
line.replace("1.1.0","1.1.1");
System.out.println("String replaced");
}
}
}
catch(Exception e){
e.printStackTrace();
}
}
}

First, make sure you are assigning the result of the replace to something, otherwise it's lost, remember, String is immutable, it can't be changed...
line = line.replace("1.1.0","1.1.1");
Second, you will need to write the changes back to some file. I'd recommend that you create a temporary file, to which you can write each `line and when finished, delete the original file and rename the temporary file back into its place
Something like...
File original = new File("G:/backup/default0/default.js");
File tmp = new File("G:/backup/default0/tmpdefault.js");
boolean replace = false;
try (FileReader fr = new FileReader(original);
BufferedReader br = new BufferedReader(fr);
FileWriter fw = new FileWriter(tmp);
BufferedWriter bw = new BufferedWriter(fw)) {
String line = null;
while ((line = br.readLine()) != null) {
if (line.contains("1.1.0")) {
System.out.println("searched");
line = line.replace("1.1.0", "1.1.1");
bw.write(line);
bw.newLine();
System.out.println("String replaced");
}
}
replace = true;
} catch (Exception e) {
e.printStackTrace();
}
// Doing this here because I want the files to be closed!
if (replace) {
if (original.delete()) {
if (tmp.renameTo(original)) {
System.out.println("File was updated successfully");
} else {
System.err.println("Failed to rename " + tmp + " to " + original);
}
} else {
System.err.println("Failed to delete " + original);
}
}
for example.
You may also like to take a look at The try-with-resources Statement and make sure you are managing your resources properly

If you're working with Java 7 or above, use the new File I/O API (aka NIO) as
// Get the file path
Path jsFile = Paths.get("C:\\Users\\UserName\\Desktop\\file.js");
// Read all the contents
byte[] content = Files.readAllBytes(jsFile);
// Create a buffer
StringBuilder buffer = new StringBuilder(
new String(content, StandardCharsets.UTF_8)
);
// Search for version code
int pos = buffer.indexOf("1.1.0");
if (pos != -1) {
// Replace if found
buffer.replace(pos, pos + 5, "1.1.1");
// Overwrite with new contents
Files.write(jsFile,
buffer.toString().getBytes(StandardCharsets.UTF_8),
StandardOpenOption.TRUNCATE_EXISTING);
}
I'm assuming your script file size doesn't cross into MBs; use buffered I/O classes otherwise.

Error in reading file java.io

So i'm trying to read the following string from the text file addToLibrary.txt
file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3
But when I do i get the following error:
java.io.FileNotFoundException: file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3 (No such file or directory)
Whats odd is that I got that string from a fileChooser using this method:
public static void addToLibrary(File f) {
String fileName = "addToLibrary.txt";
try {
FileWriter filewriter = new FileWriter(fileName, true);
BufferedWriter bufferedWriter = new BufferedWriter(filewriter);
bufferedWriter.newLine();
bufferedWriter.write(f.toURI().toString());
System.out.println("Your file has been written");
bufferedWriter.close();
} catch (IOException ex) {
System.out.println(
"Error writing to file '"
+ fileName + "'");
} finally {
}
}
An even stranger error is that my file reader can read things in another folder but not anything in iTunes Media.
I attempt to read all the files in the different folders with the following method:
public void getMusicDirectory() {
int index = 0;
try {
File[] contents = musicDir.listFiles();
//System.out.println(contents[3].toString());
for (int i = 0; i < contents.length; i++) {
//System.out.println("----------------------------------------"+contents.length);
String name = contents[i].getName();
//System.out.println(name);
if (name.indexOf(".mp3") == -1) {
continue;
}
FileInputStream file = new FileInputStream(contents[i]);
file.read();
System.out.println(contents[i].toURI().toString());
songsDir.add(new Song((new MediaPlayer(new Media(contents[i].toURI().toString()))), contents[i]));
file.close();
}
} catch (Exception e) {
System.out.println("Error -- " + e.toString());
}
try(BufferedReader br = new BufferedReader(new FileReader("addToLibrary.txt"))) {
//System.out.println("In check login try");
for (String line; (line = br.readLine()) != null; ) {
FileInputStream file = new FileInputStream(new File(line));
file.read();
songsDir.add(new Song(new MediaPlayer(new Media(line)), new File(line)));
file.close();
}
// line is not visible here.
} catch (Exception e) {
System.out.println("Error reading add to library-- " + e.toString());
}
}
So how can i make this work? why does the first part of the method work but not the second?

You are not having a problem reading the string
file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3
from a file. That part works fine. Your problem is after that, when you try to open the file with the path:
file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3
because that's not actually a path; it's a URI (although it can be converted to a path).
You could convert this to a path, in order to open it, but you have no reason to - your code doesn't actually read from the file (apart from the first byte, which it does nothing with) so there's no point in opening it. Delete the following lines:
FileInputStream file = new FileInputStream(contents[i]); // THIS ONE
file.read(); // THIS ONE
System.out.println(contents[i].toURI().toString());
songsDir.add(new Song((new MediaPlayer(new Media(contents[i].toURI().toString()))), contents[i]));
file.close(); // THIS ONE
and
FileInputStream file = new FileInputStream(new File(line)); // THIS ONE
file.read(); // THIS ONE
songsDir.add(new Song(new MediaPlayer(new Media(line)), new File(line)));
file.close(); // THIS ONE

file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3 is not a valid File reference, especially under Windows.
Since you've idendtified the String as a URI, you should treat it as such...
URI uri = URI.create("file:/Users/JEAcomputer/Music/iTunes/iTunes%20Media/Music/Flight%20Of%20The%20Conchords/Flight%20Of%20The%20Conchords/06%20Mutha'uckas.mp3");
Okay, but, there's no real way to read URI, but you can read a URL, so we need to convert the URI to URL, luckily, this is quite simple...
URL url = uri.toURL();
From there you can use URL#openStream to open an InputStream (which you can wrap in a InputStreamReader) and read the contents of the file, for example...
String imageFile = "file:/...";
URI uri = URI.create(imageFile);
try {
URL url = uri.toURL();
try (InputStream is = url.openStream()) {
byte[] bytes = new byte[1024 * 4];
int bytesRead = -1;
int totalBytesRead = 0;
while ((bytesRead = is.read(bytes)) != -1) {
// Somthing, something, something, bytes
totalBytesRead += bytesRead;
}
System.out.println("Read a total of " + totalBytesRead);
} catch (IOException ex) {
ex.printStackTrace();
}
} catch (MalformedURLException ex) {
ex.printStackTrace();
}
You could, however, save your self a lot of issues and stop using things like f.toURI().toString()); (File#toURI#toString) and simply use File#getPath instead...This would allow you to simply create a new File reference from the String...
Also, your resource management needs some work, basically, if you open it, you should close it. See The try-with-resources Statement for some more ideas

Parsed strings from .csv-file are invalid tokens in an kml-file. How can i solve this?

I have a code which parses strings from an CSV.-file (with twitter data) and gives them to a new KML file. When i parse the comments from the twitter data there are of course unknown tokens like: ðŸš¨. When i open up the new KML-File in Google Earth i get an error because of this unknown tokens.
Question:
When i parse the strings, can i tell java it should throw out all unknown tokens from the string so that i don't have any unknown tokens in my KML?
Thank you
Code below:
String csvFile = "twitter.csv";
BufferedReader br = null;
String line = "";
String cvsSplitBy = ";";
String[] twitter = null;
int row_desired = 0;
int row_counter = 0;
String[] placemarks = new String[1165];
// ab hier einlesen der CSV
try {
br = new BufferedReader(new FileReader(csvFile));
while ((line = br.readLine()) != null) {
if (row_counter++ == row_desired) {
twitter = line.split(cvsSplitBy);
placemarks[row_counter] =
"<Placemark>\n"+
"<name>User ID: "+twitter[7]+"</name>\n"+
"<description>This User wrote: "+twitter[5]+" at the: "+twitter[6]+"</description>\n"+
"<Point>\n"+
"<coordinates>"+twitter[1]+","+twitter[2]+"</coordinates>\n"+
"</Point>\n"+
"</Placemark>\n";
row_desired++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
for(int i = 2; i <= 1164;i++){
String kml2 = kml.concat(""+placemarks[i]+"");
kml=kml2;
}
kml = kml.concat("</Document></kml>");
FileWriter fileWriter = new FileWriter(filepath);
fileWriter.write(kml);
fileWriter.close();
Runtime.getRuntime().exec(googlefilepath + filepath);
}

Text files are not all built equal: you must always consider what character encoding is in use. I'm not sure about Twitter's data specifically, but I would guess they're doing like the rest of the world and using UTF-8.
Basically, avoid FileReader and instead use the constructor of InputStreamReader which lets you specify the Charset.
Tip: if you're using Java 7+, try this:
for (String line : Files.readAllLines(file.toPath(), Charset.forName("UTF-8"))) { ...
More Info
The javadoc of FileReader states "The constructors of this class assume that the default character encoding"
You should avoid this class, always. Or at least for any data that might ever be transferred between computers. Even a program running on Windows "using the default charset" will assume UTF-8 when run from inside Eclipse, or ISO_8859_1 when running outside Eclipse! Such non-determinism from a class is not good.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting HTML page with java hangs - java

I kinda solved it. I had to do a spoof.setReadTimeout(10000); so it times out after 10 seconds and tries to connect again. It should be just a safety measure, but without it, the program doesn't complete.

Related

Parsing HTML page: difference in page content between Java code and browser

I am using the epublib and I am trying to get the entire chapter of a book at a time

i want to change the text in a file, my code is searching the word but not replacing the word

Error in reading file java.io

Parsed strings from .csv-file are invalid tokens in an kml-file. How can i solve this?

Categories

Resources