Why is website crawling taking forever?

Why is website crawling taking forever? - java

public class Parser {
public static void main(String[] args) {
Parser p = new Parser();
p.matchString();
}
parserObject courseObject = new parserObject();
ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage = " ";
{
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader =
new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine()) != null) {
theWebPage = theWebPage + " " + str;
}
reader.close();
} catch (MalformedURLException e) {
// do nothing
} catch (IOException e) {
// do nothing
}
}
public void matchString() {
// this is my regex that I am using to compare strings on input page
String matchRegex = "#\\w+(-\\w+)+";
Pattern p = Pattern.compile(matchRegex);
Matcher m = p.matcher(theWebPage);
int i = 0;
while (!m.hitEnd()) {
try {
System.out.println(m.group());
courseNames.add(i, m.group());
i++;
} catch (IllegalStateException e) {
// do nothing
}
}
}
}
What I am trying to achieve with the above code is to get the list of departments on the MIT OpencourseWare website. I am using a regular expression that matches the pattern of the department names as in the page source. And I am using a Pattern object and a Matcher object and trying to find() and print these department names that match the regular expression. But the code is taking forever to run and I don't think reading in a webpage using bufferedReader takes that long. So I think I am either doing something horribly wrong or parsing websites takes a ridiculously long time. so I would appreciate any input on how to improve performance or correct a mistake in my code if any. I apologize for the badly written code.

The problem is with the code
while ((str = reader.readLine()) != null)
theWebPage = theWebPage + " " +str;
The variable theWebPage is a String, which is immutable. For each line read, this code creates a new String with a copy of everything that's been read so far, with a space and the just-read line appended. This is an extraordinary amount of unnecessary copying, which is why the program is running so slow.
I downloaded the web page in question. It has 55,000 lines and is about 3.25MB in size. Not too big. But because of the copying in the loop, the first line ends up being copied about 1.5 billion times (1/2 of 55,000 squared). The program is spending all its time copying and garbage collecting. I ran this on my laptop (2.66GHz Core2Duo, 1GB heap) and it took 15 minutes to run when reading from a local file (no network latency or web crawling countermeasures).
To fix this, make theWebPage into a StringBuilder instead, and change the line in the loop to be
theWebPage.append(" ").append(str);
You can convert theWebPage to a String using toString() after the loop if you wish. When I ran the modified version, it took a fraction of a second.
BTW your code is using a bare code block within { } inside a class. This is an instance initializer (as opposed to a static initializer). It gets run at object construction time. This is legal, but it's quite unusual. Notice that it misled other commenters. I'd suggest converting this code block into a named method.

Is this your whole program? Where is the declaration of parserObject?
Also, shouldn't all of this code be in your main() prior to calling matchString()?
parserObject courseObject = new parserObject();
ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage=" ";
{
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine())!=null)
{
theWebPage = theWebPage+" "+str;
}
reader.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
}
You are also catching exceptions and not displaying any error messages. You should always display an error message and do something when you encounter an exception. For example, if you can't download the page, there is no reason to try to parse a empty string.
From you comment I learned about static blocks in classes (thank you, didn't know about them). However, from what I've read you need to put the keyword static before the start of the block {. Also, it might just be better to put the code into your main, that way you can exit if you get a MalformedURLException or IOException.

You can, of course, solve this assignment with the limited JDK 1.0 API, and run into the issue that Stuart Marks helped you solve in his excellent answer.
Or, you just use a popular de-facto standard library, like for instance, Apache Commons IO, and read your website into a String using a no-brainer like this:
// using this...
import org.apache.commons.io.IOUtils;
// run this...
try (InputStream is = new URL("http://ocw.mit.edu/courses/").openStream()) {
theWebPage = IOUtils.toString(is);
}

Related

OutputStreamWriter only writing one item into file

I have used the following code to write elements from an arraylist into a file, to be retrieved later on using StringTokenizer. It works perfect for 3 other arraylists but somehow for this particular one, it throws an exception when reading with .nextToken() and further troubleshooting with .countTokens() shows that it only has 1 token in the file. The delimiters for both write and read are the same - "," as per the other arraylists as well.
I'm puzzled why it doesnt work the way it should as with the other arrays when I have not changed the code structure.
=================Writing to file==================
public static void copy_TimeZonestoFile(ArrayList<AL_TimeZone> timezones, Context context){
try {
FileOutputStream fileOutputStream = context.openFileOutput("TimeZones.dat",Context.MODE_PRIVATE);
OutputStreamWriter writerFile = new OutputStreamWriter(fileOutputStream);
int TZsize = timezones.size();
for (int i = 0; i < TZsize; i++) {
writerFile.write(
timezones.get(i).getRegion() + "," +
timezones.get(i).getOffset() + "\n"
);
}
writerFile.flush();
writerFile.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
==========Reading from file (nested in thread/runnable combo)===========
public void run() {
if (fileTimeZones.exists()){
System.out.println("Timezone file exists. Loading.. File size is : " + fileTimeZones.length());
try{
savedTimeZoneList.clear();
BufferedReader reader = new BufferedReader(new InputStreamReader(openFileInput("TimeZones.dat")));
String lineFromTZfile = reader.readLine();
while (lineFromTZfile != null ){
StringTokenizer token = new StringTokenizer(lineFromTZfile,",");
AL_TimeZone timeZone = new AL_TimeZone(token.nextToken(),
token.nextToken());
savedTimeZoneList.add(timeZone);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e){
e.printStackTrace();
}
}
}
===================Trace======================
I/System.out: Timezone file exists. Loading.. File size is : 12373
W/System.err: java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at com.cryptotrac.trackerService$1R_loadTimeZones.run(trackerService.java:215)
W/System.err: at java.lang.Thread.run(Thread.java:764)

It appears that this line of your code is causing the java.util.NoSuchElementException to be thrown.
AL_TimeZone timeZone = new AL_TimeZone(token.nextToken(), token.nextToken());
That probably means that at least one of the lines in file TimeZones.dat does not contain precisely two strings separated by a single comma.
This can be easily checked by making sure that the line that you read from the file is a valid line before you try to parse it.
Using method split, of class java.lang.String, is preferable to using StringTokenizer. Indeed the javadoc of class StringTokenizer states the following.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Try the following.
String lineFromTZfile = reader.readLine();
while (lineFromTZfile != null ){
String[] tokens = lineFromTZfile.split(",");
if (tokens.length == 2) {
// valid line, proceed to handle it
}
else {
// optionally handle an invalid line - maybe write it to the app log
}
lineFromTZfile = reader.readLine(); // Read next line in file.
}

There are probably multiple things wrong, because I'd actually expect you to run into an infinite loop, because you are only reading the first line of the file and then repeatedly parse it.
You should check following things:
Make sure that you are writing the file correctly. What does the written file exactly contain? Are there new lines at the end of each line?
Make sure that the data written (in this case, "region" and "offset") never contain a comma, otherwise parsing will break. I expect that there is a very good chance that "region" contains a comma.
When reading files you always need to assume that the file (format) is broken. For example, assume that readLine will return an empty line or something that contains more or less than one comma.

Editing a file using async threads in Java

I'm a small java developer currently working on a discord bot that I made in Java. one of the features of my bot is to simply have a leveling system whenever anyone sends a message (and other conditions but this is irrelevant for the problem I'm encountering).
Whenever someone sends a message an event is fired and a thread is created to compute how much exp the user should gain. and eventually, the function to edit the storage file is called.
which works fine when called sparsely. but if two threads try to write on the file at once, the file usually gets deleted or truncated. either of these two cases being undesired behavior
I then tried to make a queuing system that worked for over 24h but still failed once so it is more stable in a way. I only know the basics of how threads work so I may've skipped over an important thing that causes the problem
the function looks like this
Thread editingThread = null;
public boolean editThreadStarted = false;
HashMap<String, String> queue = new HashMap<>();
public final boolean editParameter(String key, String value) {
queue.put(key, value);
if(!editThreadStarted) {
editingThread = new Thread(new Runnable() {
#Override
public void run() {
while(queue.keySet().size() > 0) {
String key = (String) queue.keySet().toArray()[0];
String value = queue.get(key);
File inputFile = getFile();
File tempFile = new File(getFile().getName() + ".temp");
try {
tempFile.createNewFile();
} catch (IOException e) {
DemiConsole.error("Failed to create temp file");
handleTrace(e);
continue;
}
//System.out.println("tempFile.isFile = " + tempFile.isFile());
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile)); BufferedWriter writer = new BufferedWriter(new FileWriter(tempFile))){
String currentLine;
while((currentLine = reader.readLine()) != null) {
String trimmedLine = currentLine.trim();
if(trimmedLine.startsWith(key)) {
writer.write(key + ":" + value + System.getProperty("line.separator"));
continue;
}
writer.write(currentLine + System.getProperty("line.separator"));
}
writer.close();
reader.close();
inputFile.delete();
tempFile.renameTo(inputFile);
} catch (IOException e) {
DemiConsole.error("Caught an IO exception while attempting to edit parameter ("+key+") in file ("+getFile().getName()+"), returning false");
handleTrace(e);
continue;
}
queue.remove(key);
}
editThreadStarted = false;
}
});
editThreadStarted = true;
editingThread.start();
}
return true;
}
getFile() returns the file the function is meant to write to
the file format is
memberid1:expamount
memberid2:expamount
memberid3:expamount
memberid4:expamount
the way the editing works is by creating a temporary file to which i will write all of the original file's data line by line, checking if the memberid matches with what i want to edit, if it does, then instead of writing the original file's line, i will write the new edited line with the new expamount instead, before continuing on with the rest of the lines. Once that is done, the original file is deleted and the temporary file is renamed to the original file, replacing it.
This function will always be called asynchronously so making the whole thing synchronous is not an option.
Thanks in advance
Edit(1) :
I've been suggested to use semaphores and after digging a little into it (i never heard of semaphores before) it seems to be a really good option and would remove the need for a queue, simply aquire in the beginning and release at the end, nothing more required!

I ended up using semaphores as per user207421's suggestions and it seems to work perfectly
I simply put delays between each line write to artificially make the task longer and make it easier to have multiple threads trying to write at once, and they all wait for their turns!
Thanks

Java variable not being affected

Now this may sound like a question that has been repeated many times before but I've done a day of research with people that has other reasons for this Issue.
I have a function that reads a part of the save file and its been shown that it does receive the correct data. So the error is that the integer variable completely ignores the new variable and shows no change in the live debugger so like many other post it is not just a duplicate object error. I cant seem to pinpoint what is the main issue is here and it's the last major thing holding me back. Any help would be great and I'm very extremely sorry if I did manage to miss a topic about this on the internet.
Code that fails:
#Override
public void read(List<String> data) {
//world positions are not being changed at all
System.out.println(data.get(1));
int test = Integer.valueOf(data.get(1).replaceAll("[^\\d.]", ""));
worldXPos = Integer.valueOf(data.get(0).replaceAll("[^\\d.]", ""));
worldZPos = test;
}
Another class that gives the data:
public void readSaveFunctions(){
if(!gameSaves.exists()){
gameSaves.mkdir();
}
String currentLine;
try {
List<String> data = new ArrayList<String>();
FileReader read = new FileReader(currentFile);
BufferedReader reader = new BufferedReader(read);
String key = "";
while((currentLine = reader.readLine()) != null){
if(currentLine.contains("#")){
key = currentLine;
data = new ArrayList<String>();
}else if(currentLine.contains("*end")){
for(int i = 0; i < saves.length; i++){
String tryKey = "#" + saves[i].IDName();
if(tryKey.equals(key)){
key = "";
saves[i].read(data);
}
}
}else data.add(currentLine);
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Another way of explaining it is this:
Debugger is set to step - to - step mode so I see each line getting executed at human speed then I get to a line like this but all of the ones setting the variables have the same effect:
worldXPos = Integer.valueOf(data.get(0).replaceAll("[^\\d.]", ""));
and the debugger shows the two integers having different numbers but the instant class variable stays exactly the same with no effect in the debugger after the line goes through.
Update:
I forgot to mention the method has a #override method and it seems that this #override may be causing this issue, now finally I may have a path to follow again

So I found my answer: The AWT thread manage to activate calling a method from another class that changed the integer before it could be read. It really though me off at first because the debugger only showed one of the threads and with no way to know the other one was actively changing it to early. Thanks for all the help :P.

String to Text - Line Break instead of comma

I need help before I'm totally despaired :D
As you will see I tried it in different ways even if there are just a really few differences. My problem is that I have a string which I want (or have) to output. This means I need it in a text file. Not that big problem, eh? But the actual problem is that I want line breaks instead of commas. I know I could just replace them after the file is written but it's just unnecessary when there is another way.
The Output looks like this
[/rechtschreibung/_n, /rechtschreibung/_nauf, /rechtschreibung/_naus,
/rechtschreibung/_Ndrangheta, ....]
I want it to look like this
/rechtschreibung/_n
/rechtschreibung/_nauf
/rechtschreibung/_naus
/rechtschreibung/_Ndrangheta
Anyway even when I don't need this method later because I will store this and some other information into a database like sql. It will help me to build up the program step by step and learn some more Java ;)
So here is my code snippet
BufferedWriter bw = null;
//PrintWriter out
//= new PrintWriter(new BufferedWriter(new FileWriter("foo.out")));
try {
bw = new BufferedWriter(new FileWriter("bfwr.txt"));
bw.write(test5.getWoerterListe().toString());
bw.newLine();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
/*
try {
PrintWriter out = new PrintWriter(new FileWriter("prwr.txt"));
out.print(test5.getWoerterListe());
out.close();
System.out.printf("Testing String");
} catch (IOException e) {
e.printStackTrace();
}
*/
/*
try {
FileWriter test10 = new FileWriter("test.txt");
test10.write(test5.getWoerterListe().toString());
test10.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
Please be nice to me :D
Assistance appreciated =)
EDIT #1
Code directly before first one.
Oberordner test2 = new Oberordner("http://www.duden.de/definition");
Unterordner test3 = new Unterordner(test2.getOberOrdner());
WoerterListe test5 = new WoerterListe(test3.getUnterOrdnerURL());
test5.setWoerterListe();
and from WoerterListe.java the really end part
public ArrayList<String> getWoerterListe(){
return WoerterListe;
}
Additional Information: the string is not stored in the code because there are tenthousands of words like '/rechtschreibung/*'
By the way the language here is german unfortunately I have to use german words =(

I'm not a Java developer and you didn't state what getWoerterListe() returns, but here's my guess.
getWoerterListe() probably return a list of strings, and the default behaviour of toString() in this case is to convert the list to comma seperated values. So instead of calling toString() on the list, loop through it and write out each line followed by a carriage return (or whatever Java uses to end lines).

Code:
String s = "[/rechtschreibung/_n, /rechtschreibung/_nauf, "
+ "/rechtschreibung/_naus, /rechtschreibung/_Ndrangheta, ....]";
String srp = s.replaceAll("\\[|\\]|\\.+" ,"");
String[] sp = srp.split(",");
for (int i = 0; i < sp.length; i++) {
System.out.println(sp[i].trim());
}
Output:
/rechtschreibung/_n
/rechtschreibung/_nauf
/rechtschreibung/_naus
/rechtschreibung/_Ndrangheta
Explanation:
I assumed [/rechtschreibung/_n, /rechtschreibung/_nauf, /rechtschreibung/_naus, /rechtschreibung/_Ndrangheta, ....] is a String. I removed all uncessary character like [ , ] , and any number of . form it. After that, I splited by , and print each element of splited string on the output.

Java looping through array - Optimization

I've got some Java code that runs quite the expected way, but it's taking some amount of time -some seconds- even if the job is just looping through an array.
The input file is a Fasta file as shown in the image below. The file I'm using is 2.9Mo, and there are some other Fasta file that can take up to 20Mo.
And in the code im trying to loop through it by bunches of threes, e.g: AGC TTT TCA ... etc The code has no functional sens for now but what I want is to append each Amino Acid to it's equivalent bunch of Bases. Example :
AGC - Ser / CUG Leu / ... etc
So what's wrong with the code ? and Is there any way to do it better ? Any optimization ? Looping through the whole String is taking some time, maybe just seconds, but need to find a better way to do it.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class fasta {
public static void main(String[] args) throws IOException {
File fastaFile;
FileReader fastaReader;
BufferedReader fastaBuffer = null;
StringBuilder fastaString = new StringBuilder();
try {
fastaFile = new File("res/NC_017108.fna");
fastaReader = new FileReader(fastaFile);
fastaBuffer = new BufferedReader(fastaReader);
String fastaDescription = fastaBuffer.readLine();
String line = fastaBuffer.readLine();
while (line != null) {
fastaString.append(line);
line = fastaBuffer.readLine();
}
System.out.println(fastaDescription);
System.out.println();
String currentFastaAcid;
for (int i = 0; i < fastaString.length(); i+=3) {
currentFastaAcid = fastaString.toString().substring(i, i + 3);
System.out.println(currentFastaAcid);
}
} catch (NullPointerException e) {
System.out.println(e.getMessage());
} catch (FileNotFoundException e) {
System.out.println(e.getMessage());
} catch (IOException e) {
System.out.println(e.getMessage());
} finally {
fastaBuffer.close();
}
}
}

currentFastaAcid = fastaString.toString().substring(i, i + 3);
Please replace with
currentFastaAcid = fastaString.substring(i, i + 3);
toString method of StringBuilder create new instance of String object every time you call it. It still contain a copy of all your large string. If you call substring directly from StringBuilder it will return a small copy of substring.
Also remove System.out.println if you don't really need it.

The big factor here is you are doing the call to substring over a new String each time.
Instead, use substring directly over the stringbuilder
for (int i = 0; i < fastaString.length(); i+=3){
currentFastaAcid = fastaString.substring(i, i + 3);
System.out.println(currentFastaAcid);
}
Also, instead of print the currentFastaAcid each time, save it into a list and print this list at the end
List<String> acids = new LinkedList<String>();
for (int i = 0; i < fastaString.length(); i+=3){
currentFastaAcid = fastaString.substring(i, i + 3);
acids.add(currentFastaAcid);
}
System.out.println(acids.toString());

Your main problem besides the debug output surely is, that you are creating a new String with your completely read data from the file in each iteration of your loop:
currentFastaAcid = fastaString.toString().substring(i, i + 3);
fastaString.toString() will give the same result in each iteration and therefore is redundant. Get it outside the loop and you will surely save some seconds runtime.

Apart from suggested optimization in the serial code, I will go for parallel processing to reduce time further. If you have really big file, you can divide the work of reading file and processing read-lines, in separate threads. That way, when one thread is busy reading nextline from large file, other thread can process read-lines and print them on console.

If you remove the
System.out.println(currentFastaAcid);
line in the for loop, you will gain quite decent time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is website crawling taking forever? - java

Related

OutputStreamWriter only writing one item into file

Editing a file using async threads in Java

Java variable not being affected

String to Text - Line Break instead of comma

Java looping through array - Optimization

Categories

Resources