Scrape website for one data

Scrape website for one data - java

I would like to extract the value of <div class="score">4.1</div> from a website with JAVA (Android). I tried Jsoup and even though it couldn't be simpler to use, it gives me the value in 8 seconds, which is very slow. You need to know, the page source of the site has 300,000 characters and this <div> is somewhere in the middle.
Even using HttpClient and getting the source into a StringBuilder then going through the whole string until the score part is found is faster (3-4 seconds).
I couldn't try out HtmlUnit as it requires a massive amount of jar files and after a while Eclipse always pissed itself in its confusion.
Is there a faster way?

You may simply send a XMLhttpRequest and then search the response using search() function. I think this would be much faster.
Similar Question: Retrieving source code using XMLhttpRequest in javascript
To make the search more fast, you can simply use indexOf([sting to search],[starting index]) and specify the starting index (it doesn't needs to be very accurate, you just have to decrease your search area).

Here is what I did. The problem was that I read the webpage line by line then glued them together into a StringBuilder and searched for the specific part. Then I asked myself: why do I read the page line by line then glue them together? So instead I read the page into a ByteArray and converted it into a String. The scraping time became less than a second!
try
{
InputStream is = new URL(url).openStream();
outputDoc = new ByteArrayOutputStream();
byte buf[]=new byte[1024];
int len;
while((len=is.read(buf))>0)
{
outputDoc.write(buf,0, len);
}
outputDoc.close();
} catch(Exception e) { e.printStackTrace(); }
try {
page = new String(outputDoc.toByteArray(), "UTF-8");
//here I used str.indexOf to find the part
}

Related

Fastest way to import millions of JSON documents to MongoDB

I have more than 10 million JSON documents of the form :
["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]
in one file.
Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):
public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
// open file
FileInputStream fstream = null;
try {
fstream = new FileInputStream(pathToFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("file not exist, exiting");
return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
// read it line by line
String strLine;
DBCollection newColl = db.getCollection(collectionName);
try {
while ((strLine = br.readLine()) != null) {
// convert line by line to BSON
DBObject bson = (DBObject) JSON.parse(JSONstr);
// insert BSONs to database
try {
newColl.insert(bson);
}
catch (MongoException e) {
// duplicate key
e.printStackTrace();
}
}
br.close();
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion.
Thanks.

I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.
The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.
I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.

I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.
Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:
Open the file as BINARY (not TEXT!)
Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
Parse each chunk with a thread.
A reminder:
when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.

You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:
1)Parse the file and retrieve the json Object.
2)Once the parsing is over, save the json Object in the Mongo Document.

I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with
insert(List<DBObject> list)
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)
That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/ for info
Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.
Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.

You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.

Use bulk operations insert/upserts. After Mongo 2.6 you can do Bulk Updates/Upserts. Example below does bulk update using c# driver.
MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
var bulk = collection.InitializeUnorderedBulkOperation();
foreach (FooDoc fooDoc in fooDocsList)
{
var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
}
BulkWriteResult bwr = bulk.Execute();

You can use a bulk insertion
You can read the documentation at mongo website and you can also check this java example on StackOverflow

Failing for Larger Input Files Only: FileServiceFactory getBlobKey throws IllegalArgumentException

I have a Google App Engine App that converts CSV to XML files. It works fine for small XML inputs, but refuses to finalize the file for larger inputed XML. The XML is read from, and the resulting csv files are written to, many times before finalization, over a long-running (multi-day duration) task. My problem is different than this FileServiceFactory getBlobKey throws IllegalArgumentException , since my code works fine both in production and development with small input files. So it's not that I'm neglecting to write to the file before closing/finalizing. However, when I attempt to read from a larger XML file. The input XML file is ~150MB, and the resulting set of 5 CSV files is each much smaller (perhaps 10MB each). I persisted the file urls for the new csv files, and even tried to close them with some static code, but I just reproduce the same error, which is
java.lang.IllegalArgumentException: creation_handle: String properties must be 500 characters or less. Instead, use com.google.appengine.api.datastore.Text, which can store strings of any length.
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedSingleValue(DataTypeUtils.java:242)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:207)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:173)
at com.google.appengine.api.datastore.Query$FilterPredicate.<init>(Query.java:900)
at com.google.appengine.api.datastore.Query$FilterOperator.of(Query.java:75)
at com.google.appengine.api.datastore.Query.addFilter(Query.java:351)
at com.google.appengine.api.files.FileServiceImpl.getBlobKey(FileServiceImpl.java:329)
But I know that it's not a String/Text data type issue, since I am already using similar length file service urls for the previous successful attempts with smaller files. It also wasn't an issue for the other stackoverflow post I linked above. I also tried putting one last meaningless write before finalizing, just in case it would help as it did for the other post, but it made no difference. So there's really no way for me to debug this... Here is my file closing code that is not working. It's pretty similar to the Google how-to example at http://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore .
log.info("closing out file 1");
try {
//locked set to true
FileWriteChannel fwc1 = fileService.openWriteChannel(csvFile1, true);
fwc1.closeFinally();
} catch (IOException ioe) {ioe.printStackTrace();}
// You can't get the blob key until the file is finalized
BlobKey blobKeyCSV1 = fileService.getBlobKey(csvFile1);
log.info("csv blob storage key is:" + blobKeyCSV1.getKeyString());
csvUrls[i-1] = blobKeyCSV1.getKeyString();
break;
At this point, I just want to finalize my new blob files for which I have the urls, but cannot. How can I get around this issue, and also, what may be the cause? Again, my code works for small files (~60 kB), but the input file of ~150MB fails). Thank you for any advice on what is causing this or how to get around it! Also, how long will my unfinalized files stick around for, before being deleted?

This issue was a bug in the Java MapReduce and Files API, which was recently fixed by Google. Read announcement here: groups.google.com/forum/#!topic/google-appengine/NmjYYLuSizo

Java - Fastest way, and best code to load a URL and get a response from the server

I was curious as to what was the best and FASTEST way to get a response from the server, say if I used a for loop to load a url that returned an XML file, which way could I use to load the url get the response 10 times in a row? speed is the most important thing. I know it can only go as fast as your internet but I need a way to load the url as fast as my internet will allow and then put the who output of the url in a string so i can append to JTextArea.. This is the code Ive been using but seek faster alternatives if possible
int times = Integer.parseInt(jTextField3.getText());
for(int abc = 0; abc!=times; abc++){
try {
URL gameHeader = new URL(jTextField2.getText());
InputStream in = gameHeader.openStream();
byte[] buffer = new byte[1024];
try {
for(int cwb; (cwb = in.read(buffer)) != -1;){
jTextArea1.append(new String(buffer, 0, cwb));
}
} catch (IOException e) {}
} catch (MalformedURLException e) {} catch (IOException e) {}
}
is there anything that would be faster than this?
Thanks
-CLUEL3SS

This seems like a job for Java NIO (Non-blocking IO). This article is from Java 1.4 but still will give you a good understanding of how to setup NIO. Since then NIO have evolved a lot and you may need to look up the API for Java 6 or Java 7 to find out whats new.
This solution is probably best as an async option. Basically it will allow you to load 10 URLs without waiting for each one to be complete before moving on and loading an other.

You can't load text this way as the 1024 byte boundary could break an encoded character in two.
Copy all the data to ByteArrayInputStream and use toString() on it or read Text as Text using BufferedReader.

Use a BufferedReader; use a much larger buffer size than 1024; don't swallow exceptions. You could also try re-using the same URL object instead of creating a new one each time, might help with connection pooling.
But why would you want to read the same URL 10 times in a row?

StringBuilders ending with mass nul characters

I'm having a very difficult time debugging a problem with an application I've been building. The problem itself I cannot seem to reproduce with a representitive test program with the same issue which makes it difficult to demonstrate. Unfortunately I cannot share my actual source because of security, however, the following test represents fairly well what I am doing, the fact that the files and data are unix style EOL, writing to a zip file with a PrintWriter, and the use of StringBuilders:
public class Tester {
public static void main(String[] args) {
// variables
File target = new File("TESTSAVE.zip");
PrintWriter printout1;
ZipOutputStream zipStream;
ZipEntry ent1;
StringBuilder testtext1 = new StringBuilder();
StringBuilder replacetext = new StringBuilder();
// ensure file replace
if (target.exists()) {
target.delete();
}
try {
// open the streams
zipStream = new ZipOutputStream(new FileOutputStream(target, true));
printout1 = new PrintWriter(zipStream);
ent1 = new ZipEntry("testfile.txt");
zipStream.putNextEntry(ent1);
// construct the data
for (int i = 0; i < 30; i++) {
testtext1.append("Testing 1 2 3 Many! \n");
}
replacetext.append("Testing 4 5 6 LOTS! \n");
replacetext.append("Testing 4 5 6 LOTS! \n");
// the replace operation
testtext1.replace(21, 42, replacetext.toString());
// write it
printout1 = new PrintWriter(zipStream);
printout1.println(testtext1);
// save it
printout1.flush();
zipStream.closeEntry();
printout1.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
The heart of the problem is that the file I see at my side is producing a file of 16.3k characters. My friend, whether he uses the app on his pc or whether he looks at exactly the same file as me sees a file of 19.999k characters, the extra characters being a CRLF followed by a massive number of null characters. No matter what application, encoding or views I use, I cannot at all see these nul characters, I only see a single LF at the last line, but I do see a file of 20k. In all cases there is a difference between what is seen with the exact same files on the two machines even though both are windows machines and both are using the same editing softwares to view.
I've not yet been able to reproduce this behaviour with any amount of dummy programs. I have been able to trace the final line's stray CRLF to my use of println on the PrintWriter, however. When I replaced the println(s) with print(s + '\n') the problem appeared to go away (the file size was 16.3k). However, when I returned the program to println(s), the problem does not appear to return. I'm currently having the files verified by a friend in france to see if the problem really did go away (since I cannot see the nuls but he can), but this behaviour has be thoroughly confused.
I've also noticed that the StringBuilder's replace function states "This sequence will be lengthened to accommodate the specified String if necessary". Given that the stringbuilders setLength function pads with nul characters and that the ensureCapacity function sets capacity to the greater of the input or (currentCapacity*2)+2, I suspected a relation somewhere. However, I have only once when testing with this idea been able to get a result that represented what I've seen, and have not been able to reproduce it since.
Does anyone have any idea what could be causing this error or at least have a suggestion on what direction to take the testing?
Edit since the comments section is broken for me:
Just to clarify, the output is required to be in unix format regardless of the OS, hence the use of '\n' directly rather than through a formatter. The original StringBuilder that is inserted into is not in fact generated to me but is the contents of a file read in by the program. I'm happy the reading process works, as the information in it is used heavily throughout the application. I've done a little probing too and found that directly prior to saving, the buffer IS the correct capacity and that the output when toString() is invoked is the correct length (i.e. it contains no null characters and is 16,363 long, not 19,999). This would put the cause of the error somewhere between generating the string and saving the zip file.

Finally found the cause. Managed to reproduce the problem a few times and traced the cause down not to the output side of the code but the input side. My file reading function was essentially this:
char[] buf;
int charcount = 0;
StringBuilder line = new StringBuilder(2048);
InputStreamReader reader = new InputStreamReader(stream);// provides a line-wise read
BufferedReader file = new BufferedReader(reader);
do { // capture loop
try {
buf = new char[2048];
charcount = file.read(buf, 0, 2048);
} catch (IOException e) {
return null; // unknown IO error
}
line.append(buf);
} while (charcount != -1);
// close and output
problem was appending a buffer that wasnt full, so the later values were still at their initial values of null. Reason I couldnt reproduce it was because some data filled in the buffers nicely, some didn't.
Why I couldn't seem to view the problem on my text editors I still have no idea of, but I should be able to resolve this now. Any suggestions on the best way to do so are welcome, as this is part of one of my long term utility libraries I want to keep it as generic and optimised as possible.

struts2 data cut in string send to jsp

i've got this problem again...
So i've got String data in my Struts2 app. this data is quite big, 36KB data read from html with code:
BufferedReader reader = new BufferedReader(new FileReader("FILE.html"));
String readData;
while( (readData = reader.readLine()) != null) {
fileData.append(new String(readData.getBytes(),"UTF-8"));
}
reader.close();
fileData.trimToSize();
this.data2display = fileData.toString();
this.setData2display(this.data2display.replaceAll("\\s+", " "));
I display data2display in my jsp file, with just:
<s:property value="data2display" escape="false" escapeJavaScript="false" />
Aaaaaand... This data is entire while i'm debugging controller, but while i try to display this in jsp. I've got only part of data. I haven't got any error/debug logs.
Any idea how to check it/fix it ?
My app: (struts2, jsp) everything is from appfuse-basic-struts archetype.

My personal start point would be the source of PropertyTag, and from there on follow the code.
In this case, start with PropertyTag. You see that it extends ComponentTagSupport, which in turn extends StrutsBodyTagSupport.
This is where it gets interesting; the toString method uses a FastByteArrayOutputStream which uses a default block size (buffer) of 8192 bytes. Using the default constructor, as done by StrutsBodyTagSupport you can't output a String with more data than that.
Being not an expert on Struts I hesitate to say that's an implementation bug; it should IMHO compute the buffer size from the value to be printed. Unfortunately, it doesn't. So I don't think there's an easy way around it.
The non-easy way is obviously defining a List of String data parts smaller than 8k bytes, and iterate over that list in the JSP, or just use c:out or something like that.
This may not be the answer you're looking for, but I hope this will at least help you understand the trouble you're in.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scrape website for one data - java

Related

Fastest way to import millions of JSON documents to MongoDB

Failing for Larger Input Files Only: FileServiceFactory getBlobKey throws IllegalArgumentException

Java - Fastest way, and best code to load a URL and get a response from the server

StringBuilders ending with mass nul characters

struts2 data cut in string send to jsp

Categories

Resources