Extracting avg time spent at a place from Google - java

I'm trying to use jsoup to extract the average time spent at a place straight from Google's search results; as Google API does not support fetching of that info at the moment.
For example,
url is "https://www.google.com/search?q=vivocity" and the text to extract is "15 min to 2 hr"
I've tried the following code
try {
String url = "https://www.google.com.sg/search?q=vivocity";
Document doc = Jsoup.connect(url).userAgent("mozilla/17.0").get();
Elements ele = doc.select("div._B1k");
for (Element qwer:ele){
temp += "Avg time spent: " + qwer.getElementsByTag("b").first().text() + "\n";
}
}
catch (IOException e){
e.printStackTrace();
}
I have also tried just outputing doc.text() and searching through the output, it doesn't seem to contain anything to do with avg time taken too.
Strange thing is with other URLs and divs, they work perfectly fine.
Any help will be appreciated, thank you.

Related

Seemingly random tika performance issues

I use RxJava to pipe documents from a source to tika and from tika to ElasticSearch.
At some point in time tika takes about 5 minutes to index a document and continues normally afterwards.
I am unable to properly point out the cause: If I restart the application, everything is still the same, let's say it took 5mins at the 301st document last time, it will take 5mins at the 301st document again. But if I change up the order of the documents, it happens neither at the same index (301), nor with the same document (the previous 301st).
Here are the relevant parts of the application:
public Indexable analyze(Indexable indexable) {
Timestamp from = new Timestamp(System.currentTimeMillis());
if (indexable instanceof NCFile) {
// there is some code here that has no effect
Metadata md = this.generateMetadata(((NCFile) indexable).getPath());
((NCFile) indexable).setType(md.get("Content-Type"));
if (((NCFile) indexable).getType().startsWith("text/")) {
((NCFile) indexable).setContent(this.parseContent(((NCFile) indexable).getPath())); //todo . hier könnte man noch viel mehr machen
} else {
((NCFile) indexable).setContent("");
}
((NCFile) indexable).setType(this.detectMediaType(((NCFile) indexable).getPath()));
}
Timestamp to = new Timestamp(System.currentTimeMillis());
System.out.println("the file " + ((NCFile) indexable).getPath() + " took " + (to.getTime()-from.getTime()) + " ms to parse");
return indexable;
}
and the pipeline that is feeding the code above:
nc.filter(action -> action.getOperation() == Operation.INSERT)
.map(IndexingAction::getIndexable)
.subscribeOn(Schedulers.computation())
.map(indexableAction -> metadataAnalyzer.analyze(indexableAction))
.map(indexable -> {
indexer.insert(indexable);
return indexable;
})
.subscribeOn(Schedulers.io())
.map(indexable -> "The indexable " + indexable.getIdentifier() + " of class " + indexable.getClass().getName() + " has been inserted.")
.subscribe(message -> this.logger.log(Level.INFO, message));
My guess would be that the problem is memory- or thread-related; but as far as I can see, the code should work perfectly fine.
the file Workspace/xx/xx-server/.test-documents/testRFC822_base64 took 5 ms to index
the file Workspace/xx/xx-server/.test-documents/testPagesHeadersFootersAlphaLower.pages took 306889 ms to index
the file Workspace/xx/xx-server/.test-documents/testFontAfterBufferedText.rtf took 2 ms to index
the file Workspace/xx/xx-server/.test-documents/testOPUS.opus took 7 ms to index
Funny thing is, these are the tika testfiles provided in their repo.
EDIT:
After a request I looked into it using FR, but I am not sure what exactly I have to look at:
Right after the plateau it stops working, even though neither the RAM nor the CPU limit is met
EDIT 2:
Is it the PipedReader that is blocking alle of these threads? Do I understand that correctly?
EDIT 3:
Here is a 1min flight recording:
Note: The flight recording seems wrong. In my system monitor application I do not see such a big memory consumption (apparent 16GB?!)...
What am I doing wrong?

JSOUP - Extract Data Web dinamis

I tried to extract the price. Can anyone please help me? There is no output for the price and its weight ,, I've tried several ways but not out the results
Document doc = Jsoup.connect("https://www.jakmall.com/tokocamzone/mi-travel-charger-20a-output-fast-charging#9730928979371").get();
Elements rows = doc.getElementsByAttributeValue("class", "div[dp__price dp__price--2 format__money]");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
I've tried another way but I did not get the result. I was very curious but did not find it the right way
if you run this line of code above you will discover thtat there is no price ordiv[dp__price dp__price--2 format__money] DOM. There is only Javascript.
String d = doc.getElementsByClass("dp__header__info").outherHtml();
System.out.println(d);
Jsoup is not able to fetch the price because content is loaded dynamically after page loading. Consider using Selenium which more powerfull and supports JavaScript websites,

BigQuery Pagination through large result set with cloud library

I am working on accessing data from Google BigQuery, the data is 500MB which I need to transform as part of the requirement. I am setting Allow Large Results, setting a destination table etc.
I have written a java job in Google's new cloud library since that is recommended now - com.google.cloud:google-cloud-bigquery:0.21.1-beta (I have tried 0.20 beta as well without any fruitful results)
I am having problem with pagination of this data, the library is inconsistent in fetching results page wise. Here is my code snippet,
Code Snippet
System.out.println("Accessing Handle of Response");
QueryResponse response = bigquery.getQueryResults(jobId, QueryResultsOption.pageSize(10000));
System.out.println("Got Handle of Response");
System.out.println("Accessing results");
QueryResult result = response.getResult();
System.out.println("Got handle of Result. Total Rows: "+result.getTotalRows());
System.out.println("Reading the results");
int pageIndex = 0;
int rowId = 0;
while (result != null) {
System.out.println("Reading Page: "+ pageIndex);
if(result.hasNextPage())
{
System.out.println("There is Next Page");
}
else
{
System.out.println("No Next Page");
}
for (List<FieldValue> row : result.iterateAll()) {
System.out.println("Row: " + rowId);
rowId++;
}
System.out.println("Getting Next Page: ");
pageIndex++;
result = result.getNextPage();
}
Output print statements
Accessing Handle of Response
Got Handle of Response
Accessing results
Got handle of Result. Total Rows: 9617008
Reading the results
Reading Page: 0
There is Next Page
Row: 0
Row: 1
Row: 2
Row: 3
:
:
Row: 9999
Row: 10000
Row: 10001
:
:
Row: 19999
:
:
Please note that it never hits/prints - "Getting Next Page: ".
My expectation was that I would get data in chunks of 10000 rows at a time. Please note that if I run the same code on a query which returns 10-15K rows and set the pageSize to be 100 records, I do get the "Getting Next Page:" after every 100 rows. Is this a known issue with this beta library?
This looks very close to a problem I have been struggling with for hours. And I just found the solution, so I will share it here, even though you probably found a solution yourself a long time ago.
I did exactly like the documentation and tutorials said, but my page size were not respected and I kept getting all rows every time, no matter what I did. Eventually I found another example, official I think, right here.
What I learned from that example is that you should only use iterateAll() to get the rest of the rows. To get the current page rows you need to use getValues() instead.

Using twitter4j to search through more than 100 queries [duplicate]

This question already has answers here:
How to retrieve more than 100 results using Twitter4j
(4 answers)
Closed 6 years ago.
I am trying to create a program that searches a query from twitter. The problem I am having is that the API returns only a 100 result queries and when I try to retrieve more it keeps giving me the same results again.
User user = twitter.showUser("johnny");
Query query = new Query("football");
query.setCount(100);
query.lang("en");
int i=0;
try {
QueryResult result = twitter.search(query);
for(int z = 0;z<2;z++){
for( Status status : result.getTweets()){
System.out.println("#" + status.getUser().getScreenName() + ":" + status.getText());
i++;
}
}
The program will print me 200 results relating to the query "football", but instead of giving me 200 different results it prints a 100 results twice. My end results should be that I can print as many different results as the rate limit allows. I have seen programs that return more than 100 responses for a specific user, but I haven't seen something that can return more than a 100 responses for a unique query like "football".
To get more than 100 results on a search Query you need to call to the next iteration of the Query.
Query query = new Query("football");
QueryResult result;
int Count=0;
do {
result = twitter.search(query);
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
System.out.println("#" + tweet.getUser().getScreenName() + ":" + tweet.getText());
Count++;
}
try {
Thread.sleep(500);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
while ((query = result.nextQuery()) != null);
System.out.println(Count);
System.exit(0);
I just tested it and got 275 tweets, keep in mind this from the documentation:
The Search API is not complete index of all Tweets, but instead an index of recent Tweets. At the moment that index includes between 6-9 days of Tweets.
And:
Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.

Real time web crawling using Jsoup

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help
As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

Categories

Resources