Is it possible to find out the comment levels from this web like below?
https://www.ozbargain.com.au/node/249439#comment-3719026
From jsoup I am able to parse the comments, username etc, but I am having trouble getting the correct comment levels.
Viewing the source of that page, the doesn't match with the correct live posts, unless I am reading it all wrong.
Is there a way to solve this?
I was able to generate the source comment level using:
String url = "https://www.ozbargain.com.au/node/249439";
Document doc = Jsoup.connect(url).get();
Elements level = doc.select("ul.comment");
for(Element column : e.select("ul")){
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
But its doesn't look right. Only showing 1 of level 0 comment etc.
Thanks
for(Element column : e.select("ul")) {
//comment level
System.out.println(column.attr("class"));
levels.add(column.attr("class"));
}
From the above code where does the e comes from?
Anyway, you need to parse the class attribute value in order to find the comment level.
Here is a working sample code:
SAMPLE CODE
public static void main(String[] args) throws IOException {
String url="https://www.ozbargain.com.au/node/249439#comment-3719026";
Document doc = Jsoup.connect(url).get();
Elements comments = doc.select("div.comment-wrap");
Matcher levelMatcher = Pattern.compile("(?i)^(.*level)(\\d+)(.*)$").matcher("");
List<String> levels = new ArrayList<>();
System.out.println("Comments found: "+ comments.size());
for (Element comment : comments) {
if (levelMatcher.reset(comment.parent().parent().className()).find()) {
levels.add(levelMatcher.replaceAll("$2"));
}
}
System.out.println(levels);
}
OUTPUT [https://www.ozbargain.com.au/node/249439#comment-3719026] (may change depending on the request time)
Comments found: 38
[0, 1, 2, 3, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 2, 3, 3, 1, 2, 3, 3, 0, 1, 2, 3, 2, 3, 3, 2, 0, 0, 0, 1, 2, 3]
OUTPUT [https://www.ozbargain.com.au/node/249604] (may change depending on the request time)
Comments found: 14
[0, 1, 0, 1, 0, 1, 1, 2, 1, 0, 0, 1, 2, 0]
Related
This question already has answers here:
How to read json file into java with simple JSON library
(21 answers)
Closed 5 years ago.
I have this string which located inside external file.
{
"IsValid": true,
"LiveSessionDataCollection": [
{
"CreateDate": "2017-12-27T13:29:06.595Z",
"Data": "Khttp://www8.hp.com/us/en/large-format-printers/designjet-printers/products.html&AbSGOX+SGOXpLXpBF8CXpGOA9BFFPconsole.info('DeploymentConfigName%3DRelease_20171227%26Version%3D1')%3B&HoConfig: Release_20171227&AwDz//////8NuaCh63&Win32&SNgYAJBBYWCYKW9a&2&SGOX+SGOXpF/1en-us&AAAAAAAAAAAAQICBCXpGOAAMBBBB8jl",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 0,
"StreamId": 0,
"StreamMessageId": 0,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.887Z",
"Data": "oDB Information Level : Detailed&9BbRoDB Annual Sales : 55000000&BoDB Audience : Mid-Market Business&AoDB%20Audience%20Segment%20%3A%20Retail%20%26%20Distribution&AoDB B2C : true&AoDB Company Name : Clicktale Inc&AoDB SID : 120325490&AoDB Employee Count : 275&AoDB Employee Range : Mid-Market&AoDB%20Industry%20%3A%20Retail%20%26%20Distribution&AoDB Revenue Range : $50M - $100M&AoDB Sub Industry : Electronics&AoDB Traffic : High&AWB9tY/8bvOBBP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div53\"}]})&sP_({\"a\":[{\"a\":{\"s\":\"w:auto;l:auto;\"},\"n\":\"div62\"}]})&FP_({\"r\":[\"script2\"],\"m\":[{\"n\":{\"nt\":1,\"tn\":\"SCRIPT\",\"a\":{\"async\":\"\",\"src\":\"http://admin.brightcove.com/js/api/SmartPlayerAPI.js?_=1514381348598\"},\"i\":\"script55\"},\"t\":false,\"pn\":\"head1\"}]})&8GuP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&SP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:0px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&D",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 1,
"StreamId": 0,
"StreamMessageId": 1,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.971Z",
"Data": "P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div105\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div114\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div123\"}]})&9B+8P_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div167\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div169\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div178\"}]})&JP_({\"a\":[{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div220\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div229\"},{\"a\":{\"s\":\"mih:457px;\"},\"n\":\"div238\"}]})&FP_({\"a\":[{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div282\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div291\"},{\"a\":{\"s\":\"mih:480px;\"},\"n\":\"div300\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div442\"}]})&HP_({\"a\":[{\"a\":{\"s\":\"t:0px;mt:-92px;l:274.5px;ml:0px;\"},\"n\":\"div444\"}]})&B",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 2,
"StreamId": 0,
"StreamMessageId": 2,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:08.98Z",
"Data": "P_({\"r\":[\"object1\",\"param1\",\"param2\",\"param3\",\"param4\",\"param5\",\"param6\",\"param7\",\"param8\",\"param9\",\"param10\",\"param11\",\"param12\",\"param13\",\"param14\",\"param15\"],\"m\":[{\"n\":{\"nt\":1,\"tn\":\"OBJECT\",\"a\":{\"type\":\"application/x-shockwave-flash\",\"i\":\"LNK--1710e8cd-4820-4be0-8cf0-28d57402afd8LNK--1710e8cd-4820-4be0-8cf0-28d57402afd8\",\"width\":\"720\",\"height\":\"422\",\"c\":\"BrightcoveExperience BrightcoveExperienceID_1039\",\"seamlesstabbing\":\"undefined\"},\"i\":\"object3\"},\"t\":false,\"pn\":\"div443\",\"ps\":\"meta29\"},{\"n\":{\"nt\":1,\"tn\":\"SCRIPT\",\"a\":{\"type\":\"text/javascript\",\"src\":\"http://admin.brightcove.com/js/api/SmartPlayerAPI.js\"},\"i\":\"script56\"},\"t\":false,\"pn\":\"div443\",\"ps\":\"object3\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"allowScriptAccess\",\"v\":\"always\"},\"i\":\"param31\"},\"t\":false,\"pn\":\"object3\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"allowFullScreen\",\"v\":\"true\"},\"i\":\"param32\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param31\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"seamlessTabbing\",\"v\":\"false\"},\"i\":\"param33\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param32\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"swliveconnect\",\"v\":\"true\"},\"i\":\"param34\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param33\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"wmode\",\"v\":\"opaque\"},\"i\":\"param35\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param34\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"quality\",\"v\":\"high\"},\"i\":\"param36\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param35\"},{\"n\":{\"nt\":1,\"tn\":\"PARAM\",\"a\":{\"name\":\"bgcolor\",\"v\":\"FFFFFF\"},\"i\":\"param37\"},\"t\":false,\"pn\":\"object3\",\"ps\":\"param36\"}]})&9CAQ",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 3,
"StreamId": 0,
"StreamMessageId": 3,
"ProjectId": 201
},
{
"CreateDate": "2017-12-27T13:29:09.413Z",
"Data": "P_({\"a\":[{\"a\":{\"s\":\"w:720px;h:422px;p:relative;\"},\"n\":\"div443\"},{\"a\":{\"s\":\"p:relative;\"},\"n\":\"div445\"}],\"r\":[\"script55\"],\"m\":[{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"spooler\",\"s\":\"d:block;o:0;\"},\"i\":\"div451\"},\"t\":false,\"pn\":\"div443\"},{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"ispl_sm\",\"s\":\"o:1;\"},\"i\":\"div452\"},\"t\":false,\"pn\":\"div451\"},{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"layer\",\"s\":\"o:1;\"},\"i\":\"div453\"},\"t\":false,\"pn\":\"div451\",\"ps\":\"div452\"},{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"spooler\",\"s\":\"d:block;o:0;\"},\"i\":\"div454\"},\"t\":false,\"pn\":\"div445\"},{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"ispl_sm\",\"s\":\"o:1;\"},\"i\":\"div455\"},\"t\":false,\"pn\":\"div454\"},{\"n\":{\"nt\":1,\"tn\":\"DIV\",\"a\":{\"c\":\"layer\",\"s\":\"o:1;\"},\"i\":\"div456\"},\"t\":false,\"pn\":\"div454\",\"ps\":\"div455\"}]})&9CA5P_({\"a\":[{\"a\":{\"s\":\"d:block;o:0.0282439;\"},\"n\":\"div451\"},{\"a\":{\"s\":\"o:0.989022;\"},\"n\":\"div453\"},{\"a\":{\"s\":\"d:block;o:0.0282439;\"},\"n\":\"div454\"},{\"a\":{\"s\":\"o:0.989022;\"},\"n\":\"div456\"}]})&W",
"DataFlags": 8,
"DataFlagType": 264,
"LegacyLiveSessionDataType": null,
"LiveSessionId": 1545190526042650,
"MessageNumber": 4,
"StreamId": 0,
"StreamMessageId": 4,
"ProjectId": 201
}
]
I am trying to parse it into JSON array object , when I searched for it in Google I found the following solution:
JSONArray jsonArray = new JSONArray("path_to_file_to_parse");
but when I wrote it inside my code I got an error. Is there another way to make it?
I am using json-simple version 1.1
Have you looked at Jackson Tree Model?
//first, you create a mapper object
ObjectMapper mapper = new ObjectMapper();
//then you create a JsonNode instance representing your JSON root structure
//you will need to define json yourself to run the code.
JsonNode root = null;
try {
root = mapper.readTree(json);
} catch (IOException e) {
System.out.println("Some Error");
}
//here you get the list of your session nodes
JsonNode list = root.path("LiveSessionDataCollection");
//then you can iterate through them and get any inner value
for (JsonNode session : list) {
//for example, you can get the create date or live session id.
System.out.println(session.path("CreateDate"));
System.out.println(session.path("LiveSessionId"));
}
I may not be understanding your question fully but I think this is what you're after.
i'm trying to parse a json string to a JSONArray element, but when i try i get "cannot be converted to JSONArray"
My string is this way (but way longer):
{
"mylist": {
"myinfo": {
"user_id": 6225804,
"user_name": "culo",
"user_watching": 1092,
"user_completed": 0,
"user_onhold": 0,
"user_dropped": 0,
"user_plantowatch": 0,
"user_days_spent_watching": 0
},
"anime": [{
"series_animedb_id": 1,
"series_title": "Cowboy Bebop",
"series_synonyms": "; Cowboy Bebop",
"series_type": 1,
"series_episodes": 26,
"series_status": 2,
"series_start": "1998-04-03",
"series_end": "1999-04-24",
"series_image": "https:\/\/myanimelist.cdn-dena.com\/images\/anime\/4\/19644.webp",
"my_id": 0,
"my_watched_episodes": 0,
"my_start_date": "0000-00-00",
"my_finish_date": "0000-00-00",
"my_score": 0,
"my_status": 1,
"my_rewatching": 0,
"my_rewatching_ep": 0,
"my_last_updated": 1493924579,
"my_tags": ""
}, {
"series_animedb_id": 5,
"series_title": "Cowboy Bebop: Tengoku no Tobira",
"series_synonyms": "Cowboy Bebop: Knockin' on Heaven's Door; Cowboy Bebop: The Movie",
"series_type": 3,
"series_episodes": 1,
"series_status": 2,
"series_start": "2001-09-01",
"series_end": "2001-09-01",
"series_image": "https:\/\/myanimelist.cdn-dena.com\/images\/anime\/6\/14331.webp",
"my_id": 0,
"my_watched_episodes": 0,
"my_start_date": "0000-00-00",
"my_finish_date": "0000-00-00",
"my_score": 0,
"my_status": 1,
"my_rewatching": 0,
"my_rewatching_ep": 0,
"my_last_updated": 1496668154,
"my_tags": ""
}, {
"series_animedb_id": 6,
"series_title": "Trigun",
"series_synonyms": "; Trigun",
"series_type": 1,
"series_episodes": 26,
"series_status": 2,
"series_start": "1998-04-01",
"series_end": "1998-09-30",
"series_image": "https:\/\/myanimelist.cdn-dena.com\/images\/anime\/7\/20310.webp",
"my_id": 0,
"my_watched_episodes": 0,
"my_start_date": "0000-00-00",
"my_finish_date": "0000-00-00",
"my_score": 0,
"my_status": 1,
"my_rewatching": 0,
"my_rewatching_ep": 0,
"my_last_updated": 1496668441,
"my_tags": ""
}, ETCETERA 1000 more like this one
I don't really care about the "mylist" or "myinfo" part, just the "anime" part is needed. There are about 1000 items.
I've validated my JSON and it is valid.
This is my code:
JSONObject object = new JSONObject(replacedString);
JSONArray replacedResponse = new JSONArray(replacedString);
and here is where my issue begins. I've also tried this:
JSONObject object = new JSONObject(replacedString);
JSONArray replacedResponse = object.getJSONArray("mylist");
and
JSONObject object = new JSONObject(replacedString);
JSONArray replacedResponse = object.getJSONArray("anime");
with similar results
What i'm I not seeing here? thanks in advance!
Please follow this code.
String stringObj = "[YOUR JSON]";
// first Convert string into JsonObject
try {
JSONObject jsonObject = new JSONObject(stringObj);
// Inside the above object you have "mylist" key and the respective JsonObject so
JSONObject myListObject = jsonObject.optJSONObject("mylist");
// Insdide mylist you have myinfo Json and anim JsonArray
if(myListObject == null) {
return;
}
JSONObject myinfoObject = myListObject.optJSONObject("myinfo");
JSONArray animeJsonArray = myListObject.optJSONArray("anime");
// check null for myinfoObject and animeJsonArray and do the operation
} catch (JSONException e) {
e.printStackTrace();
}
So I have object:
"_id" : 1,
"employee_id" : [2, 3, 4, 5],
"project_name" : "qwerty"
And I want to delete from "employee_id" array [3, 5] and add new array [13, 6, 8]. And result will be:
"_id" : 1,
"employee_id" : [2, 4, 13, 6, 8],
"project_name" : "qwerty"
I use this Java-code:
DB database = mongoClient.getDB("employee_service");
DBCollection collectionProject = database.getCollection("project");
DBObject query = new BasicDBObject();
query.put("_id", project.getId());
DBObject projectMongoObject = new BasicDBObject();
projectMongoObject.put("project_name", project.getProjectName());
//something
collectionProject.update(query, projectMongoObject);
So how to set in projectMongoObject new array and delete array?
Make use of the $pullAll operator to remove the fields, and the combination of $push and $each to add new fields to the array.
DBObject query = new BasicDBObject();
query.put("_id", project.getId());
DBObject projectMongoObject = new BasicDBObject();
projectMongoObject.put("$set", new BasicDBObject("project_name",
project.getProjectName()));
projectMongoObject.put("$pullAll",
new BasicDBObject("employee_id", new int[]{3,5}));
collectionProject.update(query, projectMongoObject);
projectMongoObject = new BasicDBObject();
projectMongoObject.put("$push",
new BasicDBObject("employee_id",
new BasicDBObject("$each",
new int[]{13,6,8})));
collectionProject.update(query, projectMongoObject);
i need to create a classifier by feature, i have 15M rows of data like:
{
"app_entertainment" : 1,
"app_widgets" : 2,
"arcade" : 8,
"books_and_reference" : 2,
"comics" : 0,
"brain" : 20,
"business" : 0,
"cards" : 5,
"casual" : 1,
"communication" : 4,
"education" : 0,
"finance" : 1,
"game_wallpaper" : 0,
"game_widgets" : 0,
"health_fitness" : 0,
"libraries_demo" : 0,
"racing" : 1,
"lifestyle" : 1,
"media_video" : 0,
"medical" : 0,
"music_and_audio" : 7,
"news_magazines" : 2,
"personalization" : 1,
"photography" : 0,
"productivity" : 4,
"shopping" : 1,
"social" : 1,
"sports_apps" : 1,
"sports_games" : 7,
"tools" : 15,
"transportation" : 2,
"travel_and_local" : 8,
"weather" : 3,
"app_wallpaper" : 0,
"entertainment" : 0,
"health_and_fitness" : 0,
"libraries_and_demo" : 0,
"media_and_video" : 0,
"news_and_magazines" : 0,
"sports" : 0
}
also for every dataset like this i know if its true or false,
the boolean is if the user with this dataset clicked on ad or not.
how can i use mahout to train a classifier and how do i classify after i trained it?
everything that i found on the net is very abstract, not many examples of how to do it via java
There are very few materials for Mahout on the internet. I referred to the Mahout source code and the source code in Mahout in Action.
You could refer to 20newsgroup source code for classification.
A simple example using NavieBayes classifier. The vector is the dataset.
public List<String> classifyCase(Vector vector) {
TreeMap<Double, String> resultMap = new TreeMap<Double, String>();
Vector result = classifier.classifyFull(vector);
for (Vector.Element element: result) {
int categoryId = element.index();
double score = element.get();
resultMap.put(-score, labels.get(categoryId));
}
return new ArrayList<String>(resultMap.values());
}
When converting string date representation to numeric values, I obtain a different result in Java/Groovy/PHP vs in Javascript. For some dates before 1970, the JS timestamp is exactly 3600 secs before the Java timestamp. I could reproduce it for Oct 1st, but for Jan 1st it's ok.
My test case (in groovy, using the usual Java API on purpose):
def sdf = new SimpleDateFormat("dd/MM/yyyy")
["01/10/1956", "01/01/1956", "01/10/1978"].each {
def d = sdf.parse(it)
println "${it} -> ${d.time}"
}
and in JS (I simply run it from the Chrome console - "9" is October here):
new Date(1956, 9, 1, 0, 0, 0).getTime()
A few samples:
*Groovy
01/10/1956 -> -418179600000
01/01/1956 -> -441853200000
01/10/1978 -> 276040800000
*Javascript
1956,9,1,0,0,0 -> -418183200000
1956,0,1,0,0,0 -> -441853200000
1978,9,1,0,0,0 -> 276040800000
=> Notice how 01/10/1956 is not converted the same way, yielding a 3600 seconds difference.
A daylight saving time or a timezone would be the perfect culprit but I don't see why the two universe diverged at some point in the past.
Any hint welcome!
Thank you
EDIT more samples
*Java/Groovy
01/01/1974 -> 126226800000
01/10/1974 -> 149814000000
01/01/1976 -> 189298800000
01/10/1976 -> 212972400000
01/01/1978 -> 252457200000
01/10/1978 -> 276040800000
*JS
new Date(1974, 0, 1, 0, 0, 0).getTime() 126226800000
new Date(1974, 9, 1, 0, 0, 0).getTime() 149814000000
new Date(1976, 0, 1, 0, 0, 0).getTime() 189298800000
new Date(1976, 9, 1, 0, 0, 0).getTime() 212972400000
new Date(1978, 0, 1, 0, 0, 0).getTime() 252457200000
new Date(1978, 9, 1, 0, 0, 0).getTime() 276040800000
Around 1967~1971
01/01/1967 -> -94698000000
01/04/1967 -> -86922000000
01/10/1967 -> -71110800000
01/01/1968 -> -63162000000
01/04/1968 -> -55299600000
01/10/1968 -> -39488400000
01/01/1971 -> 31532400000
01/10/1971 -> 55119600000
new Date(1967, 0, 1, 0, 0, 0).getTime() -94698000000
new Date(1967, 3, 1, 0, 0, 0).getTime() -86925600000
new Date(1967, 9, 1, 0, 0, 0).getTime() -71114400000
new Date(1968, 0, 1, 0, 0, 0).getTime() -63162000000
new Date(1968, 3, 1, 0, 0, 0).getTime() -55303200000
new Date(1968, 9, 1, 0, 0, 0).getTime() -39492000000
new Date(1971, 0, 1, 0, 0, 0).getTime() 31532400000
new Date(1971, 9, 1, 0, 0, 0).getTime() 55119600000
Your profile says you're from Belgium.
There's no daylight saving time in 1976 for Brussels:
http://www.timeanddate.com/worldclock/clockchange.html?n=48&year=1976
But there is from 1977 onwards:
http://www.timeanddate.com/worldclock/clockchange.html?n=48&year=1977
Java is probably aware of this, whereas JavaScript is not.
The information about Timezones is complex and you would be surprised how often a) they change, b) they are inaccurate.
I would try this in Java/Groovy as well.
new Date(1956, 9, 1, 0, 0, 0).getTime()
A cool website on timezones. http://www.bbc.co.uk/news/world-12849630
For example, the epoch is 1970/01/01 00:00 UTC. Not Europe/London because even though it was winter, the UK was in BST (British Summer Time) This only happen from Feb 1968 to Nov 1971. :P http://www.timeanddate.com/time/uk/time-zone-background.html
The more you learn about time and date the more you realise its all rather adhoc. Even UTC is not an acronym as such, it means "Coordinated Universal Time" in English and "Temps Universel Coordonné" in French, and because they couldn't agree on what the acronym should be the compromise was UTC with is neither. http://en.wikipedia.org/wiki/Coordinated_Universal_Time