Reading data from websites using Jsoup

Reading data from websites using Jsoup - java

I found a bit of code by BalusC which was edited by another user: Pisek, and was wondering how to read data from another website.
I understand how to find the new class name to read different parts of data but I'm not sure how to read the quantity of the product.
Here's my code so far:
package internalAssessment;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class practiceArea {
public static void main(String[] args) throws Exception {
String url = "https://www.tesco.com/groceries/product/details/?id=265485175";
Document document = Jsoup.connect(url).get();
String price = document.select(".linePrice").text();
System.out.println("Price: " + price);
String quantity = document.select("").text();
System.out.println("Quantity: " + quantity);
}
}

The way you get the price is by using the class :
String price = document.select(".linePrice").text();
You can also get the quantity with its class (or by its id):
document.select(".quantity").attr("value"); // by class
document.select("#qty-265485175-1").attr("value"); // by id
The thing which differ is get the number, here it is a value attribute so you'll use : .attr("value");
As I said in comment : this launches a nex connection to the website, so there is reason that the value you'll get will not be 1

Related

Could you pleas help me in the following stanford-nlp OpenIE

I run the same demo example on the website with the following sentence:
"Hudson was born in Hampstead, which is a suburb of London."
and give me the following,
Hudson be bear
and I was expecting the following relations:
(Hudson, was born in, Hampstead)
(Hampstead, is a suburb of, London)
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
/** A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
//tokenize,ssplit,pos,lemma,depparse,natlog,openie
//tokenize,ssplit,pos,lemma,ner,regexner,parse,mention,entitymentions,coref,kbp
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
Annotation doc = new Annotation(args[0]);
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples =
sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}
Thank you for your help

So, the system is not wrong, though certainly undergenerating possible relations. Hudson be bear is just asserting that Hudson was born (a true fact). This in particular was caused by the ref edge from Hampstead -ref-> which. This should be fixed in subsequent versions of the code.
In general though, like all NLP systems, OpenIE has a certain accuracy rate that's under 100%, and you should never expect the system to produce completely correct output. Especially for a task like Open IE, where even getting agreement on what "correct" means is difficult.

Extract Data From Multiple Files

I have exactly 278 Html files of essays from different students, every file contains student id, first name and last in the following format
<p>Student ID: 000000</p>
<p>First Name: John</p>
<p>Last Name: Doe</p>
I'm trying to extract Student IDs from all this files, is there a way to extract data between X and Y? X being "<p>Student ID: " and Y being "</p>" which should leave us with ID
What Method/Language/Concept/Software would you recommend to get this work done?

Using java:
import java.io.File;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
public class StudentIDsRetriever {
public static void main(String[] args) throws IOException {
File dir = new File("htmldir");
String[] htmlFiles = dir.list();
List<String> studentIds = new ArrayList<>();
List<String> emailDs = new ArrayList<>();
for (String htmlFile : htmlFiles) {
Path path = FileSystems.getDefault().getPath("htmldir", htmlFile);
List<String> lines = Files.readAllLines(path);
for (String str : lines) {
if (str.contains("<p>Student ID:")) {
String idTag = str.substring(str.indexOf("<p>Student ID:"));
String id = idTag.substring("<p>Student ID:".length(), idTag.indexOf("</p>"));
System.out.println("Id is "+id);
studentIds.add(id);
}
if (str.contains("#") && (str.contains(".com") || str.contains(".co.in"))) {
String[] words = str.split(" ");
for (String word : words)
if (word.contains("#") && (word.contains(".com") || word.contains(".co.in")))
emailDs.add(word);
}
}
}
System.out.println("Student list is "+studentIds);
System.out.println("Student email list is "+emailDs);
}
}
P.S: This works from Java7+

I recommand you python script. If you first use python, It's OK. python is so easy script language and has many references in google.
1) Language: python (version 2.7)
2) Library: beautifulsoup (you can download this with pip(pip is package manager program, pip can be installed in python installer)
traverse files one by one and open your local files. and parse HTML content using beautifulsoup. (see this part https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag)
then, extract content from <p> tag. It return "Student ID: 000000".
split this string with ":". this return str[0] and str[1].
str[1] is student number you want (maybe you erase space character... call ' Hel lo '.strip() -> Hello
If you need a help, reply.

Connect To MongoDB using Apache Mahout

I'm trying to generate recommendations using Apache Mahout while using MongoDB to create the datamodel as per the MongoDBDataModel. My code is as follows :
import java.net.UnknownHostException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.UserBasedRecommender;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import com.mongodb.MongoException;
public class usingMongo {
public static void main(String[] args) throws UnknownHostException, Mong oException
,TasteException {
final long startTime = System.nanoTime();
MongoDBDataModel model = new MongoDBDataModel("AdamsLaptop", 27017,
"test", "ratings100k", false, false, null);
System.out.println("connected to mongo ");
UserSimilarity UserSim = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.5, UserSim, model);
UserBasedRecommender UserRecommender = new GenericUserBasedRecommender(model, neighborhood, UserSim);
List<RecommendedItem>UserRecommendations = UserRecommender.recommend(1, 3);
for (RecommendedItem recommendation : UserRecommendations) {
System.out.println("You may like movie " + recommendation.getItemID() + " as a user similar to you also rated it " + recommendation.getValue() + " USER");
}
ItemSimilarity ItemSim = new PearsonCorrelationSimilarity(model);//LogLikelihoodSimilarity(model);
GenericItemBasedRecommender ItemRecommender = new GenericItemBasedRecommender(model, ItemSim);
List<RecommendedItem>ItemRecommendations = ItemRecommender.recommend(1, 3);
for (RecommendedItem recommendation : ItemRecommendations) {
System.out.println("You may like movie " + recommendation.getItemID() + " as a user similar to you also rated it " + recommendation.getValue() + " ITEM");
}
final long duration = System.nanoTime() - startTime;
System.out.println(duration);
}
}
I cant see where I've gone wrong but with numerous changes and lots of trial and error the error message remains the same :
Exception in thread "main" java.lang.NullPointerException
at org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel.getID(MongoDBDataModel.java:743)
at org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel.buildModel(MongoDBDataModel.java:570)
at org.apache.mahout.cf.taste.impl.model.mongodb.MongoDBDataModel.<init>(MongoDBDataModel.java:245)
at recommender.usingMongo.main(usingMongo.java:24)
Any suggestions? Here's an example of my data within MongoDB :
{ "_id" : ObjectId("56ddf61f5960960c333f3dcb"),"userId" : 1, "movieId" : 292, "rating" : 4, "timestamp" : 847116936 }

I succesfully integrated MongoDB data to mahout.
The structure of the data in mongoDB depends on the kind of Similarity algorithm you use.for eg,
UserSimilarity
MongoDBDataModel datamodel = new MongoDBDataModel("127.0.0.1", 27017, "testing", "ratings", true, true, null);
where the user_id, item_id are integer values, preference are float values and created_at as timestamp
SVDRecommender
the user_id, item_id are MongoDB Objects and preference are float values and created_at as timestamp
The obvious troubleshooting you can do is whether the MongoDB server is running or not. As per the exception it's running. I think the problem lies in your structure of data..
Use user_id instead of userId, item_id instead of itemId, preference instead of rating. I don't know if this will make any difference. I used one of the tutorial online, but can't find it at the moment.
It's working but too slow when I have more than 10000 users with 1000 items.

I think that the problem is that mahout assumes some default values when it comes to some fields that need to reside in your mongoDB the item ID, User ID and preferences that are user_id, item_id and preference so The solution might lie on using another MongoDBDataModel constructor that will give you the possibility to pass as parameters the names of those fields in your mongoDB instance or redesign your Collections Schema.
I hope that makes sense.

Parsing Information from URL Using Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}

Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA  
Discharge Port/Container Arrival Date: 23 Jul 15  E

You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".

This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :
Document document = org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading data from websites using Jsoup - java

Related

Could you pleas help me in the following stanford-nlp OpenIE

Extract Data From Multiple Files

Connect To MongoDB using Apache Mahout

Parsing Information from URL Using Jsoup

How to parse data in Talend with Java (coming from a previously produced .txt file)?

Categories

Resources