Not able to parse new york times article using boilerpipe

Not able to parse new york times article using boilerpipe - java

I am trying to get news article from 'new york times' url but it is not giving any output, but if I try for any other newspaper it gives output. I want to know if something is wrong with my code or boilerpipe is not able to fetch it. Plus sometimes the output is not in english language means it shows in unicode mainly for 'daily news', I want to know reason for that also.
import java.io.InputStream;
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
class ExtractData
{
public static void main(final String[] args) throws Exception
{
URL url;
url = new URL(
"http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");
// NOTE We ignore HTTP-based character encoding in this demo...
final InputStream urlStream = url.openStream();
final InputSource is = new InputSource(urlStream);
final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
final TextDocument doc = in.getTextDocument();
urlStream.close();
// You have the choice between different Extractors
//System.out.println(DefaultExtractor.INSTANCE.getText(doc));
System.out.println(ArticleExtractor.INSTANCE.getText(doc));
}
}

Nytimes.com has a paywall and it returns HTTP 303 for your request, you could try to handle the redirect and cookies. Trying other user-agent strings might also work.

Related

How to configure RDF4J Rio writer to write IRIs with special characters?

I want to write an rdf4j.model.Model with the rdf/turtle format. The model should contain IRIs with the characters {}.
When I try to write the RDF model with rdf4j.rio.Rio, the {} characters are written as %7B%7D. Is there a way to overcome this? e.g. create an rdf4j.model.IRI with path and query variables or configure the writer to preserve the {} characters?
I am using org.eclipse.rdf4j:rdf4j-runtime:3.6.2.
An example snippet:
import org.eclipse.rdf4j.model.BNode;
import org.eclipse.rdf4j.model.IRI;
import org.eclipse.rdf4j.model.Model;
import org.eclipse.rdf4j.model.impl.SimpleValueFactory;
import org.eclipse.rdf4j.model.util.ModelBuilder;
import org.eclipse.rdf4j.rio.*;
import org.eclipse.rdf4j.rio.helpers.BasicWriterSettings;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
public class ExamplePathVariable {
private final static Logger LOG = Logger.getLogger(ExamplePathVariable.class.getCanonicalName());
public static void main(String[] args) {
SimpleValueFactory rdf = SimpleValueFactory.getInstance();
ModelBuilder modelBuilder = new ModelBuilder();
BNode subject = rdf.createBNode();
IRI predicate = rdf.createIRI("http://example.org/onto#hasURI");
// IRI with special characters !
IRI object = rdf.createIRI("http://example.org/{token}");
modelBuilder.add(subject, predicate, object);
String turtleStr = writeToString(RDFFormat.TURTLE, modelBuilder.build());
LOG.log(Level.INFO, turtleStr);
}
static String writeToString(RDFFormat format, Model model) {
OutputStream out = new ByteArrayOutputStream();
try {
Rio.write(model, out, format,
new WriterConfig().set(BasicWriterSettings.INLINE_BLANK_NODES, true));
} finally {
try {
out.close();
} catch (IOException e) {
LOG.log(Level.WARNING, e.getMessage());
}
}
return out.toString();
}
}
This is what I get:
INFO:
[] <http://example.org/onto#hasURI> <http://example.org/%7Btoken%7D> .

There is no easy way to do what you want, because that would result in a syntactically invalid URI representation in Turtle.
The characters '{' and '}', even though they are not actually reserved characters in URIs, are not allowed to exist in un-encoded form in a URI (see https://datatracker.ietf.org/doc/html/rfc3987). The only way to serialize them legally is by percent-encoding them.
As an aside the only reason this bit of code:
IRI object = rdf.createIRI("http://example.org/{token}");
succeeds is that the SimpleValueFactory you are using does not do character validation (for performance reasons). If you instead use the recommended approach (since RDF4J 3.5) of using the Values static factory:
IRI object = Values.iri("http://example.org/{token}");
...you would immediately have gotten a validation error.
If you want to input a string where in advance you don't know if it's going to contain any invalid chars, and want to have a best-effort approach to convert it to a legal URI, you can use ParsedIRI.create:
IRI object = Values.iri(ParsedIRI.create("http://example.org/{token}").toString());

How to get the content of a Website to a String in Android Studio ?

I want to display the parts of the content of a Website in my app. I've seen some solutions here but they are all very old and do not work with the newer versions of Android Studio. So maybe someone can help out.

https://jsoup.org/ should help for getting full site data, parse it based on class, id and etc. For instance, below code gets and prints site's title:
Document doc = Jsoup.connect("http://www.moodmusic.today/").get();
String title = doc.select("title").text();
System.out.println(title);

If you want to get raw data from a target website, you will need to do the following:
Create a URL object with the link of the website specified in the parameter
Cast it to HttpURLConnection
Retrieve its InputStream
Convert it to a String
This can work generally with java, no matter which IDE you're using.
To retrieve a connection's InputStream:
// Create a URL object
URL url = new URL("https://yourwebsitehere.domain");
// Retrieve its input stream
HttpURLConnection connection = ((HttpURLConnection) url.openConnection());
InputStream instream = connection.getInputStream();
Make sure to handle java.net.MalformedURLException and java.io.IOException
To convert an InputStream to a String
public static String toString(InputStream in) throws IOException {
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line;
while ((line = reader.readLine()) != null) {
builder.append(line).append("\n");
}
reader.close();
return builder.toString();
}
You can copy and modify the code above and use it in your source code!
Make sure to have the following imports
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
Example:
public static String getDataRaw() throws IOException, MalformedURLException {
URL url = new URL("https://yourwebsitehere.domain");
HttpURLConnection connection = ((HttpURLConnection) url.openConnection());
InputStream instream = connection.getInputStream();
return toString(instream);
}
To call getDataRaw(), handle IOException and MalformedURLException and you're good to go!
Hope this helps!

Scrape/extract with Java, result from coinmarketcap.com

I need to extract coinmarket cap volume (ex: Market Cap: $306,020,249,332) from top of page with Java, please see picture attached.
I have used jsoup library in Java Eclipse but didn't extract volume. Jsoup extract only other attributes. Probably problem is from a java script library.
Also I have used html unit without success:
import java.io.IOException;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Testss {
public static void main(String\[\] args) throws IOException {
String url = "https://coinmarketcap.com/faq/";
WebClient client = new WebClient();
HtmlPage page = client.getPage(url);
List<?> anchors = page.getByXPath("//div\[#class='col-sm-6 text-center'\]//a");
for (Object obj : anchors) {
HtmlAnchor a = (HtmlAnchor) obj;
System.out.println(a.getTextContent().trim());
}
}
}
How can I extract volume from this site with Java?
Thanks!

Check the network tab findout the exact request which is fetching the data, In your case its https://files.coinmarketcap.com/generated/stats/global.json
Also the request URL is the below one
So, Fetching the main URL will not give you what you require, For that you have to fetch the data from the request URL directly and parse it using any JSON library. SimpleJSON I can suggest in one of those.
The JSON data which you will get after hitting the url.
{
"bitcoin_percentage_of_market_cap": 55.95083004655126,
"active_cryptocurrencies": 1324,
"total_volume_usd": 21503093761,
"active_markets": 7009,
"total_market_cap_by_available_supply_usd": 301100436864
}

Logger is not able to separate data between different files

I made a sort of web scraper in Java that downloads html code and writes it in a logger.
The code for the data miner is the following:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.logging.FileHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Scraping {
private static final Logger LOGGER = Logger.getLogger(Logger.GLOBAL_LOGGER_NAME);
public static void getData(String address, int val) throws IOException {
// Make a URL to the web page
URL url = new URL(address);
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
FileHandler fh;
fh = new FileHandler(Integer.toString(val)+".txt");
LOGGER.addHandler(fh);
//SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(new MyFormatter());
LOGGER.setUseParentHandlers(false);
LOGGER.setLevel(Level.FINE);
while ((line = br.readLine()) != null) {
toTable(line);
}
}
/*arrange data in table*/
private static void toTable(String line){
if(line.startsWith("<tr ><th scope=\"row\" class=\"left \" data-append-csv=") && !line.contains("ts_pct")){
LOGGER.log(Level.FINE, line);
}
}
}
When I run the code once it gives me the correct output, but I need to run it multiple times in a for loop (sending another address and index i as val, giving Logger a different name for every iteration), and when I do that, the Logger file appends new data from files that should be in a different file.
So, index 0 gets data for val 0, 1, and 2, instead of having just val 0 data in there.
The file handler boolean append doesn't seem to make any difference for my program's output.

First of all,web scraping isn't data mining. No advanced statistics involved.
Secondly,don't abuse loggers for IO.
Logging is to make sure you get some debug information when your program fails, in a configurable way (so don't use GLOBAL_LOGGER but each class should have a different logger), and you can see what is happening.
For writing your output files, use the standard OutputStream etc. of your programming language. Don't try to reroute your output completely through logging.

Amazon Product Advertising API through Java/SOAP

I have been playing with Amazon's Product Advertising API, and I cannot get a request to go through and give me data. I have been working off of this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/ and this: Amazon Product Advertising API signed request with Java
Here is my code. I generated the SOAP bindings using this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/YourDevelopmentEnvironment.html#Java
On the Classpath, I only have: commons-codec.1.5.jar
import com.ECS.client.jax.AWSECommerceService;
import com.ECS.client.jax.AWSECommerceServicePortType;
import com.ECS.client.jax.Item;
import com.ECS.client.jax.ItemLookup;
import com.ECS.client.jax.ItemLookupRequest;
import com.ECS.client.jax.ItemLookupResponse;
import com.ECS.client.jax.ItemSearchResponse;
import com.ECS.client.jax.Items;
public class Client {
public static void main(String[] args) {
String secretKey = <my-secret-key>;
String awsKey = <my-aws-key>;
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(
secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item item : itemList.getItem()) {
System.out.println(item);
}
}
System.out.println("API Test stopped");
}
}
Here is what I get back.. I was hoping to see some Star Wars books available on Amazon dumped out to my console :-/:
API Test started
response: com.ECS.client.jax.ItemSearchResponse#7a6769ea
com.ECS.client.jax.Items#1b5ac06e
API Test stopped
What am I doing wrong (Note that no "item" in the second for loop is being printed out, because its empty)? How can I troubleshoot this or get relevant error information?

I don't use the SOAP API but your Bounty requirements didn't state that it had to use SOAP only that you wanted to call Amazon and get results. So, I'll post this working example using the REST API which will at least fulfill your stated requirements:
I would like some working example code that hits the amazon server and returns results
You'll need to download the following to fulfill the signature requirements:
http://associates-amazon.s3.amazonaws.com/signed-requests/samples/amazon-product-advt-api-sample-java-query.zip
Unzip it and grab the com.amazon.advertising.api.sample.SignedRequestsHelper.java file and put it directly into your project. This code is used to sign the request.
You'll also need to download Apache Commons Codec 1.3 from the following and add it to your classpath i.e. add it to your project's library. Note that this is the only version of Codec that will work with the above class (SignedRequestsHelper)
http://archive.apache.org/dist/commons/codec/binaries/commons-codec-1.3.zip
Now you can copy and paste the following making sure to replace your.pkg.here with the proper package name and replace the SECRET and the KEY properties:
package your.pkg.here;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringWriter;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Main {
private static final String SECRET_KEY = "<YOUR_SECRET_KEY>";
private static final String AWS_KEY = "<YOUR_KEY>";
public static void main(String[] args) {
SignedRequestsHelper helper = SignedRequestsHelper.getInstance("ecs.amazonaws.com", AWS_KEY, SECRET_KEY);
Map<String, String> params = new HashMap<String, String>();
params.put("Service", "AWSECommerceService");
params.put("Version", "2009-03-31");
params.put("Operation", "ItemLookup");
params.put("ItemId", "1451648537");
params.put("ResponseGroup", "Large");
String url = helper.sign(params);
try {
Document response = getResponse(url);
printResponse(response);
} catch (Exception ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
private static Document getResponse(String url) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(url);
return doc;
}
private static void printResponse(Document doc) throws TransformerException, FileNotFoundException {
Transformer trans = TransformerFactory.newInstance().newTransformer();
Properties props = new Properties();
props.put(OutputKeys.INDENT, "yes");
trans.setOutputProperties(props);
StreamResult res = new StreamResult(new StringWriter());
DOMSource src = new DOMSource(doc);
trans.transform(src, res);
String toString = res.getWriter().toString();
System.out.println(toString);
}
}
As you can see this is much simpler to setup and use than the SOAP API. If you don't have a specific requirement for using the SOAP API then I would highly recommend that you use the REST API instead.
One of the drawbacks of using the REST API is that the results aren't unmarshaled into objects for you. This could be remedied by creating the required classes based on the wsdl.

This ended up working (I had to add my associateTag to the request):
public class Client {
public static void main(String[] args) {
String secretKey = "<MY_SECRET_KEY>";
String awsKey = "<MY AWS KEY>";
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
itemRequest.getResponseGroup().add("Large");
// itemRequest.getResponseGroup().add("Images");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.setAssociateTag("th0426-20");
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item itemObj : itemList.getItem()) {
System.out.println(itemObj.getItemAttributes().getTitle()); // Title
System.out.println(itemObj.getDetailPageURL()); // Amazon URL
}
}
System.out.println("API Test stopped");
}
}

It looks like the response object does not override toString(), so if it contains some sort of error response, simply printing it will not tell you what the error response is. You'll need to look at the api for what fields are returned in the response object and individually print those. Either you'll get an obvious error message or you'll have to go back to their documentation to try to figure out what is wrong.

You need to call the get methods on the Item object to retrieve its details, e.g.:
for (Item item : itemList.getItem()) {
System.out.println(item.getItemAttributes().getTitle()); //Title of item
System.out.println(item.getDetailPageURL()); // Amazon URL
//etc
}
If there are any errors you can get them by calling getErrors()
if (response.getOperationRequest().getErrors() != null) {
System.out.println(response.getOperationRequest().getErrors().getError().get(0).getMessage());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Not able to parse new york times article using boilerpipe - java

Nytimes.com has a paywall and it returns HTTP 303 for your request, you could try to handle the redirect and cookies. Trying other user-agent strings might also work.

Related

How to configure RDF4J Rio writer to write IRIs with special characters?

How to get the content of a Website to a String in Android Studio ?

Scrape/extract with Java, result from coinmarketcap.com

Logger is not able to separate data between different files

Amazon Product Advertising API through Java/SOAP

Categories

Resources