Logger is not able to separate data between different files - java

I made a sort of web scraper in Java that downloads html code and writes it in a logger.
The code for the data miner is the following:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.logging.FileHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Scraping {
private static final Logger LOGGER = Logger.getLogger(Logger.GLOBAL_LOGGER_NAME);
public static void getData(String address, int val) throws IOException {
// Make a URL to the web page
URL url = new URL(address);
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
FileHandler fh;
fh = new FileHandler(Integer.toString(val)+".txt");
LOGGER.addHandler(fh);
//SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(new MyFormatter());
LOGGER.setUseParentHandlers(false);
LOGGER.setLevel(Level.FINE);
while ((line = br.readLine()) != null) {
toTable(line);
}
}
/*arrange data in table*/
private static void toTable(String line){
if(line.startsWith("<tr ><th scope=\"row\" class=\"left \" data-append-csv=") && !line.contains("ts_pct")){
LOGGER.log(Level.FINE, line);
}
}
}
When I run the code once it gives me the correct output, but I need to run it multiple times in a for loop (sending another address and index i as val, giving Logger a different name for every iteration), and when I do that, the Logger file appends new data from files that should be in a different file.
So, index 0 gets data for val 0, 1, and 2, instead of having just val 0 data in there.
The file handler boolean append doesn't seem to make any difference for my program's output.

First of all,web scraping isn't data mining. No advanced statistics involved.
Secondly,don't abuse loggers for IO.
Logging is to make sure you get some debug information when your program fails, in a configurable way (so don't use GLOBAL_LOGGER but each class should have a different logger), and you can see what is happening.
For writing your output files, use the standard OutputStream etc. of your programming language. Don't try to reroute your output completely through logging.

Related

How to configure RDF4J Rio writer to write IRIs with special characters?

I want to write an rdf4j.model.Model with the rdf/turtle format. The model should contain IRIs with the characters {}.
When I try to write the RDF model with rdf4j.rio.Rio, the {} characters are written as %7B%7D. Is there a way to overcome this? e.g. create an rdf4j.model.IRI with path and query variables or configure the writer to preserve the {} characters?
I am using org.eclipse.rdf4j:rdf4j-runtime:3.6.2.
An example snippet:
import org.eclipse.rdf4j.model.BNode;
import org.eclipse.rdf4j.model.IRI;
import org.eclipse.rdf4j.model.Model;
import org.eclipse.rdf4j.model.impl.SimpleValueFactory;
import org.eclipse.rdf4j.model.util.ModelBuilder;
import org.eclipse.rdf4j.rio.*;
import org.eclipse.rdf4j.rio.helpers.BasicWriterSettings;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
public class ExamplePathVariable {
private final static Logger LOG = Logger.getLogger(ExamplePathVariable.class.getCanonicalName());
public static void main(String[] args) {
SimpleValueFactory rdf = SimpleValueFactory.getInstance();
ModelBuilder modelBuilder = new ModelBuilder();
BNode subject = rdf.createBNode();
IRI predicate = rdf.createIRI("http://example.org/onto#hasURI");
// IRI with special characters !
IRI object = rdf.createIRI("http://example.org/{token}");
modelBuilder.add(subject, predicate, object);
String turtleStr = writeToString(RDFFormat.TURTLE, modelBuilder.build());
LOG.log(Level.INFO, turtleStr);
}
static String writeToString(RDFFormat format, Model model) {
OutputStream out = new ByteArrayOutputStream();
try {
Rio.write(model, out, format,
new WriterConfig().set(BasicWriterSettings.INLINE_BLANK_NODES, true));
} finally {
try {
out.close();
} catch (IOException e) {
LOG.log(Level.WARNING, e.getMessage());
}
}
return out.toString();
}
}
This is what I get:
INFO:
[] <http://example.org/onto#hasURI> <http://example.org/%7Btoken%7D> .
There is no easy way to do what you want, because that would result in a syntactically invalid URI representation in Turtle.
The characters '{' and '}', even though they are not actually reserved characters in URIs, are not allowed to exist in un-encoded form in a URI (see https://datatracker.ietf.org/doc/html/rfc3987). The only way to serialize them legally is by percent-encoding them.
As an aside the only reason this bit of code:
IRI object = rdf.createIRI("http://example.org/{token}");
succeeds is that the SimpleValueFactory you are using does not do character validation (for performance reasons). If you instead use the recommended approach (since RDF4J 3.5) of using the Values static factory:
IRI object = Values.iri("http://example.org/{token}");
...you would immediately have gotten a validation error.
If you want to input a string where in advance you don't know if it's going to contain any invalid chars, and want to have a best-effort approach to convert it to a legal URI, you can use ParsedIRI.create:
IRI object = Values.iri(ParsedIRI.create("http://example.org/{token}").toString());

Why is DataOutputStream in java not working as expected?

I am learning about file IO in java, and wanted to test this, but I am not sure why I am getting weird results. Here is the code.
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.DataOutputStream;
public class driver {
public static void main(String[] args) throws IOException {
FileOutputStream out = new FileOutputStream("Hello.txt");
DataOutputStream dos = new DataOutputStream(out);
dos.writeBoolean(true);
dos.writeInt(68);
dos.writeChar('c');
dos.writeDouble(3.14);
dos.writeFloat(56.789f);
}
}
My input file "Hello.txt doesn't exist yet and I want to put all these values like 68, c, 3,14 etc into this file, however after running the above program, these are the contents of "Hello.txt".
D c# ¸Që…Bc'ð
So what exactly is happening here?

Unable to get correct letters when sending special characters to the printer in Java

I am writing a program that are working with adresses and I want the final output to be sent to a printer. As most of the adresses are located in northen Europe I need to be able to handle some special characters. However I seem to be unable to do this when printing.
When writing to the terminal or to a *.txt file everything works fine but on the printed pages I get gibberish.
This is basicly what I am trying to do:
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.print.*;
public class PrintExample {
public static void main(String[] args) throws PrintException,IOException {
String testData = "ÅÄÖ, åäö";
PrintService service = PrintServiceLookup.lookupDefaultPrintService();
InputStream is = new ByteArrayInputStream(testData.getBytes("UTF-8"));
DocFlavor flavor = DocFlavor.INPUT_STREAM.AUTOSENSE;
DocPrintJob job = service.createPrintJob();
Doc doc = new SimpleDoc(is, flavor, null);
job.print(doc, null);
is.close();
}
}
Does anyone have a clue as to what's wrong?

Not able to parse new york times article using boilerpipe

I am trying to get news article from 'new york times' url but it is not giving any output, but if I try for any other newspaper it gives output. I want to know if something is wrong with my code or boilerpipe is not able to fetch it. Plus sometimes the output is not in english language means it shows in unicode mainly for 'daily news', I want to know reason for that also.
import java.io.InputStream;
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.DefaultExtractor;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
class ExtractData
{
public static void main(final String[] args) throws Exception
{
URL url;
url = new URL(
"http://www.nytimes.com/2013/03/02/nyregion/us-judges-offer-addicts-a-way-to-avoid-prison.html?hp&_r=0");
// NOTE We ignore HTTP-based character encoding in this demo...
final InputStream urlStream = url.openStream();
final InputSource is = new InputSource(urlStream);
final BoilerpipeSAXInput in = new BoilerpipeSAXInput(is);
final TextDocument doc = in.getTextDocument();
urlStream.close();
// You have the choice between different Extractors
//System.out.println(DefaultExtractor.INSTANCE.getText(doc));
System.out.println(ArticleExtractor.INSTANCE.getText(doc));
}
}
Nytimes.com has a paywall and it returns HTTP 303 for your request, you could try to handle the redirect and cookies. Trying other user-agent strings might also work.

Using Java to pull data from a webpage?

I'm attempting to make my first program in Java. The goal is to write a program that browses to a website and downloads a file for me. However, I don't know how to use Java to interact with the internet. Can anyone tell me what topics to look up/read about or recommend some good resources?
The simplest solution (without depending on any third-party library or platform) is to create a URL instance pointing to the web page / link you want to download, and read the content using streams.
For example:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class DownloadPage {
public static void main(String[] args) throws IOException {
// Make a URL to the web page
URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
// Once you have the Input Stream, it's just plain old Java IO stuff.
// For this case, since you are interested in getting plain-text web page
// I'll use a reader and output the text content to System.out.
// For binary content, it's better to directly read the bytes from stream and write
// to the target file.
try(BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line = null;
// read each line and write to System.out
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
}
}
Hope this helps.
The Basics
Look at these to build a solution more or less from scratch:
Start from the basics: The Java Tutorial's chapter on Networking, including Working With URLs
Make things easier for yourself: Apache HttpComponents (including HttpClient)
The Easily Glued-Up and Stitched-Up Stuff
You always have the option of calling external tools from Java using the exec() and similar methods. For instance, you could use wget, or cURL.
The Hardcore Stuff
Then if you want to go into more fully-fledged stuff, thankfully the need for automated web-testing as given us very practical tools for this. Look at:
HtmlUnit (powerful and simple)
Selenium, Selenium-RC
WebDriver/Selenium2 (still in the works)
JBehave with JBehave Web
Some other libs are purposefully written with web-scraping in mind:
JSoup
Jaunt
Some Workarounds
Java is a language, but also a platform, with many other languages running on it. Some of which integrate great syntactic sugar or libraries to easily build scrapers.
Check out:
Groovy (and its XmlSlurper)
or Scala (with great XML support as presented here and here)
If you know of a great library for Ruby (JRuby, with an article on scraping with JRuby and HtmlUnit) or Python (Jython) or you prefer these languages, then give their JVM ports a chance.
Some Supplements
Some other similar questions:
Scrape data from HTML using Java
Options for HTML Scraping
Here's my solution using URL and try with resources phrase to catch the exceptions.
/**
* Created by mona on 5/27/16.
*/
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
public class ReadFromWeb {
public static void readFromWeb(String webURL) throws IOException {
URL url = new URL(webURL);
InputStream is = url.openStream();
try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
}
catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
}
public static void main(String[] args) throws IOException {
String url = "https://madison.craigslist.org/search/sub";
readFromWeb(url);
}
}
You could additionally save it to file based on your needs or parse it using XML or HTML libraries.
Since Java 11 the most convenient way it to use java.net.http.HttpClient from the standard library.
Example:
HttpClient client = HttpClient.newBuilder()
.version(Version.HTTP_1_1)
.followRedirects(Redirect.NORMAL)
.connectTimeout(Duration.ofSeconds(20))
.proxy(ProxySelector.of(new InetSocketAddress("proxy.example.com", 80)))
.authenticator(Authenticator.getDefault())
.build();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("httpss://foo.com/"))
.timeout(Duration.ofMinutes(2))
.GET()
.build();
HttpResponse<String> response = client.send(request, BodyHandlers.ofString());
System.out.println(response.statusCode());
System.out.println(response.body());
I use the following code for my API:
try {
URL url = new URL("https://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
InputStream content = url.openStream();
int c;
while ((c = content.read())!=-1) System.out.print((char) c);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException ie) {
ie.printStackTrace();
}
You can catch the characters and convert them to string.

Categories

Resources