Reading entire html file to String?

Reading entire html file to String? - java

Are there better ways to read an entire html file to a single string variable than:
String content = "";
try {
BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
String str;
while ((str = in.readLine()) != null) {
content +=str;
}
in.close();
} catch (IOException e) {
}

You should use a StringBuilder:
StringBuilder contentBuilder = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();

There's the IOUtils.toString(..) utility from Apache Commons.
If you're using Guava there's also Files.readLines(..) and Files.toString(..).

You can use JSoup.
It's a very strong HTML parser for java

As Jean mentioned, using a StringBuilder instead of += would be better. But if you're looking for something simpler, Guava, IOUtils, and Jsoup are all good options.
Example with Guava:
String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();
Example with IOUtils:
InputStream in = new URL("/path/to/mypage.html").openStream();
String content;
try {
content = IOUtils.toString(in, StandardCharsets.UTF_8);
} finally {
IOUtils.closeQuietly(in);
}
Example with Jsoup:
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();
or
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();
NOTES:
Files.readLines() and Files.toString()
These are now deprecated as of Guava release version 22.0 (May 22, 2017).
Files.asCharSource() should be used instead as seen in the example above. (version 22.0 release diffs)
IOUtils.toString(InputStream) and Charsets.UTF_8
Deprecated as of Apache Commons-IO version 2.5 (May 6, 2016). IOUtils.toString should now be passed the InputStream and the Charset as seen in the example above. Java 7's StandardCharsets should be used instead of Charsets as seen in the example above. (deprecated Charsets.UTF_8)

I prefers using Guava :
import com.google.common.base.Charsets;
import com.google.common.io.Files;
File file = new File("/path/to/file", Charsets.UTF_8);
String content = Files.toString(file);

For string operations use StringBuilder or StringBuffer classes for accumulating string data blocks. Do not use += operations for string objects. String class is immutable and you will produce a large amount of string objects upon runtime and it will affect on performance.
Use .append() method of StringBuilder/StringBuffer class instance instead.

Here's a solution to retrieve the html of a webpage using only standard java libraries:
import java.io.*;
import java.net.*;
String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(result);
SRC

import org.apache.commons.io.IOUtils;
import java.io.IOException;
try {
var content = new String(IOUtils.toByteArray ( this.getClass().
getResource("/index.html")));
} catch (IOException e) {
e.printStackTrace();
}
//Java 10 Code mentioned above - assuming index.html is available inside resources folder.

Related

G suite account get report java sample question

I am trying to use this api to get report with java, and here is the link
https://developers.google.com/admin-sdk/reports/v1/appendix/activity/meet
and here is what i am using now
public static String getGraph() {
String PROTECTED_RESOURCE_URL = "https://www.googleapis.com/admin/reports/v1/activity/users/all/applications/meet?eventName=call_ended&maxResults=10&access_token=";
String graph = "";
try {
URL urUserInfo = new URL(PROTECTED_RESOURCE_URL + "access_token");
HttpURLConnection connObtainUserInfo = (HttpURLConnection) urUserInfo.openConnection();
if (connObtainUserInfo.getResponseCode() == HttpURLConnection.HTTP_OK) {
StringBuilder sbLines = new StringBuilder("");
BufferedReader reader = new BufferedReader(
new InputStreamReader(connObtainUserInfo.getInputStream(), "utf-8"));
String strLine = "";
while ((strLine = reader.readLine()) != null) {
sbLines.append(strLine);
}
graph = sbLines.toString();
}
} catch (IOException ex) {
x.printStackTrace();
}
return graph;
}
I am pretty sure it's not a smart way to do that and the string I get is quite complex, are there any jave sample that i can get the data directly instead of using java origin httpRequest
Or, are there and class I can import to switch the json string to the object!?
Anyone can help?!
I have trying this for many days already!
Thanks!!

Parsed strings from .csv-file are invalid tokens in an kml-file. How can i solve this?

I have a code which parses strings from an CSV.-file (with twitter data) and gives them to a new KML file. When i parse the comments from the twitter data there are of course unknown tokens like: ðŸš¨. When i open up the new KML-File in Google Earth i get an error because of this unknown tokens.
Question:
When i parse the strings, can i tell java it should throw out all unknown tokens from the string so that i don't have any unknown tokens in my KML?
Thank you
Code below:
String csvFile = "twitter.csv";
BufferedReader br = null;
String line = "";
String cvsSplitBy = ";";
String[] twitter = null;
int row_desired = 0;
int row_counter = 0;
String[] placemarks = new String[1165];
// ab hier einlesen der CSV
try {
br = new BufferedReader(new FileReader(csvFile));
while ((line = br.readLine()) != null) {
if (row_counter++ == row_desired) {
twitter = line.split(cvsSplitBy);
placemarks[row_counter] =
"<Placemark>\n"+
"<name>User ID: "+twitter[7]+"</name>\n"+
"<description>This User wrote: "+twitter[5]+" at the: "+twitter[6]+"</description>\n"+
"<Point>\n"+
"<coordinates>"+twitter[1]+","+twitter[2]+"</coordinates>\n"+
"</Point>\n"+
"</Placemark>\n";
row_desired++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
for(int i = 2; i <= 1164;i++){
String kml2 = kml.concat(""+placemarks[i]+"");
kml=kml2;
}
kml = kml.concat("</Document></kml>");
FileWriter fileWriter = new FileWriter(filepath);
fileWriter.write(kml);
fileWriter.close();
Runtime.getRuntime().exec(googlefilepath + filepath);
}

Text files are not all built equal: you must always consider what character encoding is in use. I'm not sure about Twitter's data specifically, but I would guess they're doing like the rest of the world and using UTF-8.
Basically, avoid FileReader and instead use the constructor of InputStreamReader which lets you specify the Charset.
Tip: if you're using Java 7+, try this:
for (String line : Files.readAllLines(file.toPath(), Charset.forName("UTF-8"))) { ...
More Info
The javadoc of FileReader states "The constructors of this class assume that the default character encoding"
You should avoid this class, always. Or at least for any data that might ever be transferred between computers. Even a program running on Windows "using the default charset" will assume UTF-8 when run from inside Eclipse, or ISO_8859_1 when running outside Eclipse! Such non-determinism from a class is not good.

Java: How to convert a File object to a String object in java? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to create a Java String from the contents of a file
I have a html file which I want to use to extract information. For that I am using Jsoup.
Now for using Jsoup, I need to convert the html file into a string. How can I do that?
File myhtml = new File("D:\\path\\report.html")';
Now, I want a String object that contains the content inside the html file.

I use apache common IO to read a text file into a single string
String str = FileUtils.readFileToString(file);
simple and "clean". you can even set encoding of the text file with no hassle.
String str = FileUtils.readFileToString(file, "UTF-8");

Use a library like Guava or Commons / IO. They have oneliner methods.
Guava:
Files.toString(file, charset);
Commons / IO:
FileUtils.readFileToString(file, charset);
Without such a library, I'd write a helper method, something like this:
public String readFile(File file, Charset charset) throws IOException {
return new String(Files.readAllBytes(file.toPath()), charset);
}

With Java 7, it's as simple as:
final String EoL = System.getProperty("line.separator");
List<String> lines = Files.readAllLines(Paths.get(fileName),
Charset.defaultCharset());
StringBuilder sb = new StringBuilder();
for (String line : lines) {
sb.append(line).append(EoL);
}
final String content = sb.toString();
However, it does havea few minor caveats (like handling files that does not fit into the memory).
I would suggest taking a look on corresponding section in the official Java tutorial (that's also the case if you have a prior Java).
As others pointed out, you might find sime 3rd party libraries useful (like Apache commons I/O or Guava).

Readin file with file inputstream and append file content to string.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class CopyOffileInputStream {
public static void main(String[] args) {
//File file = new File("./store/robots.txt");
File file = new File("swingloggingsscce.log");
FileInputStream fis = null;
String str = "";
try {
fis = new FileInputStream(file);
int content;
while ((content = fis.read()) != -1) {
// convert to char and display it
str += (char) content;
}
System.out.println("After reading file");
System.out.println(str);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (fis != null)
fis.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}

By the way, Jsoup has method that takes file: http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.io.File,%20java.lang.String)

You can copy all contents of myhtml to String as follows:
Scanner myScanner = null;
try
{
myScanner = new Scanner(myhtml);
String contents = myScanner.useDelimiter("\\Z").next();
}
finally
{
if(myScanner != null)
{
myScanner.close();
}
}
Ofcourse, you can add a catch block to handle exceptions properly.

Why you just not read the File line by line and add it to a StringBuffer?
After you reach end of File you can get the String from the StringBuffer.

How to read a text file directly from Internet using Java?

I am trying to read some words from an online text file.
I tried doing something like this
File file = new File("http://www.puzzlers.org/pub/wordlists/pocket.txt");
Scanner scan = new Scanner(file);
but it didn't work, I am getting
http://www.puzzlers.org/pub/wordlists/pocket.txt
as the output and I just want to get all the words.
I know they taught me this back in the day but I don't remember exactly how to do it now, any help is greatly appreciated.

Use an URL instead of File for any access that is not on your local computer.
URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
Scanner s = new Scanner(url.openStream());
Actually, URL is even more generally useful, also for local access (use a file: URL), jar files, and about everything that one can retrieve somehow.
The way above interprets the file in your platforms default encoding. If you want to use the encoding indicated by the server instead, you have to use a URLConnection and parse it's content type, like indicated in the answers to this question.
About your Error, make sure your file compiles without any errors - you need to handle the exceptions. Click the red messages given by your IDE, it should show you a recommendation how to fix it. Do not start a program which does not compile (even if the IDE allows this).
Here with some sample exception-handling:
try {
URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
Scanner s = new Scanner(url.openStream());
// read from your scanner
}
catch(IOException ex) {
// there was some connection problem, or the file did not exist on the server,
// or your URL was not in the right format.
// think about what to do now, and put it here.
ex.printStackTrace(); // for now, simply output it.
}

try something like this
URL u = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
InputStream in = u.openStream();
Then use it as any plain old input stream

What really worked to me: (source: oracle documentation "reading url")
import java.net.*;
import java.io.*;
public class UrlTextfile {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://yoursite.com/yourfile.txt");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}

Using Apache Commons IO:
import org.apache.commons.io.IOUtils;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.charset.StandardCharsets;
public static String readURLToString(String url) throws IOException
{
try (InputStream inputStream = new URL(url).openStream())
{
return IOUtils.toString(inputStream, StandardCharsets.UTF_8);
}
}

Use this code to read an Internet resource into a String:
public static String readToString(String targetURL) throws IOException
{
URL url = new URL(targetURL);
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(url.openStream()));
StringBuilder stringBuilder = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null)
{
stringBuilder.append(inputLine);
stringBuilder.append(System.lineSeparator());
}
bufferedReader.close();
return stringBuilder.toString().trim();
}
This is based on here.

For an old school input stream, use this code:
InputStream in = new URL("http://google.com/").openConnection().getInputStream();

I did that in the following way for an image, you should be able to do it for text using similar steps.
// folder & name of image on PC
File fileObj = new File("C:\\Displayable\\imgcopy.jpg");
Boolean testB = fileObj.createNewFile();
System.out.println("Test this file eeeeeeeeeeeeeeeeeeee "+testB);
// image on server
URL url = new URL("http://localhost:8181/POPTEST2/imgone.jpg");
InputStream webIS = url.openStream();
FileOutputStream fo = new FileOutputStream(fileObj);
int c = 0;
do {
c = webIS.read();
System.out.println("==============> " + c);
if (c !=-1) {
fo.write((byte) c);
}
} while(c != -1);
webIS.close();
fo.close();

Alternatively, you can use Guava's Resources object:
URL url = new URL("http://www.puzzlers.org/pub/wordlists/pocket.txt");
List<String> lines = Resources.readLines(url, Charsets.UTF_8);
lines.forEach(System.out::println);

corrected method is deprecated now. It is giving the option
private WeakReference<MyActivity> activityReference;
here solution will useful.

Extract links from a web page

Using Java, how can I extract all the links from a given web page?

download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
and find all links and then get the detials using
String linkhref=links.attr("href");
Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax
The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.
EDIT: In case you want more tutorials, you can try out this one made by mkyong.
http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.
A simple regex which would match 99% of pages could be this:
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case.
If you are only interested in the href="" and text in between you can also use this regex:
Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
And access the link part with .group(1) and the text part with .group(2)

You can use the HTML Parser library to achieve this:
public static List<String> getLinksOnPage(final String url) {
final Parser htmlParser = new Parser(url);
final List<String> result = new LinkedList<String>();
try {
final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
for (int j = 0; j < tagNodeList.size(); j++) {
final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
final String loopLinkStr = loopLink.getLink();
result.add(loopLinkStr);
}
} catch (ParserException e) {
e.printStackTrace(); // TODO handle error
}
return result;
}

This simple example seems to work, using a regex from here
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public ArrayList<String> extractUrlsFromString(String content)
{
ArrayList<String> result = new ArrayList<String>();
String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
while (m.find())
{
result.add(m.group());
}
return result;
}
and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.
import org.apache.commons.io.IOUtils;
public String getUrlContentsAsString(String urlAsString)
{
try
{
URL url = new URL(urlAsString);
String result = IOUtils.toString(url);
return result;
}
catch (Exception e)
{
return null;
}
}

import java.io.*;
import java.net.*;
public class NameOfProgram {
public static void main(String[] args) {
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
//exception
}
}
}
}

You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading entire html file to String? - java

Are there better ways to read an entire html file to a single string variable than: String content = ""; try { BufferedReader in = new BufferedReader(new FileReader("mypage.html")); String str; while ((str = in.readLine()) != null) { content +=str; } in.close(); } catch (IOException e) { }

There's the IOUtils.toString(..) utility from Apache Commons. If you're using Guava there's also Files.readLines(..) and Files.toString(..).

You can use JSoup. It's a very strong HTML parser for java

I prefers using Guava : import com.google.common.base.Charsets; import com.google.common.io.Files; File file = new File("/path/to/file", Charsets.UTF_8); String content = Files.toString(file);

Related

G suite account get report java sample question

Parsed strings from .csv-file are invalid tokens in an kml-file. How can i solve this?

Java: How to convert a File object to a String object in java? [duplicate]

How to read a text file directly from Internet using Java?

Extract links from a web page

Categories

Resources