Using Java, how can I extract all the links from a given web page?
download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
and find all links and then get the detials using
String linkhref=links.attr("href");
Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax
The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.
EDIT: In case you want more tutorials, you can try out this one made by mkyong.
http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/
Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.
A simple regex which would match 99% of pages could be this:
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case.
If you are only interested in the href="" and text in between you can also use this regex:
Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
And access the link part with .group(1) and the text part with .group(2)
You can use the HTML Parser library to achieve this:
public static List<String> getLinksOnPage(final String url) {
final Parser htmlParser = new Parser(url);
final List<String> result = new LinkedList<String>();
try {
final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
for (int j = 0; j < tagNodeList.size(); j++) {
final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
final String loopLinkStr = loopLink.getLink();
result.add(loopLinkStr);
}
} catch (ParserException e) {
e.printStackTrace(); // TODO handle error
}
return result;
}
This simple example seems to work, using a regex from here
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public ArrayList<String> extractUrlsFromString(String content)
{
ArrayList<String> result = new ArrayList<String>();
String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
while (m.find())
{
result.add(m.group());
}
return result;
}
and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.
import org.apache.commons.io.IOUtils;
public String getUrlContentsAsString(String urlAsString)
{
try
{
URL url = new URL(urlAsString);
String result = IOUtils.toString(url);
return result;
}
catch (Exception e)
{
return null;
}
}
import java.io.*;
import java.net.*;
public class NameOfProgram {
public static void main(String[] args) {
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
//exception
}
}
}
}
You would probably need to use regular expressions on the HTML link tags <a href=> and </a>
Related
I would love to scrape the titles of the top 250 movies (https://www.imdb.com/chart/top/) for educational purposes.
I have tried a lot of things but I messed up at the end every time. Could you please help me scrape the titles with Java and regex?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class scraping {
public static void main (String args[]) {
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("<td\\s+class=\"titleColumn\"[^>]*>"+ ".*?</a>");
Matcher matcher = pattern.matcher(br.readLine());
while(matcher.find()){
System.out.println(matcher.group());
}
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
}
}
Thank you for your time.
Parsing Mode
To parse an XML or HTML content, a dedicated parser will always be easier than a regex, for HTML in Java there is Jsoup, you'll get your films very easily :
Document doc = Jsoup.connect("https://www.imdb.com/chart/top/").get();
Elements films = doc.select("td.titleColumn");
for (Element film : films) {
System.out.println(film);
}
<td class="titleColumn"> 1. Les évadés <span class="secondaryInfo">(1994)</span> </td>
<td class="titleColumn"> 2. Le parrain <span class="secondaryInfo">(1972)</span> </td>
To get the content only :
for (Element film : films) {
System.out.println(film.getElementsByTag("a").text());
}
Les évadés
Le parrain
Le parrain, 2ème partie
Regex Mode
You were not reading the whole content of the website, also it's XML type so all is not on the same line, you can't find the beginning and the end of the balise on the same line, you may read all, and then use the regex, it gives something like this :
URL url = new URL("https://www.imdb.com/chart/top/");
InputStream is = url.openStream();
StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
} catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
// Full line
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.*?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group());
}
// Title only
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.+?<a href=.+?>(.+?)</a>.+?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
As the existing answer says, the Jsoup or other HTML parser should be used for sake of correctness.
I only complete your current solution if you want to use a similar approach for a more reasonable use-case. It cannot work, because you read only the first line from the buffer:
Matcher matcher = pattern.matcher(br.readLine);
Also the Regex pattern is wrong, because your solution seems is built to read 1 line-by-line and test that only line agasint the Regex. The source of the website shows that the content of the table row is spread across multiple lines.
The solution based on reading 1 line should use much simpler Regex (I am sorry, the example contains movie namess in my native language):
\" ?>([^<]+)<\/a>
An example of working code is:
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("\" ?>([^<]+)<\\/a>"); // Compiled once
br.lines() // Stream<String>
.map(pattern::matcher) // Stream<Matcher>
.filter(Matcher::find) // Stream<Matcher> .. if Regex matches
.limit(250) // Stream<Matcher> .. to avoid possible mess below
.map(m -> m.group(1)) // String<String> .. captured movie name
.forEach(System.out::println); // Printed out
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
Note the following:
Regex is not suitable for this. Use a library built for this use-case.
My solution is an working example, but the performance is poor (Stream API, Regex pattern matching of each line)...
Solution like this doesn't guarantee a possible mess. The Regex can captrue more than intended.
The website content, CSS class names etc. might change in the future.
I'm trying to filter through an array list that contains the content of a URL that is stored in: (List<String> quotes = new ArrayList<>();) and, display the result for every thing that is in between <pre> </pre> tags (all the quotes are placed between these two tags). I already figured out the printing part but is there any method in java that allows you to filter an array list as I specified? thanks
more detail:
So you have your normal html file that contains all kinds of tags. lets say I scan the page and store all the text in a string array. I want to display only the content between <pre></pre> tags and not the rest. Hope this helps
here is how the text is stored:
List<String> cookies = new ArrayList<>();
public void init() throws ServletException
{
try
{
URL url = new URL(" http://fortunes.cat-v.org/openbsd/");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String line ;
while((line = in.readLine()) != null)
{
cookies.add(line);
//line = in.readLine();
}
in.close();
}
catch (java.net.MalformedURLException e)
{
System.out.println("Malformed URL: " + e.getMessage());
}
catch (IOException e)
{
System.out.println("I/O Error: " + e.getMessage());
}
}
use regular expression, here's a full working example
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String [] args){
//This list is supposed filled with some values
List<String> quotes = new ArrayList<String>();
for(String quote:quotes){
Pattern pattern = Pattern.compile(".*?<pre>(.*?)</pre>.*?");
Matcher m = pattern.matcher(quote);
while(m.find()){
String result = m.group(1);
System.out.println(result);
}
}
}
}
You can find the index of the String "pre" and index of "/pre" and loop for all the elements between
int startIndex=quotes.IndexOf("<pre>");
int endIndex=c.IndexOf("</pre>");
for(int i=startIndex ; i<=endIndex ; i++){
// do something here ...
// System.out.println(quotes.get(i));
}
I want to read values from a text file using selenium webdriver. I want to get third text from the text file. The text file is something like this.
1.Apple
2.Orange
3.Grape
I want to read the third option(Grape) and to display it. Please help
If you are able to read the text file and store data in a String , then you can use some regular expression to get the third option.
String ps = ".*3.([A-Za-z]*)"; // regex
String s = "1.Apple 2.Orange 3.Grape"; // file data in a String object
Pattern p = Pattern.compile(ps);
Matcher m = p.matcher(s);
if (m.find()){
System.out.println(m.group(0)); // returns value of s
System.out.println(m.group(1)); // returns result= Grape
}
Check regex here : https://regex101.com/r/cF9pB7/1
You can get some other values by changing the regulaar
You do not need selenium to text file. Selenium is just a browser automation tool. You can use below code to read text file using Java.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class BufferedReaderExample {
public static void main(String[] args) {
BufferedReader br = null;
try {
String sCurrentLine;
br = new BufferedReader(new FileReader("C:\\testing.txt"));
while ((sCurrentLine = br.readLine()) != null) {
System.out.println(sCurrentLine);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
You can integrate it with your selenium code.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to create a Java String from the contents of a file
I have a html file which I want to use to extract information. For that I am using Jsoup.
Now for using Jsoup, I need to convert the html file into a string. How can I do that?
File myhtml = new File("D:\\path\\report.html")';
Now, I want a String object that contains the content inside the html file.
I use apache common IO to read a text file into a single string
String str = FileUtils.readFileToString(file);
simple and "clean". you can even set encoding of the text file with no hassle.
String str = FileUtils.readFileToString(file, "UTF-8");
Use a library like Guava or Commons / IO. They have oneliner methods.
Guava:
Files.toString(file, charset);
Commons / IO:
FileUtils.readFileToString(file, charset);
Without such a library, I'd write a helper method, something like this:
public String readFile(File file, Charset charset) throws IOException {
return new String(Files.readAllBytes(file.toPath()), charset);
}
With Java 7, it's as simple as:
final String EoL = System.getProperty("line.separator");
List<String> lines = Files.readAllLines(Paths.get(fileName),
Charset.defaultCharset());
StringBuilder sb = new StringBuilder();
for (String line : lines) {
sb.append(line).append(EoL);
}
final String content = sb.toString();
However, it does havea few minor caveats (like handling files that does not fit into the memory).
I would suggest taking a look on corresponding section in the official Java tutorial (that's also the case if you have a prior Java).
As others pointed out, you might find sime 3rd party libraries useful (like Apache commons I/O or Guava).
Readin file with file inputstream and append file content to string.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class CopyOffileInputStream {
public static void main(String[] args) {
//File file = new File("./store/robots.txt");
File file = new File("swingloggingsscce.log");
FileInputStream fis = null;
String str = "";
try {
fis = new FileInputStream(file);
int content;
while ((content = fis.read()) != -1) {
// convert to char and display it
str += (char) content;
}
System.out.println("After reading file");
System.out.println(str);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (fis != null)
fis.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
By the way, Jsoup has method that takes file: http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.io.File,%20java.lang.String)
You can copy all contents of myhtml to String as follows:
Scanner myScanner = null;
try
{
myScanner = new Scanner(myhtml);
String contents = myScanner.useDelimiter("\\Z").next();
}
finally
{
if(myScanner != null)
{
myScanner.close();
}
}
Ofcourse, you can add a catch block to handle exceptions properly.
Why you just not read the File line by line and add it to a StringBuffer?
After you reach end of File you can get the String from the StringBuffer.
Are there better ways to read an entire html file to a single string variable than:
String content = "";
try {
BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
String str;
while ((str = in.readLine()) != null) {
content +=str;
}
in.close();
} catch (IOException e) {
}
You should use a StringBuilder:
StringBuilder contentBuilder = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();
There's the IOUtils.toString(..) utility from Apache Commons.
If you're using Guava there's also Files.readLines(..) and Files.toString(..).
You can use JSoup.
It's a very strong HTML parser for java
As Jean mentioned, using a StringBuilder instead of += would be better. But if you're looking for something simpler, Guava, IOUtils, and Jsoup are all good options.
Example with Guava:
String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();
Example with IOUtils:
InputStream in = new URL("/path/to/mypage.html").openStream();
String content;
try {
content = IOUtils.toString(in, StandardCharsets.UTF_8);
} finally {
IOUtils.closeQuietly(in);
}
Example with Jsoup:
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();
or
String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();
NOTES:
Files.readLines() and Files.toString()
These are now deprecated as of Guava release version 22.0 (May 22, 2017).
Files.asCharSource() should be used instead as seen in the example above. (version 22.0 release diffs)
IOUtils.toString(InputStream) and Charsets.UTF_8
Deprecated as of Apache Commons-IO version 2.5 (May 6, 2016). IOUtils.toString should now be passed the InputStream and the Charset as seen in the example above. Java 7's StandardCharsets should be used instead of Charsets as seen in the example above. (deprecated Charsets.UTF_8)
I prefers using Guava :
import com.google.common.base.Charsets;
import com.google.common.io.Files;
File file = new File("/path/to/file", Charsets.UTF_8);
String content = Files.toString(file);
For string operations use StringBuilder or StringBuffer classes for accumulating string data blocks. Do not use += operations for string objects. String class is immutable and you will produce a large amount of string objects upon runtime and it will affect on performance.
Use .append() method of StringBuilder/StringBuffer class instance instead.
Here's a solution to retrieve the html of a webpage using only standard java libraries:
import java.io.*;
import java.net.*;
String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(result);
SRC
import org.apache.commons.io.IOUtils;
import java.io.IOException;
try {
var content = new String(IOUtils.toByteArray ( this.getClass().
getResource("/index.html")));
} catch (IOException e) {
e.printStackTrace();
}
//Java 10 Code mentioned above - assuming index.html is available inside resources folder.