Scraping a site with java regex

Scraping a site with java regex - java

I would love to scrape the titles of the top 250 movies (https://www.imdb.com/chart/top/) for educational purposes.
I have tried a lot of things but I messed up at the end every time. Could you please help me scrape the titles with Java and regex?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class scraping {
public static void main (String args[]) {
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("<td\\s+class=\"titleColumn\"[^>]*>"+ ".*?</a>");
Matcher matcher = pattern.matcher(br.readLine());
while(matcher.find()){
System.out.println(matcher.group());
}
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
}
}
Thank you for your time.

Parsing Mode
To parse an XML or HTML content, a dedicated parser will always be easier than a regex, for HTML in Java there is Jsoup, you'll get your films very easily :
Document doc = Jsoup.connect("https://www.imdb.com/chart/top/").get();
Elements films = doc.select("td.titleColumn");
for (Element film : films) {
System.out.println(film);
}
<td class="titleColumn"> 1. Les évadés <span class="secondaryInfo">(1994)</span> </td>
<td class="titleColumn"> 2. Le parrain <span class="secondaryInfo">(1972)</span> </td>
To get the content only :
for (Element film : films) {
System.out.println(film.getElementsByTag("a").text());
}
Les évadés
Le parrain
Le parrain, 2ème partie
Regex Mode
You were not reading the whole content of the website, also it's XML type so all is not on the same line, you can't find the beginning and the end of the balise on the same line, you may read all, and then use the regex, it gives something like this :
URL url = new URL("https://www.imdb.com/chart/top/");
InputStream is = url.openStream();
StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
} catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
// Full line
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.*?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group());
}
// Title only
Pattern pattern = Pattern.compile("<td class=\"titleColumn\">.+?<a href=.+?>(.+?)</a>.+?</td>");
String content = sb.toString();
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

As the existing answer says, the Jsoup or other HTML parser should be used for sake of correctness.
I only complete your current solution if you want to use a similar approach for a more reasonable use-case. It cannot work, because you read only the first line from the buffer:
Matcher matcher = pattern.matcher(br.readLine);
Also the Regex pattern is wrong, because your solution seems is built to read 1 line-by-line and test that only line agasint the Regex. The source of the website shows that the content of the table row is spread across multiple lines.
The solution based on reading 1 line should use much simpler Regex (I am sorry, the example contains movie namess in my native language):
\" ?>([^<]+)<\/a>
An example of working code is:
try {
URL URL1=new URL("https://www.imdb.com/chart/top/");
URLConnection URL1c=URL1.openConnection();
BufferedReader br=new BufferedReader(new
InputStreamReader(URL1c.getInputStream(),"ISO8859_7"));
String line;int lineCount=0;
Pattern pattern = Pattern.compile("\" ?>([^<]+)<\\/a>"); // Compiled once
br.lines() // Stream<String>
.map(pattern::matcher) // Stream<Matcher>
.filter(Matcher::find) // Stream<Matcher> .. if Regex matches
.limit(250) // Stream<Matcher> .. to avoid possible mess below
.map(m -> m.group(1)) // String<String> .. captured movie name
.forEach(System.out::println); // Printed out
} catch (Exception e) {
System.out.println("Exception: " + e.getClass() + ", Details: " + e.getMessage());
}
Note the following:
Regex is not suitable for this. Use a library built for this use-case.
My solution is an working example, but the performance is poor (Stream API, Regex pattern matching of each line)...
Solution like this doesn't guarantee a possible mess. The Regex can captrue more than intended.
The website content, CSS class names etc. might change in the future.

Related

Regex for replacing Exact String match [duplicate]

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread

You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

Java - Groovy : regex parse text block

I know that this is a common question and I've been through a lot of forums to figure out whats the problem in my code.
I have to read a text file with several blocks in the following format:
import com.myCompanyExample.gui.Layout
/*some comments here*/
#Layout
LayoutModel currentState() {
MyBuilder builder = new MyBuilder()
form example
title form{
row_1
row_1
row_n
}
return build.get()
}
#Layout
LayoutModel otherState() {
....
....
return build.get()
}
I have this code to read all the file and I'd like to extract each block between the keyword "#Layout" and the keyword "return". I need also to catch all newline so later I'll be able to split each matched block into a list
private void myReadFile(File fileLayout){
String line = null;
StringBuilder allText = new StringBuilder();
try{
FileReader fileReader = new FileReader(fileLayout);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
allText.append(line)
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println("Error reading file");
}
Pattern pattern = Pattern.compile("(?s)#Layout.*?return",Pattern.DOTALL);
Matcher matcher = pattern.matcher(allText);
while(matcher.find()){
String [] layoutBlock = (matcher.group()).split("\\r?\\n")
for(index in layoutBlock){
//check each line of the current block
}
}
layoutBlock returns size=1

I think this can potentially be a so called XY problem anyway...if the groovy source is composed only by #Layout annotated blocks of code you can use a tempered greedy token to select till the next annotation (view online demo).
Change the pattern loc as this:
Pattern pattern = Pattern.compile( "#Layout(?:(?!#Layout).)*", Pattern.DOTALL );
PS: the dotall flag (?s) inside the regex and the parameter Pattern.DOTALL do the same thing (enable the so called multiline mode), use only one of them indifferently.
UPDATE
I tried your code, the problem (preserving newline) is in the method you use to slurp the file (bufferedReader.readline() remove the newline at the end of the string).
Simply readd a newline when append to allText:
String ln = System.lineSeparator();
while((line = bufferedReader.readLine()) != null) {
allText.append(line + ln);
}
Or you can replace all the code to slurp the file with this:
import java.nio.file.Files;
import java.nio.file.Paths;
//can throw an IOException
String filePath = "/path/to/layout.groovy";
String allText = new String(Files.readAllBytes(Paths.get(filePath)),StandardCharsets.UTF_8);

How do I read Windows NTFS's Alternate Data Stream using Java's IO?

I'm trying to have my Java application read all the data in a given path. So files, directories, metadata etc. This also includes one weird thing NTFS has called Alternate Data Stream (ADS).
Apparently it's like a second layer of data in a directory or file. You can open the command prompt and create a file in the ADS using ':', for example:
C:\ADSTest> echo test>:ads.txt
So,
C:\ADSTest> notepad :ads.txt
Should open a notepad that contains the string "test" in it. However, if you did:
C:\ADSTest> dir
You will not be able to see ads.txt. However, if you use the dir option that displays ADS data, you will be able to see it:
C:\ADSTest> dir /r
MM/dd/yyyy hh:mm .:ads.txt
Now, I am aware that Java IO has the capability to read ADS. How do I know that? Well, Oracle's documentations clearly states so:
If the file attributes supported by your file system implementation
aren't sufficient for your needs, you can use the
UserDefinedAttributeView to create and track your own file attributes.
Some implementations map this concept to features like NTFS
Alternative Data Streams and extended attributes on file systems such
as ext3 and ZFS.
Also, a random post on a random forum :D
The data is stored in NTFS Alternate data streams (ADS) which are
readable through Java IO (I have tested it).
The problem is, I can't find any pre-written file attribute viewer that can parse ADS, and I don't understand how to write an ADS parser of my own. I'm a beginner programmer so I feel this is way over my head. Would anybody please help me out or point me in the right direction?
Solution
EDIT: With the help of #knosrtum I was able to concoct a method that will return all the parsed ADS information from a given path as an ArrayList of Strings (it can also be easily edited to an ArrayList of files). Here's the code for anyone who might need it:
public class ADSReader {
public static ArrayList<String> start(Path toParse) {
String path = toParse.toString();
ArrayList<String> parsedADS = new ArrayList<>();
final String command = "cmd.exe /c dir " + path + " /r"; // listing of given Path.
final Pattern pattern = Pattern.compile(
"\\s*" // any amount of whitespace
+ "[0123456789,]+\\s*" // digits (with possible comma), whitespace
+ "([^:]+:" // group 1 = file name, then colon,
+ "[^:]+:" // then ADS, then colon,
+ ".+)"); // then everything else.
try {
Process process = Runtime.getRuntime().exec(command);
process.waitFor();
try (BufferedReader br = new BufferedReader(
new InputStreamReader(process.getInputStream()))) {
String line;
while ((line = br.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
if (matcher.matches()) {
parsedADS.add((matcher.group(1)));
}
}
}
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
for (int z = 0; z<parsedADS.size(); z++)
System.out.println(parsedADS.get(z));
return parsedADS;
}
}

I was able to read the ADS of a file simply by opening the the file with the syntax "file_name:stream_name". So if you've done this:
C:>echo Hidden text > test.txt:hidden
Then you should be able to do this:
package net.snortum.play;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class AdsPlay {
public static void main(String[] args) {
new AdsPlay().start();
}
private void start() {
File file = new File("test.txt:hidden");
try (BufferedReader bf = new BufferedReader( new FileReader(file))) {
String hidden = bf.readLine();
System.out.println(hidden);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
If you want to get the ADS data from the dir /r command, I think you just need to execute a shell and capture the output:
package net.snortum.play;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ExecPlay {
public static void main(String[] args) {
new ExecPlay().start();
}
private void start() {
String fileName = "not found";
String ads = "not found";
final String command = "cmd.exe /c dir /r"; // listing of current directory
final Pattern pattern = Pattern.compile(
"\\s*" // any amount of whitespace
+ "[0123456789,]+\\s*" // digits (with possible comma), whitespace
+ "([^:]+):" // group 1 = file name, then colon
+ "([^:]+):" // group 2 = ADS, then colon
+ ".+"); // everything else
try {
Process process = Runtime.getRuntime().exec(command);
process.waitFor();
try (BufferedReader br = new BufferedReader(
new InputStreamReader(process.getInputStream()))) {
String line;
while ((line = br.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
if (matcher.matches()) {
fileName = matcher.group(1);
ads = matcher.group(2);
break;
}
}
}
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(fileName + ", " + ads);
}
}
Now you can use the first code listing to print the ADS data.

matching between array and content of file without using regex

please possible make matching between array and content of file without using regex.
please replay:-
if i have a txt file contain this sentences:
the sql is the best book for jon.
book sql is the best title for jon.
the html for author asr.
book java for famous writer amr.
and if i stored this string in array;
sql html java
jon asr amr
I want to search for content of array in the file for example if "sql" and"jon" in the same sentence in the txt file then write the sentence and
write all word before "sql" named as prefix and all word between two "sql" and"jon" and named as middle and all word after "jon"named as suffix.
I try to write cod :
String book[][] = {{"sql","html","java"},{"jon","asr","amr"}};
String input;
try {
BufferedReader br = new BufferedReader(new FileReader(new File("sample.txt") ));
input= br.readLine();
while ((input)!= null)
{
if((book[0][0].contains(input))&( book[1][0]).contains(input)){
System.out.println();
if((book[0][1].contains(input))&( book[1][1]).contains(input)){
System.out.println();
if((book[0][2].contains(input))&( book[1][2]).contains(input)){
System.out.println();
}
else
System.out.println("not match");
}}
}} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
i don't know how to write code to extract prefix,middle and suffix
the output is:
the sentence is : the sql is the best book for jon.
prefix is :the
middle is:is the best book for
suffix is: null
and so on...

You should use Pattern class for that. http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Tutorial http://docs.oracle.com/javase/tutorial/essential/regex/
Sorry, I'm not going to write the exact code.
The pattern will look like
"(.*)(?:sql|html|java)(.*)(?:jon|asr|amr)(.*)"
Then, in Matcher you will find your prefix, middle and suffix as matcher.group(1), matcher.group(2) and matcher.group(3).

Here is the code you need:
String line = "the sql is the best book for jon.";
String regex = "(.*)(sql|html|java)(.*)(jon|asr|amr)(.*)";
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(line);
matcher.find();
String prefix = matcher.group(1);
String firstMatch = matcher.group(2);
String middle = matcher.group(3);
String secondMatch = matcher.group(4);
String suffix = matcher.group(5);

Extract links from a web page

Using Java, how can I extract all the links from a given web page?

download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
and find all links and then get the detials using
String linkhref=links.attr("href");
Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax
The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.
EDIT: In case you want more tutorials, you can try out this one made by mkyong.
http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.
A simple regex which would match 99% of pages could be this:
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case.
If you are only interested in the href="" and text in between you can also use this regex:
Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
And access the link part with .group(1) and the text part with .group(2)

You can use the HTML Parser library to achieve this:
public static List<String> getLinksOnPage(final String url) {
final Parser htmlParser = new Parser(url);
final List<String> result = new LinkedList<String>();
try {
final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
for (int j = 0; j < tagNodeList.size(); j++) {
final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
final String loopLinkStr = loopLink.getLink();
result.add(loopLinkStr);
}
} catch (ParserException e) {
e.printStackTrace(); // TODO handle error
}
return result;
}

This simple example seems to work, using a regex from here
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public ArrayList<String> extractUrlsFromString(String content)
{
ArrayList<String> result = new ArrayList<String>();
String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
while (m.find())
{
result.add(m.group());
}
return result;
}
and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.
import org.apache.commons.io.IOUtils;
public String getUrlContentsAsString(String urlAsString)
{
try
{
URL url = new URL(urlAsString);
String result = IOUtils.toString(url);
return result;
}
catch (Exception e)
{
return null;
}
}

import java.io.*;
import java.net.*;
public class NameOfProgram {
public static void main(String[] args) {
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
url = new URL("http://www.stackoverflow.com");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
if(line.contains("href="))
System.out.println(line.trim());
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
//exception
}
}
}
}

You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scraping a site with java regex - java

Related

Regex for replacing Exact String match [duplicate]

Java - Groovy : regex parse text block

How do I read Windows NTFS's Alternate Data Stream using Java's IO?

matching between array and content of file without using regex

Extract links from a web page

Categories

Resources