Extract some contents from the url using regular expressions in java

Extract some contents from the url using regular expressions in java - java

I want to extract contents from this url http://www.xyz.com/default.aspx and this is the below content that I want to extract using regular expression.
String expr = "
What Regular Expression should I use here
";
Pattern patt = Pattern.compile(expr, Pattern.DOTALL | Pattern.UNIX_LINES);
URL url4 = null;
try {
url4 = new URL("http://www.xyz.com/default.aspx");
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Text" +url4);
Matcher m = null;
try {
m = patt.matcher(getURLContent(url4));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Match" +m);
while (m.find()) {
String stateURL = m.group(1);
System.out.println("Some Data" +stateURL);
}
public static CharSequence getURLContent(URL url8) throws IOException {
URLConnection conn = url8.openConnection();
String encoding = conn.getContentEncoding();
if (encoding == null) {
encoding = "ISO-8859-1";
}
BufferedReader br = new BufferedReader(new
InputStreamReader(conn.getInputStream(), encoding));
StringBuilder sb = new StringBuilder(16384);
try {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
System.out.println(line);
sb.append('\n');
}
} finally {
br.close();
}
return sb;
}

As #bkent314 has mentioned, jsoup is a better and cleaner approach than using regular expression.
If you inspect the source code of that website, you basically want content from this snippet:-
<div class="smallHd_contentTd">
<div class="breadcrumb">...</div>
<h2>Services</h2>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
By using jsoup, your code may look something like this:-
Document doc = Jsoup.connect("http://www.ferotech.com/Services/default.aspx").get();
Element content = doc.select("div.smallHd_contentTd").first();
String header = content.select("h2").first().text();
System.out.println(header);
for (Element pTag : content.select("p")) {
System.out.println(pTag.text());
}
Hope this helps.

Related

Parsing HTML page: difference in page content between Java code and browser

URL: https://www.bing.com/search?q=vevo+USIV30300367
If I View source of the above URL (in Internet Explorer 11 for that matter), the sub-string pertaining to the first search result is:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5075.1"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - Rush - [strong]Vevo[/strong][/a][/h2]"
Whereas via Java code, I get this:
"[h2][a href="https://www.vevo.com/watch/rush/tom-sawyer-(live-exit-stage-left-version)/USIV30300367" h="ID=SERP,5077.1"][span dir="ltr"]Tom [strong]Sawyer (Live Exit Stage Left Version[/strong]) - …[/span][/a][/h2]"
The formatting is a bit different (check the [span] tags), but even worse, the video title has been truncated in the search result string (i.e. "Rush - Vevo" became "...").
Why is that? How to fix it?
(NOTE: I am using "[" and "]" in this post as replacements for the original HTML tagging delimiters to avoid my strings being formatted here on SO.)
Below is my Java code:
private String getWebPage(String pageURL, UserAgentBrowser uab)
{
URL url = null;
InputStream is = null;
BufferedReader br = null;
URLConnection conn = null;
StringBuilder pagedata = new StringBuilder();
String contenttype = null, charset = "utf-8";
String line = null;
try {
url = new URL(pageURL);
conn = url.openConnection();
conn.addRequestProperty("User-Agent", uab.toString());
contenttype = conn.getContentType();
int indexL = contenttype.indexOf("charset=") + 8;
if (indexL > 7) {
int indexR = contenttype.indexOf(";", indexL);
charset = (indexR == -1 ? contenttype.substring(indexL): contenttype.substring(indexL, indexR));
}
is = conn.getInputStream(); // Could throw an IOException
br = new BufferedReader(new InputStreamReader(is, charset));
while (true) {
line = br.readLine();
if (line == null) break;
pagedata.append(line);
}
} catch (MalformedURLException mue) {
// mue.printStackTrace();
} catch (IOException ioe) {
// ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// Nothing to see here
}
}
return (pagedata.length() == 0 ? null : pagedata.toString());
}
And
String pagedata = getWebPage("https://www.bing.com/search?q=vevo+USIV30300367", UserAgentBrowser.INTERNET_EXPLORER);
Where UserAgentBrowser.INTERNET_EXPLORER.toString() equals:
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"

How to read another remote url android if the first url is not possible

I have this code and it works well to read a remote file, but I wonder how it would be possible and it would be like to read a second url url if the first fails.
That is, I read the first file url, if available, ok continued.
if you can not read the first url, then accesses the second url.
As you can add a second url "backup"
Thanks.
// Code
try {
// Create a URL for the desired page
URL url = new URL("http://myurl.com/archive.txt");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
network1 = in.readLine();
network2 = in.readLine();
network3 = in.readLine();
network4 = in.readLine();
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
// Code
}

Use something like this
String[] readUrl(String urlStr) throws Exception {
URL url = new URL(urlStr);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String result = new String[4];
for(i=0; i< 4; i++) {
result[i] = in.readLine();
}
return result;
}
String[] tryMultipleUrls(String url1, String url2) {
String result[] = null;
try {
result = readUrl(url1);
}
catch(Exception ex) {
result = readUrl(url2);
}
return result;
}

I am using the epublib and I am trying to get the entire chapter of a book at a time

I am trying to get one chapter at a time of a book. I am using the Paul Seigmann library. However, I am not sure how to do it but I am able to get all the text from the book. Not sure where to go from there.
// find InputStream for book
InputStream epubInputStream = assetManager
.open("the_planet_mappers.epub");
// Load Book from inputStream
mThePlanetMappersBookEpubLib = (new EpubReader()).readEpub(epubInputStream);
Spine spine = new Spine(mThePlanetMappersBookEpubLib.getTableOfContents());
for (SpineReference bookSection : spine.getSpineReferences()) {
Resource res = bookSection.getResource();
try {
InputStream is = res.getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = r.readLine()) != null) {
line = Html.fromHtml(line).toString();
Log.i("Read it ", line);
mEntireBook.append(line);
}
} catch (IOException e) {
}

I don't know if you're still looking for an answer, but...
I'm working on it too right now. This is the code I have to retrieve the content of all the epub file:
public ArrayList<String> getBookContent(Book bi) {
// GET THE CONTENTS OF ALL PAGES
StringBuilder string = new StringBuilder();
ArrayList<String> listOfPages = new ArrayList<>();
Resource res;
InputStream is;
BufferedReader reader;
String line;
Spine spine = bi.getSpine();
for (int i = 0; spine.size() > i; i++) {
res = spine.getResource(i);
try {
is = res.getInputStream();
reader = new BufferedReader(new InputStreamReader(is));
while ((line = reader.readLine()) != null) {
// FIRST PAGE LINE -> <?xml version="1.0" encoding="utf-8" standalone="no"?>
if (line.contains("<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>")) {
string.delete(0, string.length());
}
// ADD THAT LINE TO THE FINAL STRING REMOVING ALL THE HTML
string.append(Html.fromHtml(formatLine(line)));
// LAST PAGE LINE -> </html>
if (line.contains("</html>")) {
listOfPages.add(string.toString());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
return listOfPages;
}
private String formatLine(String line) {
if (line.contains("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd")) {
line = line.substring(line.indexOf(">") + 1, line.length());
}
// REMOVE STYLES AND COMMENTS IN HTML
if ((line.contains("{") && line.contains("}"))
|| ((line.contains("/*")) && line.contains("*/"))
|| (line.contains("<!--") && line.contains("-->"))) {
line = line.substring(line.length());
}
return line;
}
As you may have notice I need to improve the filter, but I have every chapter of that book in my ArrayList. Now I just need to call that ArrayList like myList.get(0); and is done.
To show the text in a proper way, I'm using the bluejamesbond:textjustify library (https://github.com/bluejamesbond/TextJustify-Android).
It is easy to use and powerful.
I hope it helps you, and if anybody finds a better way to filter that html, notice me, please.

Regex extract string, why my pattern don't works?

I have a long string in this format (a long single line in file):
"1":"Aname","2":"AnotherName","3":"Sempronio"
I want to extract the number and the name and save them on a Map.
I tried this:
FileReader fileReader = null;
BufferedReader br = null;
File file = new File("./SingleLineFileNames.txt");
try {
fileReader = new FileReader(file);
br = new BufferedReader(fileReader);
String line;
Pattern p = Pattern.compile("\"(\\d+)\":\"([\\w-.' ]+)\"");
Matcher matcher;
while((line = br.readLine()) != null) {
matcher = p.matcher(line);
String name;
int i = 1;
while((name = matcher.group(i)) != null){
// save in map
i++;
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
try {
br.close();
fileReader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return null;
result is java.lang.IllegalStateException: No match found
It's the right way to iterate on groups?
Where I wrong?

First split the String at , (String#split) and then split each resulting array element at : to get key and value. With input strings like these, I wonder what kind of masochism is on the developers using regex sledgehammers breaking these simple nuts..

If you use hyphen inside [] then always place at the first or at the last.
Pattern p = Pattern.compile("\"(\\d+)\":\"([-\\w.' ]+)\"");
^ here
Also the way you are checking the group() is not correct. Check here:
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Remove the broken square bracket construct ([\\w-.' ]+) . For the name containing word characters only, it is enough to put (\\w+) there.

Regular expressions in Java, can't search all HTML

I'm working with Java regular expressions on Android platform.
I'm trying to search this HTML for defined a regular expression.
Here's my code:
public void mainaaForWWW(String websiteSource){
try {
websiteSource = readDataFromWWW(websiteSource);
} catch (IOException e1) {
e1.printStackTrace();
}
ArrayList<String> cinemaArray = new ArrayList<String>();
Pattern sample = Pattern.compile("<div class=\"theatre\">");
Matcher secuence = sample.matcher(websiteSource);
try {
while (secuence.find()) {
cinemaArray.add(secuence.group());
}
} catch (Exception e) {
e.printStackTrace();
}
titleTableForWWW = new String[cinemaArray.size()];
for(int i = 0; i < titleTableForWWW.length; i++)
titleTableForWWW[i] = cinemaArray.get(i);
}
The problem is quite strange, because when I debug the code, String websiteSource is okay (all HTML files are completely loaded), but there's only 4 while loops. In the HTML document I found manually 11 matches. This regex is simplified only to find what's going on. Any ideas?
Ok, my bad. I found a solution:
So, here's my code responsible for writing HTML source code to String:
public String readDataFromWWW(String UrlAdress) throws IOException
{
String line = null;
URL url = new URL(UrlAdress);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "ISO-8859-2"));
while (rd.readLine() != null) {
line += rd.readLine();
}
System.out.println(line);
return line;
I think that reading to string that way, may something messed up, so I replaced this method by this one:
public String readDataFromWWW(String UrlAdress) throws IOException
{
String wyraz = "";
try {
String webPage = UrlAdress;
URL url = new URL(webPage);
URLConnection urlConnection = url.openConnection();
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is, "ISO-8859-2");
int numCharsRead;
char[] charArray = new char[1024];
StringBuffer sb = new StringBuffer();
while ((numCharsRead = isr.read(charArray)) > 0) {
sb.append(charArray, 0, numCharsRead);
}
wyraz = sb.toString();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return wyraz;
}
And everything works FINE! Thanks a lot for clues and help. I think the problem was connected with newline durring writing String, but I'm not quite sure.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract some contents from the url using regular expressions in java - java

Related

Parsing HTML page: difference in page content between Java code and browser

How to read another remote url android if the first url is not possible

I am using the epublib and I am trying to get the entire chapter of a book at a time

Regex extract string, why my pattern don't works?

Regular expressions in Java, can't search all HTML

Categories

Resources