LucidWorks: Java Regular Expressions & GNU Regular Expressions - java

I am trying to create regular expressions so that I can crawl and index certain URL's on my web site with LucidWorks.
Example URL: http://www.example.com/reviews/assassins-creed-revelations/24475/reviews/
Example URL: http://www.example.com/reviews/super-mario-3d-land/64303/reviews/
Basically, I want LucidWorks to search my entire site and index only URL'S that have /reviews/ at the end of the URL.
Could anyone help me construct an expression to do that please? :)
Updated:
URL: http://www.example.com/
Include paths: //*/reviews/*
That kind of worked, but it only crawls the first page, it won't go to the next page with more reviews (1,2,3 etc).
If I also add: ///reviews/.*
I get a load of pages indexed which I don't want such as http://www.example.com/?page=2

Check with this function
public boolean canAcceptURL(String url,String endsWith){
boolean canAccept = false;
String regex = "";
try{
if(endsWith.equals("")){
endsWith = "/reviews/";
}
regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends with the endString you hav passed.If end string is null it will take the default value.
canAccept = url.matches(regex);
}catch (PatternSyntaxException pe) {
pe.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("String matches : "+canAccept);
return canAccept;
}
Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true
if you want to get the url contains *'/reviews/'* just change the regex string to
String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.

Related

Can I use regex in properties file in JAVA?

I am trying to assert some text that is stored in properties file:
Text=\
some text\n\
to be\n\
asserted\n\
based on 1567 ratings
1567 vary so I want to make sure that it get's replaced with regex that will say that this can be any number using \d or \d, since I have to escape \ in properties file.
So I tried using
Text=\
some text\n\
to be\n\
asserted\n\
based on \\d ratings
this is the method I am using to assert:
Assert.assertEquals(PropertyLoader.loadProperty("filename.properties", "Text"),actual);
actual is a WebElement that gives me the text from website, I use actual.getText()
This is the class that has a loadProperty method
class PropertyLoader {
static String loadProperty(String file, String name) {
Properties props = new Properties();
try {
props.load(getResourceAsStream(file));
} catch (IOException e) {
e.printStackTrace();
}
return props.getProperty(name);
}
}
End result is that i am getting
Comparison Failure:
Expected:
some text
to be
asserted
based on \d ratings
Actual:
some text
to be
asserted
based on 1567 ratings
Not sure if this is even possible or I am simply missing something?
First the content of your properties file should be:
Text=\
some text\n\
to be\n\
asserted\n\
based on \\d+ ratings
Then your test case will be:
Assert.assertTrue(
Pattern.compile(
PropertyLoader.loadProperty("filename.properties", "Text"),
Pattern.MULTILINE
).matcher(actual).matches()
);

How to detect URL to different page (also in the same domain)

I have question about detect url in page. I'm founding the best way how it solve. For downloading page I use Jsoup.
URI uri = new URI("http://www.niocchi.com/");
Document doc = Jsoup.connect(uri.toString()).get();
Elements links = doc.select("a")
And this page get me some links. For example this:
http://www.niocchi.com/#Package organization
http://www.niocchi.com/#Architecture
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I need get only different pages without references to paragraphs.
I would like to get from example this:
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
It looks like you want to select only these <a> with href attribute with value build from characters which are not #. In that case you can use
doc.select("a[href~=^[^#]+$]")
attribute~=regex is syntax used to check if part of value of attribute can be matched with regex.
regex accepting one or more non # characters can look like this [^#]+
regex accepting only entire string (not only its part) need to be surrounded with ^ and $ anchors which represents
^ - start of the string,
$ end of the string.
You could convert them to strings and then split them based on the # mark.
for example:
public void stringSplitter() {
String result = null;
// example
String[] stringURL = {"http://www.niocchi.com/#Package organization", "http://www.niocchi.com/#Architecture",
"http://www.linkedin.com/in/ivanprado", "http://www.niocchi.com/examples/ "};
try {
for (int i = 0; i < stringURL.length; i++) {
String [] parts = stringURL[i].split("#");
result = parts[0];
System.out.println(result);
}
}catch (Exception ex) {
ex.printStackTrace();
}
}
The output is:
http://www.niocchi.com/
http://www.niocchi.com/
http://www.linkedin.com/in/ivanprado
http://www.niocchi.com/examples/
I would even think about setting a part of the method to return only unique URL's

Dynamically replace part in URL using Regex

I tried searching for something similar, and couldn't find anything. I'm having difficulty trying to replace a few characters after a specific part in a URL.
Here is the URL: https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
I want to remove the /v/ part, leave the t1.0-9, and also remove the /s130x130/.I cannot just replace s130x130, because those may be different variables. How do I go about doing that?
I have a previous URL where I am using this code:
if (pictureUri.indexOf("&url=") != -1)
{
String replacement = "";
String url = pictureUri.replaceAll("&", "/");
String result = url.replaceAll("().*?(/url=)",
"$1" + replacement + "$2");
String pictureUrl = null;
if (result.startsWith("/url="))
{
pictureUrl = result.replace("/url=", "");
}
}
Can I do something similar with the above URL?
With the regex
/v/|/s\d+x\d+/
replaced with
/
It turns the string from
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/v/t1.0-9/s130x130/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
to
https://scontent-b.xx.fbcdn.net/hphotos-xpf1/t1.0-9/10390064_10152552351881633_355852593677844144_n.jpg?oh=479fa99a88adea07f6660e1c23724e42&oe=5519DE4B
as seen here. Is this what you're trying to do?

If a string contains a letter, return the entire String

Weird one but:
Let's say you've a huge html page and if the page contains an email address (looking for an # sign) you want to return that email.
So far I know I need something like this:
String email;
if (myString.contains("#")) {
email = myString.substring("#")
}
I know how to get to the # but how do I go back in the string to find what's before it etc?
if the myString is the string for email you received from html page then ,
you can return the same string if it has # right. something like below
String email;
if (myString.contains("#")) {
email = myString;
}
whats the challenge here.. can you explain any challenge if so ?
This method will give you a list of all the email addresses contained in a string.
static ArrayList<String> getEmailAdresses(String str) {
ArrayList<String> result = new ArrayList<>();
Matcher m = Pattern.compile("\\S+?#[^. ]+(\\.[^. ]+)*").matcher(str.replaceAll("\\s", " "));
while(m.find()) {
result.add(m.group());
}
return result;
}
String email;
if (myString.contains("#")) {
// Locate the #
int atLocation = myString.indexOf("#");
// Get the string before the #
String start = myString.substring(0, atLocation);
// Substring from the last space before the end
start = start.substring(start.lastIndexOf(" "), start.length);
// Get the string after the #
String end = myString.substring(atLocation, myString.length);
// Substring from the first space after the start (of the end, lol)
end = end.substring(end.indexOf(" "), end.length);
// Stick it all together
email = start + "#" + end;
}
This may be a little off as I've been writing javascript all day. :)
Rather than exact code, I would like to give you an approach.
Checking just by # symbol might not be appropriate as it might be possible in other cases as well.
Search through internet or create your own, a regex pattern which matches an email.
(if you want, you can add a check for email providers as well) [here is a link] (http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/)
Get the index of a pattern in a string using regex and find out the substring (email in your case).

What's the best way to check if a String contains a URL in Java/Android?

What's the best way to check if a String contains a URL in Java/Android? Would the best way be to check if the string contains |.com | .net | .org | .info | .everythingelse|? Or is there a better way to do it?
The url is entered into a EditText in Android, it could be a pasted url or it could be a manually entered url where the user doesn't feel like typing in http://... I'm working on a URL shortening app.
Best way would be to use regular expression, something like below:
public static final String URL_REGEX = "^((https?|ftp)://|(www|ftp)\\.)?[a-z0-9-]+(\\.[a-z0-9-]+)+([/?].*)?$";
Pattern p = Pattern.compile(URL_REGEX);
Matcher m = p.matcher("example.com");//replace with string to compare
if(m.find()) {
System.out.println("String contains URL");
}
This is simply done with a try catch around the constructor (this is necessary either way).
String inputUrl = getInput();
if (!inputUrl.contains("http://"))
inputUrl = "http://" + inputUrl;
URL url;
try {
url = new URL(inputUrl);
} catch (MalformedURLException e) {
Log.v("myApp", "bad url entered");
}
if (url == null)
userEnteredBadUrl();
else
continue();
After looking around I tried to improve Zaid's answer by removing the try-catch block. Also, this solution recognizes more patterns as it uses a regex.
So, firstly get this pattern:
// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
"(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
+ "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
+ "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~#!:/{};']*)",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Then, use this method (supposing str is your string):
// separate input by spaces ( URLs don't have spaces )
String [] parts = str.split("\\s+");
// get every part
for( String item : parts ) {
if(urlPattern.matcher(item).matches()) {
//it's a good url
System.out.print(""+ item + " " );
} else {
// it isn't a url
System.out.print(item + " ");
}
}
Based on Enkk's answer, i present my solution:
public static boolean containsLink(String input) {
boolean result = false;
String[] parts = input.split("\\s+");
for (String item : parts) {
if (android.util.Patterns.WEB_URL.matcher(item).matches()) {
result = true;
break;
}
}
return result;
}
Old question, but found this, so I thought it might be useful to share. Should help for Android...
I would first use java.util.Scanner to find candidate URLs in the user input using a very dumb pattern that will yield false positives, but no false negatives. Then, use something like the answer #ZedScio provided to filter them down. For example,
Pattern p = Pattern.compile("[^.]+[.][^.]+");
Scanner scanner = new Scanner("Hey Dave, I found this great site called blah.com you should visit it");
while (scanner.hasNext()) {
if (scanner.hasNext(p)) {
String possibleUrl = scanner.next(p);
if (!possibleUrl.contains("://")) {
possibleUrl = "http://" + possibleUrl;
}
try {
URL url = new URL(possibleUrl);
doSomethingWith(url);
} catch (MalformedURLException e) {
continue;
}
} else {
scanner.next();
}
}
If you don't want to experiment with regular expressions and try a tested method, you can use the Apache Commons Library and validate if a given string is an URL/Hyperlink or not. Below is the example.
Please note: This example is to detect if a given text as a 'whole' is a URL. For text that may contain a combination of regular text along with URLs, one might have to perform an additional step of splitting the string based on spaces and loop through the array and validate each array item.
Gradle dependency:
implementation 'commons-validator:commons-validator:1.6'
Code:
import org.apache.commons.validator.routines.UrlValidator;
// Using the default constructor of UrlValidator class
public boolean URLValidator(String s) {
UrlValidator urlValidator = new UrlValidator();
return urlValidator.isValid(s);
}
// Passing a scheme set to the constructor
public boolean URLValidator(String s) {
String[] schemes = {"http","https"}; // add 'ftp' is you need
UrlValidator urlValidator = new UrlValidator(schemes);
return urlValidator.isValid(s);
}
// Passing a Scheme set and set of Options to the constructor
public boolean URLValidator(String s) {
String[] schemes = {"http","https"}; // add 'ftp' is you need. Providing no Scheme will validate for http, https and ftp
long options = UrlValidator.ALLOW_ALL_SCHEMES + UrlValidator.ALLOW_2_SLASHES + UrlValidator.NO_FRAGMENTS;
UrlValidator urlValidator = new UrlValidator(schemes, options);
return urlValidator.isValid(s);
}
// Possible Options are:
// ALLOW_ALL_SCHEMES
// ALLOW_2_SLASHES
// NO_FRAGMENTS
// ALLOW_LOCAL_URLS
To use multiple options, just add them with the '+' operator
If you need to exclude project level or transitive dependencies in the grade while using the Apache Commons library, you may want to do the following (Remove whatever is required from the list):
implementation 'commons-validator:commons-validator:1.6' {
exclude group: 'commons-logging'
exclude group: 'commons-collections'
exclude group: 'commons-digester'
exclude group: 'commons-beanutils'
}
For more information, the link may provide some details.
http://commons.apache.org/proper/commons-validator/dependencies.html
You need to use URLUtil isNetworkUrl(url) or isValidUrl(url)
public boolean isURL(String text) {
return text.length() > 3 && text.contains(".")
&& text.toCharArray()[text.length() - 1] != '.' && text.toCharArray()[text.length() - 2] != '.'
&& !text.contains(" ") && !text.contains("\n");
}
This function is working for me
private boolean containsURL(String content){
String REGEX = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern p = Pattern.compile(REGEX,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(content);
return m.find();
}
Call this function
boolean isContain = containsURL("Pass your string here...");
Log.d("Result", String.valueOf(isContain));
NOTE :- I have tested string containing single url
The best way is to to set the property autolink to your textview, Android will recognize, change the appearance and make clickable a link anywhere inside the string.
android:autoLink="web"

Categories

Resources