Java Regex: How detect a URL with file extension - java

How create a REGEX to detect if a "String url" contains a file extension (.pdf,.jpeg,.asp,.cfm...) ?
Valids (without extensions):
http://www.yahoo.com
http://dbpedia.org/ontology/
http://www.rdf.com.br
Invalids (with extensions):
http://www.thesis.com/paper.pdf
http://pics.co.uk/mypic.png
http://jpeg.com/images/cool/the_image.JPEG
Thanks,
Celso

In Java, you are better off using String.endsWith() This is faster and easier to read.
Example:
"file.jpg".endsWith(".jpg") == true

Alternative version without regexp but using, the URI class:
import java.net.*;
class IsFile {
public static void main( String ... args ) throws Exception {
URI u = new URI( args[0] );
for( String ext : new String[] {".png", ".pdf", ".jpg", ".html" } ) {
if( u.getPath().endsWith( ext ) ) {
System.out.println("Yeap");
break;
}
}
}
}
Works with:
java IsFile "http://download.oracle.com/javase/6/docs/api/java/net/URI.html#getPath()"

How about this?
// assuming the file extension is either 3 or 4 characters long
public boolean hasFileExtension(String s) {
return s.matches("^[\\w\\d\\:\\/\\.]+\\.\\w{3,4}(\\?[\\w\\W]*)?$");
}
#Test
public void testHasFileExtension() {
assertTrue("3-character extension", hasFileExtension("http://www.yahoo.com/a.pdf"));
assertTrue("3-character extension", hasFileExtension("http://www.yahoo.com/a.htm"));
assertTrue("4-character extension", hasFileExtension("http://www.yahoo.com/a.html"));
assertTrue("3-character extension with param", hasFileExtension("http://www.yahoo.com/a.pdf?p=1"));
assertTrue("4-character extension with param", hasFileExtension("http://www.yahoo.com/a.html?p=1&p=2"));
assertFalse("2-character extension", hasFileExtension("http://www.yahoo.com/a.co"));
assertFalse("2-character extension with param", hasFileExtension("http://www.yahoo.com/a.co?p=1&p=2"));
assertFalse("no extension", hasFileExtension("http://www.yahoo.com/hello"));
assertFalse("no extension with param", hasFileExtension("http://www.yahoo.com/hello?p=1&p=2"));
assertFalse("no extension with param ends with .htm", hasFileExtension("http://www.yahoo.com/hello?p=1&p=a.htm"));
}

Not a Java developer anymore, but you could define what you're looking for with the following regex
"/\.(pdf|jpe{0,1}g|asp|docx{0,1}|xlsx{0,1}|cfm)$/i"
Not certain what the function would look like.

If the following code returns true, then contains a file extension in the end:
urlString.matches("\\p{Graph}+\\.\\p{Alpha}{2,4}$");
Assuming that a file extension is a dot followed by 2, 3 or 4 alphabetic chars.

Related

How to Make this Regex Greedy?

I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).
I have the following lists of domains:
p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com
I'm using the following regex to extract domain + subdomain:
[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?
The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:
How can I force the regex to be greedy in the sense that it matches the full domain as a single match?
I'm using Java.
I'd use the standard URI class instead of a regular expression to parse out the domain:
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;
public class Demo {
private static Optional<String> getHostname(String domain) {
try {
// Add a scheme if missing
if (domain.indexOf("://") == -1) {
domain = "https://" + domain;
}
URI uri = new URI(domain);
return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
} catch (URISyntaxException e) {
return Optional.empty();
}
}
public static void main(String[] args) {
String[] domains = new String[] {
"p.io",
"amazon.com",
"d.amazon.ca",
"domain.amazon.co.uk",
"https://regex101.com/",
"www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's
"www.wix.com.co",
"https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
"smile.amazon.com"
};
for (String domain : domains) {
System.out.println(getHostname(domain).orElse("hostname not found"));
}
}
}
outputs
p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com

Check if files under a root are named in a portable way

I want to check if all the files in a given folder
have portable names or if they have some unfortunate names that may make impossible to represent the same file structure on various file systems; I want to at least support the most common cases.
For example, on Windows, you can not have a file called
aux.txt, and file names are not case sensitive.
This is my best attempt, but I'm not an expert in operative systems and file systems design.
Looking on wikipedia, I've found 'incomplete' lists of possible problems... but... how can I catch all the issues?
Please, look to my code below and see if I've forgotten any subtle unfortunate case. In particular, I've found a lot of 'Windows issues'. Is there any Linux/Mac issue that I should check for?
class CheckFileSystemPortable {
Path top;
List<Path> okPaths=new ArrayList<>();
List<Path> badPaths=new ArrayList<>();
List<Path> repeatedPaths=new ArrayList<>();
CheckFileSystemPortable(Path top){
assert Files.isDirectory(top);
this.top=top;
try (Stream<Path> walk = Files.walk(top)) {//the first one is guaranteed to be the root
walk.skip(1).forEach(this::checkSystemIndependentPath);
} catch (IOException e) {
throw new Error(e);
}
for(var p:okPaths) {
checkRepeatedPaths(p);
}
okPaths.removeAll(repeatedPaths);
}
private void checkRepeatedPaths(Path p) {
var s=p.toString();
for(var pi:okPaths){
if (pi!=p && pi.toString().equalsIgnoreCase(s)) {
repeatedPaths.add(pi);
}
}
}
//incomplete list from wikipedia below:
//https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
private static final List<String>forbiddenWin=List.of(
"CON", "PRN", "AUX", "CLOCK$", "NUL",
"COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
"LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
"LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$",
"$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
"$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse"
);
private void checkSystemIndependentPath(Path path) {
String lastName=path.getName(path.getNameCount()-1).toString();
String[] parts=lastName.split("\\.");
var ko = forbiddenWin.stream()
.filter(f -> Stream.of(parts).anyMatch(p->p.equalsIgnoreCase(f)))
.count();
if(ko!=0) {
badPaths.add(path);
} else {
okPaths.add(path);
}
}
}
If I understand your question correctly and by reading the Filename wikipedia page, portable file names must:
Be posix compliant. Eg. alpha numeric ascii characters and _, -
Avoid windows and DOS device names.
Avoid NTFS special names.
Avoid special characters. Eg. \, |, /, $ etc
Avoid trailing space or dot.
Avoid filenames begining with a -.
Must meet max length. Eg. 8-bit Fat has max 9 characters length.
Some systems expect an extension with a . and followed by a 3 letter extension.
With all that in mind checkSystemIndependentPath could be simplified a bit, to cover most of those cases using a regex.
For example, POSIX file name, excluding special devices, NTFS, special characters and trailing space or dot:
private void checkSystemIndependentPath(Path path){
String reserved = "^(CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9])(\\..*)*$";
String posix = "^[a-zA-Z\\._-]+$";
String trailing = ".*[\s|\\.]$";
int nameLimit = 9;
String fileName = path.getFileName().toString();
if (fileName.matches(posix) &&
!fileName.matches(reserved) &&
!fileName.matches(trailing) &&
fileName.length() <= nameLimit) {
okPaths.add(path);
} else {
badPaths.add(path);
}
}
Note that the example is not tested and doesn't cover edge conditions.
For example some systems ban dots in a directory names.
Some system will complain about multiple dots in a filename.
Assuming your windows forbidden list is correct, and adding ":" (mac) and nul (everywhere), use regex!
private static final List<String> FORBIDDEN_WINDOWS_NAMES = List.of(
"CON", "PRN", "AUX", "CLOCK$", "NUL",
"COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
"LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
"LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$",
"$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
"$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse"
); // you can add more
private static final String FORBIDDEN_CHARACTERS = "\0:"; // you can add more
private static final String REGEX = "^(?i)(?!.*[" + FORBIDDEN_CHARACTERS + "])(.*/)?(?!(\\Q" +
String.join("\\E|\\Q", FORBIDDEN_WINDOWS_NAMES) + "\\E)(\\.[^/]*)?$).*";
private static Pattern ALLOWED_PATTERN = Pattern.compile(REGEX);
public static boolean isAllowed(String path) {
return ALLOWED_PATTERN.matcher(path).matches();
}
fyi, the regex generated from the lists/chars as defined here is:
^(?i)(?!.*[<nul>:])(.*/)?(?!(\QCON\E|\QPRN\E|\QAUX\E|\QCLOCK$\E|\QNUL\E|\QCOM0\E|\QCOM1\E|\QCOM2\E|\QCOM3\E|\QCOM4\E|\QCOM5\E|\QCOM6\E|\QCOM7\E|\QCOM8\E|\QCOM9\E|\QLPT0\E|\QLPT1\E|\QLPT2\E|\QLPT3\E|\QLPT4\E|\QLPT5\E|\QLPT6\E|\QLPT7\E|\QLPT8\E|\QLPT9\E|\QLST\E|\QKEYBD$\E|\QSCREEN$\E|\Q$IDLE$\E|\QCONFIG$\E|\Q$Mft\E|\Q$MftMirr\E|\Q$LogFile\E|\Q$Volume\E|\Q$AttrDef\E|\Q$Bitmap\E|\Q$Boot\E|\Q$BadClus\E|\Q$Secure\E|\Q$Upcase\E|\Q$Extend\E|\Q$Quota\E|\Q$ObjId\E|\Q$Reparse\E)(\.[^/]*)?$).*
Each forbidden filename has been wrapped in \Q and \E, which is how you quote an expression in regex so all chars are treated as literal chars. For example, the dollar sign in \Q$Boot\E does't mean end of input, it's just a plain dollar sign.
Thanks everyone.
I have now made the complete code for this,
I'm sharing it as a potential answer, since I think the balances I had to walk are likelly quite common.
Main points:
I had to chose 248 as a max size
I had to accept '$' in file names.
I had to completelly skip any file/folder/subtree that is either labelled as hidden (win) or startin with '.'; those files are hidden and likelly to be autogenerated, out of my
control, and anyway not used by my application.
Of course if your application relies on ".**" files/folders, you may have to check for those.
Another point of friction is multiple dots: not only some system may be upset, but it is not clear where the extension starts and the main name end.
For example, I had a usecase with the file derby-10.15.2.0.jar inside.
Is the extension .jar or .15.2.0.jar? does some system disagree on this?
For now, I'm forcing to rename those files as, for example, derby-10_15_2_0.jar
public class CheckFileSystemPortable{
Path top;
List<Path> okPaths = new ArrayList<>();
List<Path> badPaths = new ArrayList<>();
List<Path> repeatedPaths = new ArrayList<>();
public void makeError(..) {..anything you need for a good message..}
public boolean isDirectory(Path top){ return Files.isDirectory(top); }
//I override the above when I do mocks for testing
public CheckFileSystemPortable(Path top){
assert isDirectory(top);
this.top = top;
walkIn1(top);
for(var p:okPaths){ checkRepeatedPaths(p); }
okPaths.removeAll(repeatedPaths);
}
public void walkIn1(Path path) {
try(Stream<Path> walk = Files.walk(path,1)){
//the first one is guaranteed to be the root
walk.skip(1).forEach(this::checkSystemIndependentPath);
}
catch(IOException e){ throw /*unreachable*/; }
}
private void checkRepeatedPaths(Path p){
var s = p.toString();
for(var pi:okPaths){
if (pi!=p && pi.toString().equalsIgnoreCase(s)) {repeatedPaths.add(pi);}
}
}
private static final List<String>forbiddenWin = List.of(
"CON", "PRN", "AUX", "CLOCK$", "NUL",
"COM0", "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
"LPT0", "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9",
"LST", "KEYBD$", "SCREEN$", "$IDLE$", "CONFIG$",
"$Mft", "$MftMirr", "$LogFile", "$Volume", "$AttrDef", "$Bitmap", "$Boot",
"$BadClus", "$Secure", "$Upcase", "$Extend", "$Quota", "$ObjId", "$Reparse",
""
);
static final Pattern regex = Pattern.compile(//POSIX + $,
"^[a-zA-Z0-9\\_\\-\\$]+$");// but . is handled separately
public void checkSystemIndependentPath(Path path){
String lastName=path.getFileName().toString();
//too dangerous even for ignored ones
if(lastName.equals(".") || lastName.equals("..")) { badPaths.add(path); return; }
boolean skip = path.toFile().isHidden() || lastName.startsWith(".");
if(skip){ return; }
var badSizeEndStart = lastName.length()>248
||lastName.endsWith(".")
||lastName.endsWith("-")
|| lastName.startsWith("-");
if(badSizeEndStart){ badPaths.add(path); return; }
var i=lastName.indexOf(".");
var fileName = i==-1?lastName:lastName.substring(0,i);
var extension = i==-1?"":lastName.substring(i+1);
var extensionDots = extension.contains(".");
if(extensionDots){ badPaths.add(path); return; }
var badDir = isDirectory(path) && i!=-1;
if(badDir){ badPaths.add(path); return; }
var badFileName = !regex.matcher(fileName).matches();
var badExtension = !extension.isEmpty() && !regex.matcher(extension).matches();
if(badFileName||badExtension){ badPaths.add(path); return; }
var ko = forbiddenWin.stream()
.filter(f->fileName.equalsIgnoreCase(f)).count();
if(ko!=0){ badPaths.add(path); return; }
okPaths.add(path);
walkIn1(path);//recursive exploration
}
}

Java commons cli parser not recognizing command line arguments

This should be very simple but I am not sure why its not working. I am trying pass arguments with a name (So I can pass arguments in any order) using the apache commons CLI library but It seems to be not working. I want to pass the arguments from eclipse IDE. I know this part is not the problem because I am able to print the arguments with args[0] kind.
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
public class MainClass {
public static void main(String[] args) throws ParseException {
System.out.println(args[0]);
Options options = new Options();
options.addOption("d", false, "add two numbers");
CommandLineParser parser = new DefaultParser();
CommandLine cmd = parser.parse( options, args);
if(cmd.hasOption("d")) {
System.out.println("found d");
} else {
System.out.println("Not found");
}
}
The above lines are exactly like the examples given online but i dont know why its not working. I am struggling this from a day now. Please help where I am going wrong.
According to the examples name of the parameter should be present in command line
Property without value
Usage: ls [OPTION]... [FILE]...
-a, --all do not hide entries starting with .
And the respective code is:
// create the command line parser
CommandLineParser parser = new DefaultParser();
// create the Options
Options options = new Options();
options.addOption( "a", "all", false, "do not hide entries starting with ." );
In this scenario correct call is:
ls -a or ls --all
With value separated by space
-logfile <file> use given file for log
Respective code is:
Option logfile = OptionBuilder.withArgName( "file" )
.hasArg()
.withDescription( "use given file for log" )
.create( "logfile" );
And call would be:
app -logfile name.of.file.txt
With value separated by equals
-D<property>=<value> use value for given property
The code is:
Option property = OptionBuilder.withArgName( "property=value" )
.hasArgs(2)
.withValueSeparator()
.withDescription( "use value for given property" )
.create( "D" );
And call would be:
app -Dmyprop=myvalue

Is there any class in org.opendaylight.yangtools.yang.model.api can be used for tailf:action?

I am working on a project which I need to parse the tailf:action in yang schema using opendaylight library. I try to find a class in org.opendaylight.yangtools.yang.model.api which can be used to parse tailf:action. Then I can get input and output (normally they are list of leafs) from this class instance to do recursive processing.
Anyone has idea whether there is a class in org.opendaylight.yangtools.yang.model.api can support tailf:action ?
I show a tailf:action example as below.
Thanks in advance.
tailf:action set-ip-attributes {
description "set ip";
tailf:info "...";
tailf:exec "/usr/local/a.py" {
tailf:args "-c $(context) -p $(path)";
}
tailf:cli-mount-point "set";
input {
leaf ip {
type inet:ip-address;
mandatory true;
description "IP Address of the session";
tailf:info "IP Address of the session";
}
leaf attribute {
type string;
mandatory true;
description "Name of the attribute";
tailf:info "Name of the attribute";
}
}
output {
uses set-session-attribute;
}
}
Opendaylight only support "action" instead of "tailf:action". Below classes are used:
org.opendaylight.yangtools.yang.model.api.ActionDefinition;
org.opendaylight.yangtools.yang.model.api.ActionNodeContainer;

Find an String with some keys in java

Consider a map as below:
Map("PDF","application/pdf")
Map("XLSX","application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
Map("CVS","application/csv")
....
There is an export method which gets the export button name and find the export type and application content type
public void setExport(String exportBtn) {
for (String key : exportTypes.keySet()) {
if (exportBtn.contains(key)) {
this.export = key;
this.exportContentType = exportTypes.get(key);
LOG.debug("Exporting to {} ", this.export);
return ;
}
}
}
This method can be called as
setExport("PDF") >> export=PDF, exportContentType=application/pdf
setExport("Make and PDF") >> PDF, exportContentType=application/pdf
setExport("PDF Maker") >> PDF, exportContentType=application/pdf
I am not feeling good with this approch! At least I think there is some libs, for example in StringUtils, which can do something like:
String keys[]={"PDF","XLSX","CVS"};
String input="Make the PDF";
selectedKey = StringUtils.xxx(input,keys);
This can some how simplify my method.
But I could not find anything. Any comments?!
You could use Regex to solve this issue, something like this:
final Pattern pattern = Pattern.compile("(PDF|XLSX|CVS)");
final Matcher matcher = pattern.matcher("Make the PDF");
if (matcher.find()) {
setExportType(matcher.group());
}
You then need to create the pattern procedurally to include all keys once, and of course use the button's name instead of "Make the PDF".
Map is the easy and best implementation to store key-value pairs.
Why cannot you directly use the get method of map with key?
exportContentType = exportTypes.get(exportBtn);
if(exportContentType !=null || exportcontentType.isEmpty())
throw error;
else
export = exportBtn;

Categories

Resources