How to censor website links? - java

I've been working on a regex censor for quite the time and can't seem to find a decent way of censoring address links (and attempts to circumvent that).
Here's what I got so far, ignoring escape sequences:
([a-zA-Z0-9_-]+[\\W[_]]*)+(\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)+([\\w]{2,6})((\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)([\\w]{1,4}))*
I'm not so sure what might be causing the problem but however it censors the word "com" and "come" and pretty much anything that is about 3+ letters.
Problem: I want to know how to censor website links and invalid links that are attempts to circumvent the censor. Examples:
Google.com
goo gle .com
g o o g l e . c o m
go o gl e % com
go og le (.) c om
Also a slight addition, is there a possible way to add links to a white list for this? Thank you.

You could use a simple function such as this..
private String hideLink(String link){
String[] split = link.split("\\.");
String output = "";
output += split[0] + ".";
for(int i = 0; i < split[1].length(); i++){
output += "*";
}
output += "." + split[2];
return output;
}
calling
hideLink("www.google.com");
returns www.******.com
calling
hideLink("www.msn.net");
returns www.***.net
calling
hideLink("http://abc.12345.org");
returns http://abc.*****.org
etc...

Related

How to remove comments from a given String in Java?

how do I remove comments start with "//" and with /**, * etc.? I haven't found any solutions on Stack Overflow that has helped me very much, a lot of them have been way above my head and I'm still at most basics.
What I have thought about so far:
for (int i = 0; i < length; i++) {
for (j = i; j < length; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/')
But I'm not really sure how to replace the words following those characters. And how to end when to stop the replacement with a "//" comment. With the /* comments, atleast conceptually I know I should replace all words till "*/" pops up. Though again, I'm not sure how to limit the replacement till that point. To replace I thought replacing the charAt after the second "/" with an empty string until....where? I cannot figure out where to "end" the replacing.
I have looked at a few implementations on Stack, but I really didn't get it. Any help is appreciated, especially if it's at a basic level and understandable for someone not well versed in programming!
Thanks.
I have done something similar with regex (Java 9+):
// Checks for
// 1) Single char literal '"'
// 2) Single char literal '\"'
// 3) Strings; termination ignores \", \\\", ..., but allows ", \\", \\\\", ...
// 4) Single-line comment // ... to first \n
// 5) Multi-line comments /*[*] ... */
Pattern regex = Pattern.compile(
"(?s)('\"'|'\\\"'|\".*?(?<!\\\\)(?:\\\\\\\\)*\"|//[^\n]*|/\\*.*?\\*/)");
// Assuming 'text' contains your java text
// Leaves 1,2,3) unchanged and replaces comments 4,5) with ""
// Need quoteReplacement to prevent matcher processing $ and \
String textWithoutComments = regex.matcher(text).replaceAll(
m -> m.group().charAt(0) == '/' ? "" : Matcher.quoteReplacement(m.group()));
If you don't have Java 9+ then you could use this replace function:
String textWithoutComments = replaceAll(regex, text,
m -> m.group().charAt(0) == '/' ? "" : m.group());
public static String replaceAll(Pattern p, String s,
Function<MatchResult, String> replacer) {
Matcher m = p.matcher(s);
StringBuilder b = new StringBuilder();
int lastStart = 0;
while (m.find()) {
String replacement = replacer.apply(m);
b.append(s.substring(lastStart, m.start())).append(replacement);
lastStart = m.end();
}
return b.append(s.substring(lastStart)).toString();
}
I'm not sure if you're using an IDE like IntelliJ or Eclipse but you could do this without using code if you're just interested in removing all comments for the project. You can do this with "Replace in Path" tool. Notice how "Regex" is checked, allowing us to match lines based on regular expressions.
This configuration in the tool will delete all lines starting with a // and replace it with an empty line.
The command to get to this on a Mac is ctrl + shift + r.

Regex to match/group repeating characters in a string

I need a regular expression that will match groups of characters in a string. Here's an example string:
qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT
It should match
(match group) "result"
(1) "q"
(2) "wwwwwwwww"
(3) "eeeee"
(4) "rr"
(5) "t"
(6) "yyyyy"
(7) "qqqq"
(8) "w"
(9) "EE"
(10) "r"
(11) "TTT"
after doing some research, this is the best I could come up with
/(.)(\1*)/g
The problem I'm having is that the only way to use the \1 back-reference is to capture the character first. If I could reference the result of a non capturing group I could solve this problem but after researching I don't think it's possible.
How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).
Looks like you need to use a Matcher in a loop:
Pattern p = Pattern.compile("((.)\\2*)");
Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT");
while (m.find()) {
System.out.println(m.group(1));
}
Outputs:
q
wwwwwwwww
eeeee
rr
t
yyyyy
qqqq
w
EE
r
TTT
Assuming what #cruncher said as a premise is true: "we want to catch repeating letter groups without knowing beforehand which letter should be repeating" then:
/((a*?+)|(b*?+)|(c*?+)|(d*?+)|(e*?+)|(f*?+)|(g*?+)|(h*?+))/
The above RegEx should allow the capture of repeating letter groups without hardcoding a particular order in which they would occur.
The ?+ is a reluctant possesive quantifier which helps us not waste RAM space by not saving previously valid backtracking cases if the current case is valid.
Since you did tag java, I'll give an alternative non-regex solution(I believe in requirements being the end product, not the method by which you get there).
String repeat = "";
char c = '';
for(int i = 0 ; i < s.length() ; i++) {
if(s.charAt(i) == c) {
repeat += c;
} else {
if(!repeat.isEmpty())
doSomething(repeat); //add to an array if you want
c = s.charAt(i);
repeat = "" + c;
}
}
if(!repeat.isEmpty())
doSomething(repeat);

Java - Pattern Matching

I have some code in php I made using preg_grep for matching several words in any order that can exist in any context. I'm trying to convert it to java but i can't seem to figure it out.
My php code for converting a keyword to a regex string is:
function createRegexSearch($keywords)
{
$regex = '';
foreach ($keywords as $key)
$regex .= '(?=.*' . $key . ')';
return '/^' . $regex . '/i';
}
It would create a regex string similar to: /^(?=.*bot)/i - which should match robot, robots, bots etc. The same regex string doesn't seem to work in java which is leaving me confused. Currently in java I created a similar effect with contains but would rather use regex.
for (Map.Entry<String, String> entry : mKeyList.entrySet())
{
boolean found = true;
String val = entry.getValue().toLowerCase();
for (int i = 0; i < keywords.length; i++)
{
if (!val.contains(keywords[i].toLowerCase()))
found = false;
}
if (found)
ret.add(entry.getValue());
}
One thing that Java does differently than many languages is have two different ways of "matching" a regex against a target - "matches()" vs "find()" - matches is the equivalent of putting ^ and $ at the beginning and end of your expression, while find finds the first match (wherever it might be in the string) - for example while you might be able to find() .*bot in the target string robots, it would not be true to say that it matches() the target... I'm not entirely sure how the lookahead might affect this...
Without posted Java code (containing the problem), it's hard to tell you where you might be going wrong, but my guess is that it could very easily be in this area.
Also, the equivalent of putting /i at the end of your expression in Java (and .Net) is putting (?i) at the beginning of your expression (or any region you want to be case sensitive). Thus, /[a-f0-9]/i is equivalent to (?i)[a-f0-9]
The String contains is case sensitive, so the first set (PHP Code) will behave case in-sensitive since the usage of \i. But the java code will behave case sensitive. So there will be differences in behavior.
So if this is difference, you convert both the end to specific char set, say toUpperCase() before the contains check.
Also you are using a regex in PHP code and not in Java, any specific reason behind this?
Regards
Ajai G
You can use the embedded flag extension (?i) so the regex you should be using to match bot, robot, bots and robots is (?i)^(.*bots?)$ This should work with either String.matches or Pattern/Matcher
JMPL is simple java library, which could emulate some of the features pattern matching, using Java 8 features.
import org.kl.state.Else;
import static org.kl.pattern.DeconstructPattern.matches;
import static org.kl.pattern.DeconstructPattern.foreach;
import static org.kl.pattern.DeconstructPattern.let;
let(figure, (int w, int h) -> {
System.out.println("border: " + w + " " + h));
});
matches(figure).as(
Rectangle.class, (int w, int h) -> System.out.println("square: " + (w * h)),
Circle.class, (int r) -> System.out.println("square: " + (2 * Math.PI * r)),
Else.class, () -> System.out.println("Default square: " + 0)
);
foreach(listRectangles, (int w, int h) -> {
System.out.println("square: " + (w * h));
});

How to remove the degrees celsius symbol from a string (java)

I've been trying to remove the degree Celsius symbol from the following string for a few hours now. I've looked at prior posts and I see that /u2103 is the unicode representation for it. Despite trying to remove that string, I've still had no luck. Here's what I have now:
String temp = "Technology=Li-poly;Temperature=23.0 <degree symbol>C;Voltage=3835";
StringBuilder filtered = new StringBuilder(temp.length());
for (int i = 0; i < temp.length(); i++) {
char test = temp.charAt(i);
if (test >= 0x20 && test <= 0x7e) {
filtered.append(test);
}
}
temp = filtered.toString();
temp.replaceAll(" ", "%20");
The resulting string looks like this:
Technology=Li-poly;Temperature=23.0 C;
I've also tried
temp.replaceAll("\\u2103", "");
temp.replaceChar((char)0x2103, ' ');
But none of this works.
My current problem is that the function to filter the string leaves a blank space but the call to replaceAll(" ", "%20") doesn't seem to recognize that particular space. ReplaceAll will replace other spaces with %20.
This is one problem:
temp.replaceAll(" ", "%20");
You're calling replaceAll but never using the result. Strings are immutable - any method which looks like it's changing the content is actually returning the different string as a result. You want:
temp = temp.replaceAll(" ", "%20");
Having said that, it's not clear why you're trying to replace the space at all, nor what's wrong with your resulting string.
You've got the same problem with your other temp.replaceAll and temp.replaceChar calls.
Your attempt to replace the character directly would also fail as you're escaping the backslash - you really want:
temp = temp.replace("\u2103", "");
Note the use of replace instead of replaceAll - the latter uses regular expressions, which there's no need to use at all here.
Perhaps you could leverage the Character.isWhiteSpace() function.

Regex to find commas that aren't inside "( and )"

I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:
,a,b,c,d,"("x","y",z)",e,f,g,
Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.
I tried a lot of combinations but regular expressions is still a little foggy for me.
I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.
So, want to do something like this:
String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g
Thanks!
You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:
String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";
String[] parts = text.split(";(?![^<>]*>)");
System.out.println(java.util.Arrays.toString(parts));
// _ _ _ _ _______ _ _ _ _________ _ _ _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]
Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.
On the pattern
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.
The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.
Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.
This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.
You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).
References
regular-expressions.info/Character class, Repetition, Lookarounds, Possessive
Try this one:
(?![^(]*\)),
It worked for me in my testing, grabbed all commas not inside parenthesis.
Edit: Gopi pointed out the need to escape the slashes in Java:
(?![^(]*\\)),
Edit: Alan Moore pointed out some unnecessary complexity. Fixed.
If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.
List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) == 0) {
String[] atoms = chunks[i].split(",");
for (int j = 0; j < atoms.length; j++)
result.add(atoms[j]);
}
else
result.add(chunks[i]);
}
Well,
After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!
But I still looking for one that can found the commas even if there's no "" in the inside terms.
Thankz for the help guyz.
This should do what you want:
(".*")|([a-z])
I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm
the groups $1 and $2 contain the right values, so they match and you should get what you want.
A littlte be trickier this will get if you have other complexer values than a-z in between the commas.
If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for.
Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly.
By inversing the problem itself, the problem gets often simpler.
I had the same issue. I choose Adam Schmideg answer and improve it.
I had to deal with these 3 string for example :
France (Grenoble, Lyon), Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
The idea was to have :
France (Grenoble, Lyon)
or Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
I choose not to use regex because I was 100% of what I was doing and that it would work in any case.
String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) != 0) {
chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
}
}
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
buffer.append(chunks[i]);
}
String s = buffer.toString();
String[] output = s.split(",");

Categories

Resources