One line check if String contains bannedSubstrings

One line check if String contains bannedSubstrings - java

I have a String title and a List<String> bannedSubstrings. Now I want to perform a one line check if title is free of those bannedSubstrings.
My approach:
if(bannedSubstrings.stream().filter(bannedSubstring -> title.contains(bannedSubstring)).isEmpty()){
...
}
Unfortunately, there is no isEmpty() method for streams. So how would you solve the problem? Is there a one line solution?

Sounds like you want to read up on anyMatch:
if (bannedSubstrings.stream().anyMatch(title::contains)) {
// bad words!
}
Inversely, there's also noneMatch:
if (bannedSubstrings.stream().noneMatch(title::contains)) {
// no bad words :D
}
This isn't very efficient if title is a long string (but titles usually aren't supposed to be long, I suppose).

If you want an efficient solution and you have many bannedSubstrings, I guess, it would be faster to join them into single regexp like this:
Pattern badWords = Pattern.compile(bannedSubstrings.stream().map(Pattern::quote)
.collect(Collectors.joining("|")));
Then use it like this:
if (badWords.matcher(title).find()) {
...
}
This should build a prefix tree from your substrings, so scanning will be significantly faster. If performance is not the concern in your case, use other answers.

I suppose you are looking for something like this:
if(bannedSubstrings.stream().anyMatch(title::contains)){
}

The answer you've selected is pretty good, but for real performance you'd probably be better off pre-compiling the list of bad words into a regex.
public class BannedWordChecker {
public final Pattern bannedWords;
public BannedWordChecker(Collection<String> bannedWords) {
this.bannedWords =
Pattern.compile(
bannedWords.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|")));
}
public boolean containsBannedWords(String string) {
return bannedWords.matcher(string).find();
}
}

Related

Replacing special characters from a string

Just would like to know if there is a more elegant and maintainable approach for this:
private String replaceSpecialChars(String fileName) {
if (fileName.length() < 1) return null;
if (fileName.contains("Ü")) {
fileName = fileName.replace("Ü", "Ue");
}
if (fileName.contains("Ä")) {
fileName = fileName.replace("Ä", "Ae");
}
if (fileName.contains("Ö")) {
fileName = fileName.replace("Ö", "Oe");
}
if (fileName.contains("ü")) {
fileName = fileName.replace("ü", "ue");
}
...
return fileName;
}
I'm restricted to Java 6.

Before you go any further on this, note that what you're doing is effectively impossible. For example, the 'ascii-fication' of 'Ö' in swedish is 'O' and not 'Oe'. There is no way to know if a word is swedish or german; after all, swedes sometimes move to germany, for example. If you open a german phonebook and you see a Mrs. Sjögren, and you asciify that to Sjoegren, you messed it up.
If you want to run 'case and asciification insensitive comparisons', well, first you have to answer a few questions. Is muller equal to mueller equal to müller? That rabbit hole goes quite deep.
The general solution is trigrams or other generalized text search tools such as provided by postgres. Alternatively, opt out of this mechanism and store this stuff in unicode, and be clear that to find Ms. Sjögren, you're going to have search for "Sjögren" for the same reason that to find Mr. Johnson, you're not going to if you try to search for Jahnson.
Note that most filesystems allow unicode filenames; there is no need to try to replace a Ü.
This also goes some way as to explain why there are no ready libraries available for this seemingly common job; the job is, in fact, impossible.
You can simplify this code by using a Map<String, String> with replacements if you must. I advise against it for the above reasons. Or, just.. keep it as is, but ditch the contains. This code is needlessly slow and lengthy.
There is no difference between:
if (fileName.contains("x")) fileName = fileName.replace("x", "y");
and just fileName = fileName.replace("x", "y"); except that the former is strictly slower (replace does not make a new string and returns itself, if you ask it to replace a string that it does not contain. The former will search twice, the latter only once, and either one will make no new strings unless actual string replacing needs to be done.
You can then chain it:
if (fileName.isEmpty()) return null;
return fileName
.replace("Ü", "Ue")
.replace("Ä", "Ae")
...
;
But, as I said, you probably don't want to do that, unless you want an aggravated person on the line at some point in the future complaining that you bungled up the asciification of their surname.

You can remove unnecessary if statements an use a chain of String.replace methods. Your code might look something like this:
private static String replaceSpecialChars(String fileName) {
if (fileName == null)
return null;
else
return fileName
.replace("Ü", "Ue")
.replace("Ä", "Ae")
.replace("Ö", "Oe")
.replace("ü", "ue");
}
public static void main(String[] args) {
System.out.println(replaceSpecialChars("ABc")); // ABc
System.out.println(replaceSpecialChars("ÜÄÖü")); // UeAeOeue
System.out.println(replaceSpecialChars("").length()); // 0
System.out.println(replaceSpecialChars(null)); // null
}

Simplifying/optimizing massive if...else if...else statement(s)

Okay so essentially, I have some code that uses the contains() method to detect the presence of specific characters in two strings. For extra context, this question is a good resource as to what kind of problem I'm having (and the third solution is also something I've looked into for this). Regardless, here is some of my code:
// code up here basically just concatenates different
// characters to Strings: stringX and stringY
if (stringX.contains("!\"#")) {
} else if (stringX.contains("$%&")) {
} else if (stringX.contains("\'()")) {
} else if (stringX.contains("!$\'")) {
} else if (stringX.contains("\"%(")) {
// literally 70+ more else-if statements
}
if (stringY.contains("!\"#")) {
} else if (stringY.contains("$%&")) {
} else if (stringY.contains("\'()")) {
} else if (stringY.contains("!$\'")) {
} else if (stringY.contains("\"%(")) {
// literally 70+ more else-if statements, all of which are
// exactly the same as those working with stringX
}
I'm still pretty new to Java programming, so I'm not sure how I should go about this. Maybe it is a non-issue? Also, if I can remedy this without using RegEx, that would be preferable; I am not very knowledgeable in it at this point it time. But if the only rational solution would be to utilize it, I will obviously do so.
Edit: The code within all of these else-if statements will not be very different from each other at all; basically just a System.out.println() with some information about what characters stringX/stringY contains.

Writing the same code more than once should immediately set off alarm bells in your head to move that code into a function so it can be reused.
As for simplifying the expression, the best approach is probably storing the patterns you're looking for as an array and iterating over the array with your condition.
private static final String[] patterns = new String[] {"!\"#", "$%&", "\'()", "!$\'", "\"%(", ...};
private static void findPatterns(String input) {
for (String pattern : patterns) {
if (input.contains(pattern) {
System.out.println("Found pattern: " + pattern);
}
}
}
// Elsewhere...
findPatterns(stringX);
findPatterns(stringY);
This pattern is especially common in functional and functional-style languages. Java 8 streams are a good example, so you could equivalently do
List<String> patterns = Arrays.asList("!\"#", "$%&", "\'()", "!$\'", "\"%(", ...);
patterns.stream()
.filter(pattern -> stringX.contains(pattern))
.forEach(pattern -> System.out.println("Found pattern: " + pattern));

can simply by make a list of your case. then using java 8 stream filter
List<String> pattems = Arrays.asList("!\"#", "$%&", ...);
Optional<String> matched = pattems.stream().filter(p -> stringX.contains(p));
if(matched.isPresent()) {
System.console().printf(matched.get())
}
java stream could make your peformance slower but not too much

Java - Parsing unformatted string

I have an unformatted string like this:
Tabs,[
{ tab1 = {
Title = tab1name
}
}
{ tab2 = {
Title = tab2name
}
}
{ tab3 = {
Title = tab3name
}
}
]
I need to parse this string and i need the title from it.
Is there is any other way to do like json parsing ?
Any help please.

Your question is a bit unclear - are you trying to parse source code or are you trying to parse the elements within the Tab[] object? If you're looking into this for a serious project, I'd recommend looking into something like cup. If it's something simpler and you merely need specific information from a collection of strings, you can use a variety of string methods. For instance -
replace()
split()
substring()
toUpperCase()
etc...
You can find more on this documentation here, I'd recommend it for a good read that might help you answer this and future questions.

complex if( ) or enum?

In my app, I need to branch out if the input matches some specific 20 entries.
I thought of using an enum
public enum dateRule { is_on, is_not_on, is_before,...}
and a switch on the enum constant to do a function
switch(dateRule.valueOf(input))
{
case is_on :
case is_not_on :
case is_before :
.
.
.
// function()
break;
}
But the input strings will be like 'is on', 'is not on', 'is before' etc without _ between words.
I learnt that an enum cannot have constants containing space.
Possible ways I could make out:
1, Using if statement to compare 20 possible inputs that giving a long if statement like
if(input.equals("is on") ||
input.equals("is not on") ||
input.equals("is before") ...) { // function() }
2, Work on the input to insert _ between words but even other input strings that don't come under this 20 can have multiple words.
Is there a better way to implement this?

You can define your own version of valueOf method inside the enum (just don't call it valueOf).
public enum State {
IS_ON,
IS_OFF;
public static State translate(String value) {
return valueOf(value.toUpperCase().replace(' ', '_'));
}
}
Simply use it like before.
State state = State.translate("is on");
The earlier switch statement would still work.

It is possible to seperate the enum identifier from the value. Something like this:
public enum MyEnumType
{
IS_BEFORE("is before"),
IS_ON("is on"),
IS_NOT_ON("is not on")
public final String value;
MyEnumType(final String value)
{
this.value = value;
}
}
You can also add methods to the enum-type (the method can have arguments as well), something like this:
public boolean isOnOrNotOn()
{
return (this.value.contentEquals(IS_ON) || this.value.contentEquals(IS_NOT_ON));
}
Use in switch:
switch(dateRule.valueOf(input))
{
case IS_ON: ...
case IS_NOT_ON: ...
case IS_BEFORE: ...
}
And when you get the value of IS_ON like for example System.out.println(IS_ON) it will show is on.

If you're using Java 7, you can also choose the middle road here, and do a switch statement with Strings:
switch (input) {
case "is on":
// do stuff
break;
case "is not on":
// etc
}

You're not really breaking the concept up enough, both solutions are brittle...
Look at your syntax
"is", can remove, seems to be ubiquitous
"not", optional, apply a ! to the output comparison
on, before, after, apply comparisons.
So do a split between spaces. Parse the split words to ensure they exist in the syntax definition and then do a step-by-step evaluation of the expression passed in. This will allow you to easily extend the syntax (without having to add an "is" and "is not" for each combination and keep your code easy to read.
Having multiple conditions munged into one for the purposes of switch statements leads to huge bloat over time.

Thanks for the suggestions. They guided me here.
This is almost same as other answers, just a bit simplified.
To summarize, I need to compare the input string with a set of 20 strings and
if they match, do something. Else, do something else.
Static set of strings to which input needs to be compared :
is on,is not on,is before,is after, etc 20 entries
I created an enum
public enum dateRules
{
is_on
,is_not_on
,is_before
,is_after
.
.
.
}
and switching on formatted value of input
if(isARule(in = input.replace(" ","_"))
{
switch(dateRule.valueOf(in))
{
case is_on,
case is_not_on,
case is_before, ...
}
}
I copied the formatted value of 'input' to 'in' so that I can reuse input without another replace of '_' with ' '.
private static boolean isARule(String value)
{
for(dateRule rule : dateRule.values())
{
if(rule.toString().equals(value))
{
return true;
}
}
return false;
}
Problem solved.
Reference : https://stackoverflow.com/a/4936895/1297564

Fastest way to lookup a String value

I have a simple application that reads data in small strings from large text files and saves them to a database. To actually save each such String, the application calls the following method several (may thousands, or more) times:
setValue(String value)
{
if (!ignore(value))
{
// Save the value in the database
}
}
Currently, I implement the ignore() method by just successively comparing a set of Strings, e.g.
public boolean ignore(String value)
{
if (value.equalsIgnoreCase("Value 1") || (value.equalsIgnoreCase("Value 2"))
{
return true;
}
return false;
}
However, because I need to check against many such "ignorable" values, which will be defined in another part of the code, I need to use a data structure for this check, instead of multiple consecutive if statements.
So, my question is, what would be the fastest data structure from standard Java to to implement this? A HashMap? A Set? Something else?
Initialization time is not an issue, since it will happen statically and once per application invocation.
EDIT: The solutions suggested thus far (including HashSet) appear slower than just using a String[] with all the ignored words and just running "equalsIgnoreCase" against each of these.

Use a HashSet, storing the values in lowercase, and its contains() method, which has better lookup performance than TreeSet (constant-time versus log-time for contains).
Set<String> ignored = new HashSet<String>();
ignored.add("value 1"); // store in lowercase
ignored.add("value 2"); // store in lowercase
public boolean ignore(String value) {
return ignored.contains(value.toLowerCase());
}
Storing the values in lowercase and searching for the lowercased input avoids the hassle of dealing with case during comparison, so you get the full speed of the HashSet implementation and zero collection-related code to write (eg Collator, Comparator etc).
EDITED
Thanks to Jon Skeet for pointing out that certain Turkish characters behave oddly when calling toLowerCase(), but if you're not intending on supporting Turkish input (or perhaps other languages with non-standard case issues) then this approach will work well for you.

In most cases I'd normally start with a HashSet<String> - but as you want case-insensitivity, that makes it slightly harder.
You can try using a TreeSet<Object> using an appropriate Collator for case-insensitivity. For example:
Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);
TreeSet<Object> set = new TreeSet<Object>(collator);
Note that you can't create a TreeSet<String> as Collator only implements Comparator<Object>.
EDIT: While the above version works with just strings, it may be faster to create a TreeSet<CollationKey>:
Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);
TreeSet<CollationKey> set = new TreeSet<CollationKey>();
for (String value : valuesToIgnore) {
set.add(collator.getCollationKey(value));
}
Then:
public boolean ignore(String value)
{
return set.contains(collator.getCollationKey(value));
}
It would be nice to have a way of storing the collation keys for all ignored values but then avoid creating new collation keys when testing, but I don't know of a way of doing that.

Add the words to ignore to a list and just check if the word is in that list.
That makes it dynamically.

If using Java 7 this is a fast way to do it:
public boolean ignore(String value) {
switch(value.toLowerCase()) { // see comment Jon Skeet
case "lowercased_ignore_value1":
case "lowercased_ignore_value2":
// etc
return true;
default:
return false;
}
}

It seems that String[] is slightly better (performance-wise) than the other methods proposed, so I will use that.
It is simply something like this:
public boolean ignore(String value)
{
for (String ignore:IGNORED_VALUES)
{
if (ignore.equalsIgnoreCase(value))
{
return true;
}
return false;
}
The IGNORED_VALUES object is just a String[] with all ignored values in there.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.