Extracting Twitter username from a given text (JAVA, Regex)

Extracting Twitter username from a given text (JAVA, Regex) - java

I believe the code is OK, the problem is the regex.
Basically I want to find a username mention (it starts with #), and then I want to extract the allowed username part from the given word.
For example if the text contains "#FOO!!" I want to extract only "foo", but I believe the problem is with my "split("[a-z0-9-_]+")[0]" part.
Btw, allowed symbols are numbers, letters, - and _
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> mentioned = new HashSet<>();
for (Tweet tweet : tweets) {
String tweetToAnal = null;
if (tweet.getText().contains("#")) tweetToAnal = tweet.getText();
if (tweetToAnal == null) continue;
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
}
}
return mentioned;
}

The problem is not on your regex but on your logic.
You are using below line to analize usernames:
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
If you debug step by step your code, you will notice that you are consuming (with substring(1)) the # and then you are splitting by using a regex, therefore this split is consuming all your characters as well. However, you don't want to consume characters with the split method but you just want to capture content.
So, you can actually use split by using the negated regex you are using by doing:
split("[^a-z0-9-_]+")
^---- Notice the negate character class indicator
On the other hand, instead of splitting the whole text in multiple tokens to further be analyzed, you can use a regex with capturing group and then grab the username you want. So, instead of having this code:
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
You can use a much more simpler code like this:
Matcher m = Pattern.compile("(?<=#)([\\w-]+)").matcher(tweetToAnal); // Analyze text with a regex that will capture usernames preceded by #
while (m.find()) { // Stores all username (without #)
mentioned.add(m.group(1));
}
Btw, I didn't test the code, so I may have a typo but you can understand the idea. Anyway the code is pretty simple to understand.

I'm not a Java-Person, but you can easily match twitter-usernames without the "#" using the following regex:
(?<=#)[\w-]+
which can be seen here. Of course you would need to escape special characters properly, but since I have no clue of Java, you would have to do that and the actual matching by yourself.

Related

Censoring bad words in a string

I am trying to create a function to replace all the bad words in a string with an asterisk in the middle, and here is what I came up with.
public class Censor {
public static String AsteriskCensor(String text){
String[] word_list = text.split("\\s+");
String result = "";
ArrayList<String> BadWords = new ArrayList<String>();
BadWords.add("word1");
BadWords.add("word2");
BadWords.add("word3");
BadWords.add("word4");
BadWords.add("word5");
BadWords.add("word6");
ArrayList<String> WordFix = new ArrayList<String>();
WordFix.add("w*rd1");
WordFix.add("w*rd2");
WordFix.add("w*rd3");
WordFix.add("w*rd4");
WordFix.add("w*rd5");
WordFix.add("w*rd6");
int index = 0;
for (String i : word_list)
{
if (BadWords.contains(i)){
word_list[index] = WordFix.get(BadWords.indexOf(i));
index++;
}
}
for (String i : word_list)
result += i + ' ';
return result;
}
My idea was to break it down into single words, then replace the word if you encounter a bad word, but so far it is not doing anything. Can someone tell me where did I go wrong? I am quite new to the language

If you move the index++ to out of the if statement, then your code works fine.
Online demo
However, it won't work properly if there are any punctuation marks immediately following a word to be censored. For example, the sentence "We have word1 to word6, and they are censored", then only "word1" will be censored, due to the comma immediately following the word.
I personally would approach this differently. Instead of maintaining two lists, you could also create a Map which maps the bad words to their censored counterparts:
static String censor(String text) {
Map<String, String> filters = Map.of(
"hello", "h*llo",
"world", "w*rld",
"apple", "*****"
);
for (var filter : filters.entrySet()) {
text = text.replace(filter.getKey(), filter.getValue());
}
return text;
}
Of course, this is code is still a little naive, because it will also filter words like 'applet', because the word 'applet' contains 'apple'. That's probably not what you want.
Instead, we need to tweak the code a little, so the found words must be whole words, that is, not part of another word. You can fix this by replacing the body of the for loop by this:
String pattern = "\\b" + Pattern.quote(filter.getKey()) + "\\b";
text = text.replaceAll(pattern, filter.getValue());
It replaces text using a regular expression. The \b is a word-boundary character, which makes sure it only matches the start or end of a word. This way, words like 'dapple' and 'applet' are no longer matched.
Online demo

Extracting digits in the middle of a string using delimiters

String ccToken = "";
String result = "ssl_transaction_type=CCGETTOKENssl_result=0ssl_token=4366738602809990ssl_card_number=41**********9990ssl_token_response=SUCCESS";
String[] elavonResponse = result.split("=|ssl");
for (String t : elavonResponse) {
System.out.println(t);
}
ccToken = (elavonResponse[6]);
System.out.println(ccToken);
I want to be able to grab a specific part of a string and store it in a variable. The way I'm currently doing it, is by splitting the string and then storing the value of the cell into my variable. Is there a way to specify that I want to store the digits after "ssl_token="?
I want my code to be able to obtain the value of ssl_token without having to worry about changes in the string that are not related to the token since I wont have control over the string. I have searched online but I can't find answers for my specific problem or I maybe using the wrong words for searching.

You can use replaceAll with this regex .*ssl_token=(\\d+).* :
String number = result.replaceAll(".*ssl_token=(\\d+).*", "$1");
Outputs
4366738602809990

You can do it with regex. It would probably be better to change the specifications of the input string so that each key/value pair is separated by an ampersand (&) so you could split it (similar to HTTP POST parameters).
Pattern p = Pattern.compile(".*ssl_token=([0-9]+).*");
Matcher m = p.matcher(result);
if(m.matches()) {
long token = Long.parseLong(m.group(1));
System.out.println(String.format("token: [%d]", token));
} else {
System.out.println("token not found");
}

Search index of ssl_token. Create substring from that index. Convert substring to number. To number can extract number when it is at the beggining of the string.

java create variable from regex findings

I'm pretty new to Java, but I am looking to create a String variable from a regex finding. But I am not too sure how.
Basically I need: previous_identifer = (all the text in nextline up to the third comma);
Something maybe like this?
previous_identifier = line.split("^(.+?),(.+?),(.+?),");
Or:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("^(.+?),(.+?),(.+?),");
previous_identifier = (courseColumnPattern.matcher(line).find());
But I know that won't work. What should I do differently?

You can use split to return an array of Strings, then use a StringBuilder to build your return string. An advantage of this approach is being able to easily return the first four strings, two strings, ten strings, etc.
int limit = 3, current = 0;
StringBuilder sb = new StringBuilder();
// Used as an example of input
String str = "test,west,best,zest,jest";
String[] strings = str.split(",");
for(String s : strings) {
if(++current > limit) {
// We've reached the limit; bail
break;
}
if(current > 1) {
// Add a comma if it's not the first element. Alternative is to
// append a comma each time after appending s and remove the last
// character
sb.append(",");
}
sb.append(s);
}
System.out.println(sb.toString()); // Prints "test,west,best"
If you don't need to use the three elements separately (you truly want just the first three elements in a chunk), you can use a Matcher with the following regular expression:
String str = "test, west, best, zest, jest";
// Matches against "non-commas", then a comma, then "non-commas", then
// a comma, then "non-commas". This way, you still don't have a trailing
// comma at the end.
Matcher match = Pattern.compile("^([^,]*,[^,]*,[^,]*)").matcher(str);
if(match.find())
{
// Print out the output!
System.out.println(match.group(1));
}
else
{
// We didn't have a match. Handle it here.
}

Your regex will work, but could be expressed more briefly. This is how you can "extract" it:
String head = str.replaceAll("((.+?,){3}).*", "$1");
This matches the whole string, while capturing the target, with the replacement being the captured input using a back reference to group 1.
Despite the downvote, here's proof the code works!
String str = "foo,bar,baz,other,stuff";
String head = str.replaceAll("((.+?,){3}).*", "$1");
System.out.println(head);
Output:
foo,bar,baz,

try an online regex tester to work out the regex, i think you need less brackets to get the entire text, i'd guess something like:
([^,+?],[^,+?],[^,+?])
Which says, find everything except a comma, then a comma, then everything but a comma, then a comman, then everything else that isn't a comma. I suspect this can be improved dramatically, i am not a regex expert
Then your java just needs to compile it and match against your string:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("([^,+?],[^,+?],[^,+?])");
if (previous_identifier.matches()) {
previous_identifier = (courseColumnPattern.matcher(line);
}

Replace text with data & matched group contents

I don't believe I saw this when searching (believe me, I spent a good amount of time searching for this) for a solution to this so here goes.
Goal:
Match regex in a string and replace it with something that contains the matched value.
Regex used currently:
\b(Connor|charries96|Foo|Bar)\b
For the record I suck at regex incase this isn't the best way to do it.
My current code (and several other methods I tried) can only replace the text with the first match it encounters if there are multiple matches.
private Pattern regexFromList(List<String> input) {
if(input.size() < 1) {
return "";
}
StringBuilder builder = new StringBuilder();
builder.append("\\b");
builder.append("(");
for(String s : input) {
builder.append(s);
if(!s.equals(input.get(input.size() - 1)))
{
builder.append("|");
}
}
builder.append(")");
builder.append("\\b");
return Pattern.compile(builder.toString(), Pattern.CASE_INSENSITIVE);
}
Example input:
charries96's name is Connor.
Example result using TEST as the data to prepend the match with
TESTcharries96's name is TESTcharries96.
Desired result using example input:
TESTcharries96's name is TESTConnor.
Here is my current code for replacing the text:
if(highlight) {
StringBuilder builder = new StringBuilder();
Matcher match = pattern.matcher(event.getMessage());
String string = event.getMessage();
if (match.find()) {
string = match.replaceAll("TEST" + match.group());
// I do realise I'm using #replaceAll but that's mainly given it gives me the same result as other methods so why not just cut to the chase.
}
builder.append(string);
return builder.toString();
}
EDIT:
Working example of desired result on RegExr

There are a few problems here:
You are taking the user input as is and build the regex:
builder.append(s);
If there are special character in the user input, it might be recognized as meta character and cause unexpected behavior.
Always use Pattern.quote if you want to match a string as it is passed in.
builder.append(Pattern.quote(s));
Matcher.replaceAll is a high level function which resets the Matcher (start the match all over again), and search for all the matches and perform the replacement. In your case, it can be as simple as:
String result = match.replaceAll("TEST$1");
The StringBuilder should be thrown away along with the if statement.
Matcher.find, Matcher.group are lower level functions for fine grain control on what you want to do with a match.
When you perform replacement, you need to build the result with Matcher.appendReplacement and Matcher.appendTail.
A while loop (instead of if statement) should be used with Matcher.find to search for and perform replacement for all matched.

String splitting

I have a string in what is the best way to put the things in between $ inside a list in java?
String temp = $abc$and$xyz$;
how can i get all the variables within $ sign as a list in java
[abc, xyz]
i can do using stringtokenizer but want to avoid using it if possible.
thx

Maybe you could think about calling String.split(String regex) ...

The pattern is simple enough that String.split should work here, but in the more general case, one alternative for StringTokenizer is the much more powerful java.util.Scanner.
String text = "$abc$and$xyz$";
Scanner sc = new Scanner(text);
while (sc.findInLine("\\$([^$]*)\\$") != null) {
System.out.println(sc.match().group(1));
} // abc, xyz
The pattern to find is:
\$([^$]*)\$
\_____/ i.e. literal $, a sequence of anything but $ (captured in group 1)
1 and another literal $
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
(…) is used for grouping. (pattern) is a capturing group and creates a backreference.
The backslash preceding the $ (outside of character class definition) is used to escape the $, which has a special meaning as the end of line anchor. That backslash is doubled in a String literal: "\\" is a String of length one containing a backslash).
This is not a typical usage of Scanner (usually the delimiter pattern is set, and tokens are extracted using next), but it does show how'd you use findInLine to find an arbitrary pattern (ignoring delimiters), and then using match() to access the MatchResult, from which you can get individual group captures.
You can also use this Pattern in a Matcher find() loop directly.
Matcher m = Pattern.compile("\\$([^$]*)\\$").matcher(text);
while (m.find()) {
System.out.println(m.group(1));
} // abc, xyz
Related questions
Validating input using java.util.Scanner
Scanner vs. StringTokenizer vs. String.Split

Just try this one:temp.split("\\$");

I would go for a regex myself, like Riduidel said.
This special case is, however, simple enough that you can just treat the String as a character sequence, and iterate over it char by char, and detect the $ sign. And so grab the strings yourself.
On a side node, I would try to go for different demarkation characters, to make it more readable to humans. Use $ as start-of-sequence and something else as end-of-sequence for instance. Or something like I think the Bash shell uses: ${some_value}. As said, the computer doesn't care but you debugging your string just might :)
As for an appropriate regex, something like (\\$.*\\$)* or so should do. Though I'm no expert on regexes (see http://www.regular-expressions.info for nice info on regexes).

Basically I'd ditto Khotyn as the easiest solution. I see you post on his answer that you don't want zero-length tokens at beginning and end.
That brings up the question: What happens if the string does not begin and end with $'s? Is that an error, or are they optional?
If it's an error, then just start with:
if (!text.startsWith("$") || !text.endsWith("$"))
return "Missing $'s"; // or whatever you do on error
If that passes, fall into the split.
If the $'s are optional, I'd just strip them out before splitting. i.e.:
if (text.startsWith("$"))
text=text.substring(1);
if (text.endsWith("$"))
text=text.substring(0,text.length()-1);
Then do the split.
Sure, you could make more sophisticated regex's or use StringTokenizer or no doubt come up with dozens of other complicated solutions. But why bother? When there's a simple solution, use it.
PS There's also the question of what result you want to see if there are two $'s in a row, e.g. "$foo$$bar$". Should that give ["foo","bar"], or ["foo","","bar"] ? Khotyn's split will give the second result, with zero-length strings. If you want the first result, you should split("\$+").

If you want a simple split function then use Apache Commons Lang which has StringUtils.split. The java one uses a regex which can be overkill/confusing.

You can do it in simple manner writing your own code.
Just use the following code and it will do the job for you
import java.util.ArrayList;
import java.util.List;
public class MyStringTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
List <String> result = getTokenizedStringsList("$abc$efg$hij$");
for(String token : result)
{
System.out.println(token);
}
}
private static List<String> getTokenizedStringsList(String string) {
List <String> tokenList = new ArrayList <String> ();
char [] in = string.toCharArray();
StringBuilder myBuilder = null;
int stringLength = in.length;
int start = -1;
int end = -1;
{
for(int i=0; i<stringLength;)
{
myBuilder = new StringBuilder();
while(i<stringLength && in[i] != '$')
i++;
i++;
while((i)<stringLength && in[i] != '$')
{
myBuilder.append(in[i]);
i++;
}
tokenList.add(myBuilder.toString());
}
}
return tokenList;
}
}

You can use
String temp = $abc$and$xyz$;
String array[]=temp.split(Pattern.quote("$"));
List<String> list=new ArrayList<String>();
for(int i=0;i<array.length;i++){
list.add(array[i]);
}
Now the list has what you want.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting Twitter username from a given text (JAVA, Regex) - java

Related

Censoring bad words in a string

Extracting digits in the middle of a string using delimiters

java create variable from regex findings

Replace text with data & matched group contents

String splitting

Categories

Resources