How to best strip out certain strings in a file?

How to best strip out certain strings in a file? - java

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?

You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);

One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).

You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);

In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.

You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

Related

How to find a String of last 2 items in colon separated string

I have a string = ab:cd:ef:gh. On this input, I want to return the string ef:gh (third colon intact).
The string apple:orange:cat:dog should return cat:dog (there's always 4 items and 3 colons).
I could have a loop that counts colons and makes a string of characters after the second colon, but I was wondering if there exists some easier way to solve it.

You can use the split() method for your string.
String example = "ab:cd:ef:gh";
String[] parts = example.split(":");
System.out.println(parts[parts.length-2] + ":" + parts[parts.length-1]);

String example = "ab:cd:ef:gh";
String[] parts = example.split(":",3); // create at most 3 Array entries
System.out.println(parts[2]);

The split function might be what you're looking for here. Use the colon, like in the documentation as your delimiter. You can then obtain the last two indexes, like in an array.

Yes, there is easier way.
First, is by using method split from String class:
String txt= "ab:cd:ef:gh";
String[] arr = example.split(":");
System.out.println(arr[arr.length-2] + " " + arr[arr.length-1]);
and the second, is to use Matcher class.

Use overloaded version of lastIndexOf(), which takes the starting index as 2nd parameter:
str.substring(a.lastIndexOf(":", a.lastIndexOf(":") - 1) + 1)

Another solution would be using a Pattern to match your input, something like [^:]+:[^:]+$. Using a pattern would probably be easier to maintain as you can easily change it to handle for example other separators, without changing the rest of the method.
Using a pattern is also likely be more efficient than String.split() as the latter is also converting its parameter to a Pattern internally, but it does more than what you actually need.
This would give something like this:
String example = "ab:cd:ef:gh";
Pattern regex = Pattern.compile("[^:]+:[^:]+$");
final Matcher matcher = regex.matcher(example);
if (matcher.find()) {
// extract the matching group, which is what we are looking for
System.out.println(matcher.group()); // prints ef:gh
} else {
// handle invalid input
System.out.println("no match");
}
Note that you would typically extract regex as a reusable constant to avoid compiling the pattern every time. Using a constant would also make the pattern easier to change without looking at the actual code.

Filter and find integers in a String with Regex

I have this long string:
String responseData = "fker.phone.bash,0,0,0"
+ "fker.phone.bash,0,0,0"
+ "fker.phone.bash,2,0,0";
What I want to do is to extract the integers in this string. I have successfully done that with this code:
String pattern = "(\\d+)";
// this pattern finds EVERY integer. I only want the integers after the comma
Pattern pr = Pattern.compile(pattern);
Matcher match = pr.matcher(responseData);
while (match.find()) {
System.out.println(match.group());
}
So far it is working, but I want to make my regex more secure because the responsedata I get is dynamic. Sometimes I might get an integer in the middle of the string, but I only want the last integers, meaning after the comma.
I know the regex for starts with is ^ and I have to put my comma tecken as an argument, but I don't know how to piece it all together and that is why I am asking for help. Thank you.

String pattern = "(,)(\\d)+";
Then get the second group.

You can use positive lookbehind for that:
String pattern = "(?<=,)\\d+";
You don't need to extract any groups to do use that solution, because lookbehind is zero-length assertion.

You can simply use the following and find by match.group(1):
String pattern = ",(\\d+)";
See working demo
You can also use word boundaries to get independent numbers:
String pattern = "\\b(\\d+)\\b";

Java Regex is including new line in match

I'm trying to match a regular expression to textbook definitions that I get from a website.
The definition always has the word with a new line followed by the definition. For example:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
Here's my snippet
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
All help is greatly appreciated. Thanks in advance!

Try using the Pattern.MULTILINE option
Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^ and $ just match the start and end of the string.
Although it makes no difference for this pattern, the Matcher.group() method returns the entire match, whereas the Matcher.group(int) method returns the match of the particular capture group (...) based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s in your Pattern as you wrote you tried, then Matcher.group() would have included that whitespace in its return value.

With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
So changing mtch.group() to mtch.group(1) should do the trick:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}

A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL in your regex string
(?s)[Your Expression]
Basically (?s) also tells dot to match all characters, including line breaks
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Just replace:
String result = mtch.group();
By:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group (e.g. (\\w+)) .

Try the next:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\\w+)\\r?\\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("$1")
); // prints "Zither"
}

The best way to find out that part of the string is potencial RegEx match

how would you do this:
I have a string and some regexes. Then I iterate over the string and in every iteration I need to know if the part (string index 0 to string currently iterated index) of that string is possible full match of one or more given regexes in next iterations.
Thank you for help.

What about a code like this:
// all of *greedy* regexs into a list
List<String> regex = new ArrayList<String>();
// here is my text
String mytext = "...";
String tmp = null;
// iterate over letters of my text
for (int i = 0; i < mytext.length(); i++) {
// substring from 0. position till i. index
tmp = mytext.substring(0, i);
// append regex on sub text
for (String reg : regex ) {
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(tmp);
// if found, do smt
if (m.find() ) { bingo.. do smt! }
}
}

You could use Matcher.lookingAt() to try to match as much as possible from a given input, but not requiring the whole input to match (.matches() would require the full input to match and .find() would not require the match to start at the beginning).

I don't believe the Java regular expression API provides such "incremental" or "step-by-step" search.
What you could do however, is to formulate your expression using reluctant quantifiers.
[...] The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string. [...]
If this isn't viable in your case, you could use the Matcher.setRegion method to incrementally increase the region used by the matcher.

So I've been searching for alternatives to Java's standart RegEx library and found one that does the job well - JRegex

Pattern, matcher in Java, REGEX help

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:
Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
// do something here
}
I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:
if (m.find()) {
System.out.println(s.substring(i, m.end()));
}
To the text file that has: This is an example example test test test.
Why is my output This is?
Edit:
if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example
List<String> newString = new ArrayList<String>();
for (String s : lineOfWords {
s = s.replaceAll( code from Kobi here);
newString.add(s);
}
but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

Try something like:
s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");
That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.
To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to best strip out certain strings in a file? - java

You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example: String line = scanner.nextLine(); int start = line.indexOf("GET"); int end = line.indexOf('#'); String result = line.substring(start + 4, end);

One way might be String strippedStart = scanner.nextLine().split(" ", 3)[2]; String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim(); This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).

You could do something like this:- String line ="11:17 GET this is my content #2013"; int startIndex = line.indexOf("GET "); int endIndex = line.indexOf("#"); line = line.substring(startIndex+4, endIndex-1); System.out.println(line);

Related

How to find a String of last 2 items in colon separated string

Filter and find integers in a String with Regex

Java Regex is including new line in match

The best way to find out that part of the string is potencial RegEx match

Pattern, matcher in Java, REGEX help

Categories

Resources