Regular expression for extracting information - java

I have a csv file with the following data format
123,"12.5","0.6","15/9/2012 12:11:19"
These numbers are:
order number
price
discount rate
date and time of sale
I want to extract these data from the line.
I have tried the regular expression:
String line = "123,\"12.5\",\"0.6\",\"15/9/2012 12:11:19\"";
Pattern pattern = Pattern.compile("(\\W?),\"([\\d\\.\\-]?)\",\"([\\d\\.\\-]?)\",\"([\\W\\-\\:]?)\"");
Scanner scanner = new Scanner(line);
if(scanner.hasNext(pattern)) {
...
}else{
// Alaways goes to here
}
It looks like my pattern is not correct as it always goes to the else section. What did I do wrong? Can someone suggests a solution for this?
Many thanks.

Seems a bit overcomplicated to specifically split, you should try splitting by the most obvious common delimiter between the elements, which is a comma. Perhaps you should try something like this:
final String info = "123,\"12.5\",\"0.6\",\"15/9/2012 12:11:19\"";
final String[] split = info.split(",");
final int orderNumber = Integer.parseInt(split[0]);
final double price = Double.parseDouble(split[1].replace("\"", ""));
final double discountRate = Double.parseDouble(split[2].replace("\"", ""));
final String date = split[3].replace("\"", "");

Regular expressions are very cumbersome for this type of work.
I suggest using a CSV library such as OpenCSV instead.
The library can parse the String entries into a String array and individual entries can be parsed as required. Here an OpenCSV example for the specific problem:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
int orderNumber = Integer.parseInt(nextLine[0]);
double price = Double.parseDouble(nextLine[1]);
double discountRate = Double.parseDouble(nextLine[2]);
...
}
Full documentation and examples can be found here

? in regex means "zero or one occurrence". You probably wanted to use + instead (one or more) so it could capture all the digits, points, colons, etc.

scanner.hasNext(pattern)
from documentation
Returns true if the next complete token matches the specified pattern.
but next token is 123,"12.5","0.6","15/9/2012 because scanner tokenizes words using space.
Also there are few problems with your regex
you used ? which means zero or one where you should use * - zero or more, or + - one or more,
you used \\W at start but this will also exclude numbers.
If you really want to use scanner and regex then try with
Pattern.compile("(\\d+),\"([^\"]+)\",\"([^\"]+)\",\"([^\"]+)\"");
and change used delimiter to new line mark with
scanner.useDelimiter(System.lineSeparator());

This is a possible solution to your situation:
String line = "123,\"12.5\",\"0.6\",\"15/9/2012 12:11:19\"";
Pattern pattern = Pattern.compile("([0-9]+),\\\"([0-9.]+)\\\",\\\"([0-9.]+)\\\",\\\"([0-9/:\\s]+)\\\"");
Scanner scanner = new Scanner(line);
scanner.useDelimiter("\n");
if(scanner.hasNext(pattern)) {
MatchResult result = scanner.match();
System.out.println("1st: " + result.group(1));
System.out.println("2nd: " + result.group(2));
System.out.println("3rd: " + result.group(3));
System.out.println("4th: " + result.group(4));
}else{
System.out.println("There");
}
Note that ? means 0 or 1 occurrences, meanwhile + means 1 or more.
Observe the use of 0-9 for digits. You can also use \dif you like. For spaces, you must change the delimiter of the scanner with scanner.useDelimiter("\n"), for example.
The output of this snippet is:
1st: 123
2nd: 12.5
3rd: 0.6
4th: 15/9/2012 12:11:19

Related

How to split a string by a newline and a fixed number of tabs like "\n\t" in Java?

My input string is the following:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
My intended result is
dir,
subdir1,
subdir2\n\t\tfile.ext
The requirement is to split the input by "\n\t" but not "\n\t\t".
A simple try of
String[] answers = input.split("\n\t");
also splits "\tfile.ext" from the last entry. Is there a simple regular expression to solve the problem? Thanks!
You can split on a newline and tab, and assert not a tab after it to the right.
\n\t(?!\t)
See a regex demo.
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] answers = input.split("\\n\\t(?!\\t)");
System.out.println(Arrays.toString(answers));
Output
[dir, subdir1, subdir2
file.ext]
If you are looking for a generic approach, it highly depends on what format will input generally have. If your format is static for all possible inputs (dir\n\tdir2\n\tdir3\n\t\tfile.something) one way to do it is the following:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] answers = input.split("\n\t");
for (int i = 1; i < answers.length; i++)
if (answers[i].contains("\t"))
answers[i-1] = answers[i-1] + "\n\t" + answers[i];
String[] answersFinal = Arrays.copyOf(answers, answers.length-1);
for (int i = 0; i < answersFinal.length; i++)
answersFinal[i] = answers[i];
for (String s : answersFinal)
System.out.println(s);
However this is not a good solution and I would suggest reformatting your input to include a special sequence of characters that you can use to split the input, for example:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
input = input.replaceAll("\n\t", "%%%").replaceAll("%%%\t", "\n\t\t");
And then split the input with '%%%', you will get your desired output.
But again, this highly depends on how generic you want it to be, the best solution is to use an overall different approach to achieve what you want, but I cannot provide it since I don't have enough information on what you are developing.
You can simply do:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] modifiedInput = input.replaceAll("\n\t\t", "####").replaceAll("\n\t", "§§§§").replaceAll("####", "\n\t\t").split("§§§§");
Replace each \n\t\t which contain the \n\t
Replace each \n\t
Change back the \n\t\t as you seemingly want to preserve it
Make the split.
Not very efficient but still works fast enough if you won't use it in mass data situations.
This approach is more efficient as it only uses 2 splits but only works if there is only one element prefixed with \n\t\t at the end. Accessing an Array is kind of cheap O(1) so constant time. More code but less full iterations (replaceAll, split).
final String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
final String[] s1 = input.split("\n\t\t");
final String last = s1[s1.length - 1];
final String[] modifiedInput = s1[0].split("\n\t");
modifiedInput[modifiedInput.length -1] = modifiedInput[modifiedInput.length -1] + "\n\t\t" + last;

How to get the string between double quotes in a string in Java [duplicate]

This question already has answers here:
Split string on spaces in Java, except if between quotes (i.e. treat \"hello world\" as one token) [duplicate]
(1 answer)
Java Regex for matching quoted string with escaped quotes
(1 answer)
Closed 8 years ago.
For example, input will be like:
AddItem rt456 4 12 BOOK “File Structures” “Addison-Wesley” “Michael Folk”
and I want to read all by using scanner and put it in a array.
like:
info[0] = rt456
info[1] = 4
..
..
info[4] = File Structures
info[5] = Addison-Wesley
So how can I get the string between quotes?
EDIT: a part of my code->
public static void main(String[] args) {
String command;
String[] line = new String[6];
Scanner read = new Scanner(System.in);
Library library = new Library();
command = read.next();
if(command.matches("AddItem"))
{
line[0] = read.next(); // Serial Number
line[1] = read.next(); // Shelf Number
line[2] = read.next(); // Shelf Index
command = read.next(); // Type of the item. "Book" - "CD" - "Magazine"
if(command.matches("BOOK"))
{
line[3] = read.next(); // Name
line[4] = read.next(); // Publisher
line[5] = read.next(); // Author
Book yeni = new Book(line[0],Integer.parseInt(line[1]),Integer.parseInt(line[2]),line[3],line[4],line[5]);
}
}
}
so I use read.next to read String without quotes.
SOLVED BY USING REGEX AS
read.next("([^\"]\\S*|\".+?\")\\s*");
You can use StreamTokenizer for this in a pinch. If operating on a String, wrap it with a StringReader. If operating on a file just pass your Reader to it.
// Replace “ and ” with " to make parsing easier; do this only if you truly are
// using pretty quotes (as you are in your post).
inputString = inputString.replaceAll("[“”]", "\"");
StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(inputString));
tokenizer.resetSyntax();
tokenizer.whitespaceChars(0, 32);
tokenizer.wordChars(33, 255);
tokenizer.quoteChar('\"');
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
// tokenizer.sval will contain the token
System.out.println(tokenizer.sval);
}
You will have to use an appropriate configuration for non-ASCII text, the above is just an example.
If you want to pull numbers out separately, then the default StreamTokenizer configuration is fine, although it uses double and provides no int numeric tokens. Annoyingly, it is not possible to simply disable number parsing without resetting the syntax from scratch.
If you don't want to mess with all this, you could also consider changing the input format to something more convenient, as in Steve Sarcinella's good suggestion, if it is appropriate.
As a reference, take a look at this: Scanner Docs
How you read from the scanner is determined by how you will present the data to your user.
If they are typing it all on one line:
Scanner scanner = new Scanner(System.in);
String result = "";
System.out.println("Enter Data:");
result = scanner.nextLine();
Otherwise if you split it up into input fields you could do:
Scanner scanner = new Scanner(System.in);
System.out.println("Enter Identifier:");
info[0] = scanner.nextLine();
System.out.println("Enter Num:");
info[1] = scanner.nextLine();
...
If you want to validate anything before assigning the data to a variable, try using scanner.next(""); where the quotes contain a regex pattern to match
EDIT:
Check here for regex info.
As an example, say I have a string
String foo = "The cat in the hat";
regex (Regular Expressions) can be used to manipulate this string in a very quick and efficient manner. If I take that string and do foo = foo.replace("\\s+", "");, this will replace any whitespace with nothing, therefore eliminating whitespace.
Breaking down the argument \\s+, we have \s which means match any character that is whitespace.
The extra \ before \s is a an escape character that allows the \s to be read properly.
The + means match the previous expression 0 or more times. (Match all).
So foo, after running replace, would be "TheCatInTheHat"
Same this regex logic can apply to scanner.next(String regex);
Hopefully this helps a bit more, I'm not the best at explanation :)
An alternative using a messy regular expression:
public static void main(String[] args) throws Exception {
Pattern p = Pattern.compile("^(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+(\\w*)[\\s]+[“](.*)[”][\\s]+[“](.*)[”][\\s]+[“](.*)[”]");
Matcher m = p.matcher("AddItem rt456 4 12 BOOK “File Structures” “Addison-Wesley” “Michael Folk”");
if (m.find()) {
for (int i=1;i<=m.groupCount();i++) {
System.out.println(m.group(i));
}
}
}
That prints:
AddItem
rt456
4
12
BOOK
File Structures
Addison-Wesley
Michael Folk
I assumed quotes are as you typed them in the question “” and not "", so they dont need to be escaped.
You can try this. I have prepared the demo for your requirement
public static void main(String args[]) {
String str = "\"ABC DEF\"";
System.out.println(str);
String str1 = str.replaceAll("\"", "");
System.out.println(str1);
}
After reading just replace the double quotes with empty string

How to best strip out certain strings in a file?

If I have a file with the following content:
11:17 GET this is my content #2013
11:18 GET this is my content #2014
11:19 GET this is my content #2015
How can I use a Scanner and ignore certain parts of a `String line = scanner.nextLine();?
The result that I like to have would be:
this is my content
this is my content
this is my content
So I'd like to trip everything from the start until GET, and then take everything until the # char.
How could this easily be done?
You can use the String.indexOf(String str) and String.indexOf(char ch) methods. For example:
String line = scanner.nextLine();
int start = line.indexOf("GET");
int end = line.indexOf('#');
String result = line.substring(start + 4, end);
One way might be
String strippedStart = scanner.nextLine().split(" ", 3)[2];
String result = strippedStart.substring(0, strippedStart.lastIndexOf("#")).trim();
This assumes the are always two space separated tokens at the beginning (11:22 GET or 11:33 POST, idk).
You could do something like this:-
String line ="11:17 GET this is my content #2013";
int startIndex = line.indexOf("GET ");
int endIndex = line.indexOf("#");
line = line.substring(startIndex+4, endIndex-1);
System.out.println(line);
In my opinion the best solution for your problem would be using Java regex. Using regex you can define which group or groups of text you want to retrieve and what kind of text comes where. I haven't been working with Java in a long time, so I'll try to help you out from the top of my head. I'll try to give you a point in the right direction.
First off, compile a pattern:
Pattern pattern = Pattern.compile("^\d{1,2}:\d{1,2} GET (.*?) #\d+$", Pattern.MULTILINE);
First part of the regex says that you expect one or two digits followed by a colon followed by one or two digits again. After that comes the GET (you can use GET|POST if you expect those words or \w+? if you expect any word). Then you define the group you want with the parentheses. Lastly, you put the hash and any number of digits with at least one digit. You might consider putting flags DOTALL and CASE_INSENSITIVE, although I don't think you'll be needing them.
Then you continue with the matcher:
Matcher matcher = pattern.matcher(textToParse);
while (matcher.find())
{
//extract groups here
String group = matcher.group(1);
}
In the while loop you can use matcher.group(1) to find the text in the group you selected with the parentheses (the text you'd like extracted). matcher.group(0) gives the entire find, which is not what you're currently looking for (I guess).
Sorry for any errors in the code, it has not been tested. Hope this puts you on the right track.
You can try this rather flexible solution:
Scanner s = new Scanner(new File("data"));
Pattern p = Pattern.compile("^(.+?)\\s+(.+?)\\s+(.*)\\s+(.+?)$");
Matcher m;
while (s.hasNextLine()) {
m = p.matcher(s.nextLine());
if (m.find()) {
System.out.println(m.group(3));
}
}
This piece of code ignores first, second and last words from every line before printing them.
Advantage is that it relies on whitespaces rather than specific string literals to perform the stripping.

Parsing string from the name

I am trying to parse the certain name from the filename.
The examples of File names are
xs_1234323_00_32
sf_12345233_99_12
fs_01923122_12_12
I used String parsedname= child.getName().substring(4.9) to get the 1234323 out of the first line. Instead, how do I format it for the above 3 to output only the middle numbers(between the two _)? Something using split?
one line solution
String n = str.replaceAll("\\D+(\\d+).+", "$1");
most efficent solution
int i = str.indexOf('_');
int j = str.indexOf('_', i + 1);
String n = str.substring(i + 1, j);
String [] tokens = filename.split("_");
/* xs_1234323_00_32 would be
[0]=>xs [1]=> 1234323 [2]=> 00 [3] => 32
*/
String middleNumber = tokens[2];
You can try using split using the '_' delimiter.
The String.split methods splits this string around matches of the given ;parameter. So use like this
String[] output = input.split("_");
here output[1] will be your desired result
ANd input will be like
String input = "xs_1234323_00_32"
I would do this:
filename.split("_", 3)[1]
The second argument of split indicates the maximum number of pieces the string should be split into, in your case you only need 3. This will be faster than using the single-argument version of split, which will continue splitting on the delimiter unnecessarily.

Splitting strings based on a delimiter

I am trying to break apart a very simple collection of strings that come in the forms of
0|0
10|15
30|55
etc etc. Essentially numbers that are seperated by pipes.
When I use java's string split function with .split("|"). I get somewhat unpredictable results. white space in the first slot, sometimes the number itself isn't where I thought it should be.
Can anybody please help and give me advice on how I can use a reg exp to keep ONLY the integers?
I was asked to give the code trying to do the actual split. So allow me to do that in hopes to clarify further my problem :)
String temp = "0|0";
String splitString = temp.split("|");
results
\n
0
|
0
I am trying to get
0
0
only. Forever grateful for any help ahead of time :)
I still suggest to use split(), it skips null tokens by default. you want to get rid of non numeric characters in the string and only keep pipes and numbers, then you can easily use split() to get what you want. or you can pass multiple delimiters to split (in form of regex) and this should work:
String[] splited = yourString.split("[\\|\\s]+");
and the regex:
import java.util.regex.*;
Pattern pattern = Pattern.compile("\\d+(?=([\\|\\s\\r\\n]))");
Matcher matcher = pattern.matcher(yourString);
while (matcher.find()) {
System.out.println(matcher.group());
}
The pipe symbol is special in a regexp (it marks alternatives), you need to escape it. Depending on the java version you are using this could well explain your unpredictable results.
class t {
public static void main(String[]_)
{
String temp = "0|0";
String[] splitString = temp.split("\\|");
for (int i=0; i<splitString.length; i++)
System.out.println("splitString["+i+"] is " + splitString[i]);
}
}
outputs
splitString[0] is 0
splitString[1] is 0
Note that one backslash is the regexp escape character, but because a backslash is also the escape character in java source you need two of them to push the backslash into the regexp.
You can do replace white space for pipes and split it.
String test = "0|0 10|15 30|55";
test = test.replace(" ", "|");
String[] result = test.split("|");
Hope this helps for you..
You can use StringTokenizer.
String test = "0|0";
StringTokenizer st = new StringTokenizer(test);
int firstNumber = Integer.parseInt(st.nextToken()); //will parse out the first number
int secondNumber = Integer.parseInt(st.nextToken()); //will parse out the second number
Of course you can always nest this inside of a while loop if you have multiple strings.
Also, you need to import java.util.* for this to work.
The pipe ('|') is a special character in regular expressions. It needs to be "escaped" with a '\' character if you want to use it as a regular character, unfortunately '\' is a special character in Java so you need to do a kind of double escape maneuver e.g.
String temp = "0|0";
String[] splitStrings = temp.split("\\|");
The Guava library has a nice class Splitter which is a much more convenient alternative to String.split(). The advantages are that you can choose to split the string on specific characters (like '|'), or on specific strings, or with regexps, and you can choose what to do with the resulting parts (trim them, throw ayway empty parts etc.).
For example you can call
Iterable<String> parts = Spliter.on('|').trimResults().omitEmptyStrings().split("0|0")
This should work for you:
([0-9]+)
Considering a scenario where in we have read a line from csv or xls file in the form of string and need to separate the columns in array of string depending on delimiters.
Below is the code snippet to achieve this problem..
{ ...
....
String line = new BufferedReader(new FileReader("your file"));
String[] splittedString = StringSplitToArray(stringLine,"\"");
...
....
}
public static String[] StringSplitToArray(String stringToSplit, String delimiter)
{
StringBuffer token = new StringBuffer();
Vector tokens = new Vector();
char[] chars = stringToSplit.toCharArray();
for (int i=0; i 0) {
tokens.addElement(token.toString());
token.setLength(0);
i++;
}
} else {
token.append(chars[i]);
}
}
if (token.length() > 0) {
tokens.addElement(token.toString());
}
// convert the vector into an array
String[] preparedArray = new String[tokens.size()];
for (int i=0; i < preparedArray.length; i++) {
preparedArray[i] = (String)tokens.elementAt(i);
}
return preparedArray;
}
Above code snippet contains method call to StringSplitToArray where in the method converts the stringline into string array splitting the line depending on the delimiter specified or passed to the method. Delimiter can be comma separator(,) or double code(").
For more on this, follow this link : http://scrapillars.blogspot.in

Categories

Resources