I am reading a tab-delimited text file line by line which is extremely messy and trying to get the unique columns names out of it.
The problem is it contains tabs as field separator but some column names have space in their names! I am using
String[] cols = line.split("\\t");
which seems that is not working properly since for some cases it gets the spaces as separators! Is using regex a good solution? If yes, could you advise what regex removes white spaces from a string but keeps the tabs?
Data is like:
Sever ID Name
12221 zxsz
Tab in a string literal is just "\t". "\\t" is a literal backslash followed by a "t". Having said that, either method works for me:
public class Scratch2 {
public static void main(String[] args) {
String welk = "anna one\tanna two\tanna three";
System.out.println("\\t");
String[] annas = welk.split("\t");
for (String anna : annas) {
System.out.println(anna);
}
System.out.println("\\\\t");
annas = welk.split("\\t");
for (String anna : annas) {
System.out.println(anna);
}
}
}
Output:
\t
anna one
anna two
anna three
\\t
anna one
anna two
anna three
The simplest explanation is that your input strings don't contain the whitespace characters you think they do.
Related
I am trying to remove extra spaces in the string.To achieve this I used normalizeSpace method in StringUtils class. But the problem is it is not removed the spaces before and after "-"
public static void main(String[] args)
{
String test = " Hi - World Java";
System.out.println(StringUtils.normalizeSpace(test));
}
Output as: "Hi - World Java" The expected output is: "Hi-World Java"
Any inputs?
Note: Below ticket solution is during concatenating strings. Where as we have data in a single string. So this ticket is not a duplicate ticket.
Remove spaces before a punctuation mark in a string
test = test.replaceAll("[ ]+"," ");
test = test.replaceAll("- ","-");
test = test.replaceAll(" -","-");
test = test.replaceAll("^\\s+","");
The utility removes all extra spaces but leaves one space. In other words where it find a sequence of more than one space it removes all but one space. So your result is as expected. If you need it the way you wrote: "Hi-World Java" then you need your own logic, as specified in some other answers here.
I want to split a String in punctuation marks and white spaces, but keep the punctuation marks. E.x
String example = "How are you? I am fine!"
I want to have as a result
["How","are","you","?","I","am","fine","!"]
but instead I get
["how"," ","are"," ","you"," ","?"," ","i"," ","am"," ","fine"," ","!"].
what I used was example.toLowerCase().trim().split("(?<=\\b|[^\\p{L}])");
Why are you doing toLowerCase()? This already messes up your expected result. And why the trim() on the full string?
Doing this with a single split call is probably not too simple.
An alternative would be to just filter out the unwanted entries:
String example = "How are you? I am fine!";
Pattern pattern = Pattern.compile("\\b");
String[] result = pattern.splitAsStream(example)
.filter(Predicate.not(String::isBlank))
.toArray(String[]::new);
System.out.println(Arrays.toString(result));
Output:
[How, are, you, ? , I, am, fine, !]
Reacting to your comment of wanting [How,are,you,?,I,am,fine,!] as output; simply dont print with Arrays.toString but build the string yourself manually. The array does not contain any whitespaces.
System.out.println("[" + String.join(",", result) + "]");
You can do it as follows:
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
String example = "How are you? I am fine!";
String[] arr = example.split("\\s+|\\b(?=\\p{Punct})");
System.out.println(Arrays.toString(arr));
}
}
Output:
[How, are, you, ?, I, am, fine, !]
Explanation of the regex:
\\s+ specifies the space
\\b specifies the word boundary
(?=\\p{Punct}) specifies the positive look ahead for punctuation.
| specifies the alternation (OR)
I need to split the string to the substings in order to sort them to quoted and not quoted ones. The single quote character is used as a separator, and two single quotes represents an escape sequence and means that they shall not be used for splitting.
For example:
"111 '222''22' 3333"
shall be splitted as
"111", "222''22", "3333"
no matter with or without whitespaces.
So, I wrote the following code, but it does not work. Tried lookbehind with "\\'(?<!\\')" as well, but with no success. Please help
String rgxSplit="\\'(?!\\')";
String text="";
Scanner s=new Scanner(System.in);
System.out.println("\""+rgxSplit+"\"");
text=s.nextLine();
while(!text.equals(""))
{
String [] splitted=text.split(rgxSplit);
for(int i=0;i<splitted.length;i++)
{
if(i%2==0)
{
System.out.println("+" + splitted[i]);
}
else
{
System.out.println("-" + splitted[i]);
}
}
text=s.nextLine();
}
Output:
$ java ParseTest
"\'(?!\')"
111 '222''22' 3333
+111
-222'
+22
- 3333
This should split on a single quote (when it is not doubled), and in the case of three consecutive, it will group the first two and will split on the third.
String [] splitted=text.split("(?<!') *' *(?!')|(?<='') *' *");
To split on single apostrophes use look arounds both sides of the apostrophe:
String[] parts = str.split(" *(?<!')'(?!') *");
See live demo on ideone.
I'm looking for a regex to split the following strings
red 12478
blue 25 12375
blue 25, 12364
This should give
Keywords red, ID 12478
Keywords blue 25, ID 12475
Keywords blue IDs 25, 12364
Each line has 2 parts, a set of keywords and a set of IDs. Keywords are separated by spaces and IDs are separated by commas.
I came up with the following regex: \s*((\S+\s+)+?)([\d\s,]+)
However, it fails for the second one. I've been trying to work with lookahead, but can't quite work it out
I am trying to split the string into its component parts (keywords and IDs)
The format of each line is one or more space separated keywords followed by one or more comma separated IDs. IDs are numeric only and keywords do not contain commas.
I'm using Java to do this.
I found a two-line solution using replaceAll and split:
pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
I assumed that the comma will always be immediately after the ID for each ID (this can be enforced by removing spaces adjacent to a comma), and that there is no trailing space.
I also assumed that the first keyword is a sequence of non-whitespace chars (without trailing comma) \\S+(?<!,)\\s+, and the rest of the keywords (if any) are digits (\\d+\\s+)*. I made this assumption based on your regex attempt.
The regex here is very simple, just take (greedily) any sequence of valid keywords that is followed by a space (or whitespaces). The longest will be the list of keywords, the rest will be the IDs.
Full Code:
public static void main(String[] args){
String pattern = "(\\S+(?<!,)\\s+(\\d+\\s+)*)";
Scanner sc = new Scanner(System.in);
while(true){
String theString = sc.nextLine();
String[] keywords = theString.replaceAll(pattern+".*","$1").split(" ");
String[] ids = theString.split(pattern)[1].split(",\\s?");
System.out.println("Keywords:");
for(String keyword: keywords){
System.out.println("\t"+keyword);
}
System.out.println("IDs:");
for(String id: ids){
System.out.println("\t"+id);
}
System.out.println();
}
}
Sample run:
red 124
Keywords:
red
IDs:
124
red 25 124
Keywords:
red
25
IDs:
124
red 25, 124
Keywords:
red
IDs:
25
124
I came up with:
(red|blue)( \d+(?!$)(?:, \d+)*)?( \d+)?$
as illustrated in http://rubular.com/r/y52XVeHcxY which seems to pass your tests. It's a straightforward matter to insert your keywords between the match substrings.
Ok since the OP didn't specify a target language, I am willing to tilt at this windmill over lunch as a brain teaser and provide a C#/.Net Regex replace with match evaluator which gives the required output:
Keywords red, ID 12478
Keywords blue 25 ID 12375
Keywords blue IDs 25, 12364
Note there is no error checking and this is fine example of using a lamda expression for the match evaluator and returning a replace per rules does the job. Also of note due to the small sampling size of data it doesn't handle multiple Ids/keywords as the case may actually be.
string data = #"red 12478
blue 25 12375
blue 25, 12364";
var pattern = #"(?xmn) # x=IgnorePatternWhiteSpace m=multiline n=explicit capture
^
(?<Keyword>[^\s]+) # Match Keyword Color
[\s,]+
(
(?<Numbers>\d+)
(?<HasComma>,)? # If there is a comma that signifies IDs
[,\s]*
)+ # 1 or more values
$";
Console.WriteLine (Regex.Replace(data, pattern, (mtch) =>
{
StringBuilder sb = new StringBuilder();
sb.AppendFormat("Keywords {0}", mtch.Groups["Keyword"].Value);
var values = mtch.Groups["Numbers"]
.Captures
.OfType<Capture>()
.Select (cp => cp.Value)
.ToList();
if (mtch.Groups["HasComma"].Success)
{
sb.AppendFormat(" IDs {0}", string.Join(", ", values));
}
else
{
if (values.Count() > 1)
sb.AppendFormat(" {0} ID {1}", values[0], values[1] );
else
sb.AppendFormat(", ID {0}", values[0]);
}
return sb.ToString();
}));
Here is my basic problem: I am reading some lines in from a file. The format of each line in the file is this:
John Doe 123
There is a tab between Doe and 123.
I'm looking for a regex such that I can "pick off" the John Doe. Something like scanner.next(regular expression) that would give me the John Doe.
This is probably very simple, but I can't seem to get it to work. Also, I'm trying to figure this out without having to rely on the tab being there.
I've looked here: Regular Expression regex to validate input: Two words with a space between. But none of these answers worked. I kept getting runtime errors.
Some Code:
while(inFile.hasNextLine()){
String s = inFile.nextLine();
Scanner string = new Scanner(s);
System.out.println(s); // check to make sure I got the string
System.out.println(string.next("[A-Za-z]+ [A-Za-z]+")); //This
//doesn't work for me
System.out.println(string.next("\\b[A-Za-z ]+\\b"));//Nor does
//this
}
Are you required to use regex for this? You could simply use a split method across \t on each line and just grab the first or second element (I'm not sure which you meant by 'pick off' john doe).
It would help if you provided the code you're trying that is giving you runtime errors.
You could use regex:
[A-Za-z]+ [A-Za-z]+
if you always knew your name was going to be two words.
You could also try
\b[A-Za-z ]+\b
which matches any number of words (containing alphabets), making sure it captures whole words (that's what the '\b' is) --> to return "John Doe" instead of "John Doe " (with the trailing space too). Don't forget backslashes need to be escaped in Java.
This basically works to isolate John Doe from the rest...
public String isolateAndTrim( String candidate ) {
// This pattern isolates "John Doe" as a group...
Pattern pattern = Pattern.compile( "(\\w+\\s+\\w+)\\s+\\d*" );
Matcher matcher = pattern.matcher( candidate );
String clean = "";
if ( matcher.matches() ) {
clean = matcher.group( 1 );
// This replace all reduces away extraneous whitespace...
clean = clean.replaceAll( "\\s+", " " );
}
return clean;
}
The grouping parenthesis will allow you to "pick off" the name portion from the digit portion. "John Doe", "Jane Austin", whatever. You should learn the grouping stuff in RegEx as it works great for problems just like this one.
The trick to remove the extra whitespace comes from How to remove duplicate white spaces in string using Java?
Do you prefer simplicity and readability? If so, consider the following solution
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MyLineScanner
{
public static void readLine(String source_file) throws FileNotFoundException
{
File source = new File(source_file);
Scanner line_scanner = new Scanner(source);
while(line_scanner.hasNextLine())
{
String line = line_scanner.nextLine();
// check to make sure line is exists;
System.out.println(line);
// this work for me
Scanner words_scanner = new Scanner(line);
words_scanner.useDelimiter("\t");
while (words_scanner.hasNext())
{
System.out.format("word : %s %n", words_scanner.next());
}
}
}
public static void main(String[] args) throws FileNotFoundException
{
readLine("source.txt");
}
}