How to use regex with String.split() - java

I have the following String:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n"
I want to convert it to an array of String which will look like this.
String[] Title = {"Title1 Title2","Title3 Title4","Title5 Title6","Title7"}
I am trying the following code.
String[] Title=fullPDFContext.split("\r\n\r\n|\r\n \r\n|\r\n");
But not getting the desired output.

You need to split with a pattern that matches any amount of whitespace that contains a line break:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
String separator = "\\p{javaWhitespace}*\\R\\p{javaWhitespace}*";
String results[] = fullPDFContex.split(separator);
System.out.println(Arrays.toString(results));
// => [Title1 Title2, Title3 Title4, Title5 Title6, Title7]
See the Java demo.
The \\p{javaWhitespace}*\\R\\p{javaWhitespace}* matches
\\p{javaWhitespace}* - 0+ whitespaces
\\R - a line break (you may replace it with [\r\n] for Java 7 and older)
\\p{javaWhitespace}* - 0+ whitespaces.
Alternatively, you may use a bit more efficient
String separator = "[\\s&&[^\r\n]]*\\R\\s*";
See another demo
Unfortunately, the \R construct cannot be used in the character classes. The pattern will match:
[\\s&&[^\r\n]]* - zero or more whitespace chars other than CR and LF (character class subtraction is used here)
\\R - a line break
\\s* - any 0+ whitespace chars.

Here is your solution. we can use StringTokenizer & I have used list to insert the splitted values.This can help you if you have n number of values splitted from your array
package com.sujit;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class UserInput {
public static void main(String[] args) {
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
StringTokenizer token = new StringTokenizer(fullPDFContex, "\r\n");
List<String> list = new ArrayList<>();
while (token.hasMoreTokens()) {
list.add(token.nextToken());
}
for (String string : list) {
System.out.println(string);
}
}
}

With this code you get the output you want:
String[] Title = fullPDFContext.split(" *(\r\n ?)+ *");

Related

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?
May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.
Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).
Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

how to get character length of the unicode along with space in java

I need to find the length of my string "பாரதீய ஜனதா இளைஞர் அணி தலைவர் அனுராக்சிங் தாகூர் எம்.பி. நேற்று தேர்தல் ஆணையர் வி.சம்பத்". I got the string length as 45 but i expect the string length to be 59. Here i need to add the regular expression condition for spaces and dot (.). My code
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class UnicodeLength
{
public static void main (String[] args)
{
String s="பாரதீய ஜனதா இளைஞர் அணி தலைவர் அனுராக்சிங் தாகூர் எம்பி நேற்று தேர்தல் ஆணையர் விசம்பத்";
List<String> characters=new ArrayList<String>();
Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
characters.add(matcher.group());
}
// Test if we have the right characters and length
System.out.println(characters);
System.out.println("String length: " + characters.size());
}
}
The code below worked for me. There were three issues that I fixed:
I added a check for spaces to your regular expression.
I added a check for punctuation to your regular expression.
I pasted the string from your comment into the string in your code. They weren't the same!
Here's the code:
public static void main(String[] args) {
String s = "பாரதீய ஜனதா இளைஞர் அணி தலைவர் அனுராக்சிங் தாகூர் எம்.பி. நேற்று தேர்தல் ஆணையர் வி.சம்பத்";
List<String> characters = new ArrayList<String>();
Pattern pat = Pattern.compile("\\p{P}|\\p{L}\\p{M}*| ");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
characters.add(matcher.group());
}
// Test if we have the right characters and length
int i = 1;
for (String character : characters) {
System.out.println(String.format("%d = [%s]", i++, character));
}
System.out.println("Characters Size: " + characters.size());
}
It's probably worth pointing out that your code is remarkably similar to the solution for this SO. One comment on that solution in particular led me to discover the missing check for punctuation in your code and allowed me to notice that the string from your comment didn't match the string in your code.

extracting sentences which contain a particular word

I want to get the sentences in a textfile which contain a particular keyword. I tried a lot but not able to get the proper sentences that contain the keyword....I have more that one set of keywords if any of this match with the paragraph then it should be taken.
For eg :if my text file contains words like robbery,robbed etc then that sentence shold be extracted.. Below is the code which I tried. Is there anyway to solve this using regex. Any help will be appreciated.
BufferedReader br1 = new BufferedReader(new FileReader("/home/pgrms/Documents/test/one.txt"));
String str="";
while(br1 .ready())
{
str+=br1 .readLine() +"\n";
}
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find())
{
sentenceString=match.group(0);
System.out.println(sentenceString);
}
Here is an example for when you have a list of predefined keywords:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public class Tester {
public static void main(String [] args){
try {
BufferedReader br1 = new BufferedReader(new FileReader("input"));
String[] words = {"robbery","robbed", "robbers"};
String word_re = words[0];
String str="";
for (int i = 1; i < words.length; i++)
word_re += "|" + words[i];
word_re = "[^.]*\\b(" + word_re + ")\\b[^.]*[.]";
while(br1.ready()) { str += br1.readLine(); }
Pattern re = Pattern.compile(word_re,
Pattern.MULTILINE | Pattern.COMMENTS |
Pattern.CASE_INSENSITIVE);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find()) {
sentenceString = match.group(0);
System.out.println(sentenceString);
}
} catch (Exception e) {}
}
}
This creates a regex of the form:
[^.]*\b(robbery|robbed|robbers)\b[^.]*[.]
In general, to check if a sentence contains rob or robbery or robbed, you can add a lookehead after the beginning of string anchor, before the rest of your regex pattern:
(?=.*(?:rob|robbery|robbed))
In this case, it is more efficient to group the rob then check for potential suffixes:
(?=.*(?:rob(?:ery|ed)?))
In your Java code, we can (for instance) modify your loop like this:
while (match.find())
{
sentenceString=match.group(0);
if (sentenceString.matches("(?=.*(?:rob(?:ery|ed)?))")) {
System.out.println(sentenceString);
}
}
Explain Regex
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times
# (matching the most amount possible))
(?: # group, but do not capture:
rob # 'rob'
(?: # group, but do not capture (optional
# (matching the most amount possible)):
ery # 'ery'
| # OR
ed # 'ed'
)? # end of grouping
) # end of grouping
) # end of look-ahead
Take a look at the ICU Project and icu4j. It does boundary analysis, so it splits sentences and words for you, and will do it for different languages.
For the rest, you can either match the words against a Pattern (as others have suggested), or check it against a Set of the words you're interested in.

Splitting String using split method

I want split a string like this:
C:\Program\files\images\flower.jpg
but, using the following code:
String[] tokens = s.split("\\");
String image= tokens[4];
I obtain this error:
11-07 12:47:35.960: E/AndroidRuntime(6921): java.util.regex.PatternSyntaxException: Syntax error U_REGEX_BAD_ESCAPE_SEQUENCE near index 1:
try
String s="C:\\Program\\files\\images\\flower.jpg"
String[] tokens = s.split("\\\\");
In java(regex world) \ is a meta character. you should append with an extra \ or enclose it with \Q\E if you want to treat a meta character as a normal character.
below are some of the metacharacters
<([{\^-=$!|]})?*+.>
to treat any of the above listed characters as normal characters you either have to escape them with '\' or enclose them around \Q\E
like:
\\\\ or \\Q\\\\E
You need to split with \\\\, because the original string should have \\. Try it yourself with the following test case:
#Test
public void split(){
String s = "C:\\Program\\files\\images\\flower.jpg";
String[] tokens = s.split("\\\\");
String image= tokens[4];
assertEquals("flower.jpg",image);
}
There is 2 levels of interpreting the string, first the language parser makes it "\", and that's what the regex engine sees and it's invalid because it's an escape sequence without the character to escape.
So you need to use s.split("\\\\"), so that the regex engine sees \\, which in turn means a literal \.
If you are defining that string in a string literal, you must escape the backslashes there as well:
String s = "C:\\Program\\files\\images\\flower.jpg";
String[] tokens=s.split("\\\\");
Try this:
String s = "C:/Program/files/images/flower.jpg";
String[] tokens = s.split("/");
enter code hereString image= tokens[4];
Your original input text should be
C:\\Program\\files\\images\\flower.jpg
instead of
C:\Program\files\images\flower.jpg
This works,
public static void main(String[] args) {
String str = "C:\\Program\\files\\images\\flower.jpg";
str = str.replace("\\".toCharArray()[0], "/".toCharArray()[0]);
System.out.println(str);
String[] tokens = str.split("/");
System.out.println(tokens[4]);
}

Splitting Java string with quotation marks [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can you recommend a Java library for reading (and possibly writing) CSV files?
I need to split the String in Java. The separator is the space character.
String may include the paired quotation marks (with some text and spaces inside) - the whole body inside the paired quotation marks should be considered as the single token.
Example:
Input:
token1 "token 2" token3
Output: array of 3 elements:
token1
token 2
token3
How to do it?
Thanks!
Split twice. First on quotes, then on spaces.
Assuming that the other solutions will not work for you, because they do not properly detect matching quotes or ignore spaces within quoted text, try something like:
private void addTokens(String tokenString, List<String> result) {
String[] tokens = tokenString.split("[\\r\\n\\t ]+");
for (String token : tokens) {
result.add(token);
}
}
List<String> result = new ArrayList<String>();
while (input.contains("\"")) {
String prefixTokens = input.substring(0, input.indexOf("\""));
input = input.substring(input.indexOf("\"") + 1);
String literalToken = input.substring(0, input.indexOf("\""));
input.substring(input.indexOf("\"") + 1);
addTokens(prefixTokens, result);
result.add(literalToken);
}
addTokens(input, result);
Note that this won't handle unbalanced quotes, escaped quotes, or other cases of erroneous/malformed input.
import java.util.StringTokenizer;
class STDemo {
static String in = "token1;token2;token3"
public static void main(String args[]) {
StringTokenizer st = new StringTokenizer(in, ";");
while(st.hasMoreTokens()) {
String val = st.nextToken();
System.out.println(val);
}
}
}
this is easy way to string tokenize

Categories

Resources