validate file java

validate file java - java

I have a flat file like:
A 10
S 20
W A 20 10
S A 45 10
S W S 20 20 20 30
W A S 22 50 20 55
I want to make sure it is well formed, (separated by blank space " ")
allowing only a regular expression like:
anyword* then " " then (word*|numbers*)*
where * is any number of words
but there is also one issue,
if there is only one word or char there is only one number
if there are 2 words or chars separated by " " then there must be 2 numbers separated by " "
if there are 3 words or chars separated by " " then there must be 4 numbers separated by " "
I was doing something like this, but do not know where to incorporate validation of line
try {
input = new BufferedReader(new FileReader(new File(filename)));
String line = null;
while ((line = input.readLine()) != null) {
String[] words = line.split(" ");
if (words.length == 2) {
}
}
}

This regex should do it:
^[a-z]+ (?:\d+|[a-z]+(?: \d+ \d+| [a-z]+(?: \d+){4}))$
I tried to make it as short as possible, but it may be possible to condense it a bit more. This should be used with case sensitivity enabled or you should change all of the [a-z] to [a-zA-Z].
Here is a Rubular.

Related

Java - String splitting

I read a txt with data in the following format: Name Address Hobbies
Example(Bob Smith ABC Street Swimming)
and Assigned it into String z
Then I used z.split to separate each field using " " as the delimiter(space) but it separated Bob Smith into two different strings while it should be as one field, same with the address. Is there a method I can use to get it in the particular format I want?
P.S Apologies if I explained it vaguely, English isn't my first language.
String z;
try {
BufferedReader br = new BufferedReader(new FileReader("desc.txt"));
z = br.readLine();
} catch(IOException io) {
io.printStackTrace();
}
String[] temp = z.split(" ");

If the format of name and address parts is fixed to consist of two parts, you could just join them:
String z = ""; // z must be initialized
// use try-with-resources to ensure the reader is closed properly
try (BufferedReader br = new BufferedReader(new FileReader("desc.txt"))) {
z = br.readLine();
} catch(IOException io) {
io.printStackTrace();
}
String[] temp = z.split(" ");
String name = String.join(" ", temp[0], temp[1]);
String address = String.join(" ", temp[2], temp[3]);
String hobby = temp[4];
Another option could be to create a format string as a regular expression and use it to parse the input line using named groups (?<group_name>capturing text):
// use named groups to define parts of the line
Pattern format = Pattern.compile("(?<name>\\w+\\s\\w+)\\s(?<address>\\w+\\s\\w+)\\s(?<hobby>\\w+)");
Matcher match = format.matcher(z);
if (match.matches()) {
String name = match.group("name");
String address = match.group("address");
String hobby = match.group("hobby");
System.out.printf("Input line matched: name=%s address=%s hobby=%s%n", name, address, hobby);
} else {
System.out.println("Input line not matching: " + z);
}

I can think of three solutions.
In order from best to worst:
Different delimiter
Enforce the format to always have two names, two address parts and one hobby
Have a dictionary with names and hobbies, check each word to determine which type it is and then group them together as needed.
(The 3rd option is not meant as a serious alternative.)

As others have mentioned, using spaces as both field delimiter and inside fields is problematic. You could use a regex pattern to split the line (paste (\w+ \w+) (\w+ \w+) (.+) in Regex101 for an explanation):
Pattern pattern = Pattern.compile("(\\w+ \\w+) (\\w+ \\w+) (.+)");
Matcher matcher = pattern.matcher("Bob Smith ABC Street Bowling Fishing Rollerblading");
System.out.println("matcher.matches() = " + matcher.matches());
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println("matcher.group(" + i + ") = " + matcher.group(i));
}
This would give the following output:
matcher.matches() = true
matcher.group(0) = Bob Smith ABC Street Bowling Fishing Rollerblading
matcher.group(1) = Bob Smith
matcher.group(2) = ABC Street
matcher.group(3) = Bowling Fishing Rollerblading
However this only works for this exact format. If you get a line with three name parts for example:
John B Smith ABC Street Swimming
This will get split into John B as the name, Smith ABC as the address and Street Swimming as hobbies.
So either make 100% sure your input will always match this format or use a different delimiter.

The split() method majorly works on the 2 things:
Delimiter and
The String Object
Sometimes on limit too.
Whatever limit you will provide, the split() method will do its work according to that.
It doesn't understand whether the left substring is a name or not, same as for the right substring.
Have a look at this code snippet:
String assets = "Gold:Stocks:Fixed Income:Commodity:Interest Rates";
String[] splits = assets.split(":");
System.out.println("splits.size: " + splits.length);
for(String asset: splits){
System.out.println(assets);
}
OutPut
splits.size: 5
Gold
Stocks
Fixed Income // with space
Commodity
Interest Rates // with space
The output came with spaces because I provided the ; as a delimiter.
This probably helped you to get your answer.
Find Detailed Information on Split():
Top 5 Use cases of Split()
Java Docs : Split()

It depends on the data you're dealing with. Will the name always consist of a first and last name? Then you can simply combine the first two elements from the resulting array into a new string.
Otherwise, you might have to find a different way to separate out the different pieces within the txt file. Possibly a comma? Some character that you know won't ever be used in your normal data.

Assuming that every line follows the format
Bob Smith ABC Street Swimming
ie, name surname.... this code can manually manipulate the data for you:
String[] temp = z.split(" ");
String[] temp2 = new String[temp.length - 1];
temp2[0] = temp[0] + " " + temp[1];
for (int i = 2; i < temp.length; i++) {
temp2[i] = temp2[i];
}
temp = temp2;

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?

May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.

Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).

Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

How to access each element after a split

I am trying to read from a text file and split it into three separate categories. ID, address, and weight. However, whenever I try to access the address and weight I have an error. Does anyone see the problem?
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;
class Project1
{
public static void main(String[] args)throws Exception
{
List<String> list = new ArrayList<String>();
List<String> packages = new ArrayList<String>();
List<String> addresses = new ArrayList<String>();
List<String> weights = new ArrayList<String>();
//Provide the file path
File file = new File(args[0]);
//Reads the file
BufferedReader br = new BufferedReader(new FileReader(file));
String str;
while((str = br.readLine()) != null)
{
if(str.trim().length() > 0)
{
//System.out.println(str);
//Splits the string by commas and trims whitespace
String[] result = str.trim().split("\\s*,\\s*", 3);
packages.add(result[0]);
//ERROR: Doesn't know what result[1] or result[2] is.
//addresses.add(result[1]);
//weights.add(result[2]);
System.out.println(result[0]);
//System.out.println(result[1]);
//System.out.println(result[2]);
}
}
for(int i = 0; i < packages.size(); i++)
{
System.out.println(packages.get(i));
}
}
}
Here is the text file (The format is intentional):
,123-ABC-4567, 15 W. 15th St., 50.1
456-BgT-79876, 22 Broadway, 24
QAZ-456-QWER, 100 East 20th Street, 50
Q2Z-457-QWER, 200 East 20th Street, 49
678-FGH-9845 ,, 45 5th Ave,, 12.2,
678-FGH-9846,45 5th Ave,12.2
123-A BC-9999, 46 Foo Bar, 220.0
347-poy-3465, 101 B'way,24
,123-FBC-4567, 15 West 15th St., 50.1
678-FGH-8465 45 5th Ave 12.2

Seeing the pattern in your data, where some lines start with an unneeded comma, and some lines having multiple commas as delimiter and one line not even having any comma delimiter and instead space as delimiter, you will have to use a regex that handles all these behaviors. You can use this regex which does it all for your data and captures appropriately.
([\w- ]+?)[ ,]+([\w .']+)[ ,]+([\d.]+)
Here is the explanation for above regex,
([\w- ]+?) - Captures ID data which consists of word characters hyphen and space and places it in group1
[ ,]+ - This acts as a delimiter where it can be one or more space or comma
([\w .']+) - This captures address data which consists of word characters, space and . and places it in group2
[ ,]+ - Again the delimiter as described above
([\d.]+) - This captures the weight data which consists of numbers and . and places it in group3
Demo
Here is the modified Java code you can use. I've removed some of your variable declarations which you can have them back as needed. This code prints all the information after capturing the way you wanted using Matcher object.
Pattern p = Pattern.compile("([\\w- ]+?)[ ,]+([\\w .']+)[ ,]+([\\d.]+)");
// Reads the file
try (BufferedReader br = new BufferedReader(new FileReader("data1.txt"))) {
String str;
while ((str = br.readLine()) != null) {
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(String.format("Id: %s, Address: %s, Weight: %s",
new Object[] { m.group(1), m.group(2), m.group(3) }));
}
}
}
Prints,
Id: 456-BgT-79876, Address: 22 Broadway, Weight: 24
Id: QAZ-456-QWER, Address: 100 East 20th Street, Weight: 50
Id: Q2Z-457-QWER, Address: 200 East 20th Street, Weight: 49
Id: 678-FGH-9845, Address: 45 5th Ave, Weight: 12.2
Id: 678-FGH-9846, Address: 45 5th Ave, Weight: 12.2
Id: 123-A BC-9999, Address: 46 Foo Bar, Weight: 220.0
Id: 347-poy-3465, Address: 101 B'way, Weight: 24
Id: 678-FGH-8465, Address: 45 5th Ave, Weight: 12.2
Let me know if this works for you and if you have any query further.

The last line only contains one token. So split will only return an array with one element.
A minimal reproducing example:
import java.io.*;
class Project1 {
public static void main(String[] args) throws Exception {
//Provide the file path
File file = new File(args[0]);
//Reads the file
BufferedReader br = new BufferedReader(new FileReader(file));
String str;
while ((str = br.readLine()) != null) {
if (str.trim().length() > 0) {
String[] result = str.trim().split("\\s*,\\s*", 3);
System.out.println(result[1]);
}
}
}
}
With this input file:
678-FGH-8465 45 5th Ave 12.2
The output looks like this:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at Project1.main(a.java:22)
Process finished with exit code 1
So you will have to decide, what your program should do in such cases. You might ignore those lines, print an error, or only add the first token in one of your lists.

you can add following code in your code
if (result.length > 0) {
packages.add(result[0]);
}
if (result.length > 1) {
addresses.add(result[1]);
}
if (result.length > 2) {
weights.add(result[2]);
}

how to format string line that contains numbers with regex

I want to format telephone numbers from the following format:
359878123456
0878123456
00359878123456
that are placed in a file that has information about name and phone number in the following format:
DarkoT 00359878123456
to be formatted in a standard form just for the numbers(to ignore the name). see below:
DarkoT +359 87 2 123456
this is for all cases.
This is where i am at.(my regex)
System.out.println(String.valueOf(inputLine).replaceAll("((\\+|00)359|0)(\\-|\\s)?8[7-9][2-9](\\-|\\s)?\\d{3}(\\s|\\-)?\\d{3}$", "($1)-\\$"));
I am confused with the placement. Please advise.

1 solution without regex: you could just split the string and group it from the back. But, I would really prefer doing it with regex, so here I go:
I suppose you must have this format (if this is not correct, this will not work):
[OPTIONAL] 00 (length: 2)
[OPTIONAL] 111 (Considering always 3 numbers) (length: 3)
If you don't have the former: 0 (changing zones, I understand?) (length: 1)
22 (length: 2)
3 (length: 1)
444444 (length: 6)
And now the regex to capture this:
(?:(?:00)?(\d{3})|0)(\d{2})(\d{1})(\d{6})
You will have, as a result:
Group 1: 3 digits (country code?) or nothing (if it's the '0').
Group 2: 2 digits (zone?)
Group 3: 1 digit (no idea, in my country we don't use this)
Group 4: last 6 digits
Using a replace have some limitations, so I would use a matcher, as easy as:
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(originalNumber);
if (m.find()) {
String nationalCode = m.group(1) != null ? m.group(1) : DEFAULT_NATIONAL_CODE;
formattedNumber = "+" + nationalCode + " " + m.group(2) + " " + m.group(3) + " " + m.group(4);
}
If you want more flexibility (for example, country numbers as 2 digits, not only 3) let me know and I will change the regexp.
NOTE: I didn't test this, just coded off the top of my head, let me know if it fails.

I think this is a continuation of your previous problem How to check exact phone number in Java with regex
Sample.java
import java.util.*;
import java.util.regex.*;
import java.io.*;
public class Sample{
public static void main(String[] args){
try{
File inFile = new File ("phonebook.txt");
Scanner sc = new Scanner (inFile);
while (sc.hasNextLine())
{
String line = sc.nextLine();
if(line.matches("^((.)*\\s)?((\\+|00)359|0)8[7-9][2-9]\\d{6}$")) // here that code doesn't work
{
System.out.println (line.replaceAll("^([^\\s\\+]*\\s?)?((\\+|00)?359|0)[-\\s]?(8[7-9][2-9])[-\\s]?(\\d{3})[-\\s]?(\\d{3})$", "$1 - $2 $4 $5 $6").replaceAll("00359","+359").replaceAll("- 0","+359"));
}
}
sc.close();
}catch(Exception e){}
}
}
phonebook.txt
Sagar +359883123456
Test 00359883123565
Someone 0883123456
People 1234567890
Test1 +359873123456
Output
C:\Users\acer\Desktop\Java\programs>javac Sample.java
C:\Users\acer\Desktop\Java\programs>java Sample
Sagar - +359 883 123 456
Test - +359 883 123 565
Someone +359 883 123 456
Test1 - +359 873 123 456

Capture each data in a string from a flat file txt?

I have a question, I happen to read a flat arhivo few codes, but the query is how to capture each data organized, I mean that there is always values in some columns, if they are empty I'm going to save as NULL on an object.
Input:
19150526 1 7 1
19400119 2 20 1 1
19580122 2 20 9 1
19600309 1 20 7 1
19570310 2 20 5 1
19401215 1 10 1 1
19650902 2 20 0 1
19510924 1 20 3 1
19351118 2 30 1
19560118 1 20 0 1
19371108 2 7 1
19650315 1 30 6 1
19601217 2 30 4 1
Code Java:
FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
DataInputStream entrada = new DataInputStream(fstream);
BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
String strLinea;
List<Sppadron> listSppadron = new ArrayList<Sppadron>();
while ((strLinea = buffer.readLine()) != null){
Sppadron spadron= new Sppadron();
spadron.setSpNac(strLinea.substring(143, 152).trim());
spadron.setSpSex(strLinea.substring(152, 154).trim());
spadron.setSpGri(strLinea.substring(154, 157).trim());
spadron.setSpSec(strLinea.substring(157, 158).trim());
spadron.setSpDoc(strLinea.substring(158, strLinea.length()).trim());
listSppadron.add(spadron);
}
entrada.close();
Originally I had the idea of doing it this way, but in practice happens is that the position of each string is not fixed as it looks, so I happened to use a split (), but there are different spaces between each data and the latest to use a replaceAll (), but leaves all the data together, is there any way to separate each data regardless of the spacing between each data.
Whereas each row penultimate data can come see it empty as the input data file that printable.

try following
strLinea = strLinea.trim().replaceAll("\\s"," ");
String stArr[] = strLinea.split(" ");
then use strArr for further as per your requirement.
if you want it as list you can use Arrays.asList(strArr);

try this way..
replace wherever you see contiguous spaces with a single space
strLinea.replaceAll("\\s+"," ")
Then do your splits
OR
something like
String[] tokensVal = strLinea.split("\\s+");

You're on the right lines using a StringBuffer to read the file line by line. Once you have a line in the buffer, try using the StringTokenizer class to pull out each field. StringTokenizer will by default split on white space and you can iterate through the columns.
consider the below:
public static void main(String[] args) {
String s = "hello\t\tworld some spaces \tbetween here";
StringTokenizer st = new StringTokenizer(s);
while(st.hasMoreTokens())
{
System.out.println(st.nextToken());
}
}
This will output:
hello
world
some
spaces
between
here
You could base your solution on this. Maybe have a builder pattern that can return the objects you need given the current line..

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

validate file java - java

This regex should do it: ^[a-z]+ (?:\d+|[a-z]+(?: \d+ \d+| [a-z]+(?: \d+){4}))$ I tried to make it as short as possible, but it may be possible to condense it a bit more. This should be used with case sensitivity enabled or you should change all of the [a-z] to [a-zA-Z]. Here is a Rubular.

Related

Java - String splitting

How to splitting records based white spaces when different lines have spaces at different positions

How to access each element after a split

how to format string line that contains numbers with regex

Capture each data in a string from a flat file txt?

Categories

Resources