I have a question, I happen to read a flat arhivo few codes, but the query is how to capture each data organized, I mean that there is always values in some columns, if they are empty I'm going to save as NULL on an object.
Input:
19150526 1 7 1
19400119 2 20 1 1
19580122 2 20 9 1
19600309 1 20 7 1
19570310 2 20 5 1
19401215 1 10 1 1
19650902 2 20 0 1
19510924 1 20 3 1
19351118 2 30 1
19560118 1 20 0 1
19371108 2 7 1
19650315 1 30 6 1
19601217 2 30 4 1
Code Java:
FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
DataInputStream entrada = new DataInputStream(fstream);
BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
String strLinea;
List<Sppadron> listSppadron = new ArrayList<Sppadron>();
while ((strLinea = buffer.readLine()) != null){
Sppadron spadron= new Sppadron();
spadron.setSpNac(strLinea.substring(143, 152).trim());
spadron.setSpSex(strLinea.substring(152, 154).trim());
spadron.setSpGri(strLinea.substring(154, 157).trim());
spadron.setSpSec(strLinea.substring(157, 158).trim());
spadron.setSpDoc(strLinea.substring(158, strLinea.length()).trim());
listSppadron.add(spadron);
}
entrada.close();
Originally I had the idea of doing it this way, but in practice happens is that the position of each string is not fixed as it looks, so I happened to use a split (), but there are different spaces between each data and the latest to use a replaceAll (), but leaves all the data together, is there any way to separate each data regardless of the spacing between each data.
Whereas each row penultimate data can come see it empty as the input data file that printable.
try following
strLinea = strLinea.trim().replaceAll("\\s"," ");
String stArr[] = strLinea.split(" ");
then use strArr for further as per your requirement.
if you want it as list you can use Arrays.asList(strArr);
try this way..
replace wherever you see contiguous spaces with a single space
strLinea.replaceAll("\\s+"," ")
Then do your splits
OR
something like
String[] tokensVal = strLinea.split("\\s+");
You're on the right lines using a StringBuffer to read the file line by line. Once you have a line in the buffer, try using the StringTokenizer class to pull out each field. StringTokenizer will by default split on white space and you can iterate through the columns.
consider the below:
public static void main(String[] args) {
String s = "hello\t\tworld some spaces \tbetween here";
StringTokenizer st = new StringTokenizer(s);
while(st.hasMoreTokens())
{
System.out.println(st.nextToken());
}
}
This will output:
hello
world
some
spaces
between
here
You could base your solution on this. Maybe have a builder pattern that can return the objects you need given the current line..
Related
I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?
May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.
Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).
Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.
BOWLING O M R W ECON 0s 45 6 WD NB Losing Dhoni as a batter always
difficult for us - Raina
TABoult 4 0 3 0 925 M 2 3 1 0 The Chennai Super Kings batsman
struck form after lean season and
JETED 6 0 = 4 O 0 0 lauded Dhoni's support at the crease
CHMorris 4 0 4 ns o9 8 1 1 against Delhi Capitals
AR Patel 3 o 3 1 1033 6 3 2 o o “Watch the ball, hit the ball' - Dhoni's
formula for the final over
S o0 e sEoe 10 o o The CSK captain has hit 554 runs in
e PR el 227 balls inthe 20th over of an IPL
match. Thats 13% of all the runs he's
made i this tournament
. Delhi Capitals Innings (target: 180 runs from 20 overs) Talking Points - Is Dhoni babering #EEIEER -
this one is my String
i want in excel
Based on the sparse description on what you want to do i would suggest:
Read the text from the image
Replace all spaces with a colon
String csvContent = imgData.replaceAll(" ",";");
save text to a csv file
open csv file with excel
The following example assumes that you have managed to retrieve the data which is then post-processed to provide the csv format. The contents are written to a file which you can just doubleclick to see that the data is split into columns as you requested.
String[] data = new String[] {
"BOWLING O M R W ECON 0s 45 6", //notice that your OCR software does not properly recognise the string here
"TABoult 4 0 3 0 925 M 2 3",
"JETED 6 0 = 4 O 0 0"
};
BufferedWriter writer = new BufferedWriter( new FileWriter( System.getProperty( "user.home" ) + System.getProperty( "file.separator" ) + "data.csv" ) );
for( String record : data ) {
writer.write( record.replaceAll( " ", ";" ) );
writer.write( "\n" );
}
writer.close();
Like i put in comment above, your OCR does not work correctly. I would suggest you take a look into JSOUP html parser to get the information and continue from there. Otherwise you will not be satisfied by the result.
driver.get("https://www.espncricinfo.com/series/8048/scorecard/1178425/chennai-super-kings-vs-delhi-capitals-50th-match-indian-premier-league-2019");
WebElement element = driver.findElement(By.xpath("//article[#class='sub-module scorecard'][1]"));
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("arguments[0].scrollIntoView(true);", element);
File screen = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
File file = new File("C:\\Users\\user\\Desktop\\screenshot1\\screenshotOfElement2.png");
FileHandler.copy(screen, file);
ITesseract instance = new Tesseract();
instance.setDatapath("C:\\selenium_work\\ScrapingText.PDF\\tessdata");
String result = instance.doOCR(file);
//System.out.println(result);
String[] lines = result.split("\\n");
this one what am trying
I'm trying to work out the average number of items from a list of insurance policies, how do I get the value of only the items(int) from a text file?
Here is an example of data in the text file:
20-Jul-2017 EQ123B 3 40000 30 A 5389 l a
20-Jul-2017 ED423A 2 40000 30 A 5389 k d
31-Jul-2017 ZD123V 4 40000 30 A 5389 s c
Each line represents data for a different insurance policy, with the third column being the amount of items to be insured. I had planned to get the average number of items per policy by getting the total amount of items in the file and dividing that by the number of policies.
Here is my code so far:
try{
int numOfPolicies = 0;
try (Scanner file = new Scanner(new FileReader("policy.txt"))) {
//loop through the file counting each line. Each line represents a policy
while(file.hasNextLine()){
numOfPolicies++;
file.nextLine();
}
}
System.out.println("Total Number of Policies: " + numOfPolicies);
}
catch(FileNotFoundException e){
System.out.println("File not found");
}
As you can see I have already got the number of policies in the file. How do I read only the number of items from each line, and store this in a variable?
If your lines follow this format (whitespace used only as separator) :
20-Jul-2017 EQ123B 3 40000 30 A 5389 l a
To retrieve 3, you could capture the number between the second and the third whitespace.
You could use the String.split() method with the \\s regex and a limit of 4 as you don't care token after the number of policies :
String[] split = file.nextLine().split("\\s+", 4);
You will get the following token :
20-Jul-2017
EQ123B
3
40000 30 A 5389 l a
You could get so the third token :
String number = split[2];
I have file which I read line by line in java.
Below is the content of the file
My File contains the following characters (persons, indicated by name)
There are three characters in this line Jack = 10 Jill = 11 Jhon = 12
There are two characters in the line Jack = 14 Melissa = 15
I have to search line by line for 'Jack' and I have to fetch his value 10 (in first line) and 14 (in second line) and pass it to another variable. How to achieve this?
This should get you started. I assume you know how to read file line by line, that's the draft of what you should do for every line.
Pattern pattern = Pattern.compile("(.*Jack)\\s*=\\s*(\\d+)(.*)");
String testString = " Jack =154, Jill = 111";
Matcher matcher = pattern.matcher(testString);
if(matcher.find()) {
System.out.println(matcher.group(2));
}
These are the essentials you should know to understand what's going on: http://docs.oracle.com/javase/tutorial/essential/regex/
I have a flat file like:
A 10
S 20
W A 20 10
S A 45 10
S W S 20 20 20 30
W A S 22 50 20 55
I want to make sure it is well formed, (separated by blank space " ")
allowing only a regular expression like:
anyword* then " " then (word*|numbers*)*
where * is any number of words
but there is also one issue,
if there is only one word or char there is only one number
if there are 2 words or chars separated by " " then there must be 2 numbers separated by " "
if there are 3 words or chars separated by " " then there must be 4 numbers separated by " "
I was doing something like this, but do not know where to incorporate validation of line
try {
input = new BufferedReader(new FileReader(new File(filename)));
String line = null;
while ((line = input.readLine()) != null) {
String[] words = line.split(" ");
if (words.length == 2) {
}
}
}
This regex should do it:
^[a-z]+ (?:\d+|[a-z]+(?: \d+ \d+| [a-z]+(?: \d+){4}))$
I tried to make it as short as possible, but it may be possible to condense it a bit more. This should be used with case sensitivity enabled or you should change all of the [a-z] to [a-zA-Z].
Here is a Rubular.