How to split a file into several tokens - java

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.

I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";

Try pattern as String pattern = "[^\\w]"; in the same code

Related

Java compare strings from two places and exclude any matches

I'm trying to end up with a results.txt minus any matching items, having successfully compared some string inputs against another .txt file. Been staring at this code for way too long and I can't figure out why it isn't working. New to coding so would appreciate it if I could be steered in the right direction! Maybe I need a different approach? Apologies in advance for any loud tutting noises you may make. Using Java8.
//Sending a String[] into 'searchFile', contains around 8 small strings.
//Example of input: String[]{"name1","name2","name 3", "name 4.zip"}
^ This is my exclusions list.
public static void searchFile(String[] arr, String separator)
{
StringBuilder b = new StringBuilder();
for(int i = 0; i < arr.length; i++)
{
if(i != 0) b.append(separator);
b.append(arr[i]);
String findME = arr[i];
searchInfo(MyApp.getOptionsDir()+File.separator+"file-to-search.txt",findME);
}
}
^This works fine. I'm then sending the results to 'searchInfo' and trying to match and remove any duplicate (complete, not part) strings. This is where I am currently failing. Code runs but doesn't produce my desired output. It often finds part strings rather than complete ones. I think the 'results.txt' file is being overwritten each time...but I'm not sure tbh!
file-to-search.txt contains: "name2","name.zip","name 3.zip","name 4.zip" (text file is just a single line)
public static String searchInfo(String fileName, String findME)
{
StringBuffer sb = new StringBuffer();
try {
BufferedReader br = new BufferedReader(new FileReader(fileName));
String line = null;
while((line = br.readLine()) != null)
{
if(line.startsWith("\""+findME+"\""))
{
sb.append(line);
//tried various replace options with no joy
line = line.replaceFirst(findME+"?,", "");
//then goes off with results to create a txt file
FileHandling.createFile("results.txt",line);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
What i'm trying to end up with is a result file MINUS any matching complete strings (not part strings):
e.g. results.txt to end up with: "name.zip","name 3.zip"
ok with the information I have. What you can do is this
List<String> result = new ArrayList<>();
String content = FileUtils.readFileToString(file, "UTF-8");
for (String s : content.split(", ")) {
if (!s.equals(findME)) { // assuming both have string quotes added already
result.add(s);
}
}
FileUtils.write(newFile, String.join(", ", result), "UTF-8");
using apache commons file utils for ease. You may add or remove spaces after comma as per your need.

Generating pattern that adds + sign between spaces as one string for each line

I am using a scanner to read a file which is structured as follows:
ali nader sepahi
simon nadel
rahim nadeem merse
shahid nadeem
Each line has a multiple strings which represented the full name of the person. How to add "+" in between spaces for each name, so I will be having something like this "ali+nader+sepahi" printed into one String.
public class dataScanner
{
public dataScanner() throws IOException
{
Scanner file = new Scanner(new File("info.txt"));
while(file.hasNext())
{
String s = file.next().trim();
System.out.println(s+"+");
}
}
}
Use Scanner.nextLine to read the whole line, then replace the spaces with +
For this kind of need, a Scanner is not really suitable, you should use a BufferedReader and String.replace(char, char) as next:
try (BufferedReader reader = new BufferedReader(new FileReader("info.txt"))) {
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line.replace(' ', '+'));
}
}

BufferedReader to read lines, then assign the new formed line's tokens to variables

I have a text file that I need to modify before parsing it. 1) I need to combine lines if leading line ends with "\" and delete white spaced line. this has been done using this code
public List<String> OpenFile() throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(path))) {
String line;
StringBuilder concatenatedLine = new StringBuilder();
List<String> formattedStrings = new ArrayList<>();
while ((line = br.readLine()) != null) {
if (line.isEmpty()) {
line = line.trim();
} else if (line.charAt(line.length() - 1) == '\\') {
line = line.substring(0, line.length() - 1);
concatenatedLine.append(line);
} else {
concatenatedLine.append(line);
formattedStrings.add(concatenatedLine.toString());
concatenatedLine.setLength(0);
}
}
return formattedStrings;
}
}
}//The formattedStrings arrayList contains all of the strings formatted for use.
Now My question, how can I search those lines for pattern and assign their token[i] to variables that I can call or use later.
the New combined text will look like this:
Field-1 Field-2 Field-3 Field-4 Field-5 Field-6 Field-7
Now, if the line contains "Field-6" and "Field-2" Then set the following:
String S =token[1] token[3];
String Y =token[5-7];
Question you might have for me, how am I deciding on which token to save to a string? I will manually search for the pattern in the text file and if the "Line contain Field-6 and Field-2 or any other required pattern. Then manually count which token I need to assign to the string. However, it will be nice if there is another way to approach this, for ex assign what's in between token[4] and token[7] to string (s) if the line has token[2] and token[6]. or another way that provides more Granule Control over what to store as string and what to ignore.

split(//s+) dont remove whitespaces

I need to read alot of files and insert the data into Ms sql.
Got a file, it looks the texts are separated by //t.
Split does not do the job, I have even tried with "//s+" as you can see in the code below
public void InsetIntoCustomers(final File _file, final Connection _conn)
{
conn = _conn;
try
{
FileInputStream fs = new FileInputStream(_file);
DataInputStream in = new DataInputStream(fs);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
//String strline contains readline() from BufferedReader
String strline;
while((strline = br.readLine()) != null)
{
if(!strline.contains("#"))
{
String[] test = strline.split("//s+");
if((tempid = sNet.chkSharednet(_conn, test[0] )) != 0)
{
// do something
}
}
}
// close BufferedReader
br.close();
}
I need to know where in my String[] the data is placed in a file with 500k lines. But my Test[] get length 1 and all data from readline are on place 0.
Do I use split wrong ?
Or are there other places I need to look?:
// Mir
haha - Thank you so much - why the hell didnt I see that myself.
yeah ofc. iam using \s+ at all other files.
but thank for pointing it out.
The correct regex is \\s+, with back-shashes instead of forward-slashes.
You could have still tried with \\t

Processing a string of information

I have a text file that holds baseball teams as YEAR:TEAM1:POINTS1:TEAM2:POINTS2 on each line.
How can I process it so that I wind up with the year, 1st team's name, and if they won or not?
I know I should use delimiter \n and : to separate the data, but how can I actually keep track of the info that I need?
Since this is homework, here is not the solution, but just some hints:
Have a look at the class StringTokenizer to split the line.
Have a look at InputStreamReader and FileInputStream to read the file.
Have a look at the String class's split method.
To split the text you can use the methods String#indexOf(), String#lastIndexOf() and String#subString.
Then to compare which team has one, I would convert the String into an int and then compare the two values.
How about a healthy serving of Regex?
try something like this
public static void readTeams() throws IOException{
try {
fstream = new FileInputStream("yourPath");
in = new DataInputStream(fstream);
br = new BufferedReader(new InputStreamReader(in));
String s = br.readLine();
String[] tokens = s.split(":");
while(s!=null){
for (String t : tokens){
System.out.println(t);
}
}
in.close();
} catch (FileNotFoundException ex) {
Logger.getLogger(YourClass.class.getName()).log(Level.SEVERE, null, ex);
}
}
Here is an example found at java-examples.com
String str = "one-two-three";
String[] temp;
/* delimiter */
String delimiter = "-";
/* given string will be split by the argument delimiter provided. */
temp = str.split(delimiter);
/* print substrings */
for(int i =0; i < temp.length ; i++)
System.out.println(temp[i]);//prints one two three on different lines
Now for reading the input you can use BufferedReader and FileReader check examples for that on google.

Categories

Resources