OpenNLP - Tokenize an Array of Strings - java

I am trying to tokenize a text file using the OpenNLP tokenizer.
What I do, I read in a .txt file and store it in a list, want to iterate over every line, tokenize the line and write the tokenized line to a new file.
In the line:
tokens[i] = tokenizer.tokenize(output[i]);
I get:
Type mismatch: cannot convert from String[] to String
This is my code:
public class Tokenizer {
public static void main(String[] args) throws Exception {
InputStream modelIn = new FileInputStream("en-token-max.bin");
try {
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
CSVReader reader = new CSVReader(new FileReader("ParsedRawText1.txt"),',', '"', 1);
String csv = "ParsedRawText2.txt";
CSVWriter writer = new CSVWriter(new FileWriter(csv),CSVWriter.NO_ESCAPE_CHARACTER,CSVWriter.NO_QUOTE_CHARACTER);
//Read all rows at once
List<String[]> allRows = reader.readAll();
for(String[] output : allRows) {
//get current row
String[] tokens=new String[output.length];
for(int i=0;i<output.length;i++){
tokens[i] = tokenizer.tokenize(output[i]);
System.out.println(tokens[i]);
}
//write line
writer.writeNext(tokens);
}
writer.close();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
}
}
Does anyone has any idea how to complete this task?

As compiler says, you try to assign array of Strings (result of tokenize()) to String (tokens[i] is a String). So you should declare and use tokens inside the inner loop and write tokens[] there, too:
for (String[] output : allRows) {
// get current row
for (int i = 0; i < output.length; i++) {
String[] tokens = tokenizer.tokenize(output[i]);
System.out.println(tokens);
// write line
writer.writeNext(tokens);
}
}
writer.close();
Btw, are you sure that your source file is a csv? If it is actually a plain text file, then you split text by commas and gives such chunks to Opennlp, and it can perform worse, because its model was trained over normal sentences, not split like yours.

Related

Read the each string text from file in java

I am new in java. I just wants to read each string in java and print it on console.
Code:
public static void main(String[] args) throws Exception {
File file = new File("/Users/OntologyFile.txt");
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(
fstream));
String data = new String();
while ((data = infile.readLine()) != null) { // use if for reading just 1 line
System.out.println(""+data);
}
} catch (IOException e) {
// Error
}
}
If file contains:
Add label abc to xyz
Add instance cdd to pqr
I want to read each word from file and print it to a new line, e.g.
Add
label
abc
...
And afterwards, I want to extract the index of a specific string, for instance get the index of abc.
Can anyone please help me?
It sounds like you want to be able to do two things:
Print all words inside the file
Search the index of a specific word
In that case, I would suggest scanning all lines, splitting by any whitespace character (space, tab, etc.) and storing in a collection so you can later on search for it. Not the question is - can you have repeats and in that case which index would you like to print? The first? The last? All of them?
Assuming words are unique, you can simply do:
public static void main(String[] args) throws Exception {
File file = new File("/Users/OntologyFile.txt");
ArrayList<String> words = new ArrayList<String>();
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(
fstream));
String data = null;
while ((data = infile.readLine()) != null) {
for (String word : data.split("\\s+") {
words.add(word);
System.out.println(word);
}
}
} catch (IOException e) {
// Error
}
// search for the index of abc:
for (int i = 0; i < words.size(); i++) {
if (words.get(i).equals("abc")) {
System.out.println("abc index is " + i);
break;
}
}
}
If you don't break, it'll print every index of abc (if words are not unique). You could of course optimize it more if the set of words is very large, but for a small amount of data, this should suffice.
Of course, if you know in advance which words' indices you'd like to print, you could forego the extra data structure (the ArrayList) and simply print that as you scan the file, unless you want the printings (of words and specific indices) to be separate in output.
Split the String received for any whitespace with the regex \\s+ and print out the resultant data with a for loop.
public static void main(String[] args) { // Don't make main throw an exception
File file = new File("/Users/OntologyFile.txt");
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(fstream));
String data;
while ((data = infile.readLine()) != null) {
String[] words = data.split("\\s+"); // Split on whitespace
for (String word : words) { // Iterate through info
System.out.println(word); // Print it
}
}
} catch (IOException e) {
// Probably best to actually have this on there
System.err.println("Error found.");
e.printStackTrace();
}
}
Just add a for-each loop before printing the output :-
while ((data = infile.readLine()) != null) { // use if for reading just 1 line
for(String temp : data.split(" "))
System.out.println(temp); // no need to concatenate the empty string.
}
This will automatically print the individual strings, obtained from each String line read from the file, in a new line.
And afterwards, I want to extract the index of a specific string, for
instance get the index of abc.
I don't know what index are you actually talking about. But, if you want to take the index from the individual lines being read, then add a temporary variable with count initialised to 0.
Increment it till d equals abc here. Like,
int count = 0;
for(String temp : data.split(" ")){
count++;
if("abc".equals(temp))
System.out.println("Index of abc is : "+count);
System.out.println(temp);
}
Use Split() Function available in Class String.. You may manipulate according to your need.
or
use length keyword to iterate throughout the complete line
and if any non- alphabet character get the substring()and write it to the new line.
List<String> words = new ArrayList<String>();
while ((data = infile.readLine()) != null) {
for(String d : data.split(" ")) {
System.out.println(""+d);
}
words.addAll(Arrays.asList(data));
}
//words List will hold all the words. Do words.indexOf("abc") to get index
if(words.indexOf("abc") < 0) {
System.out.println("word not present");
} else {
System.out.println("word present at index " + words.indexOf("abc"))
}

Export a string in Java to CSV

How to export a string in Java to a csv file having this format using only one column.
This is what i am expecting:
Column 1
Row 1: string1,string2,string3
Row 2: string4, string5, string6
Thanks in advance
In the code below you provide a List of elements. Each element contains the info for one line of the csv file.
The StringBuilder is used to create the String for one line, which then is output at once to the file.
public void writeCsvFile(List elements, String fileName) throws IOException {
BufferedWriter csvFile = null;
String delim = ",";
try {
csvFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileName), StandardCharsets.UTF_8));
for (int i = 0; i < objects.size(); i++) {
StringBuilder buf = new StringBuilder();
Elements elem = elements.get(i);
buf.append(elem.info1).append(delim);
buf.append(elem.info2).append(delim);
buf.append(elem.info3);
csvFile.write(buf.toString());
csvFile.newLine();
}
} finally {
try {
if (csvFile != null) {
csvFile.close();
}
} catch (IOException e) {
// empty
}
}
}
You essentially have to "escape" the commas so the CSV reader won't interrpret them as columns delimiters.
If you wrap your row values in quotes then the commas should be ignored as delimeters
This will give you 3 columns
Value1,Value2,Value3
This should give you 1 column with the entire string as a single value
"Value1,Value2,Value3"

Keep quotes when parsing csv

I know there is already a question related to this: How to keep quotes when parsing csv file? (But it's for C#)
Let's say I have a csv with values e.g:
12312414-DEF_234, "34-DE, 234-EG, 36354-EJ", 23
...
When I parse it with OpenCSV, it doesn't keep the quotes.
CSVReader reader = new CSVReader(new FileReader("../path.csv"), ',', '\"');
List<String[]> list = reader.readAll();
String[][] csvArray = new String[list.size()][];
csvArray = list.toArray(csvArray);
So, after I store all of the values into an array, when I try to print out the values (for checking), the quotes are not there.
...
System.out.println(csvArray[i][j]);
// output below
// 34-DE, 234-EG, 36354-EJ
How can I keep the quotes? The reason is because I am going to be changing some values, and need to re-output it back into a csv.
The CSVReader has to parse and remove the quotes, otherwise you wouldn't get one value 34-DE, 234-EG, 36354-EJ, but three values "34-DE, 234-EG and 36354-EJ". So it's OK that the quotes are being removed.
The CSVWriter should add them again for every value that needs quoting.
Have you tried to write the array back into a CSV? The value 34-DE, 234-EG, 36354-EJ - actually any value that contains a comma - should be quoted.
public static void readCSV(){
String csvFile = "input.csv";
BufferedReader br = null;
String line = "";
String splitter = ",";
try {
br = new BufferedReader(new FileReader(csvFile));
while ((line = br.readLine()) != null) {
// use comma as separator
String[] words = line.split(splitter);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

Write Array of String Arrays to File (txt,csv, etc)

I have an Arraylist of String Arrays called NewArray.
ArrayList<String[]> NewArray = new ArrayList<String[]>();
The data in NewArray looks somewhat like
[Vial1,Dest1]
[Vial2,Dest1]
[Vial3,Dest2]
[Vial4,Dest2]
I want to save this data, in this format (without the brackets) to a CSV/text file (with headers). The ideal output format would be:
VialNo,DestinationNo (these are the headers)
Vial1,Dest1
Vial2,Dest1
Vial3,Dest2
Vial4,Dest2
How would I use something like FileWriter to obtain that desired output in a txt/CSV file?
I've tried something like
FileWriter writer = new FileWriter("output.txt");
for(String[] str: NewArray) {
writer.write(str);
}
writer.close();
But I'm getting the error "The method write(int) in the type OutputStreamWriter is not applicable for the arguments (String[])"
public static void main(String[] args) throws Exception {
// initialize
ArrayList<String[]> list = new ArrayList<String[]>();
list.add(new String[] {"Vial1","Dest1"});
list.add(new String[] {"Vial2","Dest2"});
list.add(new String[] {"Vial3","Dest3"});
list.add(new String[] {"Vial4","Dest4"});
// writer
FileWriter writer = new FileWriter("output.txt");
// headers
writer.write("VialNo,DestinationNo\n");
writer.flush();
// data
for(String[] arr: list) {
String appender = "";
for(String s : arr){
writer.write(appender + s);
appender = ",";
}
writer.write("\n");
writer.flush();
}
writer.close();
}
This gave me the output
VialNo,DestinationNo
Vial1,Dest1
Vial2,Dest2
Vial3,Dest3
Vial4,Dest4
You need to loop over each string in each array, not try to simply print out the array. I also used an appender for formatting the file as a csv.
Updated code to include creating the headers
I would suggest you loop through your array and always write a line. Better put your writer in a using, so you don't have to bother with closing and flushing Streams, Writers etc.
If you actually want to save the object and not just write the content of the array down, then take a look at the serializer which outputs the object as xml which you can save to a file and load through a deserialize.
In case the accepted answer didn't work for someone (didn't for me). Try this for
ArrayList of specific class types foe example ArrayList tester = new ArrayList();
public class PassDataToFile {
public static void main(String[] args) throws IOException {
try {
RSSFeedParser parser = new RSSFeedParser("http://feeds.reuters.com/reuters/technologysectorNews");
Feed feed = parser.readFeed();
String input = "C:\\Users\\Special\\workspace\\demo.txt";
File newFile = new File(input);
if (!newFile.exists()){
newFile.createNewFile();
}
FileWriter writer = new FileWriter(newFile.getAbsoluteFile());
int sx = feed.getMessages().size();
for (int i = 0; i < sx; i++) {
writer.write(feed.getMessages().get(i).toString() + "\n");
}
writer.close();
System.out.println("File successfully written into " + input);
} catch (IOException e) {
System.out.println("File writing operation failed ");
e.printStackTrace();
}
}
}

In Java, I want to split an array into smaller arrays, the length of which varys with inputted text files

So far, I have 2 arrays: one with stock codes and one with a list of file names. What I want to do is input the .txt files from each of the file names from the second array and then split this input into: 1. Arrays for each file 2. Arrays for each part with each file.
I have this:
ImportFiles f1 = new ImportFiles("File");
for (String file : FileArray.filearray) {
if (debug) {
System.out.println(file);
}
try {
String line;
String fileext = "C:\\ASCIIpdbSKJ\\"+file+".txt";
importstart = new BufferedReader(new FileReader(fileext));
for (line = importstart.readLine(); line != null; line = importstart.readLine()) {
importarray.add (line);
if (debug){
System.out.println(importarray.size());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
importarray.add ("End")
This approach works to create a large array of all the files, will it be easier to change the input method to split it as it is coming in or split the large array I have?
At this point, the stock code array is irrelevant. Once I have split the arrays down I know where I will go from there.
Thanks.
Edit: I am aware that this code is incomplete in terms of { } but it is only printstreams and debugging missed off.
If you want to get a map with a filename and all its lines from all the files, here are relevant code parts:
Map<String, List<String>> fileLines = new HashMap<String, List<String>>();
for (String file : FileArray.filearray)
BufferedReader reader = new BufferedReader(new FileReader(fileext));
List<String> lines = new ArrayList<String>();
while ((line = reader.readLine()) != null){
lines.add(line);
}
fileLines.put(file, lines);
}

Categories

Resources