Splitting up a text file into two files (java) - java

I need some help into figuring out how to split a text file into two files in java.
I have a text file in which each line contains in alphabetical order a word a space and its index, i.e.
...
stand 345
stand 498
stare 894
...
What I would like to do is to read in this file and then write two separate files. One file should contain only one instance of the word and the other the positions of the word in the document.
The file is really big and I was wondering if I can use an array or a list to store the word and index before creating the file or if there is a better way.
I don't really know how to think.

I would suggest you to create a HashMap using the word as key and a list of indexes as value, like HashMap< String, ArrayList< String >>. This way you can easily check the words you already have put in the map, and update its index list.
List<String> list = map.get(word);
if (list == null)
{
list = new ArrayList<String>();
map.put(word, list);
}
list.add(index);
After reading and storing all values, you just need to iterate through the map and write its keys in one file and values in another.
for (Map.Entry<String, Object> entry : map.entrySet()) {
String key = entry.getKey();
ArrayList value = (ArrayList) entry.getValue();
// writing code here
}

If your file is really long, then you should consider using a database. If your file is not too big then you can use a HashMap. You can also use a class like this, it requires that the file is sorted, and it writes the words in one file and the indices in another file:
public class Split {
private String fileName;
private PrintWriter fileWords;
private PrintWriter fileIndices;
public Split(String fname) {
fileName = fname;
if (initFiles()) {
writeList();
}
closeFiles();
}
private boolean initFiles() {
boolean retval = false;
try {
fileWords = new PrintWriter("words-" + fileName, "UTF-8");
fileIndices = new PrintWriter("indices-" + fileName, "UTF-8");
retval = true;
} catch (Exception e) {
System.err.println(e.getMessage());
}
return retval;
}
private void closeFiles() {
if (null != fileWords) {
fileWords.close();
}
if (null != fileIndices) {
fileIndices.close();
}
}
private void writeList() {
String lastWord = null;
List<String> wordIndices = new ArrayList<String>();
Path file = Paths.get(fileName);
Charset charset = Charset.forName("UTF-8");
try (BufferedReader reader = Files.newBufferedReader(file, charset)) {
String line = null;
while ((line = reader.readLine()) != null) {
int len = line.length();
if (len > 0) {
int ind = line.indexOf(' ');
if (ind > 0 && ind < (len - 1)) {
String word = line.substring(0, ind);
String indice = line.substring(ind + 1, len);
if (!word.equals(lastWord)) {
if (null != lastWord) {
writeToFiles(lastWord, wordIndices);
}
lastWord = word;
wordIndices = new ArrayList<String>();
wordIndices.add(indice);
} else {
wordIndices.add(indice);
}
}
}
}
if (null != lastWord) {
writeToFiles(lastWord, wordIndices);
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
private void writeToFiles(String word, List<String> list) {
boolean first = true;
fileWords.println(word);
for (String elem : list) {
if (first) {
first = false;
}
else {
fileIndices.print(" ");
}
fileIndices.print(elem);
}
fileIndices.println();
}
}
Be careful that the file name handling is not very robust, you can use it that way:
Split split = new Split("data.txt") ;

You can use this to save the words and the indices. You just need to call addLine for each line of your file.
Map<String, Set<Integer>> entries = new LinkedHashMap<>();
public void addLine(String word, Integer index) {
Set<Integer> indicesOfWord = entries.get(word);
if (indicesOfWord == null) {
entries.put(word, indicesOfWord = new TreeSet<>());
}
indicesOfWord.add(index);
}
To store them in separate files you can use this method:
public void storeInSeparateFiles(){
for (Entry<String, Set<Integer>> entry : entries.entrySet()) {
String word = entry.getKey();
Set<Integer> indices = entry.getValue();
// TODO: Save in separate files.
}
}

Related

Find the line number of a text file by each word

I want to find the line number of a text file by each word, however, the method I wrote below only gives the first number while I need a list of line numbers.
For instance, if "a" occurs in lines: 1,3,5, it should have a list of [1,3,5]. This list result then will be passed into another method for further process. But, my result only shows [1] for "a".
Can someone help me fix this? Thank you!
public SomeObject<Word> buildIndex(String fileName, Comparator<Word> comparator) {
SomeObject<Word> someObject = new SomeObject<>(comparator);
Comparator<Word> comp = checkComparator(someObject.comparator());
int num = 0;
if (fileName != null) {
File file = new File(fileName);
try (Scanner scanner = new Scanner(file, "latin1")) {
while (scanner.hasNextLine()) {
String lines;
if (comparator instanceof IgnoreCase) {
lines = scanner.nextLine().toLowerCase();
} else {
lines = scanner.nextLine();
}
if (lines != null) {
String[] lineFromText = lines.split("\n");
List<Integer> list = new ArrayList<>();
for (int i = 0; i < lineFromText.length; i++) {
String[] wordsFromText = lineFromText[i].split("\\W");
num++;
for (String s : wordsFromText) {
if (s != null && lineFromText[i].contains(s)) {
list.add(num);
}
if (s != null && !s.trim().isEmpty() && s.matches("^[a-zA-Z]*$")) {
doInsert(s, comp, someObject, list);
}
}
}
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
return someObject;
}
Does something like this work for you?
It reads in the lines one at a time.
Finds the words by splitting on spaces.
Then puts the words and the line numbers in a map where the
key is the word an the value is a list of line numbers.
int lineCount = 1;
String fileName = "SomeFileName";
Map<String, List<Integer>> index = new HashMap<>();
Scanner scanner = new Scanner("fileName");
while (scanner.hasNextLine()) {
//get single line from file
String line = scanner.nextLine().toLowerCase();
//split into words
for (String word : line.split("\\s+")) {
// add to lineNumber to map if List already there.
// otherwise add new List and then add lineNumber
index.compute(word,
(wd, list) -> list == null ? new ArrayList<>()
: list).add(lineCount);
}
// bump lineCount for next line
lineCount++;
}
Print them out.
index.forEach((k, v) -> System.out.println(k + " --> " + v));

Reading input files in Java

The purpose of this program is to read an input file and parse it looking for words. I used a class and instantiated objects to hold each unique word along with a count of that word as found in the input file. For instance, for a sentence “Word” is found once, “are” is found once, “fun” is found twice, ... This program ignores numeric data (e.g. 0, 1, ...) as well as punctuation (things like . , ; : - )
The assignment does not allow using a fixed size array to hold word strings or counts. The program should work regardless of the size of the input file.
I am getting the following compiling error:
'<>' operator is not allowed for source level below 1.7 [line: 9]
import java.io.*;
import java.util.*;
public class Test {
public static void main(String args[]) throws IOException {
HashMap<String,Word> map = new HashMap<>();
// The name of the file to open.
String fileName = "song.txt";
// This will reference one line at a time
String line = null;
try {
// FileReader reads text files in the default encoding.
FileReader fileReader =
new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
String[] words = line.split(" ");
for(String word : words){
if(map.containsKey(word)){
Word w = map.get(word);
w.setCount(w.getCount()+1);
}else {
Word w = new Word(word, 1);
map.put(word,w);
}
}
}
// Always close files.
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
}
catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
// Or we could just do this:
// ex.printStackTrace();
}
for(Map.Entry<String,Word> entry : map.entrySet()){
System.out.println(entry.getValue().getWord());
System.out.println("count:"+entry.getValue().getCount());
}
}
static class Word{
public Word(String word, int count) {
this.word = word;
this.count = count;
}
String word;
int count;
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public int getCount() {
return count;
}
public void setCount(int count) {
this.count = count;
}
}
}
You either need to compile with a JDK of version 1.7 or later, or change the line:
HashMap<String,Word> map = new HashMap<>();
to
HashMap<String,Word> map = new HashMap<String,Word>();
replace
HashMap<String,Word> map = new HashMap<>();
with:
HashMap<String,Word> map = new HashMap<String,Word>();

Reading Unique Values

I wrote a piece of code that reads values from columns in a text file. To output the number of values, I used 'length' which works fine..but I need to count only the number of unique values.
public class REading_Two_Files {
public static void main(String[] args) {
try {
readFile(new File("C:\\Users\\teiteie\\Desktop\\RECSYS\\yoochoose-test.csv"), 0,( "C:\\Users\\teiteie\\Desktop\\RECSYS\\yoochoose-buys.csv"), 3);
//readFile(new File(File1,0, File2,3);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
//// 0 - will print column from file1
//3 - will print column from file 2
private static void readFile(File fin1,int whichcolumnFirstFile,String string,int whichcolumnSecondFile) throws IOException {
//private static void readFile(File fin1,int whichcolumnFirstFile,String string,int whichcolumnSecondFile) throws IOException
// code for this method.
//open the two files.
int noSessions = 0;
int noItems = 0;
// HashSet<String> uniqueLength = new HashSet<String>();
FileInputStream fis = new FileInputStream(fin1); //first file
FileInputStream sec = new FileInputStream(string); // second file
//Construct BufferedReader from InputStreamReader
BufferedReader br1= new BufferedReader(new InputStreamReader(fis));
BufferedReader br2= new BufferedReader(new InputStreamReader(sec));
String lineFirst = null, first_file[];
String lineSec = null, second_file [];
while ((lineFirst = br1.readLine()) != null && (lineSec = br2.readLine()) != null) {
first_file= lineFirst.split(",");
second_file = lineSec.split(",");
//int size = data[].size();
System.out.println(first_file[0]+" , "+second_file[0]);
if(first_file.length != 0){
noSessions++;
}
if(second_file.length != 0) {
noItems ++;
}
}
br1.close();
br2.close();
System.out.println("no of sessions "+noSessions+"\nno of items "+noItems );
}
}
To count only unique values we usually use a Set as they are specified as only containing unique values.
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
Essentially - put all of your values in a Set (generally a HashSet is the most efficient but if you want concurrency there are better options) and then take the Set.size() as the number of unique values you put in.
just to give you some inspiration:
Map<String,Integer> lAllWordsWithCount = new HashMap<String, Integer>();
String[] lAllMyStringToCount = {"Hello", "I", "am", "what", "I", "am"};
for (String lMyString : lAllMyStringToCount) {
int lCount = 1;
if (lAllWordsWithCount.containsKey(lMyString)){
lCount = lAllWordsWithCount.get(lMyString) +1;
}
lAllWordsWithCount.put(lMyString, lCount);
}
for(String lStringKey : lAllWordsWithCount.keySet()){
System.out.println(lStringKey+" count="+lAllWordsWithCount.get(lStringKey));
}
will results in:
what count=1
am count=2
I count=2
Hello count=1

how to get specifics rows of 2d array returned by reading CSV file in java

This is data.csv file, now I want rows having classtype x (any number) and store those extarcted rows into new array, so if i have n classtype then i will have n new arrays.
age sex zipcode classtype
21 m 23423 1
12 f 23133 2
23 m 32323 2
23 f 23211 1
Example: If I want to retrieve rows which have classtype 1 and store this values in a new 2d array. Then output should come like this:
array1={{21,m,23423,1},{23,f,23211,1}}
I have written the below code which gives me arrayList as output.
public class CsvParser {
public static void main(String[] args) {
try {
FileReader fr = new FileReader((args.length > 0) ? args[0] : "data.csv");
Map<String, List<String>> values = parseCsv(fr, "\\s,", true);
System.out.println(values);
} catch (IOException e) {
e.printStackTrace();
}
}
public static Map<String, List<String>> parseCsv(Reader reader, String separator, boolean hasHeader) throws IOException {
Map<String, List<String>> values = new LinkedHashMap<String, List<String>>();
List<String> columnNames = new LinkedList<String>();
BufferedReader br = null;
br = new BufferedReader(reader);
String line;
int numLines = 0;
while ((line = br.readLine()) != null) {
if (StringUtils.isNotBlank(line)) {
if (!line.startsWith("#")) {
String[] tokens = line.split(separator);
if (tokens != null) {
for (int i = 0; i < tokens.length; ++i) {
if (numLines == 0) {
columnNames.add(hasHeader ? tokens[i] : ("row_"+i));
} else {
List<String> column = values.get(columnNames.get(i));
if (column == null) {
column = new LinkedList<String>();
}
column.add(tokens[i]);
values.put(columnNames.get(i), column);
}
}
}
++numLines;
}
}
}
return values;
}
The ouput of this code is:
{age=[21,12,23,23],sex=[m,f,m,f],zipcode=[23423,23133,32323,23211],classtype=[1,2,2,1]}
I got few links, which says about grouping elements in "java collectors class", But dont whether that is useful.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-
Your help will be very useful.
You can try something like
String[][] allArrays = new String[50][]; //Set it to however many you need
String classType = "1";
int counter = 0;
Scanner s = new Scanner(new File(fileName));
while(s.hasNextLine()) {
String row = s.nextLine();
if (row.endsWith(classType) {
allArrays[counter++] = row.split(","); //Adds the row, with each element being split by the comma
}
}
Do not reinvent the wheel, you can use an existing library to dump the content of CSV file to a Java Collection. I usually use OpenCSV to dump the contents of CSV file to List<String[]>. It has a one liner code to read all.
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
List<String[]> lines= reader.readAll();
Then iterate the list like this to do the grouping.
Map<String, List<String[]>> values = new LinkedHashMap<String, List<String[]>>();
for(String[] line : lines){
String key = line[4];
if(values.get(key) == null){
values.put(key, new ArrayList<String[]>());
}
values.get(key).add(line);
}
System.out.println(values);

How to read several files to array of hashtable and get corresponding file name

I want to read all files in a folder, each file content will be read into a hashtable. Then I need to compare each word in a text file with each of this hashtable. If that word match any word in that hashtable, a variable will be named after the corresponding file name that created that hashtable.
Now I have two difficulties:
1.How to have a list of hashtable for every files in the folder.
2.How to named the variable when finding the word in that hashtable.
I try this code and it works for 1 file, 1 hashtable.
Hashtable HashTableName;
public String namebymatching;
// compare the spannedText for words in each dictionary in folder
public OneExtractor() throws IOException {
super();
// location
HashTableName = new Hashtable();
FilenameFilter ff = new OnlyExt("txt");
File folder = new File("/Folder Path/");
File[] files = folder.listFiles(ff);
Map<String, String> map = new LinkedHashMap<String, String>();
for (int i = 0; i < files.length; i++) {
FileReader fr = new FileReader(files[i].getPath());
BufferedReader reader = new BufferedReader(fr);
String st = "", str = " ";
while ((st = reader.readLine()) != null) {
str += st + " ";
}
map.put(files[i].getName(), str);
}
Set set = map.entrySet();
Iterator i = set.iterator();
while (i.hasNext()) {
Map.Entry me = (Map.Entry) i.next();
BufferedReader br = null;
try {
br = new BufferedReader(new FileReader(
"/Folder Path"+me.getKey()));
} catch (FileNotFoundException ex) {
Logger.getLogger(PersonExtractor.class.getName()).log(Level.SEVERE, null, ex);
}
try {
String line = br.readLine();
HashTableName.put(line.toLowerCase(), 1);
while (line != null) {
line = br.readLine();
if (!line.isEmpty())
HashTableName.put(line.toLowerCase(), 1);
}
} catch (Exception ex) {
} finally {
try {
br.close();
} catch (IOException ex) {
Logger.getLogger(PersonExtractor.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
}
private boolean isHashTableName(String s) {
return HashTableName.containsKey(s.toLowerCase());
}
///Extension
public static class OnlyExt implements FilenameFilter {
String ext;
public OnlyExt(String ext) {
this.ext = "." + ext;
}
public boolean accept(File dir, String name) {
return name.endsWith(ext);
}
}
// Find word match :
String word = //some function here to extract word;
namebymatching = "NOT" + filename; //filename should be here
if (isHashTableName(spannedText))
namebymatching = "ISPARTOF" +filename;//filename should be here
You can use another Hashtable to manage your collection of Hashtables!! If you want to be slightly more modern, use a HashMap instead. You can use an outer hash table that maps files to inner hash tables, and the inner hash tables can then be analyzed. For each file you find, add an entry to the outer hash table, then for each entry, do the process you have already figured out for that file.

Categories

Resources