How to count duplicate entries in a .csv file?

How to count duplicate entries in a .csv file? - java

I have a .csv file that is formated like this:
ID,date,itemName
456,1-4-2020,Lemon
345,1-3-2020,Bacon
345,1-4-2020,Sausage
123,1-1-2020,Apple
123,1-2-2020,Pineapple
234,1-2-2020,Beer
345,1-4-2020,Cheese
I have already implemented the algorithm to go through the file, scan for the first number and sort it in a descending order and make a new output:
123,1-1-2020,Apple
123,1-2-2020,Pineapple
234,1-2-2020,Beer
345,1-3-2020,Bacon
345,1-4-2020,Cheese
345,1-4-2020,Sausage
456,1-4-2020,Lemon
My question is, how do I implement my algorithm to make an output that counts the duplicate first number entries and reformat it to make it look like this...
123,1-1-2020,1,Apple
123,1-2-2020,1,Pineapple
234,1-2-2020,1,Beer
345,1-3-2020,1,Bacon
345,1-4-2020,2,Cheese,Sausage
456,1-4-2020,1,Lemon
...so that it counts the number of occurrence for each ID, denote it with the number of times, and if the date of that ID is also the same, combine the item names to the same line. Below is my source code (each line in the .csv is made into an object named 'receipt' that has ID, date, and name with their respective get() methods):
public class ReadFile {
private static List<Receipt> readFile() {
List<Receipt> receipts = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new FileReader("dataset.csv"));
// Move past the first title line
reader.readLine();
String line = reader.readLine();
// Start reading from second line till EOF, split each string at ","
while (line != null) {
String[] attributes = line.split(",");
Receipt attribute = getAttributes(attributes);
receipts.add(attribute);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
return receipts;
}
private static Receipt getAttributes(String[] attributes) {
// Get ID located before the first ","
long memberNumber = Long.parseLong(attributes[0]);
// Get date located after the first ","
String date = attributes[1];
// Get name located after the second ","
String name = attributes[2];
return new Receipt(memberNumber, date, name);
}
// Parse the data into new file after sorting
private static void parse(List<Receipt> receipts) {
PrintWriter output = null;
try {
output = new PrintWriter("output.txt");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
// For each receipts, assert the text output stream is not null, print line.
for (Receipt p : receipts) {
assert output != null;
output.println(p.getMemberNumber() + "," + p.getDate() + "," + p.getName());
}
assert output != null;
output.close();
}
// Main method, accept input file, sort and parse
public static void main(String[] args) {
List<Receipt> receipts = readFile();
QuickSort q = new QuickSort();
q.quickSort(receipts);
parse(receipts);
}
}

The easiest way is to use a map.
Sample data from your file.
String[] lines = {
"123,1-1-2020,Apple",
"123,1-2-2020,Pineapple",
"234,1-2-2020,Beer",
"345,1-3-2020,Bacon",
"345,1-4-2020,Cheese",
"345,1-4-2020,Sausage",
"456,1-4-2020,Lemon"};
Create a map
as you read the lines, split them and add them to the map using the compute method. This will put the line in if the key (number and date) doesn't exist. Otherwise it simply appends the last item to the existing entry.
the file does not have to be sorted but the values will be added to the end as they are encountered.
Map<String, String> map = new LinkedHashMap<>();
for (String line : lines) {
String[] vals = line.split(",");
// if v is null, add the line
// if v exists, take the existing line and append the last value
map.compute(vals[0]+vals[1], (k,v)->v == null ? line : v +","+vals[2]);
}
for (String line : map.values()) {
String[] fields = line.split(",",3);
int count = fields[2].split(",").length;
System.out.printf("%s,%s,%s,%s%n", fields[0],fields[1],count,fields[2]);
}
For this sample run prints
123,1-1-2020,1,Apple
123,1-2-2020,1,Pineapple
234,1-2-2020,1,Beer
345,1-3-2020,1,Bacon
345,1-4-2020,2,Cheese,Sausage
456,1-4-2020,1,Lemon

Related

algorithm arraylist remove String duplicates and save to new text file

I am currently writing an algorithm that creates an ArrayList from a .txt file, checks it with a loop for duplicates (where the loop should look like this:
Line one is written to new .txt & boolean found is set to true because the string was already found.
Line 2 is written to new .txt etc.
But if two strings are identical, the duplicate, i.e. the second string should just be ignored and continue with the next one).
public class test {
public static void main(String[] args) throws IOException {
String suche = "88 BETRAG-MINUS VALUE 'M'.";
String suche2 = "88 BETRAG-PLUS VALUE 'P'";
boolean gefunden = false;
File neueDatei = new File("C:\\Dev\\xx.txt");
if (neueDatei.createNewFile()) {
System.out.println("Datei wurde erstellt");
}
if (gefunden == false) {
dateiEinlesen(null, gefunden);
ArrayList<String> arr = null;
inNeueDateischreiben(neueDatei, gefunden, arr, suche, suche2);
}
}
public static void dateiEinlesen(File neueDatei, boolean gefunden) {
BufferedReader reader;
String zeile = null;
try {
reader = new BufferedReader(new FileReader("C:\\Dev\\Test.txt"));
zeile = reader.readLine();
ArrayList<String[]> arr = new ArrayList<String[]>();
while (zeile != null) {
arr.add(zeile.split(" "));
zeile = reader.readLine();
}
System.out.println(arr);
} catch (IOException e) {
System.err.println("Error2 :" + e);
}
}
public static void inNeueDateischreiben(File neueDatei, boolean gefunden, ArrayList<String> arr, String suche2,
String suche22) throws IOException {
FileWriter writer = new FileWriter(suche22);
String lastValue = null;
for (Iterator<String> i = arr.iterator(); i.hasNext();) {
String currentValue = i.next();
if (lastValue != null && currentValue.equals(lastValue)) {
i.remove();
{
writer.write(suche2.toString());
gefunden = true;
}
}
writer.close();
}
}
}

Your variable namings (suche2, suche22) makes reading the code difficult.
Other than that, your writing algorithm looks funny. You only compare adjacent lines while duplicate lines could be anywhere. In addition, writer.write only hits when you find a duplicate. Also how you call it and other things don't look right.
Here are some general steps to write this correctly:
Open the file so you can read it line by line.
Create a file writer
Create a set or dictionary like data structure that enables you to look up items in constant time.
For each line that you read do the following:
Look if the line exists in the dictionary.
If not, write it to the new file
If it already exists in the dictionary, skip to step 4.
Add that line to the dictionary for later comparisons and go to step 4.
When the lines are exhausted close both files.
I suggest, you rewrite your code completely as the current version is very difficult to amend.

In Java, using Scanner, Is there a way to find a specific String in a CSV file, use it as a column header and return all values under it?

I am trying to find the String "5464" in a csv document then have it return all of the values under that String (same number of Delimiters from the start of the line), until reaching the end of the list (no more values in the column). Any help would be sincerely appreciated.
import javax.swing.JOptionPane;
public class SearchNdestroyV2 {
private static Scanner x;
public static void main(String[] args) {
String filepath = "tutorial.txt";
String searchTerm = "5464"
readRecord(searchTerm,filepath);
}
public void readRecord(String searchTerm, String filepath)
{
boolean found = false;
String ID = ""; String ID2 = ""; String ID3 = "";
}
try
{
x = new Scanner(new File(filepath));
x.useDelimeter("[,\n]");
while(x.hasNext() && !found )
{
ID = x.next();
ID2 = x.nextLine();
ID3 = x.nextLine();
if(ID.equals(searchTerm))
{
found = true;
}
}
if (found)
{
JOptionPane.showMessageDialog(null,"ID: " + ID + "ID2: " + ID2 + "ID3: "+ID3);
}
}
else
{
JOptionPane.showMessageDialog(null, "Error:");
}
catch(Exception e)
{
}
{
}

I'm not exactly sure of what you mean. The way I read your question:
You want to locate a specific String ("5464") that is contained within a specific column within a comma (,) delimited CSV file. If this specific string (search term) is found then retrieve all other values contained within the same column for the rest of the CSV file records from the point of location. Here is how:
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.JOptionPane;
public class SearchNDestroyV2 {
private Scanner fileInput;
public static void main(String[] args) {
// Do this if you don't want to deal with statics
new SearchNDestroyV2().startApp(args);
}
private void startApp(String[] args) {
String filepath = "tutorial.txt";
String searchTerm = "5464";
readRecord(searchTerm, filepath);
}
public void readRecord(String searchTerm, String filepath) {
try {
fileInput = new Scanner(new File(filepath));
// Variable to hold each file line data read.
String line;
// Used to hold the column index value to
// where the found search term is located.
int foundColumn = -1;
// An ArrayList to hold the column values retrieved from file.
ArrayList<String> columnList = new ArrayList<>();
// Read file to the end...
while(fileInput.hasNextLine()) {
// Read in file - 1 trimmed line per iteration
line = fileInput.nextLine().trim();
//Skip blank lines (if any).
if (line.equals("")) {
continue;
}
// Split the curently read line into a String Array
// based on the comma (,) delimiter
String[] lineParts = line.split("\\s{0,},\\s{0,}"); // Split on any comma/space situation.
// Iterate through the lineParts array to see if any
// delimited portion equals the search term.
for (int i = 0; i < lineParts.length; i++) {
/* This IF statement will always accept the column data and
store it if the foundColumn variable equals i OR the current
column data being checked is equal to the search term.
Initially when declared, foundColumn equals -1* and will
never equal i unless the search term is indeed found. */
if (foundColumn == i || lineParts[i].equals(searchTerm)) {
// Found a match
foundColumn = i; // Hold the Coloumn index number of the found item.
columnList.add(lineParts[i]); // Add the found ite to the List.
break; // Get out of this loop. Don't need it anymore for this line.
}
}
}
if (foundColumn != -1) {
System.out.println("Items Found:" + System.lineSeparator() +
"============");
for (String str : columnList) {
System.out.println(str);
}
}
else {
JOptionPane.showMessageDialog(null, "Can't find the Search Term: " + searchTerm);
}
}
catch(Exception ex) {
System.out.println(ex.getMessage());
}
}
}
If however, what you want is to search through the CSV file and as soon as any particular column equals the Search Term ("5464") then simply store the CSV line (all its data columns) which contains that Search Term. Here is how:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
public class SearchNDestroyV2 {
/* A JFrame used as Parent for displaying JOptionPane dialogs.
Using 'null' can allow the dialog to open behind other open
applications (like the IDE). This ensures that it will be
displayed above all other applications at center screen. */
JFrame iFRAME = new JFrame();
{
iFRAME.setAlwaysOnTop(true);
iFRAME.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
iFRAME.setLocationRelativeTo(null);
}
public static void main(String[] args) {
// Do this if you don't want to deal with statics
new SearchNDestroyV2().startApp(args);
}
private void startApp(String[] args) {
String filepath = "tutorial.txt";
String searchTerm = "5464";
ArrayList<String> recordsFound = readRecord(searchTerm, filepath);
/* Display any records found where a particular column
matches the Search Term. */
if (!recordsFound.isEmpty()) {
System.out.println("Records Found:" + System.lineSeparator()
+ "==============");
for (String str : recordsFound) {
System.out.println(str);
}
}
else {
JOptionPane.showMessageDialog(iFRAME, "Can't find the Search Term: " + searchTerm);
iFRAME.dispose();
}
}
/**
* Returns an ArrayList (of String) of any comma delimited CSV file line
* records which contain any column matching the supplied Search Term.<br>
*
* #param searchTerm (String) The String to search for in all Record
* columns.<br>
*
* #param filepath (String) The CSV (or text) file that contains the data
* records.<br>
*
* #return ({#code ArrayList<String>}) An ArrayList of String Type which
* contains the file line records where any particular column
* matches the supplied Search Term.
*/
public ArrayList<String> readRecord(String searchTerm, String filepath) {
// An ArrayList to hold the line(s) retrieved from file
// that match the search term.
ArrayList<String> linesList = new ArrayList<>();
// Try With Resourses used here to auto-close the Scanner reader.
try (Scanner fileInput = new Scanner(new File(filepath))) {
// Variable to hold each file line data read.
String line;
// Read file to the end...
while (fileInput.hasNextLine()) {
// Read in file - 1 trimmed line per iteration
line = fileInput.nextLine().trim();
//Skip blank lines (if any).
if (line.equals("")) {
continue;
}
// Split the curently read line into a String Array
// based on the comma (,) delimiter
String[] lineParts = line.split("\\s{0,},\\s{0,}"); // Split on any comma/space situation.
// Iterate through the lineParts array to see if any
// delimited portion equals the search term.
for (int i = 0; i < lineParts.length; i++) {
if (lineParts[i].equals(searchTerm)) {
// Found a match
linesList.add(line); // Add the found line to the List.
break; // Get out of this loop. Don't need it anymore for this line.
}
}
}
}
catch (FileNotFoundException ex) {
System.out.println(ex.getMessage());
}
return linesList; // Return the ArrayList
}
}
Please try to note the differences between the two code examples. In particular how the file reader (Scanner object) is closed, etc.

How to access key contents of nested hashmap

I'm creating an inverted index for an information retrieval course and can't figure out how to see if a word is in my nested hashmap.
"inner" contains a word & its frequency while the "invertedIndex" contains the name of the document it occurs in.
When processing a search, I'm trying to see if the user input defined as "query" is in the inner hashmap. I'm pretty sure the error is arising from the nested for loop at the bottom of my code...
My code is below.
public class PositionalIndex extends Stemmer{
// no more than this many input files needs to be processed
final static int MAX_NUMBER_OF_INPUT_FILES = 100;
// an array to hold Gutenberg corpus file names
static String[] inputFileNames = new String[MAX_NUMBER_OF_INPUT_FILES];
static int fileCount = 0;
// loads all files names in the directory subtree into an array
// violates good programming practice by accessing a global variable (inputFileNames)
public static void listFilesInPath(final File path) {
for (final File fileEntry : path.listFiles()) {
if (fileEntry.isDirectory()) {
listFilesInPath(fileEntry);
}
else if (fileEntry.getName().endsWith((".txt"))) {
inputFileNames[fileCount++] = fileEntry.getPath();
}
}
System.out.println("File count: " + fileCount);
}
public static void main(String[] args){
// did the user provide correct number of command line arguments?
// if not, print message and exit
if (args.length != 1){
System.err.println("Number of command line arguments must be 1");
System.err.println("You have given " + args.length + " command line arguments");
System.err.println("Incorrect usage. Program terminated");
System.err.println("Correct usage: java Ngrams <path-to-input-files>");
System.exit(1);
}
// extract input file name from command line arguments
// this is the name of the file from the Gutenberg corpus
String inputFileDirName = args[0];
System.out.println("Input files directory path name is: " + inputFileDirName);
// collects file names and write them to
listFilesInPath(new File (inputFileDirName));
// wordPattern specifies pattern for words using a regular expression
Pattern wordPattern = Pattern.compile("[a-zA-Z]+");
// wordMatcher finds words by spotting word word patterns with input
Matcher wordMatcher;
// a line read from file
String line;
// br for efficiently reading characters from an input stream
BufferedReader br = null;
// an extracted word from a line
String word;
// simplified version of porterStemmer
Stemmer porterStemmer = new Stemmer();
System.out.println("Processing files...");
// create an instance of the Stemmer class
Stemmer stemmer = new Stemmer();
Map<String, Map<String, Integer>> invertedIndex = new HashMap<String, Map<String, Integer>>();
Map<String, Integer> inner = new HashMap<String, Integer>();
// process one file at a time
for (int index = 0; index < fileCount; index++){
// open the input file, read one line at a time, extract words
// in the line, extract characters in a word, write words and
// character counts to disk files
try {
// get a BufferedReader object, which encapsulates
// access to a (disk) file
br = new BufferedReader(new FileReader(inputFileNames[index]));
// as long as we have more lines to process, read a line
// the following line is doing two things: makes an assignment
// and serves as a boolean expression for while test
while ((line = br.readLine()) != null) {
// process the line by extracting words using the wordPattern
wordMatcher = wordPattern.matcher(line);
// process one word at a time
while ( wordMatcher.find() ) {
// extract the word
word = line.substring(wordMatcher.start(), wordMatcher.end());
word = word.toLowerCase();
//use Stemmer class to stem word & convert to lowercase
porterStemmer.stemWord(word);
if (!inner.containsKey(word)) {
inner.put(word, 1);
}
else
{
inner.put(word, inner.get(word) + 1);
}
} // end one word at a time while
} // end outer while
invertedIndex.put(inputFileNames[index], inner);
/*for(String x : inner.keySet()) {
System.out.println(x);
}*/
inner.clear();
} // end try
catch (IOException ex) {
System.err.println("File " + inputFileNames[index] + " not found. Program terminated.\n");
System.exit(1);
}
} // end for
System.out.print("Enter a query: ");
Scanner kbd = new Scanner(System.in);
String query = kbd.next();
for(String fileName : invertedIndex.keySet()) {
for(String wordInFile : invertedIndex.get(fileName).keySet())
{
if(wordInFile.equals(query))
{
System.out.println(query + " was found in document " + fileName);
}
}
}
}
}

Why are you invoking:
inner.clear()
it seems that a new inner map needs to be created every time and then added to invertedIndex; instead of clearing it as data are lost.

try this
for(String w : invertedIndex.keySet()) {
Map<String, Integer> fileWordMap = invertedIndex.get(w)
if(fileWordMap.containsKey(query))
{
System.out.println(query + " was found in document " + w);
}
}
or as per your original code
for(String fileName : invertedIndex.keySet()) {
for(String wordInFile : invertedIndex.get(fileName).keySet())
{
if(wordInFile.equals(query))
{
System.out.println(query + " was found in document " + fileName);
}
}
}
As a tip, try having variable names that can tell you what the code is doing :) Its very easy to get confused if we only use random variable names

Read the each string text from file in java

I am new in java. I just wants to read each string in java and print it on console.
Code:
public static void main(String[] args) throws Exception {
File file = new File("/Users/OntologyFile.txt");
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(
fstream));
String data = new String();
while ((data = infile.readLine()) != null) { // use if for reading just 1 line
System.out.println(""+data);
}
} catch (IOException e) {
// Error
}
}
If file contains:
Add label abc to xyz
Add instance cdd to pqr
I want to read each word from file and print it to a new line, e.g.
Add
label
abc
...
And afterwards, I want to extract the index of a specific string, for instance get the index of abc.
Can anyone please help me?

It sounds like you want to be able to do two things:
Print all words inside the file
Search the index of a specific word
In that case, I would suggest scanning all lines, splitting by any whitespace character (space, tab, etc.) and storing in a collection so you can later on search for it. Not the question is - can you have repeats and in that case which index would you like to print? The first? The last? All of them?
Assuming words are unique, you can simply do:
public static void main(String[] args) throws Exception {
File file = new File("/Users/OntologyFile.txt");
ArrayList<String> words = new ArrayList<String>();
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(
fstream));
String data = null;
while ((data = infile.readLine()) != null) {
for (String word : data.split("\\s+") {
words.add(word);
System.out.println(word);
}
}
} catch (IOException e) {
// Error
}
// search for the index of abc:
for (int i = 0; i < words.size(); i++) {
if (words.get(i).equals("abc")) {
System.out.println("abc index is " + i);
break;
}
}
}
If you don't break, it'll print every index of abc (if words are not unique). You could of course optimize it more if the set of words is very large, but for a small amount of data, this should suffice.
Of course, if you know in advance which words' indices you'd like to print, you could forego the extra data structure (the ArrayList) and simply print that as you scan the file, unless you want the printings (of words and specific indices) to be separate in output.

Split the String received for any whitespace with the regex \\s+ and print out the resultant data with a for loop.
public static void main(String[] args) { // Don't make main throw an exception
File file = new File("/Users/OntologyFile.txt");
try {
FileInputStream fstream = new FileInputStream(file);
BufferedReader infile = new BufferedReader(new InputStreamReader(fstream));
String data;
while ((data = infile.readLine()) != null) {
String[] words = data.split("\\s+"); // Split on whitespace
for (String word : words) { // Iterate through info
System.out.println(word); // Print it
}
}
} catch (IOException e) {
// Probably best to actually have this on there
System.err.println("Error found.");
e.printStackTrace();
}
}

Just add a for-each loop before printing the output :-
while ((data = infile.readLine()) != null) { // use if for reading just 1 line
for(String temp : data.split(" "))
System.out.println(temp); // no need to concatenate the empty string.
}
This will automatically print the individual strings, obtained from each String line read from the file, in a new line.
And afterwards, I want to extract the index of a specific string, for
instance get the index of abc.
I don't know what index are you actually talking about. But, if you want to take the index from the individual lines being read, then add a temporary variable with count initialised to 0.
Increment it till d equals abc here. Like,
int count = 0;
for(String temp : data.split(" ")){
count++;
if("abc".equals(temp))
System.out.println("Index of abc is : "+count);
System.out.println(temp);
}

Use Split() Function available in Class String.. You may manipulate according to your need.
or
use length keyword to iterate throughout the complete line
and if any non- alphabet character get the substring()and write it to the new line.

List<String> words = new ArrayList<String>();
while ((data = infile.readLine()) != null) {
for(String d : data.split(" ")) {
System.out.println(""+d);
}
words.addAll(Arrays.asList(data));
}
//words List will hold all the words. Do words.indexOf("abc") to get index
if(words.indexOf("abc") < 0) {
System.out.println("word not present");
} else {
System.out.println("word present at index " + words.indexOf("abc"))
}

ArrayList confusion

The code below is my attempt to read from a file of strings, read through each line until a ':' is found then store + print everything after that. however The print function prints out everything that I read in from the file. Can someone spot where I'm going wrong? thanks
edit: every line is in this format "Some text here:More text here"
public void openFile() {
try {
scanner = new BufferedReader(new FileReader("calendar.ics"));
} catch (Exception e) {
System.out.println("Could not open file");
}
}
public void readFile() {
ArrayList<String> vals = new ArrayList<String>();
String test;
try {
while ((line = scanner.readLine()) != null)
{
int indexOfComma = line.indexOf("\\:"); // returns firstIndexOf ':'
test = line.substring(indexOfComma+1); // test to be everything after ':'
vals.add(test); // add values to vals
}
} catch(Exception ex){ }
for(int i=0; i<vals.size(); i++){
System.out.println(vals.get(i));
}
}

You don't need to escape your colon.
line.indexOf("\\:");
Change the above line to: -
line.indexOf(":");
Because, that will search for \\:, and if not found return the value -1.
test = line.substring(indexOfComma+1);
So, if your indexComma is -1, which will certainly be, if your string does not contain - \\:, then your above line becomes: -
line.substring(0); // same as whole string
As a suggestion, you should have abstract type as the type of reference when declaring your list. So, you should use List instead of ArrayList on the LHS of the List declaration: -
List<String> vals = new ArrayList<String>();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to count duplicate entries in a .csv file? - java

Related

algorithm arraylist remove String duplicates and save to new text file

In Java, using Scanner, Is there a way to find a specific String in a CSV file, use it as a column header and return all values under it?

How to access key contents of nested hashmap

Read the each string text from file in java

ArrayList confusion

Categories

Resources