Write file with SuperCsv preserving leading zeros while opening in excel - java

I was wondering if there is a way to keep the leading 0 while using SuperCsv.
My problem is that I have a few columns which have numbers with leading 0. I want to keep the 0, but excel keeps stripping it, I've also tried to append a few characters at the beginning of the number like ' = " but no good result.
Excel is displaying the first character which I've added at the beginning of the number, so the column value looks like =0222333111 , and that's because probably supercsv is wrapping the output between quotes.
I didn't find anything on the superCsv website and I guess I am not the only one who has this problem.
Should I migrate the to an Excel Java lib, or there is a workaround?

The CSV file format does not allow you to specify how the cells are treated by external programs. Even if the leading zeroes are written to the CSV file (please check that, if you have not already done so), Excel might think that it's smarter than you, that the leading zeroes are there by accident and discard them.
Even if there where some workarounds like adding all sorts of invisible Unicode characters, this is just a hack that is not guaranteed to work with other versions of Excel.
Therefore, CSV seems not to be an adequate file format for your requirements. Either switch to a different file format, or configure Excel to treat all cells as strings instead of numbers (I don't know how or if the latter is possible).

In supercsv, you can use custom cellprocessor below, it will append = in your cell value
public class PreserveLeadingZeroes extends CellProcessorAdaptor {
private static final Logger LOG = LoggerFactory.getLogger(PreserveLeadingZeroes.class);
public PreserveLeadingZeroes() {
super();
}
public PreserveLeadingZeroes(CellProcessor next) {
super(next);
}
public Object execute(Object value, CsvContext context) {
if (value == null) {
// LOG.debug("null customer code");
final String result = "";
return next.execute(result, context);
}
// LOG.debug("parse customer code : " + value.toString());
final String result = "=\"" + value.toString() + "\"";
return next.execute(result, context);
}
}

Related

CSVFormat.RFC4180 ignores quoted values in .csv file

I have .csv file that has quoted values
Gender,ParentIncome,IQ,ParentEncouragement,CollegePlans
"Male",53900,118,"Encouraged","Plans to attend"
"Female",24900,87,"Not Encouraged","Does not plan to attend"
"Female",65800,93,"Not Encouraged","Does not plan to attend"
Reading this file with the following code (using IntelliJ and observing values in the debugger), returns values without quotes.
#Override
public CsvConnectorService read(String fullFileName, String outputAddress, int intervalMs, boolean repeat,
Handler<AsyncResult<Void>> result) {
CSVFormat format = CSVFormat.RFC4180.withHeader().withIgnoreEmptyLines().withQuote('"');
Subscription subscription = createCsvObservable(fullFileName, format, intervalMs, repeat)
.subscribeOn(Schedulers.io())
.subscribe(record ->
eventBus.publish(outputAddress, convertRecordToJson(record)));
subscriptions.add(subscription);
result.handle(Future.succeededFuture());
return this;
}
Reading with .withQuote('"'); or without it, makes no difference.
The quote " is the default character to represent quoted fields, and setting it explicitly makes no difference.
Do you want to get the original quote characters? In this case try setting the quote to a character that doesn't occur in the text, such as .withQuote('\0');

Ignore rows having columns less than number of headers in csv - SuperCSV [duplicate]

I am working on CSV parser requirement and I am using supercsv parser library. My CSV file can have 25 columns(separated by tab(|)) and up to 100k rows with additional header row.
I would like to ignore white-space only lines and lines containing less than 25 columns.
I am using IcvBeanReader with name mappings(to set csv values to pojo) and field processors(to handle validations) for reading a file.
I am assuming that Supercsv IcvBeanReader will skip white space lines by default. But how to handle if a row contains less than 25 column numbers?
You can easily do this by writing your own Tokenizer.
For example, the following Tokenizer will have the same behaviour as the default one, but will skip over any lines that don't have the correct number of columns.
public class SkipBadColumnCountTokenizer extends Tokenizer {
private final int expectedColumns;
private final List<Integer> ignoredLines = new ArrayList<>();
public SkipBadColumnCountTokenizer(Reader reader,
CsvPreference preferences, int expectedColumns) {
super(reader, preferences);
this.expectedColumns = expectedColumns;
}
#Override
public boolean readColumns(List<String> columns) throws IOException {
boolean moreInputExists;
while ((moreInputExists = super.readColumns(columns)) &&
columns.size() != this.expectedColumns){
System.out.println(String.format("Ignoring line %s with %d columns: %s", getLineNumber(), columns.size(), getUntokenizedRow()));
ignoredLines.add(getLineNumber());
}
return moreInputExists;
}
public List<Integer> getIgnoredLines(){
return this.ignoredLines;
}
}
And a simple test using this Tokenizer...
#Test
public void testInvalidRows() throws IOException {
String input = "column1,column2,column3\n" +
"has,three,columns\n" +
"only,two\n" +
"one\n" +
"three,columns,again\n" +
"one,too,many,columns";
CsvPreference preference = CsvPreference.EXCEL_PREFERENCE;
int expectedColumns = 3;
SkipBadColumnCountTokenizer tokenizer = new SkipBadColumnCountTokenizer(
new StringReader(input), preference, expectedColumns);
try (ICsvBeanReader beanReader = new CsvBeanReader(tokenizer, preference)) {
String[] header = beanReader.getHeader(true);
TestBean bean;
while ((bean = beanReader.read(TestBean.class, header)) != null){
System.out.println(bean);
}
System.out.println(String.format("Ignored lines: %s", tokenizer.getIgnoredLines()));
}
}
Prints the following output (notice how it's skipped all of the invalid rows):
TestBean{column1='has', column2='three', column3='columns'}
Ignoring line 3 with 2 columns: only,two
Ignoring line 4 with 1 columns: one
TestBean{column1='three', column2='columns', column3='again'}
Ignoring line 6 with 4 columns: one,too,many,columns
Ignored lines: [3, 4, 6]
(1) If the selection must be done by your Java program using Super CSV, then (and I quote) "you'll have to use CsvListReader". In particular: listReader.length()
See this Super CSV page for details.
(2) If you can perform the selection by preprocessing the CSV file, then you might wish to consider a suitable command-line tool (or tools, depending on how complicated the CSV format is). If the delimiter of the CSV file does not occur within any field, then awk would suffice. For example, if the assumption is satisfied, and if the delimiter is |, then the relevant awk filter could be as simple as:
awk -F'|' 'NF == 25 {print}'
If the CSV file format is too complex for a naive application of awk, then you may wish to convert the complex format to a simpler one; often TSV has much to recommend it.

How do I skip white-space only lines and lines having variable columns using supercsv

I am working on CSV parser requirement and I am using supercsv parser library. My CSV file can have 25 columns(separated by tab(|)) and up to 100k rows with additional header row.
I would like to ignore white-space only lines and lines containing less than 25 columns.
I am using IcvBeanReader with name mappings(to set csv values to pojo) and field processors(to handle validations) for reading a file.
I am assuming that Supercsv IcvBeanReader will skip white space lines by default. But how to handle if a row contains less than 25 column numbers?
You can easily do this by writing your own Tokenizer.
For example, the following Tokenizer will have the same behaviour as the default one, but will skip over any lines that don't have the correct number of columns.
public class SkipBadColumnCountTokenizer extends Tokenizer {
private final int expectedColumns;
private final List<Integer> ignoredLines = new ArrayList<>();
public SkipBadColumnCountTokenizer(Reader reader,
CsvPreference preferences, int expectedColumns) {
super(reader, preferences);
this.expectedColumns = expectedColumns;
}
#Override
public boolean readColumns(List<String> columns) throws IOException {
boolean moreInputExists;
while ((moreInputExists = super.readColumns(columns)) &&
columns.size() != this.expectedColumns){
System.out.println(String.format("Ignoring line %s with %d columns: %s", getLineNumber(), columns.size(), getUntokenizedRow()));
ignoredLines.add(getLineNumber());
}
return moreInputExists;
}
public List<Integer> getIgnoredLines(){
return this.ignoredLines;
}
}
And a simple test using this Tokenizer...
#Test
public void testInvalidRows() throws IOException {
String input = "column1,column2,column3\n" +
"has,three,columns\n" +
"only,two\n" +
"one\n" +
"three,columns,again\n" +
"one,too,many,columns";
CsvPreference preference = CsvPreference.EXCEL_PREFERENCE;
int expectedColumns = 3;
SkipBadColumnCountTokenizer tokenizer = new SkipBadColumnCountTokenizer(
new StringReader(input), preference, expectedColumns);
try (ICsvBeanReader beanReader = new CsvBeanReader(tokenizer, preference)) {
String[] header = beanReader.getHeader(true);
TestBean bean;
while ((bean = beanReader.read(TestBean.class, header)) != null){
System.out.println(bean);
}
System.out.println(String.format("Ignored lines: %s", tokenizer.getIgnoredLines()));
}
}
Prints the following output (notice how it's skipped all of the invalid rows):
TestBean{column1='has', column2='three', column3='columns'}
Ignoring line 3 with 2 columns: only,two
Ignoring line 4 with 1 columns: one
TestBean{column1='three', column2='columns', column3='again'}
Ignoring line 6 with 4 columns: one,too,many,columns
Ignored lines: [3, 4, 6]
(1) If the selection must be done by your Java program using Super CSV, then (and I quote) "you'll have to use CsvListReader". In particular: listReader.length()
See this Super CSV page for details.
(2) If you can perform the selection by preprocessing the CSV file, then you might wish to consider a suitable command-line tool (or tools, depending on how complicated the CSV format is). If the delimiter of the CSV file does not occur within any field, then awk would suffice. For example, if the assumption is satisfied, and if the delimiter is |, then the relevant awk filter could be as simple as:
awk -F'|' 'NF == 25 {print}'
If the CSV file format is too complex for a naive application of awk, then you may wish to convert the complex format to a simpler one; often TSV has much to recommend it.

DisplayTag format number in Excel export

I have a little problem with DisplayTag and its Excel export. I have a table with columns containing strings starting with 0, like phone numbers or pin codes for example...
When I try to export them to an Excel file, Excel treats them as a number and deletes leading zero... (0012 becomes 12)
My config is the following
export.excel.class = org.displaytag.export.ExcelView
I already added a decorator (see below) that adds ="MYSTRING" when I export via Excel, but I don't like that much this solution, because you can see the trick on the Excel file...
public class QuotedExportDecorator implements DisplaytagColumnDecorator {
#Override
public Object decorate(Object value, PageContext pageContext, MediaTypeEnum media) {
if (media.equals(MediaTypeEnum.EXCEL)) {
value = "=\"" + value + "\"";
}
return value;
}
}
Any other idea to go around this problem?

How to know if a file is text rendering or not? (Java)

How can I know at run time if the file in a specified folder is text rendering or not? (i.e files like csv, html, etc that can be displayed as text)
I do not want to do this via extension matching (by checking for .txt, .html extensions etc).
Suppose if there is a jpg file, I deliberately rename the extension to .txt and still the java code should be able to detect that this file (although with .txt extn) cannot be rendered as text.
How can I achieve this in java?
You could guess the type by scanning the file and usinng Character.html#isISOControl to check whether there are non printable character included.
Binary files usually include headers which often contain control chars see this list of File Signatures most of them would be detected by isISOControl.
Implement a heuristic matcher which scans files for known signatures.
One classic example is the file command: http://en.wikipedia.org/wiki/File_(command) and the libmagic library.
There are several variants in Java, such as Tika: http://tika.apache.org/
I don't think there is a 100% foolproof way to do this, since it's a matter of opinion what counts as "can be displayed as text" ... but if you're okay with restricting it to English text, you could examine the bytes of the file, and if most or all of the byte values are in the range of 32 through 126 (decimal unsigned), then it is likely vanilla ASCII text.
This is going call for some kind of statistical pattern matching. You could, for example, if you were working with English only, check how many "foreign" characters appear in the first 100 characters. That should give you a pretty good idea of whether this is a text document or not. If you run into too many characters that are not a..zA..Z0..9[punctutation], then you can guess it is not text. Working with English-language files, and languages that can be expressed using mostly the ASCII list of characters, you should be relatively safe.
This of course goes out the window the moment you start working with foreign languages where some of the characters might appear to be special characters, but only to someone who does not speak the language.
The other alternative is to use file markers (like in Java a class file starts with a specific header) and compare the values in the file to a library of headers. It's cumbersome and error-prone as well, as you might not have the file on record and could therefore think it's a text file when it is not.
The use of a Character#isISOControl is a good thing. You should take the encoding in consideration too (p.ex.UTF-8). Here my function:
/**
* Test is a file is a text file. It is the case only if it has no well-known control characters.
* (see {#link Character#isISOControl(int)})
* #param file
* #return
* #throws IOException
*/
public static boolean isTextFile (final File file) throws IOException
{
BufferedInputStream is = null;
try
{
final BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-16"));
boolean isText;
int read;
do
{
read = in.read();
isText = read == -1;
isText |= read == 13; // newline
isText |= read == 10; // newline
isText |= read == 9; // tab
isText |= !Character.isISOControl(read);
}
while (isText && read != -1);
return isText;
}
finally {
if (is != null)
{
try
{
is.close();
}
catch (IOException e)
{
throw new Error(e);
}
}
}
}
You can maintain a list of acceptable Mime Types and then get Mime Type of file you are reading. If it matches good to go.
import javax.activation.MimetypesFileTypeMap;
import java.io.File;
class GetMimeType {
public static void main(String args[]) {
File f = new File("gumby.gif");
System.out.println("Mime Type of " + f.getName() + " is " +
new MimetypesFileTypeMap().getContentType(f));
// expected output :
// "Mime Type of gumby.gif is image/gif"
}
}
http://www.rgagnon.com/javadetails/java-0487.html

Categories

Resources