Ignore unwanted columns using CSVBeanReader

Ignore unwanted columns using CSVBeanReader - java

I hope the kind people here can help me with my CSV situation. I only need to read the first 3 columns in the CSV that i'm using. I have no control in the number of columns that the CSV file has and there are no headers available. I tried using the partial read with CSVBeanReader (https://super-csv.github.io/super-csv/examples_partial_reading.html) but I keep getting the "nameMapping array and the number of columns read" error. I would like to ask if the partial reading example works for the supercsv version 2.4.0 which i'm currently using. See below the code I used which I patterned similar to the partial read example
public class MainPartialRead {
public void partialRead() throws Exception {
ICsvBeanReader beanReader = null;
String csv_filename = "test2.csv";
try {
beanReader = new CsvBeanReader(new FileReader(csv_filename), CsvPreference.STANDARD_PREFERENCE);
beanReader.getHeader(true); // skip past the header (we're defining our own)
System.out.println("beanreader Length: " + beanReader.length());
// only map the first 3 columns - setting header elements to null means those columns are ignored
final String[] header = new String[]{"column1", "column2", "column3", null, null, null, null, null,
null, null};
// no processing required for ignored columns
final CellProcessor[] processors = new CellProcessor[]{new NotNull(), new NotNull(),
new NotNull(), null, null, null, null, null, null, null};
beanCSVReport customer;
while ((customer = beanReader.read(beanCSVReport.class, header, processors)) != null) {
System.out.println(String.format("lineNo=%s, rowNo=%s, customer=%s", beanReader.getLineNumber(),
beanReader.getRowNumber(), customer));
}
} finally {
if (beanReader != null) {
beanReader.close();
}
}
}
Here is the sample CSV file i'm using:
466,24127,abc,53516
868,46363,hth,249

you did not mention the complete error message.
Exception in thread "main" java.lang.IllegalArgumentException: the nameMapping array and the number of columns read should be the same size
(nameMapping length = 10, columns = 4)
From this it is very clear what the issue is. You have just 4 columns in the csv file but you have mentioned mapping for 10 columns, 7 of them null.
Removing 6 nulls from header and processors fixes the issue.
another point to note. The following code skips the first line assuming it to be header as you instructed but in fact it is not a header row. You should not call this method.
beanReader.getHeader(true);

Related

Hbase mapreduce job: all column values are null

I am trying to create a map-reduce job in Java on table from a HBase database. Using the examples from here and other stuff from the internet, I managed to successfully write a simple row-counter. However, trying to write one that actually does something with the data from a column was unsuccessful, since the received bytes are always null.
A part of my Driver from the job is this:
/* Set main, map and reduce classes */
job.setJarByClass(Driver.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
/* Get data only from the last 24h */
Timestamp timestamp = new Timestamp(System.currentTimeMillis());
try {
long now = timestamp.getTime();
scan.setTimeRange(now - 24 * 60 * 60 * 1000, now);
} catch (IOException e) {
e.printStackTrace();
}
/* Initialize the initTableMapperJob */
TableMapReduceUtil.initTableMapperJob(
"dnsr",
scan,
Map.class,
Text.class,
Text.class,
job);
/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
As you can see, the table is called dnsr. My mapper looks like this:
#Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue("d".getBytes(), "fqdn".getBytes());
if (columnValue == null)
return;
byte[] firstSeen = value.getValue("d".getBytes(), "fs".getBytes());
// if (firstSeen == null)
// return;
String fqdn = new String(columnValue).toLowerCase();
String fs = (firstSeen == null) ? "empty" : new String(firstSeen);
context.write(new Text(fqdn), new Text(fs));
}
Some notes:
the column family from the dnsr table is just d. There are multiple columns, some of them being called fqdn and fs (firstSeen);
even if the fqdn values appear correctly, the fs are always the "empty" string (I added this check after I had some errors that were saying that you can't convert null to a new string);
if I change the fs column name with something else, for example ls (lastSeen), it works;
the reducer doesn't do anything, just outputs everything it receives.
I created a simple table scanner in javascript that is querying the exact same table and columns and I can clearly see the values are there. Using the command line and doing queries manually, I can clearly see the fs values are not null, they are bytes that can e later converted into a string (representing a date).
What can be the problem I'm always getting null?
Thanks!
Update:
If I get all the columns in a specific column family, I don't receive fs. However, a simple scanner implemented in javascript return fs as a column from the dnsr table.
#Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue(columnFamily, fqdnColumnName);
if (columnValue == null)
return;
String fqdn = new String(columnValue).toLowerCase();
/* Getting all the columns */
String[] cns = getColumnsInColumnFamily(value, "d");
StringBuilder sb = new StringBuilder();
for (String s : cns) {
sb.append(s).append(";");
}
context.write(new Text(fqdn), new Text(sb.toString()));
}
I used an answer from here to get all the column names.

In the end, I managed to find the 'problem'. Hbase is a column oriented datastore. Here, data is stored and retrieved in columns and hence can read only relevant data if only some data is required. Every column family has one or more column qualifiers (columns) and each column has multiple cells. The interesting part is that every cell has its own timestamp.
Why was this the problem? Well, when you are doing a ranged search, only the cells whose timestamp is in that range are returned, so you may end up with a row with "missing cells". In my case, I had a DNS record and other fields such as firstSeen and lastSeen. lastSeen is a field that is updated every time I see that domain, firstSeen will remain unchanged after the first occurrence. As soon as I changed the ranged map reduce job to a simple map reduce job (using all time data), everything was fine (but the job took longer to finish).
Cheers!

Java Google Seehts API v4 get row column index by search value or append

Hi I'm using the Google Sheets API v4 for Java.
I want to make a Server List where I register up new Server IP's for my own small project. At the moment I can append a new Entry at a empty row using
AppendCellsRequest appendCellReq = new AppendCellsRequest();
appendCellReq.setSheetId(0);
appendCellReq.setRows(rowData);
appendCellReq.setFields("userEnteredValue");
The Problem is now, that I want delete this row later, so I need to figure out how to find it later. My Idea was to add a UniqueID or to search for the exact added Values or to remember the row number. However a way would it be to find and replace all cells. But I would rather have a way to get the row number of my added data.
I'm very happy to hear some advices.

Since long search I finaly found an answer. There are tutorials which indeed made it possible to append a row, but wasnt able to return in which Row they inserted. With this code its now possible. I found it after hours of searching somewhere. It is not the best code, but it works and can be modified.
String range = "A1"; // TODO: Update placeholder value.
// How the input data should be interpreted.
String valueInputOption = "RAW"; // TODO: Update placeholder value.
// How the input data should be inserted.
String insertDataOption = "INSERT_ROWS"; // TODO: Update placeholder value.
// TODO: Assign values to desired fields of `requestBody`:
ValueRange requestBody = new ValueRange();
List<Object> data1 = new ArrayList<Object>();
data1.addAll(Arrays.asList(dataArr));
List<List<Object>> data2 = new ArrayList<List<Object>>();
data2.add(data1);
requestBody.setValues(data2);
Sheets sheetsService;
try {
sheetsService = getSheetsService();
Sheets.Spreadsheets.Values.Append request = sheetsService.spreadsheets().values().append(spreadSheetId,
range, requestBody);
request.setValueInputOption(valueInputOption);
request.setInsertDataOption(insertDataOption);
AppendValuesResponse response = request.execute();
// TODO: Change code below to process the `response` object:
Logger.println(response.getTableRange());
String startCell = response.getTableRange().split(":")[1];
String colString = startCell.replaceAll("\\d", "");
String row = startCell.replaceAll(colString, "");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Inserting a row in Google spreadsheet: Blank rows cannot be written; use delete instead

I am unable to add row to an existing spreadsheet.
I'm trying the steps from here: https://developers.google.com/google-apps/spreadsheets/data
The following line throws the exception below:
row = service.insert(listFeedUrl, row);
Exception:
Exception in thread "main" com.google.gdata.util.InvalidEntryException: Bad Request
Blank rows cannot be written; use delete instead.
at com.google.gdata.client.http.HttpGDataRequest.handleErrorResponse(HttpGDataRequest.java:602)
at com.google.gdata.client.http.GoogleGDataRequest.handleErrorResponse(GoogleGDataRequest.java:564)
at com.google.gdata.client.http.HttpGDataRequest.checkResponse(HttpGDataRequest.java:560)
at com.google.gdata.client.http.HttpGDataRequest.execute(HttpGDataRequest.java:538)
at com.google.gdata.client.http.GoogleGDataRequest.execute(GoogleGDataRequest.java:536)
at com.google.gdata.client.Service.insert(Service.java:1409)
at com.google.gdata.client.GoogleService.insert(GoogleService.java:613)
at TestGoogle.main(TestGoogle.java:93)
The full story short: The above example is quite similar with the code in the application that I need to fix, and the application worked some times ago.
I managed to pass the OAuth2 authentication.

The reason you get this error message is probably because you're trying to add to a blank spreadsheet and the header doesn't exist.
If we add the headers first, then it should work.
Using the "Add a list row" example from the documentation you linked; add the headers like this before adding a list row
CellQuery cellQuery = new CellQuery(worksheet.CellFeedLink);
CellFeed cellFeed = service.Query(cellQuery);
CellEntry cellEntry = new CellEntry(1, 1, "firstname");
cellFeed.Insert(cellEntry);
cellEntry = new CellEntry(1, 2, "lastname");
cellFeed.Insert(cellEntry);
cellEntry = new CellEntry(1, 3, "age");
cellFeed.Insert(cellEntry);
cellEntry = new CellEntry(1, 4, "height");
cellFeed.Insert(cellEntry);
Then the list entry example should add to the spreadsheet properly
// Fetch the list feed of the worksheet.
ListQuery listQuery = new ListQuery(listFeedLink.HRef.ToString());
ListFeed listFeed = service.Query(listQuery);
// Create a local representation of the new row.
ListEntry row = new ListEntry();
row.Elements.Add(new ListEntry.Custom() { LocalName = "firstname", Value = "Joe" });
row.Elements.Add(new ListEntry.Custom() { LocalName = "lastname", Value = "Smith" });
row.Elements.Add(new ListEntry.Custom() { LocalName = "age", Value = "26" });
row.Elements.Add(new ListEntry.Custom() { LocalName = "height", Value = "176" });
// Send the new row to the API for insertion.
service.Insert(listFeed, row);

Java - Parse delimited file and find column datatypes

Is it possible to parse a delimited file and find column datatypes? e.g
Delimited file:
Email,FirstName,DOB,Age,CreateDate
test#test1.com,Test User1,20/01/2001,24,23/02/2015 14:06:45
test#test2.com,Test User2,14/02/2001,24,23/02/2015 14:06:45
test#test3.com,Test User3,15/01/2001,24,23/02/2015 14:06:45
test#test4.com,Test User4,23/05/2001,24,23/02/2015 14:06:45
Output:
Email datatype: email
FirstName datatype: Text
DOB datatype: date
Age datatype: int
CreateDate datatype: Timestamp
The purpose of this is to read a delimited file and construct a table creation query on the fly and insert data into that table.
I tried using apache validator, I believe we need to parse the complete file in order to determine each column data type.
EDIT: The code that I've tried:
CSVReader csvReader = new CSVReader(new FileReader(fileName),',');
String[] row = null;
int[] colLength=(int[]) null;
int colCount = 0;
String[] colDataType = null;
String[] colHeaders = null;
String[] header = csvReader.readNext();
if (header != null) {
colCount = header.length;
}
colLength = new int[colCount];
colDataType = new String[colCount];
colHeaders = new String[colCount];
for (int i=0;i<colCount;i++){
colHeaders[i]=header[i];
}
int templength=0;
String tempType = null;
IntegerValidator intValidator = new IntegerValidator();
DateValidator dateValidator = new DateValidator();
TimeValidator timeValidator = new TimeValidator();
while((row = csvReader.readNext()) != null) {
for(int i=0;i<colCount;i++) {
templength = row[i].length();
colLength[i] = templength > colLength[i] ? templength : colLength[i];
if(colHeaders[i].equalsIgnoreCase("email")){
logger.info("Col "+i+" is Email");
} else if(intValidator.isValid(row[i])){
tempType="Integer";
logger.info("Col "+i+" is Integer");
} else if(timeValidator.isValid(row[i])){
tempType="Time";
logger.info("Col "+i+" is Time");
} else if(dateValidator.isValid(row[i])){
tempType="Date";
logger.info("Col "+i+" is Date");
} else {
tempType="Text";
logger.info("Col "+i+" is Text");
}
logger.info(row[i].length()+"");
}
Not sure if this is the best way of doing this, any pointers in the right direction would be of help

If you wish to write this yourself rather than use a third party library then probably the easiest mechanism is to define a regular expression for each data type and then check if all fields satisfy it. Here's some sample code to get you started (using Java 8).
public enum DataType {
DATETIME("dd/dd/dddd dd:dd:dd"),
DATE("dd/dd/dddd",
EMAIL("\\w+#\\w+"),
TEXT(".*");
private final Predicate<String> tester;
DateType(String regexp) {
tester = Pattern.compile(regexp).asPredicate();
}
public static Optional<DataType> getTypeOfField(String[] fieldValues) {
return Arrays.stream(values())
.filter(dt -> Arrays.stream(fieldValues).allMatch(dt.tester)
.findFirst();
}
}
Note that this relies on the order of the enum values (e.g. testing for datetime before date).

Yes it is possible and you do have to parse the entire file first. Have a set of rules for each data type. Iterate over every row in the column. Start of with every column having every data type and cancel of data types if a row in that column violates a rule of that data type. After iterating the column check what data type is left for the column. Eg. Lets say we have two data types integer and text... rules for integer... well it must only contain numbers 0-9 and may begin with '-'. Text can be anything.
Our column:
345
-1ab
123
The integer data type would be removed by the second row so it would be text. If row two was just -1 then you would be left with integer and text so it would be integer because text would never be removed as our rule says text can be anything... you dont have to check for text basically if you left with no other data type the answer is text. Hope this answers your question

I have slight similar kind of logic needed for my project. Searched lot but did not get right solution. For me i need to pass string object to the method that should return datatype of the obj. finally i found post from #sprinter, it looks similar to my logic but i need to pass string instead of string array.
Modified the code for my need and posted below.
public enum DataType {
DATE("dd/dd/dddd"),
EMAIL("#gmail"),
NUMBER("[0-9]+"),
STRING("^[A-Za-z0-9? ,_-]+$");
private final String regEx;
public String getRegEx() {
return regEx;
}
DataType(String regEx) {
this.regEx = regEx;
}
public static Optional<DataType> getTypeOfField(String str) {
return Arrays.stream(DataType.values())
.filter(dt -> {
return Pattern.compile(dt.getRegEx()).matcher(str).matches();
})
.findFirst();
}
}
For example:
Optional<DataType> dataType = getTypeOfField("Bharathiraja");
System.out.println(dataType);
System.out.println(dataType .get());
Output:
Optional[STRING]
STRING
Please note, regular exp pattern is vary based on requirements, so modify the pattern as per your need don't take as it is.
Happy Coding !

How to get composite column name components

How can I get A composite column components from ByteBuffer?
I asked this question
But got no response. I am now attempting to get the column name from a byte buffer as
Composite start = new Composite();
start.addComponent(System.currentTimeMillis(), LS);
List<HColumn<Composite, String>> columns = cs.setRange(start, null, true, 10).execute().get().getColumns();
for (HColumn<Composite, String> column : columns) {
ByteBuffer bf = column.getNameBytes();
Serializer<Composite> ns = column.getNameSerializer();
Composite composite = ns.fromByteBuffer(bf);
// I get an exception from above line
String value = column.getValue();
}
My problem is that I have a column family with composite comparatator made of two LongType
Then I do a column slice on one of it's rows and from the columns List I want to get the Column name and get the individual components from it. Please someone help me I am stuck

for (HColumn<Composite, String> column : columns) {
for(Component<?> compositeComponent: column.getName().getComponents()) {
Serializer<?> srz = compositeComponent.getSerializer();
Object value = compositeComponent.getValue(srz);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Ignore unwanted columns using CSVBeanReader - java

Related

Hbase mapreduce job: all column values are null

Java Google Seehts API v4 get row column index by search value or append

Inserting a row in Google spreadsheet: Blank rows cannot be written; use delete instead

Java - Parse delimited file and find column datatypes

How to get composite column name components

Categories

Resources