I'm creating a parquet file from Java using org.apache.parquet.*. No issues with other data types, but when a binary value is written and I cat the parquet file using parquet-tools, it is showing the value in encoded format. Because of that, the parquet is not processed in my system further.
Code block:
case BINARY:
recordConsumer.addBinary(stringToBinary(val));
break;
AND
private Binary stringToBinary(Object value) {
return Binary.fromString(value.toString());
}
Schema used is:
message m {
required INT64 id;
required binary username;
required boolean active;
}
When I cat:
parquet-tools cat <parquetFileName>
I see something like this:
id = 1
username = TmFtZTE=
active = true
id = 2
username = TmFtZTI=
active = false
I want to see the actual Username passed and not the encoded strings.
Try this in your schema
required binary username (UTF8)
Related
I am trying to parse Widevine PSSH data and read its content. This is a PSSH data example taken from https://storage.googleapis.com/wvmedia/cenc/h264/tears/tears.mpd
AAAAR3Bzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAACcIARIBMBoNd2lkZXZpbmVfdGVzdCIKMjAxNV90ZWFycyoFQVVESU8=
I create the proto message in this way
syntax = "proto2";
package drm;
option java_package = "com.drm.widevine";
option java_outer_classname = "WidevineCenc";
message WidevinePsshData {
enum Algorithm {
UNENCRYPTED = 0 ;
AESCTR = 1 ;
};
optional Algorithm algorithm = 1 ;
repeated bytes key_id = 2 ;
// Content provider name.
optional string provider = 3 ;
// A content identifier, specified by content provider.
optional bytes content_id = 4 ;
// Track type. Acceptable values are SD, HD and AUDIO. Used to
// differentiate content keys used by an asset.
optional string track_type = 5 ;
// The name of a registered policy to be used for this asset.
optional string policy = 6 ;
// Crypto period index, for media using key rotation.
optional uint32 crypto_period_index = 7;
}
And I try to deserialize the pssh-data as following
String psshString = "AAAAR3Bzc2gAAAAA7e+LqXnWSs6jyCfc1R0h7QAAACcIARIBMBoNd2lkZXZpbmVfdGVzdCIKMjAxNV90ZWFycyoFQVVESU8=";
byte[] data = Base64.decode(psshString, Base64.DEFAULT);
WidevineCenc.WidevinePsshData pssh = WidevineCenc.WidevinePsshData.parseFrom(data);
I get this error
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
Currently I encounter a behavior of JsonFormater.printer printing the long(fixed64) value as String in JSON.
Is there any way/option to set the JsonFormater.printer not to do this conversion (Long(fixed64) -> String in Json)?
The Json is consumed by Java app, representing fixed64 as integer in JSON should not be a problem for Java.
Here is the code:
In data.proto
syntax = "proto2";
message data{
required fixed64 user_id = 1;
required int32 member_id = 2
}
Here is the java code, the file format is *.pb.gz
import com.google.protobuf.util.JsonFormat;
.......
//print data in JSON format
final InputStream input = new GZIPInputStream(new FileInputStream(pathToFile));
Message m;
m = defaultMsg.getParserForType().parseDelimitedFrom(input));
String jsonString = JsonFormat.printer().preservingProtoFieldNames().omittingInsignificantWhitespace().print(m);
Generated Java class: Data.java (Generated with protoc 2.6.1)
...
private long userId_;
...
private int memberId_;
...
expected result:
{"user_id":6546585813946021349,member_id":7521}
actual result:
{"user_id":"6546585813946021349",member_id":7521}
The user_id is String in json, but I want it as integer
Thanks
David
It seems this is by design, according to the source code. UINT64 and FIXED64 types are always printed out with surrounding double quotes, no questions asked:
https://github.com/protocolbuffers/protobuf/blob/f9d8138376765d229a32635c9209061e4e4aed8c/java/util/src/main/java/com/google/protobuf/util/JsonFormat.java#L1081-L1084
case INT64:
case SINT64:
case SFIXED64:
generator.print("\"" + ((Long) value).toString() + "\"");
In that same file, a few lines above, you can see that INT32 types are only double quoted if they're keys in a map (which your proto obviously doesn't have).
So, I'd ask for more information on the protobuf mailing list, or maybe report it as a bug/feature-request.
I encode two EPC tags through "NiceLabel Pro" with data:
First tag: EPC: 555555555, UserData: 9876543210123456789
Second tag: EPC: 444444444, UserData: 123456789123456789
Now I'm trying to get that data through LLRP (in my Java application):
My LLRPClient (one function):
public void PrepareInventoryRequest() {
AccessCommand accessCommand = new AccessCommand();
// A list to hold the op specs for this access command.
accessCommand.setAccessCommandOpSpecList(GenerateOpSpecList());
// Create a new tag spec.
C1G2TagSpec tagSpec = new C1G2TagSpec();
C1G2TargetTag targetTag = new C1G2TargetTag();
targetTag.setMatch(new Bit(1));
// We want to check memory bank 1 (the EPC memory bank).
TwoBitField memBank = new TwoBitField("2");
targetTag.setMB(memBank);
// The EPC data starts at offset 0x20.
// Start reading or writing from there.
targetTag.setPointer(new UnsignedShort(0));
// This is the mask we'll use to compare the EPC.
// We want to match all bits of the EPC, so all mask bits are set.
BitArray_HEX tagMask = new BitArray_HEX("00");
targetTag.setTagMask(tagMask);
// We only only to operate on tags with this EPC.
BitArray_HEX tagData = new BitArray_HEX("00");
targetTag.setTagData(tagData);
// Add a list of target tags to the tag spec.
List <C1G2TargetTag> targetTagList =
new ArrayList<>();
targetTagList.add(targetTag);
tagSpec.setC1G2TargetTagList(targetTagList);
// Add the tag spec to the access command.
accessCommand.setAirProtocolTagSpec(tagSpec);
accessSpec.setAccessCommand(accessCommand);
...
private List<AccessCommandOpSpec> GenerateOpSpecList() {
// A list to hold the op specs for this access command.
List <AccessCommandOpSpec> opSpecList =
new ArrayList<>();
// Set default opspec which for eventcycle of accessspec 3.
C1G2Read opSpec1 = new C1G2Read();
// Set the OpSpecID to a unique number.
opSpec1.setOpSpecID(new UnsignedShort(1));
opSpec1.setAccessPassword(new UnsignedInteger(0));
// We'll read from user memory (bank 3).
TwoBitField opMemBank = new TwoBitField("3");
opSpec1.setMB(opMemBank);
// We'll read from the base of this memory bank (0x00).
opSpec1.setWordPointer(new UnsignedShort(0));
// Read two words.
opSpec1.setWordCount(new UnsignedShort(0));
opSpecList.add(opSpec1);
return opSpecList;
}
My tag handler function:
private void updateTable(TagReportData tag) {
if (tag != null) {
EPCParameter epcParam = tag.getEPCParameter();
String EPCStr;
List<AccessCommandOpSpecResult> accessResultList = tag.getAccessCommandOpSpecResultList();
for (AccessCommandOpSpecResult accessResult : accessResultList) {
if (accessResult instanceof C1G2ReadOpSpecResult) {
C1G2ReadOpSpecResult op = (C1G2ReadOpSpecResult) accessResult;
if ((op.getResult().intValue() == C1G2ReadResultType.Success) &&
(op.getOpSpecID().intValue() < 1000)) {
UnsignedShortArray_HEX userMemoryHex = op.getReadData();
System.out.println("User Memory read from the tag is = " + userMemoryHex.toString());
}
}
}
...
For the first tag, "userMemoryHex.toString()" = "3938 3736"
For the second tag, "userMemoryHex.toString()" = "3132 3334"
Why? How do I get all user data?
This is my rfid tag.
The values that you get seem to be the first 4 characters of the number (interpreted as an ASCII string):
39383736 = "9876" (when interpreting those 4 bytes as ASCII characters)
31323334 = "1234" (when interpreting those 4 bytes as ASCII characters)
Since the specification of your tag says
Memory: EPC 128 bits, User 32 bits
your tag can only contain 32 bits (= 4 bytes) of user data. Hence, your tag simply can't contain the full value (i.e. 9876543210123456789 or 123456789123456789) that you tried to write as UserData (regardless of whether this was interpreted as a decimal number or a string).
Instead, your writer application seems to have taken the first 4 characters of those values, encoded them in ASCII, and wrote them to the tag.
I want to set pageToken to get items stored at Google Cloud Storage. I'm using Google API Client Library for Java v1.19.x.
I have no idea to generate pageToken from file path(or file name).
2 files stored in bucket.
my-bucket
/test.csv
/test2.csv
When I tried Google APIs Explorer with following parameters, I could get nextPageToken Cgh0ZXN0LmNzdg==.
And I found out that I can get test.csv string by decoding nextPageToken with base64.
bucket: my-bucket
pageToken:
prefix: test
maxResults: 1
{"kind": "storage#objects", "nextPageToken": "Cgh0ZXN0LmNzdg==", ...}
But How can I get Cgh0ZXN0LmNzdg== from test.csv?
Although I tried Base64 encoding, result didn't match.
import com.google.api.client.repackaged.org.apache.commons.codec.binary.Base64;
String lastFile = "test.csv"
String token = Base64.encodeBase64String(lastFile.getBytes());
String bucket = "my-bucket"
String prefix = "test"
Storage.Objects.List listObjects = client.objects().list(bucket);
listObjects.setPrefix(prefix);
listObjects.setPageToken(token);
long maxResults = 1;
listObjects.setMaxResults(maxResults);
do {
Objects objects = listObjects.execute();
List<StorageObject> items = objects.getItems();
token = objects.getNextPageToken();
listObjects.setPageToken(token);
} while (token != null);
I could get next token from file path string using following codes by myself.
How to get nextToken from path string
String nextToken = base64encode(0x0a + asciiCode + pathString)
asciiCode can be taken between 0x01(SOH) and 0x7f(DEL). It seems to depend on path length.
my-bucket/
a/a(3byte) 0x03
a/ab(4byte) 0x04
test.txt(8byte) 0x08
Notice
If path length is longer than 1024 byte, another rule seems to apply. But I couldn't found out rules.
See also Object Name Requirements
import com.google.common.io.BaseEncoding;
String lastFile = "test.csv"
String token = base64Encode(lastFile);
String bucket = "my-bucket"
String prefix = "test"
Storage.Objects.List listObjects = client.objects().list(bucket);
listObjects.setPrefix(prefix);
listObjects.setPageToken(token);
long maxResults = 1;
listObjects.setMaxResults(maxResults);
do {
Objects objects = listObjects.execute();
List<StorageObject> items = objects.getItems();
token = objects.getNextPageToken();
listObjects.setPageToken(token);
} while (token != null);
private String base64Encode(String path) {
byte[] encoding;
byte[] utf8 = path.getBytes(Charsets.UTF_8);
encoding = new byte[utf8.length + 2];
encoding[0] = 0x0a;
encoding[1] = new Byte(String.valueOf(path.length()));
String s = BaseEncoding.base64().encode(encoding);
return s;
}
I know this question is already answered and is applied to Java, I'd like to mention that this question applies to PHP as well.
With the help of the approved post from sakama above I figured out a PHP version of his solution.
The PHP equivalent for generating the token is as follow:
base64_encode(pack('c', 0x0a) . pack('c', $path_string_length) . pack('a*', $path_string));
The byte pattern seems indeed (as sakama already mentioned) to be:
<line feed><line data length><line data>
I am doing a project using libsvm and I am preparing my data to use the lib. How can I convert CSV file to LIBSVM compatible data?
CSV File:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/iris.csv
In the frequencies questions:
How to convert other data formats to LIBSVM format?
It depends on your data format. A simple way is to use libsvmwrite in the libsvm matlab/octave interface. Take a CSV (comma-separated values) file in UCI machine learning repository as an example. We download SPECTF.train. Labels are in the first column. The following steps produce a file in the libsvm format.
matlab> SPECTF = csvread('SPECTF.train'); % read a csv file
matlab> labels = SPECTF(:, 1); % labels from the 1st column
matlab> features = SPECTF(:, 2:end);
matlab> features_sparse = sparse(features); % features must be in a sparse matrix
matlab> libsvmwrite('SPECTFlibsvm.train', labels, features_sparse);
The tranformed data are stored in SPECTFlibsvm.train.
Alternatively, you can use convert.c to convert CSV format to libsvm format.
but I don't wanna use matlab, I use python.
I found this solution as well using JAVA
Can anyone recommend a way to tackle this problem ?
You can use csv2libsvm.py to convert csv to libsvm data
python csv2libsvm.py iris.csv libsvm.data 4 True
where 4 means target index, and True means csv has a header.
Finally, you can get libsvm.data as
0 1:5.1 2:3.5 3:1.4 4:0.2
0 1:4.9 2:3.0 3:1.4 4:0.2
0 1:4.7 2:3.2 3:1.3 4:0.2
0 1:4.6 2:3.1 3:1.5 4:0.2
...
from iris.csv
150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
...
csv2libsvm.py does not work with Python3, and also it does not support label targets (string targets), I have slightly modified it. Now It should work with Python3 as well as wıth the label targets.
I am very new to Python, so my code may do not follow the best practices, but I hope it is good enough to help someone.
#!/usr/bin/env python
"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.
"""
import sys
import csv
import operator
from collections import defaultdict
def construct_line(label, line, labels_dict):
new_line = []
if label.isnumeric():
if float(label) == 0.0:
label = "0"
else:
if label in labels_dict:
new_line.append(labels_dict.get(label))
else:
label_id = str(len(labels_dict))
labels_dict[label] = label_id
new_line.append(label_id)
for i, item in enumerate(line):
if item == '' or float(item) == 0.0:
continue
elif item=='NaN':
item="0.0"
new_item = "%s:%s" % (i + 1, item)
new_line.append(new_item)
new_line = " ".join(new_line)
new_line += "\n"
return new_line
# ---
input_file = sys.argv[1]
try:
output_file = sys.argv[2]
except IndexError:
output_file = input_file+".out"
try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = 0
try:
skip_headers = sys.argv[4]
except IndexError:
skip_headers = 0
i = open(input_file, 'rt')
o = open(output_file, 'wb')
reader = csv.reader(i)
if skip_headers:
headers = reader.__next__()
labels_dict = {}
for line in reader:
if label_index == -1:
label = '1'
else:
label = line.pop(label_index)
new_line = construct_line(label, line, labels_dict)
o.write(new_line.encode('utf-8'))