same strings, different charset, not equals

same strings, different charset, not equals - java

I have a weird problem.
I have an application that crawl a webpage to get a list o names. Than this list is passed to another application that using those names, ask for information to a site, using its API's.
When I compare some strings in the first webpage to some others grabbed by API's usually I get wrong results.
I tried to get character value letter by letter I got this:
Rocco De Nicola
82 111 99 99 111 160 68 101 32 78 105 99 111 108 97 1st web page
82 111 99 99 111 32 68 101 32 78 105 99 111 108 97 2nd
As you can see, in the first string a space is codified by 160 (non-breaking space) instead of 32.
I can I codify correctly the first set of Strings?
I have also tried to set the Charset to UTF-8 but it didn't worked.
Maybe I just have to replace 160 to 32 ?

I would at first trim and replace complicated characters from the strings to compare. After this step follows the equals call. This brings also the advantages in cases you have language specific replacements in your text. It's also a good idea to convert the resulting strings to lower case.
Normally I use something like that ....
private String removeExtraCharsAndToLower(String str) {
str=str.toLowerCase();
str=str.replaceAll("ä", "ae");
str=str.replaceAll("ö", "oe");
str=str.replaceAll("ü", "ue");
str=str.replaceAll("ß", "ss");
return str.toLowerCase().replaceAll("[^a-z]","");
}

Using brute force. This lists all the character set which convert 160 to 32 when encoding.
String s = "" + (char) 160;
for (Map.Entry<String, Charset> stringCharsetEntry : Charset.availableCharsets().entrySet()) {
try {
ByteBuffer bytes = stringCharsetEntry.getValue().encode(s);
if (bytes.get(0) == 32)
System.out.println(stringCharsetEntry.getKey());
} catch (Exception ignored) {
}
}
prints nothing.
If I change the condition to
if (bytes.get(0) != (byte) 160)
System.out.println(stringCharsetEntry.getKey()+" "+new String(bytes.array(), 0));
I get quite a few examples.

Related

Align all the strings in proper way using String.format java

I am creating a file by appending all the strings in StringBuilder and then dumping it in a file. But when I check the file (which is obvious) all my strings in each line is not properly align..
Below is the code called by upper layer by passing all the necessary stuff.
private static StringBuilder appendStrings(String pro, List<ProcConfig> list,
StringBuilder sb) {
for (ProcConfig pc : list) {
sb.append(pro).append(" ").append(pc.getNo()).append(" ").append(pc.getId())
.append(" ").append(pc.getAddress()).append(" ").append(pc.getPort()).append(" ")
.append(pc.getNumPorts()).append(" ").append(pc.getName())
.append(System.getProperty("line.separator"));
}
sb.append(System.getProperty("line.separator"));
return sb;
}
And here is how sample lines look like from the above code. After each new line, there is a new set so I want to align all the lines properly in each set before each new line.
config 51 106 10.178.151.25 8095 5 tyt_87612nsas_woqa_7y2_0
config 51 104 10.124.192.124 8080 5 tyt_abc_pz1_rn03c-7vb_01
if_hello_abc_tree tyt.* then process_is_not_necessary 32 80 10.86.25.29 9091 5 tyt_goldenuserappslc22
if_hello_abc_tree tyt.* then process_is_not_necessary 51 50 10.174.192.209 9091 5 tyt_goldenuserapprno01
if_hello_abc_tree tyt.* then config 4 140 10.914.198.26 10001 1 silos_lvskafka-1702600
if_hello_abc_tree tyt.* then config 4 184 10.444.289.138 10001 1 silos_lvskafka-1887568
Is there any way I can align each of the above line properly? So output should be like this:
config 51 106 10.178.151.25 8095 5 tyt_87612nsas_woqa_7y2_0
config 51 104 10.124.192.124 8080 5 tyt_abc_pz1_rn03c-7vb_01
if_hello_abc_tree tyt.* then process_is_not_necessary 32 80 10.86.25.29 9091 5 tyt_goldenuserappslc22
if_hello_abc_tree tyt.* then process_is_not_necessary 51 50 10.174.192.209 9091 5 tyt_goldenuserapprno01
if_hello_abc_tree tyt.* then config 4 140 10.914.198.26 10001 1 silos_lvskafka-1702600
if_hello_abc_tree tyt.* then config 4 184 10.444.289.138 10001 1 silos_lvskafka-1887568
Update:
Below is how it is getting generated now. As you can see on IP Address it is slightly off if you compare the two lines.
config 51 106 97.143.765.65 8095 5 abc_tyewaz1_rna03c-7nhl_02
config 51 104 97.143.162.184 8080 5 abc_tyewaz1_rna03c-7vjb_01
Instead can we generate like below? Basically making each column straight. Is this possible to do?
config 51 106 97.143.765.65 8095 5 abc_tyewaz1_rna03c-7nhl_02
config 51 104 97.143.162.184 8080 5 abc_tyewaz1_rna03c-7vjb_01

First of all: I'm not a java developer, but I know a little about it. I know this is not the best way to solve your problem but you'll get the point.
Instead of calling append method multiple times use a Formatter and loop through it like this:
private static StringBuilder appendStrings(String pro, List<ProcConfig> list, StringBuilder sb) {
Formatter formatter = new Formatter(sb);
String template ="";
for (ProcConfig pc : list) {
if (pro.length() == 6)
template = "%-6s %d %3d %-15s %d %d %s %n";
else if (pro.length() > 35)
template = "%-53s %d %3d %-15s %d %d %s %n";
else
template = "%-35s %d %3d %-15s %d %d %s %n";
formatter.format(template, pro, pc.getNo(), pc.getId(), pc.getAddress(), pc.getPort(), pc.getNumPorts(), pc.getName());
}
formatter.close();
return sb;
}
Because pro is in different length you could change the template to a proper one for each pro.
Note: Don't forget to import java.util.Formatter.
You can use format specifiers to specify the way the data is formatted. This is a list of common formatters:
%S or %s: Specifies String
%X or %x: Specifies hexadecimal integer
%o: Specifies Octal integer
%d: Specifies Decimal integer
%c: Specifies character
%T or %t: Specifies Time and date
%n: Inserts newline character
%B or %b: Specifies Boolean
%A or %a: Specifies floating point hexadecimal
%f: Specifies Decimal floating point
As I said before in comments The dash means the first string is left justified and every string will append after number of chars defined in template, so try to find the best number that suites your needs.
To learn more about formatter take a look at here and here.

DB2 UTF-8 XML C2 85 to new line conversion

We have problem when saving XML data ( UTF-8) encoded to DB2 9.7 LUW in table.
Table DDL:
CREATE TABLE DB2ADMIN.TABLE_FOR_XML
(
ID INTEGER NOT NULL,
XML_FIELD XML NOT NULL
)
Problem occurs in some rare examples with rare Unicode characters, we are using java jdbc db2 driver.
For example looking in editor in normal mode not in hex view (Notepad++) this strange A below (after 16.) is represented as NEL in blacks square
Input XML is in UTF-8 encoding and when looked in HEX editor has this values:
00000010h: 31 36 2E 20 C2 85 42 ; 16. Â…B
After inserting in DB2 I presume that some kind of conversion occurs because when selecting data back this same character are now
00000010h: 31 36 2E 20 0D 0A 42 ; 16. ..B
C2 85 is transformed into 0D 0A that is new line.
One another thing I noticed that although when saving XML into table header content was starting with
<xml version="1.0" encoding="UTF-8">
but after fetching xml from db2 content was starting with
<xml version="1.0" encoding="UTF-16">
Is there way to force db2 to store XML in UTF-8 without conversions ? Fetching with XMLSERIALIZE didn't help
SELECT XML_FIELD AS CONTENT1, XMLSERIALIZE(XML_FIELD as cLOB(1M)) AS CONTENT2 from DB2ADMIN.TABLE_FOR_XML
IN content2 there is no XML header but stile newLine is there.

This behaviour is standard for XML 1.1 processors. XML 1.1 s2.11:
the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating [the single character #x85] to a single #xA character
Line ending type is one of the many details of a document that will be lost over a parse-and-serialise cycle (eg attribute order, whitespace in tags, numeric character references...).
It's slightly surprising that DB2's XML fields are using XML 1.1 since not much uses that revision of XML, but not super-surprising in that support for NEL (ancient, useless mainframe line ending character) is something only IBM ever wanted.
Is there way to force db2 to store XML in UTF-8 without conversions ?
Use a BLOB?
If you need both native-XML-field functionality and to retain the exact original serialised form of a document then you'll need two columns.
(Are you sure you need to retain NEL line endings? Nobody usually cares about line endings, and these are pretty bogus.)

As I don't generally need non readable characters, before saving XML string to Db2 I decided to implement clean string from x'c285 (code point 133) and 4 byte UTF-8 characters just for the case:
Found similar example(How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?) and adjusted it.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String line) {
final int length = line.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = line.codePointAt(offset);
// do something with the codepoint
if (codepoint > LAST_3_BYTE_UTF_CHAR.codePointAt(0)) { //4-byte UTF replace
b.append(REPLACEMENT_CHAR);
} else if( codepoint == 133){ //NEL or x'c285
b.append(REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}

Java Convert an object in table format

I'm implementing an api that reads data from json response and writes the resulting objects to csv.
Is there a way to convert an object in java to a table format (row-column)?
E.g. assume I have these objects:
public class Test1 {
private int a;
private String b;
private Test2 c;
private List<String> d;
private List<Test2> e;
// getters-setters ...
}
public class Test2 {
private int x;
private String y;
private List<String> z;
// getters-setters ...
}
Lets say I have an instance with the following values
Test1 c1 = new Test1();
c1.setA(11);
c1.setB("12");
c1.setC(new Test2(21, "21", Arrays.asList(new String[] {"211", "212"}) ));
c1.setD(Arrays.asList(new String[] {"111", "112"}));
c1.setE(Arrays.asList(new Test2[] {
new Test2(31, "32"),
new Test2(41, "42")
}));
I would like to see something like this returned as a List<Map<String, Object>> or some other object:
a b c.x c.y c.z d e.x e.y
---- ---- ------ ------- ------ ---- ------ ------
11 12 21 21 211 111 31 32
11 12 21 21 211 111 41 42
11 12 21 21 211 112 31 32
11 12 21 21 211 112 41 42
11 12 21 21 212 111 31 32
11 12 21 21 212 111 41 42
11 12 21 21 212 112 31 32
11 12 21 21 212 112 41 42
I have already implemented something in order to achieve this result using reflections but my solution is too slow for larger objects.
I was thinking in using an in memory database so to convert the object into a database table and then select the result, something like MongoDB or ObjectDB, but I think its an overkill, and maybe slower than my approach. Also, these two do not support in memory database and I do not want to use another disk database, since I'm already using MySQL with hibernate. Usint ramdisk is not an option, since my server only has limited ram. Is there there an in memory oodbms that can do this?
I would prefeer as a solution an algorithm, or even better, if there is already a library that can convert any object to a row-column format? something like jackson or jaxb that convert data to/from other formats.
Thanks for the help

Finally after one week of banging my head against any possible thing available in my house I managed to find a solution.
I shared the code on GitHub so that if anyone ever encounters this problem again, he can avoids a couple of migranes :)
you can get the code from here:
https://github.com/Sebb77/Denormalizer
Note: I had to use the getType() function and the FieldType enum for my specific problem.
In the future I will try to speed up the code with some caching, or something else :)
Note2: this is just a sample code that should be used only for reference. Lots of improvements can be done.
Anyone is free to use the code, just send me a thank you email :)
Any suggestions, improvements or bugs reports are very welcome.

RXTX receive error

I'm using RXTX-2.1-7 and am trying to read a COM port in Windows. I see some inconsistencies in the received data compared to the data received by using PuTTY.
The data received in PuTTY:
a5 5a 0b 05 10 00 00 00 00 27 d4 73 30 ae
But the exact same data received using RXTX:
3f 5a 0b 0510 00 00 00 00 27 3f 73 30 3f
It seems like all received bytes greater than at least a0 are read as 3f. Here's the relevant part of the code that I'm using
char[] buffer = new char[14];
int i=0;
Arrays.fill(buffer,(char)0);
while (i<14)
{
buffer[i++] = (char)charReader.read(); /*DOUBT*/
}
/*System.out.println(Arrays.toString(buffer));*/
String bufferString = new String(buffer);
System.out.println(String.format("%x ", new BigInteger(bufferString.getBytes("ISO-8859-1"))));
And charReader is an InputStreamReader of the opened serial port. I also checked if the typecast to (char) in the line marked /*DOUBT*/ is the culprit, but the code without the cast I still get the inconsistency.
65533, 90, 11, 5, 16, 0, 0, 0, 0, 39, 65533, 115, 48, 65533
Any ideas why I'm getting this inconsistency?

It's a character encoding issue: I think your java program decodes port input as UTF-8. So you get either ? (ascii \x1a) or unicode REPLACEMENT CHARACTER (65533) as a placeholder for invalid characters.
It's better to work with bytes explicitly, without conversion to characters, when you really work with bytes. If you absolutely have to represent bytes as characters, use unibyte encoding ISO-8859-1 (which has 1:1 mapping to lower 256 unicode characters).

Converting a large ASCII to CSV file in python or java or something else on Linux or Win7

Need a hint so I can convert a huge (300-400 mb) ASCII file to a CSV file.
My ASCII file is a database with a lot of products (about 600,000 pcs = 55,200,000 lines in the file).
Below is ONE product. It is like a tablerow in a database, with 88 columns.
If you count the below lines, there is 92 lines.
For every time we have the '00I+CR\LF' it indicates, that we have a new row/product.
Each line is ended with a CR+LF.
A whole product/row is ended with the following three lines:
A00
A10
A21
-as shown below.
Between the starting line '00I CR+LF' and the three ending lines, we have lines, starting with 2 digits (column name), and what comes after those digits, is the data for the column.
If we take the first line below the starting line '00I CR+LF' we will see:
'0109321609'. 01 indicates that it is the column named 01, and the rest is the data stored in that column: '09321609'.
I want to strip out the two digits, indicating each column name/line-number, so the first line (after the starting indication '00I'): 0109321609 comes out as the following: ”09321609”.
Putting it together with the next line (02), it should give an output like:
”09321609”,”15274”, etc.
When coming to the end, we want a new row.
The first line '00I' and the three last lines 'A00', 'A10' and 'A21' we don't want to be included in the output file.
Here is how a row looks like (every line is ended by a CR+LF):
00I
0109321609
0215274
032
0419685
05
062
072
081
09
111
121
15
161
17
1814740
1920120401
2020120401
2120120401
22
230
240
251
26BLAHBLAH 1000MG
27
281
29
30
31BLAHBLAH 1000 mg Filmtablets Hursutacinzki
32
3336
341
350
361
371
401
410
420
43
445774
45FTA
46
47AN03AX14
48BLAHBLAH00000000000000000000010
491
501
512
522
5317
542
552
561
572
581
591
60
61
62
631
641
65
66
67
681
69
721
74884
761
771
780
790
801
811
831
851474
86
871
880
891
901
911
922
930
941
951
961
97
98
990
A00
A10
A21
Anyone got a hint on how it can be converted?
The file is too big for a webserver with php and mysql to run. My thought was to put the file in a directory on my local server, and read the file, strip out the line numbers, and insert the data directly in a mysql database on the fly, but the file is too big, and the server stalls.
I'm able to run under Linux (Ubuntu) and Windows 7.
Maybe some python or java is recommended? I'm able to run both, but my experience with those is low, but I'm a quick learner, so if someone can give a hint? :-)
Best Regards
Bjarke :-)

If you are absolutely certain that each entry is 92 lines long:
from itertools import izip
import csv
with open('data.txt') as inf, open('data.csv','wb') as outf:
lines = (line[2:].rstrip() for line in inf)
rows = (data[1:89] for data in izip(*([lines]*92)))
csv.writer(outf).writerows(rows)

It should be like this in python.
import csv
fo = csv.writer(open('out.csv','wb'))
with open('eg.txt', 'r') as f:
for line in f:
assert line[:3] == '00I'
buf = []
for i in range(88):
line = f.next()
buf.append(line.strip()[2:])
line = f.next()
assert line[:3] == 'A00'
line = f.next()
assert line[:3] == 'A10'
line = f.next()
assert line[:3] == 'A21'
fo.writerow(buf)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

same strings, different charset, not equals - java

Related

Align all the strings in proper way using String.format java

DB2 UTF-8 XML C2 85 to new line conversion

Java Convert an object in table format

RXTX receive error

Converting a large ASCII to CSV file in python or java or something else on Linux or Win7

Categories

Resources