hex-Encoding in Java goes wrong - java

me and several experienced Java developers worked on this for like 1 hour now and we cannot get it to work. Someone has any tips for me?
Problem:
We got a text in an Excel file which seems to be encoded completely inconsistent and stupid. Sometimes there are special chars, sometimes not, sometimes they are shown and interpreted differently.
What i wanted to do now is to write a little Java-Script, that checks the given Text in the Excel File and converts all the different Char-sequences into what we want it to be.
My Code:
while (iterator.hasNext()) {
Entity entity = (Entity) iterator.next();
Dataset dataset = produkt_store.getDataset(entity);
FormData formdata = dataset.getFormData();
DomElement dom = (DomElement) formdata.get(lang,
"cs_description_short").get();
String beschreibung = dom.toText(true);
System.out.println("Before: " + beschreibung);
String hexBeschreibung = StringToHex(beschreibung);
String newHexBeschreibung = hexBeschreibung.replaceAll("75 3F", "FC");
newHexBeschreibung = newHexBeschreibung.replaceAll("75 A8", "FC");
//beschreibung2 = beschreibung2.replaceAll("75A8", "FC");
System.out.println("After: " + HexToString(newHexBeschreibung));
System.out.println(hexBeschreibung.equals(newHexBeschreibung) + "\n");
// dom.set(beschreibung);
}
Also i got those functions to encode / decode to hex:
private static String StringToHex(String s) {
if (s.length() == 0)
return "";
char c;
StringBuffer buff = new StringBuffer();
for (int i = 0; i < s.length(); i++) {
c = s.charAt(i);
buff.append(Integer.toHexString(c) + " ");
}
return buff.toString().trim();
}
private static String HexToString(String s) {
if (s.length() == 0)
return "";
String[] arr = s.split(" ");
StringBuffer buff = new StringBuffer();
int i;
for (String str : arr) {
i = Integer.valueOf(str, 16).intValue();
String hs = new Character((char) i).toString();
buff.append(hs);
}
return buff.toString();
}
Example:
Sometimes where there should be an "ü" it is shown as "u?" which we obviously want to avoid. When looking into it in an hex-Editor we see those things represented sometimes as
753F or 75A8. Same goes for "ä" or "ö" or "ß". So even for "u?" it varies from 753F to sometimes being 75A8. We tried to replace that with "ü". Doesn't work. Someone got any tips?
We tried to use String.replaceAll() before that and used something like String.replaceAll("u\?","ü"); But that didn't work either as of nothing was changed at all.
Thanks for any tips on that encoding stuff! :)
EDIT:
This is the solution which works perfectly fine:
beschreibung = beschreibung.replace("U\u0308", "\u00DC"); // "Ü"
beschreibung = beschreibung.replace("u\u0308", "\u00FC"); // "ü"
beschreibung = beschreibung.replace("A\u0308", "\u00C4"); // "Ä"
beschreibung = beschreibung.replace("a\u0308", "\u00E4"); // "ä"
beschreibung = beschreibung.replace("O\u0308", "\u00D6"); // "Ö"
beschreibung = beschreibung.replace("o\u0308", "\u00F6"); // "ö"
beschreibung = beschreibung.replace("s\u0308", "\u00DF"); // "ß"

Somewhere there was ü represented not as one char U-UMLAUT but as SMALL-LETTER-U followed by COMBING-DIACRITICAL-MARK-UMLAUT. This is valid.
Then there was some conversion back, to maybe ISO-8859-1 (or even US-ASCII?), and the Umlaut got separately converted. There was no such character in ISO-8859-1 and you got a question mark instead.
A repair afterwards would be:
String s = ...
s = s.replace("U?", "\u00DC")); // "Ü"
s = s.replace("u?", "\u00FC"); // "ü"
...
(I have escaped the chars to prevent problems with possibly different encoding of java compiler and editor. (Would be an error.)
That can also be done a bit more sophisticated:
s = s.replaceAll("([aouAOU])\\?", "$1\u0308"); // Again ASCII + Umlaut separately
s = TextNormalizer.normalize(s, TextNormalizer.Form.NFC);
// Now single non-ASCII letters.
The TextNormalizer might be a help here.
Caveat: The '?' can also be shown in a console (i.e. from the IDE), as there a conversion takes place too.
Somewhere a conversion was done. This can happen implicitly, where the encoding is optional and such. You might try with setting the system property file.encoding to UTF-8 or Cp1252 (Windows Latin-1).

First thing to check: are upper/lowercase important? e.g. if your toHex produces "75 3f" you won't replace it with your given command. hexBeschreibung = hexBeschreibung.toLowercase() would solve this issue.
Second: (more of a hint) "u?" doesn't mean 'u' + '?', but 'u' + <not unicode character and definitly not '?'>.
I hope my first suggestion will help :)
--
Sorry I can't comment, so I have to edit:
Hex editors may show hex values upper or lower case, because it doesn't matter. You have to check your used String by yourself, because Java may represent hex in Strings with lowercase letters.

Related

how to detect base64 encoded strings? [duplicate]

I want to decode a Base64 encoded string, then store it in my database. If the input is not Base64 encoded, I need to throw an error.
How can I check if a string is Base64 encoded?
You can use the following regular expression to check if a string constitutes a valid base64 encoding:
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /]. If the rest length is less than 4, the string is padded with '=' characters.
^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$ means the string ends in one of three forms: [A-Za-z0-9+/]{4}, [A-Za-z0-9+/]{3}= or [A-Za-z0-9+/]{2}==.
If you are using Java, you can actually use commons-codec library
import org.apache.commons.codec.binary.Base64;
String stringToBeChecked = "...";
boolean isBase64 = Base64.isArrayByteBase64(stringToBeChecked.getBytes());
[UPDATE 1] Deprecation Notice
Use instead
Base64.isBase64(value);
/**
* Tests a given byte array to see if it contains only valid characters within the Base64 alphabet. Currently the
* method treats whitespace as valid.
*
* #param arrayOctet
* byte array to test
* #return {#code true} if all bytes are valid characters in the Base64 alphabet or if the byte array is empty;
* {#code false}, otherwise
* #deprecated 1.5 Use {#link #isBase64(byte[])}, will be removed in 2.0.
*/
#Deprecated
public static boolean isArrayByteBase64(final byte[] arrayOctet) {
return isBase64(arrayOctet);
}
Well you can:
Check that the length is a multiple of 4 characters
Check that every character is in the set A-Z, a-z, 0-9, +, / except for padding at the end which is 0, 1 or 2 '=' characters
If you're expecting that it will be base64, then you can probably just use whatever library is available on your platform to try to decode it to a byte array, throwing an exception if it's not valid base 64. That depends on your platform, of course.
As of Java 8, you can simply use java.util.Base64 to try and decode the string:
String someString = "...";
Base64.Decoder decoder = Base64.getDecoder();
try {
decoder.decode(someString);
} catch(IllegalArgumentException iae) {
// That string wasn't valid.
}
Try like this for PHP5
//where $json is some data that can be base64 encoded
$json=some_data;
//this will check whether data is base64 encoded or not
if (base64_decode($json, true) == true)
{
echo "base64 encoded";
}
else
{
echo "not base64 encoded";
}
Use this for PHP7
//$string parameter can be base64 encoded or not
function is_base64_encoded($string){
//this will check if $string is base64 encoded and return true, if it is.
if (base64_decode($string, true) !== false){
return true;
}else{
return false;
}
}
var base64Rejex = /^(?:[A-Z0-9+\/]{4})*(?:[A-Z0-9+\/]{2}==|[A-Z0-9+\/]{3}=|[A-Z0-9+\/]{4})$/i;
var isBase64Valid = base64Rejex.test(base64Data); // base64Data is the base64 string
if (isBase64Valid) {
// true if base64 formate
console.log('It is base64');
} else {
// false if not in base64 formate
console.log('it is not in base64');
}
Try this:
public void checkForEncode(String string) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(string);
if (m.find()) {
System.out.println("true");
} else {
System.out.println("false");
}
}
It is impossible to check if a string is base64 encoded or not. It is only possible to validate if that string is of a base64 encoded string format, which would mean that it could be a string produced by base64 encoding (to check that, string could be validated against a regexp or a library could be used, many other answers to this question provide good ways to check this, so I won't go into details).
For example, string flow is a valid base64 encoded string. But it is impossible to know if it is just a simple string, an English word flow, or is it base 64 encoded string ~Z0
There are many variants of Base64, so consider just determining if your string resembles the varient you expect to handle. As such, you may need to adjust the regex below with respect to the index and padding characters (i.e. +, /, =).
class String
def resembles_base64?
self.length % 4 == 0 && self =~ /^[A-Za-z0-9+\/=]+\Z/
end
end
Usage:
raise 'the string does not resemble Base64' unless my_string.resembles_base64?
Check to see IF the string's length is a multiple of 4. Aftwerwards use this regex to make sure all characters in the string are base64 characters.
\A[a-zA-Z\d\/+]+={,2}\z
If the library you use adds a newline as a way of observing the 76 max chars per line rule, replace them with empty strings.
/^([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)$/
this regular expression helped me identify the base64 in my application in rails, I only had one problem, it is that it recognizes the string "errorDescripcion", I generate an error, to solve it just validate the length of a string.
For Flutter, I tested couple of the above comments and translated that into dart function as follows
static bool isBase64(dynamic value) {
if (value.runtimeType == String){
final RegExp rx = RegExp(r'^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$',
multiLine: true,
unicode: true,
);
final bool isBase64Valid = rx.hasMatch(value);
if (isBase64Valid == true) {return true;}
else {return false;}
}
else {return false;}
}
In Java below code worked for me:
public static boolean isBase64Encoded(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
This works in Python:
import base64
def IsBase64(str):
try:
base64.b64decode(str)
return True
except Exception as e:
return False
if IsBase64("ABC"):
print("ABC is Base64-encoded and its result after decoding is: " + str(base64.b64decode("ABC")).replace("b'", "").replace("'", ""))
else:
print("ABC is NOT Base64-encoded.")
if IsBase64("QUJD"):
print("QUJD is Base64-encoded and its result after decoding is: " + str(base64.b64decode("QUJD")).replace("b'", "").replace("'", ""))
else:
print("QUJD is NOT Base64-encoded.")
Summary: IsBase64("string here") returns true if string here is Base64-encoded, and it returns false if string here was NOT Base64-encoded.
C#
This is performing great:
static readonly Regex _base64RegexPattern = new Regex(BASE64_REGEX_STRING, RegexOptions.Compiled);
private const String BASE64_REGEX_STRING = #"^[a-zA-Z0-9\+/]*={0,3}$";
private static bool IsBase64(this String base64String)
{
var rs = (!string.IsNullOrEmpty(base64String) && !string.IsNullOrWhiteSpace(base64String) && base64String.Length != 0 && base64String.Length % 4 == 0 && !base64String.Contains(" ") && !base64String.Contains("\t") && !base64String.Contains("\r") && !base64String.Contains("\n")) && (base64String.Length % 4 == 0 && _base64RegexPattern.Match(base64String, 0).Success);
return rs;
}
There is no way to distinct string and base64 encoded, except the string in your system has some specific limitation or identification.
This snippet may be useful when you know the length of the original content (e.g. a checksum). It checks that encoded form has the correct length.
public static boolean isValidBase64( final int initialLength, final String string ) {
final int padding ;
final String regexEnd ;
switch( ( initialLength ) % 3 ) {
case 1 :
padding = 2 ;
regexEnd = "==" ;
break ;
case 2 :
padding = 1 ;
regexEnd = "=" ;
break ;
default :
padding = 0 ;
regexEnd = "" ;
}
final int encodedLength = ( ( ( initialLength / 3 ) + ( padding > 0 ? 1 : 0 ) ) * 4 ) ;
final String regex = "[a-zA-Z0-9/\\+]{" + ( encodedLength - padding ) + "}" + regexEnd ;
return Pattern.compile( regex ).matcher( string ).matches() ;
}
If the RegEx does not work and you know the format style of the original string, you can reverse the logic, by regexing for this format.
For example I work with base64 encoded xml files and just check if the file contains valid xml markup. If it does not I can assume, that it's base64 decoded. This is not very dynamic but works fine for my small application.
This works in Python:
def is_base64(string):
if len(string) % 4 == 0 and re.test('^[A-Za-z0-9+\/=]+\Z', string):
return(True)
else:
return(False)
Try this using a previously mentioned regex:
String regex = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
if("TXkgdGVzdCBzdHJpbmc/".matches(regex)){
System.out.println("it's a Base64");
}
...We can also make a simple validation like, if it has spaces it cannot be Base64:
String myString = "Hello World";
if(myString.contains(" ")){
System.out.println("Not B64");
}else{
System.out.println("Could be B64 encoded, since it has no spaces");
}
if when decoding we get a string with ASCII characters, then the string was
not encoded
(RoR) ruby solution:
def encoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count.zero?
end
def decoded?(str)
Base64.decode64(str.downcase).scan(/[^[:ascii:]]/).count > 0
end
Function Check_If_Base64(ByVal msgFile As String) As Boolean
Dim I As Long
Dim Buffer As String
Dim Car As String
Check_If_Base64 = True
Buffer = Leggi_File(msgFile)
Buffer = Replace(Buffer, vbCrLf, "")
For I = 1 To Len(Buffer)
Car = Mid(Buffer, I, 1)
If (Car < "A" Or Car > "Z") _
And (Car < "a" Or Car > "z") _
And (Car < "0" Or Car > "9") _
And (Car <> "+" And Car <> "/" And Car <> "=") Then
Check_If_Base64 = False
Exit For
End If
Next I
End Function
Function Leggi_File(PathAndFileName As String) As String
Dim FF As Integer
FF = FreeFile()
Open PathAndFileName For Binary As #FF
Leggi_File = Input(LOF(FF), #FF)
Close #FF
End Function
import java.util.Base64;
public static String encodeBase64(String s) {
return Base64.getEncoder().encodeToString(s.getBytes());
}
public static String decodeBase64(String s) {
try {
if (isBase64(s)) {
return new String(Base64.getDecoder().decode(s));
} else {
return s;
}
} catch (Exception e) {
return s;
}
}
public static boolean isBase64(String s) {
String pattern = "^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(s);
return m.find();
}
For Java flavour I actually use the following regex:
"([A-Za-z0-9+]{4})*([A-Za-z0-9+]{3}=|[A-Za-z0-9+]{2}(==){0,2})?"
This also have the == as optional in some cases.
Best!
I try to use this, yes this one it's working
^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)?$
but I added on the condition to check at least the end of the character is =
string.lastIndexOf("=") >= 0

masking of email address in java

I am trying to mask email address with "*" but I am bad at regex.
input : nileshxyzae#gmail.com
output : nil********#gmail.com
My code is
String maskedEmail = email.replaceAll("(?<=.{3}).(?=[^#]*?.#)", "*");
but its giving me output nil*******e#gmail.com I am not getting whats getting wrong here. Why last character is not converted?
Also can someone explain meaning all these regex
Your look-ahead (?=[^#]*?.#) requires at least 1 character to be there in front of # (see the dot before #).
If you remove it, you will get all the expected symbols replaced:
(?<=.{3}).(?=[^#]*?#)
Here is the regex demo (replace with *).
However, the regex is not a proper regex for the task. You need a regex that will match each character after the first 3 characters up to the first #:
(^[^#]{3}|(?!^)\G)[^#]
See another regex demo, replace with $1*. Here, [^#] matches any character that is not #, so we do not match addresses like abc#example.com. Only those emails will be masked that have 4+ characters in the username part.
See IDEONE demo:
String s = "nileshkemse#gmail.com";
System.out.println(s.replaceAll("(^[^#]{3}|(?!^)\\G)[^#]", "$1*"));
If you're bad at regular expressions, don't use them :) I don't know if you've ever heard the quote:
Some people, when confronted with a problem, think
"I know, I'll use regular expressions." Now they have two problems.
(source)
You might get a working regular expression here, but will you understand it today? tomorrow? in six months' time? And will your colleagues?
An easy alternative is using a StringBuilder, and I'd argue that it's a lot more straightforward to understand what is going on here:
StringBuilder sb = new StringBuilder(email);
for (int i = 3; i < sb.length() && sb.charAt(i) != '#'; ++i) {
sb.setCharAt(i, '*');
}
email = sb.toString();
"Starting at the third character, replace the characters with a * until you reach the end of the string or #."
(You don't even need to use StringBuilder: you could simply manipulate the elements of email.toCharArray(), then construct a new string at the end).
Of course, this doesn't work correctly for email addresses where the local part is shorter than 3 characters - it would actually then mask the domain.
Your Look-ahead is kind of complicated. Try this code :
public static void main(String... args) throws Exception {
String s = "nileshkemse#gmail.com";
s= s.replaceAll("(?<=.{3}).(?=.*#)", "*");
System.out.println(s);
}
O/P :
nil********#gmail.com
I like this one because I just want to hide 4 characters, it also dynamically decrease the hidden chars to 2 if the email address is too short:
public static String maskEmailAddress(final String email) {
final String mask = "*****";
final int at = email.indexOf("#");
if (at > 2) {
final int maskLen = Math.min(Math.max(at / 2, 2), 4);
final int start = (at - maskLen) / 2;
return email.substring(0, start) + mask.substring(0, maskLen) + email.substring(start + maskLen);
}
return email;
}
Sample outputs:
my.email#gmail.com > my****il#gmail.com
info#mail.com > i**o#mail.com
//In Kotlin
val email = "nileshkemse#gmail.com"
val maskedEmail = email.replace(Regex("(?<=.{3}).(?=.*#)"), "*")
public static string GetMaskedEmail(string emailAddress)
{
string _emailToMask = emailAddress;
try
{
if (!string.IsNullOrEmpty(emailAddress))
{
var _splitEmail = emailAddress.Split(Char.Parse("#"));
var _user = _splitEmail[0];
var _domain = _splitEmail[1];
if (_user.Length > 3)
{
var _maskedUser = _user.Substring(0, 3) + new String(Char.Parse("*"), _user.Length - 3);
_emailToMask = _maskedUser + "#" + _domain;
}
else
{
_emailToMask = new String(Char.Parse("*"), _user.Length) + "#" + _domain;
}
}
}
catch (Exception) { }
return _emailToMask;
}

java how to escape accented character in string

For example
{"orderNumber":"S301020000","customerFirstName":"ke ČECHA ","customerLastName":"张科","orderStatus":"PENDING_FULFILLMENT_REQUEST","orderSubmittedDate":"May 13, 2015 1:41:28 PM"}
how to get the accented character like "Č" in above json string and escape it in java
Just give some context of this question, please check this question from me
Ajax unescape response text from java servlet not working properly
Sorry for my English :)
You should escape all characters that are greater than 0x7F. You can loop through the String's characters using the .charAt(index) method. For each character ch that needs escaping, replace it with:
String hexDigits = Integer.toHexString(ch).toUpperCase();
String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits;
I don't think you will need to unescape them in JavaScript because JavaScript supports escaped characters in string literals, so you should be able to work with the string the way it is returned by the server. I'm guessing you will be using JSON.parse() to convert the returned JSON string into a JavaScript object, like this.
Here's a complete function:
public static escapeJavaScript(String source)
{
StringBuilder result = new StringBuilder();
for (int i = 0; i < source.length(); i++)
{
char ch = source.charAt(i);
if (ch > 0x7F)
{
String hexDigits = Integer.toHexString(ch).toUpperCase();
String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits;
result.append(escapedCh);
}
else
{
result.append(ch);
}
}
return result.toString();
}

Want to replace special characters with equivalent UTF-8 symbols

As part of my application I have written a custom method to extract data from the DB and return it as a string. My string has special characters like the pound sign, which when extracted looks like this:
"MyMobile Blue £54.99 [12 month term]"
I want the £ to be replaced with actual pound symbol. Below is my method:
public String getOfferName(String offerId) {
log(Level.DEBUG, "Entered getSupOfferName");
OfferClient client = (OfferClient) ApplicationContext
.get(OfferClient.class);
OfferObject offerElement = getOfferElement(client, offerId);
if (offerElement == null) {
return "";
} else {
return offerElement.getDisplayValue();
}
}
Can some one help on this?
The document contains XML/HTML entities .
You can use the StringEscapeUtils.unescapeXml() method from commons-lang to parse these back to their unicode equivalents.
If this is HTML rather than XML use the other methods as there are differences in the two sets of entities.
I voted for StringEscapeUtils.unescapeXml() solution. Anyway, here's is a custom solution
String s = "MyMobile Blue £54.99 [12 month term]";
Pattern p = Pattern.compile("&#(\\d+?);");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while(m.find()) {
int c = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "" + (char)c);
}
m.appendTail(sb);
System.out.println(sb);
output
MyMobile Blue £54.99 [12 month term]
note that it does not accept hex entity reference

Trim() in Java not working the way I expect? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland                                          '
'1: Sophomore                                          '
'2: Someplace, Virginia  12345<br />VA SomeCity<br />'
'3: Undergraduate                                          '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.

Categories

Resources