Java is changing my leading spaces to question marks - java

I am using this inside of a minecraft mod to read and write file and all the leading space are being converted to ? in the file.
file input sample:
{
   "ReturnToStart": "1b",
file out put sample:
{
???"ReturnToStart": "1b",
//xxxxxxxxxxxxxxxxxxxxxxx
var ips = new java.io.FileInputStream("ABC.json");
var fileReader = new java.io.InputStreamReader(ips,"UTF-8");
var data1 = fileReader.read();
var data;
var start1 = "";
while(data1 != -1) {
data = String.fromCharCode(data1);
start1 = start1+data;
data1 = fileReader.read();
}
fileReader.close();
var fileWriter = new java.io.FileWriter("J_out2.txt");
fileWriter.write(start1);
fileWriter.close();

It looks like you are using Nashorn in Java 8. Basically that is JavaScript running in a Java VM with access to all Java objects. I don't think those are normal spaces, and I suspect these are non breaking spaces (with code 160). It would be interesting to see what the value of data1 is at these positions.
The JavaScript method String.fromCharCode doesn't convert UTF-8 codes properly in Nashorn. Actually this will never work because in UTF-8 a single character can be split over multiple characters, and the value that comes back from read is limited to 16 bits. Not enough.
Below is probably what you need. I have included but commented the start1 variable because you may want to use that in your code, but is not needed.
var fileReader = new java.io.InputStreamReader(
new java.io.FileInputStream("ABC.json"), "UTF-8");
var bufferedReader = new java.io.BufferedReader(fileReader);
var fileWriter = new java.io.OutputStreamWriter(
new java.io.FileOutputStream("J_out2.txt"),"UTF-8");
var line;
// var start1=new java.lang.StringBuilder();
while(line=bufferedReader.readLine()) {
// start1.append(line);
// start1.append('\n');
fileWriter.write(line);
fileWriter.write('\n');
}
fileWriter.close();
bufferedReader.close();

var ips = new java.io.FileReader("ABC.json");
var data1 = ips.read();
var data;
var start1 = "";
while(data1 != -1) {
data = String.fromCharCode(data1);
if (data1 ==11 ||data1 ==12 || data1 ==10) {
data1 = ips.read();
continue;
}
//npc.say(data1+" "+ data);
data = String.fromCharCode(data1);
start1 = start1+data;
data1 = ips.read();
}
ips.close();
npc.say(start1);
Well I took out line feed, vertical tab, and form feed (10,11,12) and it works.

Related

How to avoid java memory heap space you are modifying data inside a camel route?

I am new to java and am currently trying to update the body of a message that is transported within a camel route (via talend).
So far all is well, I manage to modify this message. However if the file is too large the JVM goes out of memory.
How to reconstruct the message knowing that it must be modified, but only certain lines?
For example I have to change line A, but keep lines B and C intact, and thus rebuild my message.
Knowing that camel does not provide an updateBody () for example and the setOut body creates a new message for me.
Thanks in advance !
InputStream inputStream = exchange.getIn().getBody(InputStream.class);
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String line = reader.readLine();
String columnToForceToZero = String.valueOf(exchange.getProperty("targetColumn"));
String lineToChange = String.valueOf(exchange.getProperty("targetLine"));
//String Separator = String.valueOf(exchange.getProperty("separator"));
List<String> listColumn = new ArrayList<String>(Arrays.asList(columnToForceToZero.split(",")));
int lengthLineToChange = lineToChange.length();
//String newBody = exchange.getIn().getBody(String.class);
String nouveauBody="";
while( line != null) {
if (lineToChange.equals(line.substring(0,lengthLineToChange))){
List<String> list = new ArrayList<String>(Arrays.asList(line.split("\\|")));
for (int i = 0; i < listColumn.size() ; i++) {
list.set(Integer.parseInt(listColumn.get(i)), context.work_YODA_IN_value_to_force);
}
line = String.join("|", list);
}
nouveauBody = nouveauBody + line +"\n";
line = reader.readLine();
}
exchange.getOut().setBody(nouveauBody);
exchange.getOut().setHeaders(exchange.getIn().getHeaders());
//copy attachements from IN to OUT to propagate them
exchange.getOut().setAttachments(exchange.getIn().getAttachments());

Optimising CSV parsing to be faster

I'm working on this "program" that reads data from 2 large csv files (line by line), compares an Array element from the files and, when a match is found, it writes my necessary data into a 3rd file. The only problem I have is that it is very slow. It reads 1-2 lines per second, which is extremely slow, considering I have millions of records. Any ideas on how could I make it faster? Here's my code:
public class ReadWriteCsv {
public static void main(String[] args) throws IOException {
FileInputStream inputStream = null;
FileInputStream inputStream2 = null;
Scanner sc = null;
Scanner sc2 = null;
String csvSeparator = ",";
String line;
String line2;
String path = "D:/test1.csv";
String path2 = "D:/test2.csv";
String path3 = "D:/newResults.csv";
String[] columns;
String[] columns2;
Boolean matchFound = false;
int count = 0;
StringBuilder builder = new StringBuilder();
FileWriter writer = new FileWriter(path3);
try {
// specifies where to take the files from
inputStream = new FileInputStream(path);
inputStream2 = new FileInputStream(path2);
// creating scanners for files
sc = new Scanner(inputStream, "UTF-8");
// while there is another line available do:
while (sc.hasNextLine()) {
count++;
// storing the current line in the temporary variable "line"
line = sc.nextLine();
System.out.println("Number of lines read so far: " + count);
// defines the columns[] as the line being split by ","
columns = line.split(",");
inputStream2 = new FileInputStream(path2);
sc2 = new Scanner(inputStream2, "UTF-8");
// checks if there is a line available in File2 and goes in the
// while loop, reading file2
while (!matchFound && sc2.hasNextLine()) {
line2 = sc2.nextLine();
columns2 = line2.split(",");
if (columns[3].equals(columns2[1])) {
matchFound = true;
builder.append(columns[3]).append(csvSeparator);
builder.append(columns[1]).append(csvSeparator);
builder.append(columns2[2]).append(csvSeparator);
builder.append(columns2[3]).append("\n");
String result = builder.toString();
writer.write(result);
}
}
builder.setLength(0);
sc2.close();
matchFound = false;
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
//then I close my inputStreams, scanners and writer
Use an existing CSV library rather than rolling your own. It will be far more robust than what you have now.
However, your problem is not CSV parsing speed, it that your algorithm is O(n^2), for each line in the first file, you need to scan the second file. This kind of algorithm explodes very quickly with the size of data, when you have millions of rows, you'll run into problems. You need a better algorithm.
The other problem is you are re-parsing the second file for every scan. You should at least read it into an memory as an ArrayList or something first at the start of the program so you only need to load and parse it once.
Use univocity-parsers' CSV parser as it won't take much longer than a couple of seconds to process two files with 1 million rows each:
public void diff(File leftInput, File rightInput) {
CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial
CsvParser leftParser = new CsvParser(settings);
CsvParser rightParser = new CsvParser(settings);
leftParser.beginParsing(leftInput);
rightParser.beginParsing(rightInput);
String[] left;
String[] right;
int row = 0;
while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) {
row++;
if (!Arrays.equals(left, right)) {
System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right));
}
}
leftParser.stopParsing();
rightParser.stopParsing();
}
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Java remove diacritic

I am trying to make function which will remove diacritic(dont want to use Normalizer on purpose).Function looks like
private static String normalizeCharacter(Character curr) {
String sdiac = "áäčďéěíĺľňóôőöŕšťúůűüýřžÁÄČĎÉĚÍĹĽŇÓÔŐÖŔŠŤÚŮŰÜÝŘŽ";
String bdiac = "aacdeeillnoooorstuuuuyrzAACDEEILLNOOOORSTUUUUYRZ";
char[] s = sdiac.toCharArray();
char[] b = bdiac.toCharArray();
String ret;
for(int i = 0; i < sdiac.length(); i++){
if(curr == s[i])
curr = b[i];
}
ret = curr.toString().toLowerCase();
ret = ret.replace("\n", "").replace("\r","");
return ret;
}
funcion is called like this(every charracter from file is sent to this function)
private static String readFile(String fName) {
File f = new File(fName);
StringBuilder sb = new StringBuilder();
try{
FileInputStream fStream = new FileInputStream(f);
Character curr;
while(fStream.available() > 0){
curr = (char) fStream.read();
sb.append(normalizeCharacter(curr));
System.out.print(normalizeCharacter(curr));
}
}catch(IOException e){
e.printStackTrace();
}
return sb.toString();
}
file text.txt contains this: ľščťžýáíéúäôň and i expect lcstzyaieuaonin return from program but insted of expected string i get this ¾è yaieuaoò. I know that problem is somewhere in encoding but dont know where. Any ideas ?
You are trying to convert bytes into characters.
However, the character ľ is not represented as a single byte. Its unicode representation is U+013E, and its UTF-8 representation is C4 BE. Thus, it is represented by two bytes. The same is true for the other characters.
Suppose the encoding of your file is UTF-8. Then you read the byte value C4, and then you convert it to a char. This will give you the character U+00C4 (Ä), not U+013E. Then you read the BE, and it is converted to the character U+00BE (¾).
So don't confuse bytes and characters. Instead of using the InputStream directly, you should wrap it with a Reader. A Reader is able to read charecters based on the encoding it is created with:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8
)
);
Now, you'll be able to read characters or even whole lines and the encoding will be done directly.
int readVal;
while ( ( readVal = reader.read() ) != -1 ) {
curr = (char)readVal;
// ... the rest of your code
}
Remember that you are still reading an int if you are going to use read() without parameters.

Handling unicode characters in Java and write the results in mysql and HBase

I have a large volume of twitter dataset where each tweet is stored as a JSON object in which one of the field is tweet text which is a sequence of unicode characters. What I need to do is store this tweet text in MySql and HBase. Naturally MySql and HBase doesn't store unicode characters by default and I tried a number of approaches including changing the format of data storage in MySql. However in order to process this 1TB dataset I use map reduce to extract out the relevant fields from each JSON Object and write the output as a sequence of string such as:
1000000069:2014-04-20+09:23:41,457811834188607488%3A1%3ART+%40followback_707%3A+Retweet+this+%3F+ALL+%3F+WHO+%3F+RETWEETS+%3F+WANT+%3F+NEW+FOLLOWERS+FAST+%3F+%3F+%23FollowPyramid+%3F+%23TeamFollowBack%
(The key value pairs are separated after the first comma so in the above text 1000000069:2014-04-20+09:23:41 is the key and everything after that is the text information. The numbers in the initial are the tweet ID and then the sentiment score of the tweet.)
Now in the above text we can see that some characters were correctly encoded however there are texts that are complete illegible because their storage was not properly handled. The bigger challenge here is handling the text in the map reduce job and writing the output from this map reduce job correctly so that its fit for storage in the required stores i.e. MySql and HBase.
1000001353:2014-04-28+13:20:59,460770655601164288%3A0%3A%22%40FuckinFather_%3A+%3F%3F%3F%3F%3F%3F%3F+%3F+%3F%3F%3F%3F%3F%3F%3F%3F.%22+%3F+%3F%3F%3F%3F%3F%3F+%3F%3F%3F%3F%3F+%3F+%3F%3F%3F%3F%3F%3F+%3F%3F%3F%3F%3F%3F.
Actually the sequence of %3F we see are all decoded as question marks, which results in complete loss of tweet text information. I have been using URLEncoder.encode(string, "UTF-8) function to encode the tweet text, however apparently it only works for some set of strings, but not for other type of characters. Is there some way that I can use that perfectly encodes all type of tweet text correctly for storage into MySql database and HBase.
The snippet of code that I am using to write down data from my reducer is given below:
My mapper function:
public class TweetParsingMapper {
public static void main(String[] args){
List<HashMap<String, Integer>> wordSentimentScoringList = createAffinDataSet();
HashSet<String> censorList = createCensorList();
try{
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in));
String jsonString;
while((jsonString = bufferedReader.readLine()) != null){
JSONObject tweetInfo = new org.json.JSONObject(jsonString);
String tweetText = tweetInfo.get("tweetText").toString();
String tweetId = tweetInfo.get("tweetId").toString();
String userId = tweetInfo.get("userId").toString();
String timeStamp = tweetInfo.get("timeStamp").toString();
Integer sentimentScore = calculateSentimentScore(wordSentimentScoringList, tweetText);
String censoredTweet = censorTweet(censorList, tweetText);
censoredTweet = censoredTweet.replace("\n", "+delimiterfornewline+");
censoredTweet = censoredTweet.replace("\r", "+delimiterforcarriagereturn+");
censoredTweet = censoredTweet.replace(",", "+delimiterforcomma+");
censoredTweet = tweetId + ":" + sentimentScore.toString() + ":"+ censoredTweet;
String requestKey = userId + ":" + timeStamp;
System.out.println(requestKey + "\t" + censoredTweet);
}
bufferedReader.close();
}catch(IOException e){
System.err.println(e.getMessage());
}
}
My reducer function:
{
try{
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in));
String input = null;
String requestKey = null;
String currentRequestKey = null;
List<String> responseString = new ArrayList<String>();
while((input = bufferedReader.readLine()) != null){
String[] mapperResposeSet = input.split("\t", 2);
requestKey = mapperResposeSet[0];
if(requestKey.equals(currentRequestKey) && currentRequestKey != null){
if(!responseString.contains(mapperResposeSet[1]))
responseString.add(mapperResposeSet[1]);
}
else{
if(currentRequestKey != null){
StringBuilder finalResponse = new StringBuilder();
for(String str : responseString){
finalResponse.append(str);
finalResponse.append("+delimiter+");
}
System.out.println(currentRequestKey + "," + URLEncoder.encode(finalResponse.toString(), "UTF-8"));
responseString.clear();
}
currentRequestKey = requestKey;
responseString.add(mapperResposeSet[1]);
}
}
}catch(IOException e){
System.err.println(e.getMessage());
}
}
One of the sample tweet object is:
{"timeStamp": "2014-06-25+08:54:16", "tweetText": "\uff62\u30c9\u30e9\u30b4\u30f3\u30b3\u30a4\u30f3\u30ba\uff63\u3092\u59cb\u3081\u305f\u3088\uff01\u30b8\u30e3\u30e9\u30b8\u30e3\u30e9\u30b3\u30a4\u30f3\u304c\u8d85\u723d\u5feb\uff01\u4e00\u7dd2\u306b\u3042\u305d\u307c\uff01 http://t.co/An4nuiswy1 #\u30c9\u30e9\u30b3\u30a4", "uId": "1669549520", "ht": ["\u30c9\u30e9\u30b3\u30a4"], "tweetId": "481722030383828994"}
You could either store it as is -- that is, 6 bytes per 'character' (\uxxxx). If you could get that turned into utf8, it would take half the space. For example the 6 characters \u30c9 represents Katakana DO; so does the hex e38389 in utf8, which would store as 3 bytes.
In PHP (version >= 5.4), this will give the utf8 instead of the \uxxxx:
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
I do not know the equivalent in Java.

How to set jsp variable in java script array?

Hi I am trying to set jsp variable into a javascript array. I tried, but I am not able to achieve this when I run my code. I do not get my output, I only see a blank screen.
Here is my code:
value.jsp
<script>
<%
URL url;
ArrayList<String> list1 = new ArrayList<String>();
ArrayList<String> list2 = new ArrayList<String>();
List commodity = null;
List pric = null;
int c = 0;
int p = 0;
try {
// get URL content
String a = "http://122.160.81.37:8080/mandic/commoditywise?c=paddy";
url = new URL(a);
URLConnection conn = url.openConnection();
// open the stream and put it into BufferedReader
BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream()));
StringBuffer sb = new StringBuffer();
String inputLine;
while ((inputLine = br.readLine()) != null)
{
System.out.println(inputLine);
// sb.append(inputLine);
String s=inputLine.replace("|", "\n");
s = s.replace("~"," ");
StringTokenizer str = new StringTokenizer(s);
while(str.hasMoreTokens())
{
String mandi = str.nextElement().toString();
String price = str.nextElement().toString();
list1.add(mandi);
list2.add(price);
}
}
commodity = list1.subList(0,10);
pric = list2.subList(0,10);
for (c = 0,p = 0; c < commodity.size()&& p<pric.size(); c++,p++)
{
String x = (String)commodity.get(c);
String y = (String)pric.get(p);
//out.println(y);
//out.println(x);}
%>
jQuery(function ($) {
var roles = [];
roles.push(<%= x%>)
roles.push(<%= y%>)
<%
}
%>
var counter = 0;
var $role = $('#role')
//repeat the passed function at the specified interval - it is in milliseconds
setInterval(function () {
//display the role and increment the counter to point to next role
$role.text(roles[counter++]);
//if it is the last role in the array point back to the first item
if (counter >= roles.length) {
counter = 0;
}
}, 400)
});
</script>
<%
br.close();
//System.out.println(sb);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
%>
How can I achieve my desired output?
When you output to JS from JSP, you have to make sure your output is valid JS.
Your current output will output a string with no quotes, which the JS interpreter will treat as JS code and not strings or arrays as you want.
In addition, you are producing 10 copies of the JS
jQuery(function ($) {
var roles = [];
roles.push(<%= x%>)
roles.push(<%= y%>)
and that function is never closed.
I would look for a decent JSON output library. That would simplify your life quite a bit here.
It looks like you are clearing the roles variable var roles = []; with each pass of the loop.
You might want to consider not posting the entire file, but pairing the example down to just the part you need help with.
You could also put the array into a JSON object and parse that in your javascript. That might be a cleaner implementation.
try enclosing the scriptlets with quotes like roles.push("<%= x%>") and also check the array initialization.

Categories

Resources