Convert in reverse ascii to whole decimal in Java - java

Hi all and thank you for the help in advance.
I have scoured the webs and have not really turned up with anything concrete as to my initial question.
I have a program I am developing in JAVA thats primary purpose is to read a .DAT file and extract certain values from it and then calculate an output based on the extracted values which it then writes back to the file.
The file is made up of records that are all the same length and format and thus it should be fairly straightforward to access, currently I am using a loop and and an if statement to find the first occurrence of a record and then through user input determine the length of each record to then loop through each record.
HOWEVER! The first record of this file is a blank (Or so I thought). As it turns out this first record is the key to the rest of the file in that the first few chars are ascii and reference the record length and the number of records contained within the file respectively.
below are a list of the ascii values themselves as found in the files (Disregard the " " the ascii is contained within them)
"#¼ ä "
"#g â "
"ÇG # "
"lj ‰ "
"Çò È "
"=¼ "
A friend of mine who many years ago use to code in Basic recons the first 3 chars refer to the record length and the following 9 refer to the number of records.
Basically what I am needing to do is convert this initial string of ascii chars to two decimals in order to work out the length of each record and the number of records.
Any assistance will be greatly appreciated.
Edit...
Please find below the Basic code used to access the file in the past, perhaps this will help?
CLS
INPUT "Survey System Data File? : ", survey$
survey$ = "f:\apps\survey\" + survey$
reclen = 3004
OPEN survey$ + ".dat" FOR RANDOM AS 1 LEN = reclen
FIELD #1, 3 AS RL$, 9 AS n$
GET #1, 1
RL = CVI(RL$): n = CVI(n$)
PRINT "Record Length = "; RL
reclen = RL
PRINT "Number of Records = "; n
CLOSE #1
Basically what I am looking for is something similar but in java.

ASCII is a special way to translate a bit pattern in a byte to a character, and that gives each character a numerical value; for the letter 'A' is this 65.
In Java, you can get that numerical value by converting the char to an int (ok, this gives you the Unicode value, but as for the ASCII characters the Unicode value is the same as for ASCII, this does not matter).
But now you need to know how the length is calculated: do you have to add the values? Or multiply them? Or append them? Or multiply them with 128^p where p is the position, and add the result? And, in the latter case, is the first byte on position 0 or position 3?
Same for the number of records, of course.
Another possible interpretation of the data is that the bytes are BCD encoded numbers. In that case, each nibble (4bit set) represents a number from 0 to 9. In that case, you have to do some bit manipulation to extract the numbers and concatenate them, from left (highest) to right (lowest). At least you do not have to struggle with the sequence and further interpretation here …
But as BCD would require 8-bit, this would be not the right interpretation if the file really contains ASCII, as ASCII is 7-bit.

Related

how to output sorted files in java

I have a problem where I want to scan the files that are in a certain folder and output them.
the only problem is that the output is: (1.jpg , 10.jpg , 11.jpg , 12.jpg , ... , 19.jpg , 2.jpg) when I want it to be: (1.jpg , 2.jpg and so on). Since I use: File actual = new File(i.); (i is the number of times the loop repeats) to scan for images, I don't know how to sort the output.
this is my code for now.
//variables
String htmlHeader = ("<!DOCTYPE html>:\n"
+ "<html lang=\"en\">\n"
+ "<head>\n"
+ "<meta charset=\"UTF-8\">\n"
+ "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n"
+ "<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n"
+ "<title>Document</title>\n"
+ "</head>"
+ "<body>;\n");
String mangaName = ("THREE DAYS OF HAPPINESS");
String htmlEnd = ("</body>\n</html>");
String image = ("image-");
//ask for page number
Scanner scan = new Scanner(System.in);
System.out.print("enter a chapter number: ");
int n = scan.nextInt();
//create file for chapter
File creator = new File("manga.html");
//for loop
for (int i = 1; i <= n; ++i) {
//writing to HTML file
BufferedWriter bw = null;
bw = new BufferedWriter(new FileWriter("manga"+i+".html"));
bw.write(htmlHeader);
bw.write("<h2><center>" + mangaName + "</center></h2</br>");
//scaning files
File actual = new File("Three Days Of Happiness Chapter "+i+" - Manganelo_files.");
for (File f : actual.listFiles()) {
String pageName = f.getName();
//create list
List<String> list = Arrays.asList(pageName);
list.sort(Comparator.nullsFirst(Comparator.comparing(String::length).thenComparing(Comparator.naturalOrder())));
System.out.println("list");
//for loop
//writing bpdy to html file
bw.write("<p><center><img src=\"Three Days Of Happiness Chapter "+i+" - Manganelo_files/" + pageName + "\" <br/></p>\n");
System.out.println(pageName);
}
bw.write(htmlEnd);
bw.close();
System.out.println("Process Finished");
}}
}```
When you try to sort the names, you'll most certainly notice that they are sorted alphanumerically (e.g. Comparing 9 with 12; 12 would come before 9 because the leftmost digit 1 < 9).
One way to get around this is to use an extended numbering format when naming & storing your files.
This has been working great for me when sorting pictures, for example. I use YYYY-MM-DD for all dates regardless whether the day contains one digit (e.g. 9) or two digits (11). This would mean that I always type 9 as 09. This also means that every file name in a given folder has the same length, and each digit (when compared to the corresponding digit to any other adjacent file) is compared properly.
One solution to your problem is to do the same and add zeros to the left of the file names so that they are easily sorted both by the OS and by your Java program. The drawback to this solution is that you'll need to decide the maximum number of files you'll want to store in a given folder beforehand – by setting the number of digits properly (e.g. 3 digits would mean a maximum of 1000 uniquely & linearly numbered file names from 000 to 999). The plus, however, is that this will save you the hassle of having to sort unevenly numerered files, while making it so that your files are pre-sorted once and are ready to be quickly read whenever.
Generally, file systems do not have an order to the files in a directory. Instead, anything that lists files (be it an ls or dir command on a command line, calling Files.list in java code, or opening Finder or Explorer) will apply a sorting order.
One common sorting order is 'alphanumerically'. In which case, the order you describe is correct: 2 comes after 1 and also after 10. You can't wave a magic wand and tell the OS or file system driver not to do that; files as a rule don't have an 'ordering' property.
Instead, make your filenames such that they do sort the way you want, when sorting alphanumerically. Thus, the right name for the first file would be 01.jpg. Or possibly even 0001.jpg - you're going to have to make a call about how many digits you're going to use before you start, unfortunately.
String.format("%05d", 1) becomes "00001" - that's pretty useful here.
The same principle applies to reading files - you can't just rely on the OS sorting it for you. Instead, read it all into e.g. a list of some sort and then sort that. You're going to have to write a fairly funky sorting order: Find the dot, strip off the left side, check if it is a number, etc. Quite complicated. It would be a lot simpler if the 'input' is already properly zero-prefixed, then you can just sort them naturally instead of having to write a complex comparator.
That comparator should probably by modal. Comparators work by being handed 2 elements, and you must say which one is 'earlier', and you must be consistent (if a is before b, and later I ask you: SO, how about b and a, you must indicate that b is after a).
Thus, an algorithm would look something like:
Determine if a is numeric or not (find the dot, parseInt the substring from start to the dot).
Determine if b is numeric or not.
If both are numeric, check ordering of these numbers. If they have an order (i.e. aren't identical), return an answer. Otherwise, compare the stuff after the dot (1.jpg should presumably be sorted before 1.png).
If neither are numeric, just compare alphanum (aName.compareTo(bName)).
If one is numeric and the other one is not, the numeric one always wins, and vice versa.

Keeping Java String Offsets With Unicode Consistent in Python

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.
As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:
0,6
7,14
15,20
Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.
All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.
For example, given the string "I feel 🙂 today", the Java program will output:
0,1
2,6
7,9
10,15
On the Python side, these translate to:
0,1 "I"
2,6 "feel"
7,9 "🙂 "
10,15 "oday"
Where the last index is technically invalid. Java sees "🙂" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.
Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.
So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?
You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:
x = "I feel 🙂 today"
y = bytearray(x, "UTF-16LE")
offsets = [(0,1),(2,6),(7,9),(10,15)]
for word in offsets:
print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))
Output:
I
feel
🙂
today
Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:
from itertools import accumulate
x = "I feel 🙂 today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program
# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))
# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]
# now you can just use those indices as normal
for word in offsets:
print(x[word[0]:word[1]])
Output:
I
feel
🙂
today
The above code is messy and can probably be made clearer, but you get the idea.
This solves the problem given the proper encoding, which, in our situation appears to be 'UTF-16BE':
def correct_offsets(input, offsets, encoding):
offset_list = [{'old': o, 'new': [o[0],o[1]]} for o in offsets]
for idx in range(0, len(input)):
if len(input[idx].encode(encoding)) > 2:
for o in offset_list:
if o['old'][0] > idx:
o['new'][0] -= 1
if o['old'][1] > idx:
o['new'][1] -= 1
return [o['new'] for o in offset_list]
This may be pretty inefficient though. I gladly welcome any performance improvements.

Translate Hexadecimal transformation from Oracle SQL into Java code

In searching for an answer, I used the solution provided in the following link : How to format a Java string with leading zero?
I have the following code that needs to be translated into java:
TRIM(TO_CHAR(123,'000X'))
From what I can tell, it translates the number into hexa and adds some leading zeros.
However, if I give a big value, I get ##### as answer, e.g. for the following code:
TRIM(TO_CHAR(999999,'000X'))
In Java, my current solution is the following:
String numberAsHex = Integer.toHexString(123);
System.out.println(("0000" + numberAsHex).substring(numberAsHex.length()));
It works fine for small numbers, but for big ones, like 999999 it outputs 423f. So it does the transformation, resulting the value f423f and then it cuts a part off. So I am nowhere near the value from Oracle
Any suggestion as to how to do this? Or why are ##### displayed in the Oracle case?
Instead of Integer.toHexString I would recommend using String.format because of its greater flexibility.
int number = ...;
String numberAsHex = String.format("%06x", number);
The 06 means you get 6 digits with leading zeros, x means you get lowercase hexadecimal.
Examples:
for number = 123 you get numberAsHex = "00007b"
for number = 999999you get numberAsHex = "0f423f"

Function displaying wrong character code value

Background to my problem
Hi, I am just attempting to complete an exercise on project Euler which states that I must read all names from a ".txt" file and add all the character codes for each character within that string etc. As I was doing the exercises I realized that the wrong character codes is being displayed.
This is the full details for my problem from project Euler
Using names.txt (right click and 'Save Link/Target As...'), a 46K text
file containing over five-thousand first names, begin by sorting it
into alphabetical order. Then working out the alphabetical value for
each name, multiply this value by its alphabetical position in the
list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN,
which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the
list. So, COLIN would obtain a score of 938 × 53 = 49714.
What is the total of all the name scores in the file?
My Question
why is my code displaying the value "67" for the character "C" when the actual character code value for "C" is 3? . Thanks in advance.
private static int NameValue(string name)
{
string StrimName = name.Substring(1, name.Length-2); // name ---> COLIN
Console.WriteLine(StrimName[0] + 0); // should print 3 because character code for "C" Is 3 but result is 67...
return 0;
}
It prints a number from an ASCII table: http://www.asciitable.com/
You should replace it with:
Console.WriteLine((StrimName[0]-64) + 0);
to receive what you want. It turns out you want to count 'A' as one, and its number in ASCII table is 65, therefrom I subtract 64.
Every character has a number in the ascii code,
The ascii-code dor 'C' is 67, This is why you see 67.
You can see here a table for ascii code

How to use minimum memory when analyzing data from a huge file

I have two huge files and each file will store integer values each line. The format looks like below:
File1:
3
4
11
30
0
...
File2:
13
43
11
40
9
...
I need to find a way to read from these two files and find the duplicate values on these two files. Let's consider above example, the value 11 will be printed since it appears on both of the files.
Reading from file and looping the values are easy. But the problem is that the number of lines for each file is far more than Integer.MAXIMUM. So I can't read the whole files into memory otherwise I will run out of memory. Is there any efficient way to solve this problem and consuming less memory?
EDIT1
I want to find a solution which not read all the data into memory. It would be better to read a part of the file and do analyze then continue reading. But I don't know how to achieve my goal.
A naive approach to minimize memory would be to read every file line by line ("for each line of file 1 do for each line of file 2 compare line"), but might take quite long as the bottle neck is disk I/O. It's on the lower end of the spectrum - minimum memory, maximum duration.
On the other end would be to create huge heaps and operate entirely on memory that - maximum memory, minimum duration.
So the best solution would be something between, balancing tradeoffs, which usually requires some more upfront thinking.
analyze the input data
With two files only containing integers (32bit) values, and both files containing >2^32 entries you certainly will have duplicates already in the first file.
But to determine, if file 2 contains a duplicate, we only need to know if the same value occurs at least once in the first file. That just 1 bit information we have to keep in memory, not the entire value.
Because the value-range is limited to integer, and for each possible value we have to know if it occurs at least once, we need a bit-set with 2^32 bits.
In a single long value we can store the information for 64 values of your file 64 bits). So we need a long array of size 67'108'864 to store the entire min-occurence information of file 1 in memory. That's around 512 MB of memory you need.
After you have read this representation of file 1 into memory. You could scan file 2 line by line and check for each value, if it occurs at least once in file 1 using the array we've created and print out if its a duplicate or not (or write it to another file or into a datastrcture).
the smart version
In case you want to get your hands dirty, continue reading. If you want to use what's out of the JDK box, use a BitSet of size Integer.MAX_VALUE (thanks #Mzf).
BitSet bs = new BitSet(Integer.MAX_VALUE);
bs.set(valueFromFile1);
boolean duplicate = bs.get(valueFromFile2);
the version for men with beards who run as root:
The structure of the lookup array is like
[ 0000 0001 0000 0000 ... , 0000 0000 1000 000 ..., ...]
^ ^
7 (0*64 + 7) 74 (1*64 + 8)
What you need to have is a conversion from int value to index position and bit-offset.
int pos(int value){
return value / 64;
}
long offsetMask(int value){
return 1L << (value % 64)
}
boolean exists(long[] index, int value) {
return (index[pos(value)] & offsetMask(value)) != 0;
}
long[] index = new long[67108864];
//read references file
Scanner sc = new Scanner(new File("file1"));
while (sc.hasNextInt()) {
int value = sc.nextInt();
index[pos(value)] |= offsetMask(value);
}
//find duplicates
Scanner sc = new Scanner(new File("file2"));
while (sc.hasNextInt()) {
int value = sc.nextInt();
boolean result = exists(index, value)
if(result) {
System.out.println("Duplicate: " + value);
}
}
(it's basically the same what's done in the BitSet)
It doesn't matter if files are larger as longe as the value range does not increase you do not need more than 512 MB.
First sort each file smallest to largest. Now read the first line from each file. Compare. If they match, record a duplicate, then go to the next one either file (I choose file A). If they don't match, go to the next line in the in the file that had the smaller one. Repeat til you reach the end of one file.
First you will have to split you input into smaller files with ordered data.
These files should have fixed data range so first file would have distinc values in range of 0-1000000, second 1000001-2000000 and so on and so on.
If you do this for each input, you will end up with ordered "buckets" of values in given range. Al you will have to do now is to compare those "buckets" agains eachother to get duplication values.
This consumes disk space in exchange to memory usage.
This is how I would try to solve this at first glance.
There are several ways to solve this problem.
But to start with, you don't have to read the entire file into memory but read it line by line using FileUtils - see several examples here.
You said that the number of lines is bigger than Integer.MAXIMUM.
So you can read the first file, line by line, save the disticnt number in a Hashmap.
Now read the second file line by line , for each line search the number in the map - if it exist - print it.
The max anount of memory used is Integer.MAXIMUM
If you want to reduct the memory footprint you can use bit map instead of map so instead of useing sizeof(int) * #distinct_numbers_in_first_file you will use const size of Integer.MAXIMUM bits
If you know something about the input - you can choose the right solution for you
In Both cases, the memory used is limited and doesn't load entire file into memory

Categories

Resources