Normalize Text - Read Each Character and remove spaces - Bad Enconding

Normalize Text - Read Each Character and remove spaces - Bad Enconding - java

I am trying to have a program that normalizes my text, it removes multiple empty spaces, it prints the other characters from the original file, and also put spaces and start and ending symbols.
So the conversion, after I write the txt file and open it, I see this content:
numa situaã § ã £ o de emergãªncia mã © dica
as you can see there are some weird characters that I don't want, maybe it's because of Encoding??
This is a text in my language, Portuguese.
Here is my code, how can I fix it?
public static void main(String[] args) throws IOException {
Charset encoding = Charset.defaultCharset();
InputStream in = new FileInputStream(new File("data.txt"));
Reader reader = new InputStreamReader(in, encoding);
Reader buffer = new BufferedReader(reader);
StringBuilder normalizedLanguage = new StringBuilder("<");
int r;
while ((r = buffer.read()) != -1) {
char ch = (char) r;
boolean newline = false;
boolean hasLetterBefore = false;
boolean hasLetterAfter = false;
char symbol = '-';
int lines = 0;
if (newline)
{
normalizedLanguage.append("\n<");
}
if (ch == '\r' || ch == '\n' )
{
lines++;
normalizedLanguage.append(">");
newline = true;
hasLetterBefore = false;
}
else if (Character.isLetterOrDigit(ch))
{
if (hasLetterBefore == true)
{
normalizedLanguage.append(Character.toString(symbol) + Character.toString(Character.toLowerCase(ch)));
}else{
normalizedLanguage.append(Character.toString(Character.toLowerCase(ch)));
}
newline = false;
hasLetterBefore = true;
}
else if (ch == ' ')
{
normalizedLanguage.append(Character.toString(ch));
newline = false;
hasLetterBefore = false;
}
else if (ch == '\t')
{
System.out.println("Tab detected: " + ch);
newline = false;
hasLetterBefore = false;
}
else
{
//Símbolos, entre outros..
if (!hasLetterBefore)
{
normalizedLanguage.append(" " + Character.toString(ch) + " ");
}
else
{
symbol = ch;
}
newline = false;
}
}
String normalizedLanguageString = normalizedLanguage.toString().trim().replaceAll(" +", " ");
PrintWriter out = new PrintWriter("data_after.txt");
out.println(normalizedLanguageString);
out.close();
buffer.close();
reader.close();
in.close();
}
Thank you very much in advance ;)

The problem got solved using another Charset Encoding :)
Change this line:
Charset encoding = Charset.defaultCharset();
To:
Charset encoding = Charset.forName("UTF8");
Thank you very much anyways

Related

Strange behavior while decoding hex characters to ASCII in Java

I wrote a Java program to take the lines of a file, and sort out a specific id which will then be converted to ASCII characters from HEX. Worked great for a couple of files until it found out the "0D" HEX character which seems to be carriage return (no idea what that does).
When it encounters that, it ends the line output (which it shouldn't do). I can't figure out what's happening.
This is the code, which compiles with no error. I've attached a picture with the result.
The file 1 contains the characters until the ID=xxx:LENGHT=8 and after that the 8 HEX characters needed to convert. after that, the program converts and add the text in the same line. I need them to be on the same line to figure out the pattern.
import java.io.*;
import java.util.Scanner;
import java.io.FileWriter;
import java.io.IOException;
public class FrameDecoder {
public static void main(String[] args) throws IOException {
try {
// Sortam frameurile cu id-ul tinta
File fisierSursa = new File("file1.txt"); //Fisierul original
FileWriter fisierData = new FileWriter("file2.txt"); //Fisierul cu frameurile care au id-ul cautat
FileWriter fisierTranzit = new FileWriter("file3.txt"); //Fisier cu caractere HEX, care va fi sters.
Scanner citireSursa = new Scanner(fisierSursa);
while (citireSursa.hasNextLine()){
String data = citireSursa.nextLine();
//System.out.println("data = " + data);
int intIndex = data.indexOf("ID=289"); // idul pe care il cauti
int intIndex2 = data.indexOf("ID=1313"); //al doilea id pe care il cauti
if (intIndex != -1 || intIndex2 != -1){
char[] text = data.toCharArray();
int counter = 0;
for (int i=0; i<text.length; i++){
if (text[i] == ':' && counter < 5){
counter++;
}
if (text[i] == ':' && counter == 5){
fisierTranzit.write(text[i+1]);
fisierTranzit.write(text[i+2]);
}
}
fisierTranzit.write("\r\n");
fisierData.write(data + "\r\n");
}
}
citireSursa.close();
fisierTranzit.close();
fisierData.close();
// Convertire HEX to ASCII
FileWriter fisierAscii = new FileWriter("file4.txt"); //Fisier care va contine caraterele ASCII decodate
File fisierTranzitRedeschis = new File("file3.txt"); //Reinitializam fisierul tranzit pentru a putea citi din el
Scanner citireTranzit = new Scanner(fisierTranzitRedeschis);
while (citireTranzit.hasNextLine()){
String data2 = citireTranzit.nextLine();
System.out.println("data2 = " + data2);
if (data2.length() % 2 != 0){
System.err.println("Invalid hex string!");
return;
}
StringBuilder builder = new StringBuilder();
for (int i=0; i<data2.length(); i=i+2){
//Impartim sirul in grupe de cate doua caractere
String s = data2.substring(i, i+2);
//Convertim fiecare grup in integer folosinf valueOfTheMetod
int n = Integer.valueOf(s, 16);
//Convertim valoare integer in char
builder.append((char)n);
}
fisierAscii.write(builder.toString() + "\r\n");
//System.out.println(builder.toString());
}
citireTranzit.close();
fisierAscii.close();
//Stergem fisierul 3
File stergereFisier3 = new File("file3.txt");
if(stergereFisier3.delete()){
System.out.println("File 3 deleted successfully");
}else{
System.out.println("Failed to delete file 3");
}
// Combinam fisierele
PrintWriter fisierFinal = new PrintWriter("file5.txt");
BufferedReader br1 = new BufferedReader(new FileReader("file2.txt"));
BufferedReader br2 = new BufferedReader(new FileReader("file4.txt"));
String line1 = br1.readLine();
String line2 = br2.readLine();
//loop to copy lines
//of file1.txt and file2.txt
//to file3.txt alternatively
while (line1 != null || line2 !=null){
if(line1 != null){
fisierFinal.print(line1 + " ");
line1 = br1.readLine();
}
if (line2 != null){
fisierFinal.println(line2 );
line2 = br2.readLine();
}
}
fisierFinal.flush();
//closing resources
br1.close();
br2.close();
fisierFinal.close();
System.out.println("Merged files succesfully");
//Stergem fisierul 2 si 4
File stergereFisier2 = new File("file2.txt");
File stergereFisier4 = new File("file4.txt");
if(stergereFisier2.delete() && stergereFisier4.delete()){
System.out.println("Files 2 and 4 deleted successfully");
}else{
System.out.println("Failed to delete files 2 and 4");
}
}catch (FileNotFoundException e){
System.out.println("An error occurred.");
e.printStackTrace();
}catch (IOException e){
System.out.println("No data to print");
e.printStackTrace();
}
}
}
Edit: I've cheated a little and place a condition when printing the HEX characters, if encounters 0D, just replacem them with 00. It worked. I'll also try your method, that one seems more ok than mine.
for (int i=0; i<text.length; i++){
if (text[i] == ':' && counter < 5){
counter++;
}
if (text[i] == ':' && counter == 5){
if(text[i+1] == '0' && text[i+2] == 'D'){
fisierTranzit.write('0');
fisierTranzit.write('0');
}
else{
fisierTranzit.write(text[i+1]);
fisierTranzit.write(text[i+2]);
}
}
}

The carriage return character \r (hex 0D) is one of the standard line separator characters, and Scanner.hasNextLine() and nextLine() methods assume it must terminate the current line.
To get more control, set the delimiter for Scanner to just the line feed character \n and use hasNext/next methods instead of hasNextLine/nextLine methods. For example:
Scanner citireTranzit = new Scanner(fisierTranzitRedeschis);
citireTranzit.useDelimiter("\n");
while (citireTranzit.hasNext()){
String data2 = citireTranzit.next();
...
}

How can i use splitter ^ in java

I have a problem with my java program. I have to read lines from a file, the form of these lines is:
1#the^cat#the^dog#the^bird#^fish#bear
2#the^cat#the^dog#the^bird#^fish#bear
and print all, accept the "#" and "^" at textfields in my GUI. The "^" must appear in case there in not article. For exaple ^fish, i have to print it as ^fish but the^dog i have to print the dog.
As far i can read and print the lines in the textfields but i can't find a way to skip the "^" between the words.
Here is my code:
try {
FileReader file = new FileReader("C:\\Guide.txt");
BufferedReader BR = new BufferedReader(file);
boolean eof = false;
int i=0;
while (!eof) {
String line = BR.readLine();
if (line == null)
eof = true;
else {
i++;
System.out.println("Parsing line "+i+" <"+line+">");
String[] words = line.split("#");
if (words.length != 7) continue;
number=words[0];
onomastiki=words[1];
geniki=words[2];
aitiatiki=words[3];
klitiki=words[4];
genos=words[5];
Region=words[6];
E = new CityEntry(number,onomastiki,geniki,
aitiatiki,klitiki,
genos,Region);
Cities.add(E);
}

You can try something like this.
FileReader file = new FileReader("C:\\\\Users\\\\aq104e\\\\Desktop\\\\text");
BufferedReader BR = new BufferedReader(file);
boolean eof = false;
int i = 0;
while (!eof) {
String line = BR.readLine();
if (line == null)
eof = true;
else {
i++;
System.out.println("Parsing line " + i + " <" + line + ">");
String[] words = line.split("#");
for (int j = 0; j < words.length; j++) {
if(words[j].contains("^")) {
if(words[j].indexOf("^") == 0) {
// write your code here
//This is case for ^fish
}else {
// split using ^ and do further manipulations
}
}
}
}
}
Let me know if this works for you.

That is gonna work, but it is not best way)
foreach(String word : words){
if(word.contains"the"){
word.replace("^"," ");
}
}

Reading ascii file line by line - Java

I am trying to read an ascii file and recognize the position of newline character "\n" as to know which and how many characters i have in every line.The file size is 538MB. When i run the below code it never prints me anything.
I search a lot but i didn't find anything for ascii files. I use netbeans and Java 8. Any ideas??
Below is my code.
String inputFile = "C:\myfile.txt";
FileInputStream in = new FileInputStream(inputFile);
FileChannel ch = in.getChannel();
int BUFSIZE = 512;
ByteBuffer buf = ByteBuffer.allocateDirect(BUFSIZE);
Charset cs = Charset.forName("ASCII");
while ( (rd = ch.read( buf )) != -1 ) {
buf.rewind();
CharBuffer chbuf = cs.decode(buf);
for ( int i = 0; i < chbuf.length(); i++ ) {
if (chbuf.get() == '\n'){
System.out.println("PRINT SOMETHING");
}
}
}

Method to store the contents of a file to a string:
static String readFile(String path, Charset encoding) throws IOException
{
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
Here's a way to find the occurrences of a character in the entire string:
public static void main(String [] args) throws IOException
{
List<Integer> indexes = new ArrayList<Integer>();
String content = readFile("filetest", StandardCharsets.UTF_8);
int index = content.indexOf('\n');
while (index >= 0)
{
indexes.add(index);
index = content.indexOf('\n', index + 1);
}
}
Found here and here.

The number of characters in a line is the length of the string read by a readLine call:
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
int iLine = 0;
String line;
while ((line = br.readLine()) != null) {
System.out.println( "Line " + iLine + " has " +
line.length() + " characters." );
iLine++;
}
} catch( IOException ioe ){
// ...
}
Note that the (system-dependent) line end marker has been stripped from the string by readLine.
If a very large file contains no newlines, it is indeed possible to run out of memory. Reading character by character will avoid this.
File file = new File( "Z.java" );
Reader reader = new FileReader(file);
int len = 0;
int c;
int iLine = 0;
while( (c = reader.read()) != -1) {
if( c == '\n' ){
iLine++;
System.out.println( "line " + iLine + " contains " +
len + " characters" );
len = 0;
} else {
len++;
}
}
reader.close();

You should user FileReader which is convenience class for reading character files.
FileInputStream javs docs clearly states
FileInputStream is meant for reading streams of raw bytes such as
image data. For reading streams of characters, consider using
FileReader.
Try below
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
for (int pos = line.indexOf("\n"); pos != -1; pos = line.indexOf("\n", pos + 1)) {
System.out.println("\\n at " + pos);
}
}
}

Java: reading utf-8 file page by page using FileInputStream

I need some code that will allow me to read one page at a time from a UTF-8 file.
I've used the code;
File fileDir = new File("DIRECTORY OF FILE");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
String str;
while ((str = in.readLine()) != null) {
System.out.println(str);
}
in.close();
}
After surrounding it with a try catch block it runs but outputs the entire file!
Is there a way to amend this code to just display ONE PAGE of text at a time?
The file is in UTF-8 format and after viewing it in notepad++, i can see the file contains FF characters to denote the next page.

You will need to look for the form feed character by comparing to 0x0C.
For example:
char c = in.read();
while ( c != -1 ) {
if ( c == 0x0C ) {
// form feed
} else {
// handle displayable character
}
c = in.read();
}
EDIT added an example of using a Scanner, as suggested by Boris
Scanner s = new Scanner(new File("a.txt")).useDelimiter("\u000C");
while ( s.hasNext() ) {
String str = s.next();
System.out.println( str );
}

If the file is valid UTF-8, that is, the pages are split by U+00FF, aka (char) 0xFF, aka "\u00FF", 'ÿ', then a buffered reader can do. If it is a byte 0xFF there would be a problem, as UTF-8 may use a byte 0xFF.
int soughtPageno = ...; // Counted from 0
int currentPageno = 0;
try (BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), StandardCharsets.UTF_8))) {
String str;
while ((str = in.readLine()) != null && currentPageno <= soughtPageno) {
for (int pos = str.indexOf('\u00FF'; pos >= 0; )) {
if (currentPageno == soughtPageno) {
System.out.println(str.substring(0, pos);
++currentPageno;
break;
}
++currentPageno;
str = str.substring(pos + 1);
}
if (currentPageno == soughtPageno) {
System.out.println(str);
}
}
}
For a byte 0xFF (wrong, hacked UTF-8) use a wrapping InputStream between FileInputStream and the reader:
class PageInputStream implements InputStream {
InputStream in;
int pageno = 0;
boolean eof = false;
PageInputSTream(InputStream in, int pageno) {
this.in = in;
this.pageno = pageno;
}
int read() throws IOException {
if (eof) {
return -1;
}
while (pageno > 0) {
int c = in.read();
if (c == 0xFF) {
--pageno;
} else if (c == -1) {
eof = true;
in.close();
return -1;
}
}
int c = in.read();
if (c == 0xFF) {
c = -1;
eof = true;
in.close();
}
return c;
}
Take this as an example, a bit more work is to be done.

You can use a Regex to detect form-feed (page break) characters. Try something like this:
File fileDir = new File("DIRECTORY OF FILE");
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fileDir), "UTF8"));
String str;
Regex pageBreak = new Regex("(^.*)(\f)(.*$)")
while ((str = in.readLine()) != null) {
Match match = pageBreak.Match(str);
bool pageBreakFound = match.Success;
if(pageBreakFound){
String textBeforeLineBreak = match.Groups[1].Value;
//Group[2] will contain the form feed character
//Group[3] will contain the text after the form feed character
//Do whatever logic you want now that you know you hit a page boundary
}
System.out.println(str);
}
in.close();
The parenthesis around portions of the Regex denote capture groups, which get recorded in the Match object. The \f matches on the form feed character.
Edited Apologies, for some reason I read C# instead of Java, but the core concept is the same. Here's the Regex documentation for Java: http://docs.oracle.com/javase/tutorial/essential/regex/

Java ignores EOF while reading chars from file

I try to read a File char by char. Unfortunately Java ignores EOF while reading chars from file.
FileReader fileReader = new FileReader(fileText);
char c;
String word = "";
List<String> words = new ArrayList<String>();
while ((c = (char) fileReader.read()) != -1) {
System.out.println(c);
if (c != ' ') {
word = word + c;
}
else {
words.add(word + " ");
word = "";
}
}
It should break up after the file is read, but instead it never stops running....

In Java, char is unsigned and cannot equal -1. You should do the comparison before you do the cast.
int ch;
while ((ch = fileReader.read()) != -1) {
char c = (char)ch;
System.out.println(c);
...
}

This happens because char cannot be equal to -1, even if you assign -1 to it:
char c = (char)-1;
System.out.println(c == -1); // prints false
Make c an int, and cast it to char only when you concatenate:
word = word + (char)c;
Better yet, use StringBuilder to build strings at runtime: otherwise, you create lots of temporary string objects in a loop, and these objects get thrown away.
StringBuilder word = new StringBuilder();
List<String> words = new ArrayList<String>();
int c;
while ((c = fileReader.read()) != -1) {
System.out.println((char)c);
word.append((char)c);
if (c == ' ') {
words.add(word.toString());
word = new StringBuilder();
}
}

You should try the below code
public static void main(String[] args) throws IOException {
FileReader fileReader = new FileReader(fileLocation);
int c;
String word = "";
List<String> words = new ArrayList<String>();
while ((c = (int) fileReader.read()) != -1) {
System.out.println((char)c);
char ch = (char)c;
if (ch != ' ') {
word = word + ch;
} else {
words.add(word + " ");
word = "";
}
}
System.out.println(word);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Normalize Text - Read Each Character and remove spaces - Bad Enconding - java

The problem got solved using another Charset Encoding :) Change this line: Charset encoding = Charset.defaultCharset(); To: Charset encoding = Charset.forName("UTF8"); Thank you very much anyways

Related

Strange behavior while decoding hex characters to ASCII in Java

How can i use splitter ^ in java

Reading ascii file line by line - Java

Java: reading utf-8 file page by page using FileInputStream

Java ignores EOF while reading chars from file

Categories

Resources