How can i check a Line contains a special chracter? - java

Hi I have a file stored in Linux system that contains a special character ^C
Something like this:
ABCDEF^CIJKLMN
Now i need to read this file in java and detect that there is this ^C to make split.
The problem that to read the file in UNIX.I must use cat -v fileName to see the special chracter ^C elsewhere i can't see it.
This is my sample code.
InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(this),
Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(inputStreamReader);
String line;
while ((line = br.readLine()) != null) {
if (line.contains("^C")) {
String[] split = line.split("\\" + sepRecord);
System.out.println(split);
}

You are checking if the line contains the String "^C", not the character '^C' (which corresponds to 0x03, or \u0003). You should search for the character 0x03 instead. Here's a code example that would work in your case:
byte[] fileContent = new byte[] {'A', 0x03, 'B'};
String fileContentStr = new String (fileContent);
System.out.println (fileContentStr.contains ("^C")); // false
System.out.println (fileContentStr.contains (String.valueOf ((char) 0x03))); // true
System.out.println (fileContentStr.contains ("\u0003")); // true, thanks to #Thomas Fritsch for the precision
String[] split = fileContentStr.split ("\u0003");
System.out.println (split.length); // 2
System.out.println (split[0]); // A
System.out.println (split[1]); // B
The ^C character is displayed in Caret Notation, and must be interpreted as a single character.

Related

Incorrect printing of non-eglish characters with Java

I thought this was only an issue with Python 2 but have run into a similar issue now with java (Windows 10, JDK8).
My searches have lead to little resolution so far.
I read from 'stdin' input stream this value: Viļāni. When I print it to console I get this: Vi????ni.
Relevant code snippets are as follows:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = in.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
allCorpus = corpus.toArray(allCorpus);
for (String line : allCorpus) {
System.out.println(line);
}
Further expansion on my problem as follows:
I read a file containing the following 2 lines:
を
Sōten_Kōro
When I read this from disk and output to a second file I get the following output:
ã‚’
S�ten_K�ro
When I read the file from stdin using cat testinput.txt | java UTF8Tester I get the following output:
???
S??ten_K??ro
Both are obviously wrong. I need to be able to print the correct characters to console and file. My sample code is as follows:
public class UTF8Tester {
public static void main(String args[]) throws Exception {
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
BufferedReader fileReader = new BufferedReader(new FileReader("testinput.txt"));
String[] fileData = readLines(fileReader);
printToFile(fileData, "file_out.txt");
}
private static void printToFile(String[] data, String fileName)
throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileName, "UTF-8");
for (String line : data) {
writer.println(line);
}
writer.close();
}
private static String[] readLines(BufferedReader reader) throws IOException {
ArrayList<String> corpus = new ArrayList<String>();
String inputString = null;
while ((inputString = reader.readLine()) != null) {
corpus.add(inputString);
}
String[] allCorpus = new String[corpus.size()];
return corpus.toArray(allCorpus);
}
}
Really stuck here and help would really be appreciated! Thanks in advance. Paul
System.in/out will use the default Windows character set.
Java String will use Unicode internally.
FileReader/FileWriter are old utility classes that use the default character set, hence they are for non-portable local files only.
The error you saw, was a special character as two bytes UTF-8 sequence, but every (special UTF-8) byte interpreted as the default single byte encoding, but with a value not present, hence twice a ? substitution.
Required is that the character can be entered on System.in in the default charset.
Then the String was converted from the default charset.
Writing it to file in UTF-8 needs to specify UTF-8.
Hence:
BufferedReader stdinReader = new BufferedReader(new InputStreamReader(System.in));
String[] stdinData = readLines(stdinReader);
printToFile(stdinData, "stdin_out.txt");
Path path = Paths.get("testinput-utf8.txt");
List<String> lines = Files.readAllLines(path); // Here the default is UTF-8!
Path path = Paths.get("testinput-winlatin1.txt");
List<String> lines = Files.readAllLines(path, "Windows-1252");
Files.write(lines, Paths.get("file_out.txt"), StandardCharsets.UTF_8);
To check whether your current computer system handles Japanese:
System.out.println("Hiragana letter Wo '\u3092'."); // Either を or ?.
Seeing ? the conversion to the default system encoding could not deliver.
を is U+3092, u-encoded as ASCII with \u3092.
To create an UTF-8 text under Windows:
Files.write(Paths.get("out-utf8.txt"),
"\uFEFFHiragana letter Wo '\u3092'.".getBytes(StandardCharsets.UTF_8));
Here I use an ugly (generally unneeded) BOM marker char \uFEFF (a zero-width space) that will let Windows Notepad recognize the text being in UTF-8.

Bufferedreader doesn't read the whole line at a time?

I'm using BufferedReader to read a text file line by line using Bufferedreader.readLine() but suddenly it doesn't read the whole line instead it reads only the first string only
Example: if the first line in the text file is:
[98.0,20.0,-65.0] [103.0,20.0,-70.0] 5.0 [98.0,20.0,-70.0] ccw
And my code is:
BufferedReader br = new BufferedReader(new FileReader("path" + "arcs.txt"));
String Line = br.readLine();
System.out.println(Line):
The output will be:
[98.0,20.0,-65.0]
Why is this happening?
The readLine() method of the buffer reader reads a string until it reaches a line seperator such as \n or \r. Your textfile must have these tokens after [98.0,20.0,-65.0].
The BufferedReader works internally as follows:
The InputSream or InputFile character stream will be buffered until an instance of \n (linefeed) or \r (carriage return) occures. Without the occurence of one of these two characters the BufferedReader is stuck in a while loop. An Exception to this case is, when the input character stream is closed. Therefore the only reasonable explanation why it's returning early must be, that there is some kind of line ending character after the first bit of the line. To make sure you catch everything in the file you can add a surrounding while loop.
Example:
// Please ignore the fact that I am using the System.in stream in this example.
try(BufferedReader br = new BufferedReader( new InputStreamReader( System.in) ))
{
String line;
StringBuilder sb = new StringBuilder();
// Read lines until null (EoF / End of File / End of Stream)
while((line = br.readLine()) != null)
{
// append the line we just read to the StringBuilds buffer
sb.append(line);
}
// Print the StringBuilders buffer as String.
System.out.println(sb.toString());
}
catch ( IOException e )
{
e.printStackTrace();
// Exception handling in general...
}

Buffered Reader read text until character

I am using a buffered reader to read in a file filled with lines of information. Some of the longer lines of text extend to be more than one line so the buffered views them as a new line. Each line ends with ';' symbol. So I was wondering if there was a way to make the buffered reader read a line until it reaches the ';' then return the whole line as a string. Here a how I am using the buffered reader so far.
String currentLine;
while((currentLine = reader.readLine()) != null) {
// trim newline when comparing with lineToRemove
String[] line = currentLine.split(" ");
String fir = line[1];
String las = line[2];
for(int c = 0; c < players.size(); c++){
if(players.get(c).getFirst().equals(fir) && players.get(c).getLast().equals(las) ){
System.out.println(fir + " " + las);
String text2 = currentLine.replaceAll("[.*?]", ".150");
writer.write(text2 + System.getProperty("line.separator"));
}
}
}
It would be much easier to do with a Scanner, where you can just set the delimiter:
Scanner scan = new Scanner(new File("/path/to/file.txt"));
scan.useDelimiter(Pattern.compile(";"));
while (scan.hasNext()) {
String logicalLine = scan.next();
// rest of your logic
}
To answer your question directly, it is not possible. Buffered Reader cannot scan stream in advance to find this character and then return everything before target character.
When you read from stream with Buffered Reader you are consuming characters and you cannot really know character without reading.
You could use inherited method read() to read only single character and then stop when you detect desired character. Granted, this is not good thing to do because it contradicts the purpose of BufferedReader.

Readline() reads only first line

I need to read a text file and split using a common text in the lines, and print a part of the split text. This works fine but only does this for the first line in the text.
However, if I print lines without the split part, it prints correctly. Please what am I doing wrong?
File sample: (I want to split by "words")
Line 1 This text is of length: 7 words. I need to learn how to program.
Line 2 Now we have text of length: 3 words. No matter what the words are, I must program
FileInputStream fis = new FileInputStream(fin);
//Construct BufferedReader from InputStreamReader
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String line = null;
ArrayList<String> txt1 = new ArrayList<>();
while ((line = br.readLine()) != null) {
String[] pair = line.split("words");
txt1.add(pair[1]);
System.out.println(txt1);
//System.out.println(line);
}
br.close();
Given the following unit test simulating the input of a file via a StringInputStream:
#Test
public void test() throws IOException
{
String fileContent = "This text is of length: 7 words.\r\nI need to learn how to program and one day.";
StringInputStream stream = new StringInputStream(fileContent);
BufferedReader br = new BufferedReader(new InputStreamReader(stream));
String line = null;
ArrayList<String> txt1 = new ArrayList<>();
while ((line = br.readLine()) != null) {
String[] pair = line.split("words");
txt1.add(pair[1]);
System.out.println(txt1);
//System.out.println(line);
}
}
The first line will be split into
"This text is of length: 7 " and
"."
Since you put item [1] into the array list, it'll just contain a single dot.
The second line will be split into
"I need to learn how to program and one day."
only. There is no second item, so accessing [1] results in a ArrayIndexOutOfBoundsException.

reading character like ö and ü from file in eclipse

I have a input file which contains some words like bört and übuk.When I read this line based on the following code I got these strange results. How can I solve it?
String line = bufferedReader.readLine();
if (line == null) { break; }
String[] words = line.split("\\W+");
for (String word : words) {
System.out.println(word);
output is
b
rt
and
buk
Try to create a BufferedReader handling UTF8 characters encoding :
FileInputStream fis = new FileInputStream(new File("someFile.txt"));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(isr);
It seems that your problem is that standard character class \\W is negation of \\w which represents only [a-zA-Z0-9_] characters, so split("\\W+") will split on every character which is not in this character class like in your case ö, ü.
To solve this problem and include also Unicode characters you can compile your regex with Pattern.UNICODE_CHARACTER_CLASS flag which enables the Unicode version of Predefined character classes and POSIX character classes. To use this flag you can add (?U)at start of used regex
String[] words = line.split("(?U)\\W+");
Demo:
String line = "bört and übuk";
String[] words = line.split("(?U)\\W+");
for (String word : words)
System.out.println(word);
Output:
bört
and
übuk
You need something like this :-
BufferedReader bufferReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileDir), "UTF-8"));
Here instead of UTF-8 , you can put the encoding you need to support while reading the file

Categories

Resources