Strange behaviour of String.length() - java

I have class with main:
public class Main {
// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary
public static void main(String[] args) {
try {
List<String> firstLastWords = FileParser.getWords(args[0]);
System.out.println(firstLastWords);
System.out.println(firstLastWords.get(0).length());
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
and I have FileParser:
public class FileParser {
public FileParser() {
}
final static Charset ENCODING = StandardCharsets.UTF_8;
public static List<String> getWords(String filePath) throws IOException {
List<String> list = new ArrayList<String>();
Path path = Paths.get(filePath);
try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
String line = null;
while ((line = reader.readLine()) != null) {
String line1 = line.replaceAll("\\s+","");
if (!line1.equals("") && !line1.equals(" ") ){
list.add(line1);
}
}
reader.close();
}
return list;
}
}
args[0] is the path to txt file with just 2 words. So if file contains:
тор
кит
programm returns:
[тор, кит]
4
If file contains:
т
тор
кит
programm returns:
[т, тор, кит]
2
even if file contains:
//jump to next line
тор
кит
programm returns:
[, тор, кит]
1
where digit - is length of the first string in the list.
So the question is why it counts one more symbol?

Thanks all.
This symbol as said #Bill is BOM (http://en.wikipedia.org/wiki/Byte_order_mark) and reside at the beginning of a text file.
So i found this symbol by this line:
System.out.println(((int)firstLastWords.get(0).charAt(0)));
it gave me 65279
then i just changed this line:
String line1 = line.replaceAll("\\s+","");
to this
String line1 = line.replaceAll("\uFEFF","");

Cyrillic characters are difficult to capture using Regex, eg \p{Graph} does not work, although they are clearly visible characters. Anyways, that is besides the OP question.
The actual problem is likely due to other non-visible characters, likely control characters present. Try following regex to remove more: replaceAll("(\\s|\\p{Cntrl})+",""). You can play around with the Regex to further extend that to other cases.

Related

finding character count between two special symbols

Am trying to find the character count between = and \n new line character using below java code. But \n is not considering in my case.
am using import org.apache.commons.lang3.StringUtils; package
Please find my below java code.
public class CharCountInLine {
public static void main(String[] args)
{
BufferedReader reader = null;
try
{
reader = new BufferedReader(new FileReader("C:\\wordcount\\sample.txt"));
String currentLine = reader.readLine();
String[] line = currentLine.split("=");
while (currentLine != null ){
String res = StringUtils.substringBetween(currentLine, "=", "\n"); // \n is not working.
if(res != null) {
System.out.println("line -->"+res.length());
}
currentLine = reader.readLine();
}
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
reader.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
}
Please find my sample text file.
sample.txt
Karthikeyan=123456
sathis= 23546
Arun = 23564
Well, you're reading the string using readLine(), which according to the Javadoc (emphasis mine):
Returns:
A String containing the contents of the line, not including
any line-termination characters, or null if the end of the stream has
been reached
So your code doesn't work because the string does not contain a newline character.
You can address this in a number of ways:
Use StringUtils.substringAfter() instead of StringUtils.substringBetween().
If it meets the requirements, treat your file as a Java properties file so you don't need to parse it yourself.
Use String.split().
Use String.lastIndexOf().
Some simple regex matching and grouping.
You don't need to change how you read the lines, simply change your logic to extract the text after =.
Pattern p = Pattern.compile("(?:.+)=(.+)$");
Matcher m = p.matcher("Karthikeyan=123456");
if (m.find()) {
System.out.println(m.group(1).length());
}
No need for Apache StringUtils either, simple Java regex will do. If you don't want to count whitespace, trim the string before calling length().
Alternatively, you can also split the line around = as discussed here.
10x simpler code:
Path p = Paths.get("C:\\wordcount\\sample.txt");
Files.lines(p)
.forEach { line ->
// Put the above code here
}

Find a word in a File and print the line that contains it in java

Using command line, I am supposed to enter a file name that contains text and search for a specific word.
foobar file.txt
I started writing the following code:
import java.util.*;
import java.io.*;
class Find {
public static void main (String [] args) throws FileNotFoundException {
String word = args[0];
Scanner input = new Scanner (new File (args[1]) );
while (input.hasNext()) {
String x = input.nextLine();
}
}
}
My program is supposed to find word and then print the whole line that contains it.
Please be specific since I am new to java.
You are already reading in each line of the file, so using the String.contains() method will be your best solution
if (x.contains(word) ...
The contains() method simply returns true if the given String contains the character sequence (or String) you pass to it.
Note: This check is case sensitive, so if you want to check if the word exists with any mix of capitalization, just convert the strings to the same case first:
if (x.toLowerCase().contains(word.toLowerCase())) ...
So now here is a complete example:
public static void main(String[] args) throws FileNotFoundException {
String word = args[0];
Scanner input = new Scanner(new File(args[1]));
// Let's loop through each line of the file
while (input.hasNext()) {
String line = input.nextLine();
// Now, check if this line contains our keyword. If it does, print the line
if (line.contains(word)) {
System.out.println(line);
}
}
}
Firest you have to open file and then read it line by line and check that word is in that line on not. see the code below.
class Find {
public static void main (String [] args) throws FileNotFoundException {
String word = args[0]; // the word you want to find
try (BufferedReader br = new BufferedReader(new FileReader("foobar.txt"))) { // open file foobar.txt
String line;
while ((line = br.readLine()) != null) { //read file line by line in a loop
if(line.contains(word)) { // check if line contain that word then prints the line
System.out.println(line);
}
}
}
}
}

Issue with encoding; .jar doesn't work with Cyrillic characters in UTF-8 files

So I have this regex as String literal in my code:
private static final String FILE_PATTERN = "((\\s*\".*НЕКОТОРЫЕ СИМВОЛЫ .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
Also I have input test files in UTF-8 encoding.
And the problem is that when I test my program in IDE (IntelliJ IDEA in my case) everything is OK. Particularly, regex works with Cyrillic characters in test files.
But when I build my program (Maven) and tested .jar file with the same test files, it turned out that most likely regex won't work with Cyrillic characters.
Then I tested it again with file in Windows 1251 encoding and it worked.
So my question is - how can I make my .jar work with UTF-8 files, just like in IDE?
Thanks in advance.
[UPDATE1]
two test files, one in UTF-8 and another in Windows 1251
I've tried to replace Cyrillic characters with \u codes like this:
private static final String FILE_PATTERN = "((\\s*\".*\\u041E\\u0442\\u0434\\u0435\\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
this doesn't work :(
[UPDATE2]
File processing starts like this:
static void processFile(String inputFile) {
try {
String fileStr = FileHandler.readFile(inputFile).toString();
if (!FileParser.validateFile(fileStr)) {
System.out.println("Sorry, input file format is invalid");
...
File validating looks like this:
public class FileParser {
private static final String FILE_PATTERN = "((\\s*\".*Отдел .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
public static boolean validateFile(String fileStr) {
return Pattern.compile(FILE_PATTERN).matcher(fileStr).matches();
}
...
File reading is very common I think:
public class FileHandler {
public static StringBuilder readFile(String fileName) {
StringBuilder res = new StringBuilder();
String temp;
try (BufferedReader r = new BufferedReader(new FileReader((fileName)))) {
while ((temp = r.readLine()) != null) {
res.append(temp).append("\n");
}
} catch (FileNotFoundException e) {
System.out.println("Input file not found!");
} catch (IOException e) {
// log exception
}
return res;
}
...
I'll throw some possibilities at the problem.
The classes FileReader and FileWriter use the default platform encoding, without overload for a specified encoding. I am not sure whether this is intended, but one of the alternatives:
public static StringBuilder readFile(String fileName) {
StringBuilder res = new StringBuilder();
String temp;
Charset charset = StandardCharsets.UTF_8;
//Charset charset = Charset.fromName("Windows-1251");
try (BufferedReader r = Files.newBufferedReader(fileName, charset)) {
while ((temp = r.readLine()) != null) {
res.append(temp).append("\n");
}
} catch (FileNotFoundException e) {
System.out.println("Input file not found!");
} catch (IOException e) {
// log exception
}
return res;
}
Or:
String readFile(String fileName) throws IOException {
byte[] content = Files.readAllBytes(Paths.get(fileName));
return new String(content, StandardCharsets.UTF_8);
}
Then the editor encoding of the java sources must be the same encoding as that of the javac compiler. One can check this by using the \uXXXX ASCII representation of such special chars: if it then suddenly works, ...
You used two backslashes, but \u0063 (letter c) works java source level, and in fact instead of public class you can write publi\u0063 \u0063lass.
private static final String FILE_PATTERN =
"((\\s*\".*\u041E\u0442\u0434\u0435\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";
Then there is the regular expression, that has two Unicode flags, (?u) and (?U) undermore for what a letter constitutes. That should not be a problem here.

remove '#' symbol from the beginning of the string in java

Sample data in csv file
##Troubleshooting DHCP Configuration
#Module 3: Point-to-Point Protocol (PPP)
##Configuring HDLC Encapsulation
Hardware is HD64570
So i want to get the lines as
#Troubleshooting DHCP Configuratin
Module 3: Point-to-Point Protocol(PPP)
#Configuring HDLC Encapsulation
Hardware is HD64570
I have written sample code
public class ReadCSV {
public static BufferedReader br = null;
public static void main(String[] args) {
ReadCSV obj = new ReadCSV();
obj.run();
}
public void run() {
String sCurrentLine;
try {
br = new BufferedReader(new FileReader("D:\\compare\\Genre_Subgenre.csv"));
try {
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.charAt(0) == '#'){
System.out.println(sCurrentLine);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I am getting below error
##Troubleshooting DHCP Configuration
#Module 3: Point-to-Point Protocol (PPP)
##Configuring HDLC Encapsulation
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(Unknown Source)
at example.ReadCSV.main(ReadCSV.java:19)
Please suggest me how to do this?
Steps:
Read the CSV file line by line
Use line.replaceFirst("#", "") to remove the first # from each line
Write the modified lines to an output stream (file or String) which suites you
If the variable s contains the content of the CSV file as String
s = s.replace("##", "#");
will replace all the occurrencies of '##" with '#'
You need something like String line=buffer.readLine()
Check the first character of the line with line.charAt(0)=='#'
Get the new String with String newLine=line.substring(1)
This is a rather trivial question. Rather than do the work for you, I'll outline the steps that you need to take without gifting you the answer.
Read in a file line by line
Take the first line and check if the first character of this line is a # - If it is, create a substring of this line excluding the first character ( or use fileLine.replaceFirst("#", ""); )
Store this line somewhere in an array like data structure or simply replace the current variable with the edited one ( fileLine = fileLine.replaceFirst("#", ""); )
Repeat until no more lines left from file.
If you want to add these changes to the file, simply overwrite the old file with the new lines (e.g. Using a steam reader and setting second parameter to false would overwrite)
Make an attempt and show us what you have tried, people will be more likely to help if they believe you have attempted the problem yourself thoroughly first.
package stackoverflow.q_25054783;
import java.util.Arrays;
public class RemoveHash {
public static void main(String[] args) {
String [] strArray = new String [3];
strArray[0] = "##Troubleshooting DHCP Configuration";
strArray[1] = "#Module 3: Point-to-Point Protocol (PPP)";
strArray[2] = "##Configuring HDLC Encapsulation";
System.out.println("Original array: " + Arrays.toString(strArray));
for (int i = 0; i < strArray.length; i++) {
strArray[i] = strArray[i].replaceFirst("#", "");
}
System.out.println("Updated array: " + Arrays.toString(strArray));
}
}
//Output:
//Original array: [##Troubleshooting DHCP Configuration, #Module 3: Point-to-Point Protocol (PPP), ##Configuring HDLC Encapsulation]
//Updated array: [#Troubleshooting DHCP Configuration, Module 3: Point-to-Point Protocol (PPP), #Configuring HDLC Encapsulation]
OpenCSV reads CSV file line by line and gives you an array of strings, where each string is one comma separated value, right? Thus, you are operating on a string.
You want to remove '#' symbol from the beginning of the string (if it is there). Correct?
Then this should do it:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
if (nextLine[0].charAt(0) == '#') {
nextLine[0] = nextLine[0].substring(1, nextLine[0].length());
}
}
Replacing the first '#' symbol on each of the lines in the CSV file.
private List<String> getFileContentWithoutFirstChar(File f){
try (BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(f), Charset.forName("UTF-8")))){
List<String> lines = new ArrayList<String>();
for(String line = input.readLine(); line != null; line = input.readLine()) {
lines.add(line.substring(1));
}
return lines
} catch(IOException e) {
e.printStackTrace();
System.exit(1);
return null;
}
}
private void writeFile(List<String> lines, File f){
try(BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), StandardCharsets.UTF_8))){
for(String line : lines){
bw.write(content);
}
bw.flush();
}catch (Exception e) {
e.printStackTrace();
}
}
main(){
File f = new File("file/path");
List<Stirng> lines = getFileContent(f);
f.delete();
writeFile(lines, f);
}

JAVA: Getting the content of specific strings from text files

I have a text file like this:
text
text
text
.
.
#data
instances1
instances2
.
.
instancesN
I want to get the contents of this file from #data until the end of the file, how can I do?
I found this method of FileUtils (from apache commons-lang) class but it's usable only if I already know the line number.
String ln = FileUtils.readLines(new File("arff_file/"+results.get(0)))
.get(lineNumber);
Since you are using Apache Commons, you can do it in one line:
String contents = FileUtils.readFileToString(new File("arff_file/"+results.get(0)), "UTF-16").replaceAll("^.*?(?=#data)", "");
This works by
reading the whole file into a single String
using regex-based replaceAll() to remove (by replacing with a blank) everything up to, but not including, #data
The regex breakdown of ^.*?(?=#data) is:
^ start of input
.*? a reluctantly quantified wildcard
(?=#data) a positive (non-consuming) look ahead that asserts that the next input is #data
A reluctant quantifier could be important to use so it won't skip past the first #data, in case it appears more than once in the input.
try {
String file = "fileName";
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if (line.equals("#data"))
nowRead(br);//I just do this for more efficiency, you can set a boolean flag instead
}
br.close();
}catch (IOException e) {
//OMG Exception again!
}
}
static ArrayList<String> nowRead(BufferedReader br) throws IOException {
ArrayList<String> s = new ArrayList<String>();// do it as you wish
String line;
while ((line = br.readLine()) != null) {
s.add(line);
}
return s;
}
Path start = Paths.get("test.txt");
try
{
List<String> lines = Files.readAllLines(start);
for (Iterator<String> it = lines.iterator(); it.hasNext();)
{
String line = it.next();
if (!"#data".equals(line.trim()))
{
it.remove();
}
else
{
break;
}
}
System.out.println(lines);
}
catch (IOException e)
{
e.printStackTrace();
}
I was reading about Path online so why not something like this as alternative to Bohemian code?
Maybe something could be done using stream() of Java 8 but not still nothing...

Categories

Resources