I am having a List of DTO which is mapped from a HTTP response(using RestTemplate call) having two value id and content. When I am iterating over list of DTO's, I am escaping HTML characters in content and replacing some unwated characters using the code below:
String content = null;
for(TestDto testDto: testDtoList) {
try {
content = StringEscapeUtils.unescapeHtml4(testDto.getContent()).
replaceAll("<style(.+?)</style>", "").
replaceAll("<script(.+?)</script>", "").
replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ").
replaceAll("[^a-zA-Z0-9\\\\.]+", " ").
replace("\\n", " ").
replaceAll("\\\\r","").trim();
processContent(content);
} catch (Exception e) {
System.out.println("Content err: " + e.getMessage());
}
}
In between the loop, code get halted due to java constant string too long exception. Even I am not able to catch this exception.
How should I handle this problem?
EDIT :
The length of getContent() String can exceeds Integer.MAX_VALUE
That code is hard to read anyways so you might want to refactor it. One thing you could try is to use a StringBuffer along with Pattern, Matcher and the appendReplacement() and appendTail() methods. That way you could prove a list of patterns and replacements, iterate over it, iterate over all occurences of the current pattern and replace it. Unfortunately those methods don't accept StringBuilder but it might at least be worth a try. In fact, the replaceAll() method basically does the same but by doing it yourself you could skip the return sb.toString(); part which probably causes the problem.
Example:
class ReplacementInfo {
String pattern;
String replacement;
}
List<ReplacementInfo> list = ...; //build it
StringBuffer input = new StringBuffer( testDto.getContent() );
StringBuffer output = new StringBuffer( );
for( ReplacementInfo replacementInfo : list ) {
//create the pattern and matcher for the current input
Pattern pattern = Pattern.compile( replacementInfo.pattern );
Matcher matcher = pattern.matcher( input );
//replace all occurences of the pattern
while( matcher.find() ) {
matcher.appendReplacement( output, replacementInfo.replacement );
}
//add the rest of the input
matcher.appendTail( output );
//reuse the output as the input for the next iteration
input = output;
output = new StringBuffer();
}
At the end input would contain the result unless you handle reusing the intermediate steps differently, e.g. by clearing the buffers and adding output to input thus keeping output until the next iteration.
Btw, you might also want to look into using StringEscapeUtils.UNESCAPE_HTML4.translate(input, writer) along with a StringWriter that allows you to access the underlying StringBuffer and thus completely operate on the content without using String.
Supposing your DTO isn't big enough, you could:
store the response in a temporary file,
add a catch clause with the specific exception that is thrown during the runtime, and inside the clause the handling code for it.
That way you can parse the strings and when the exception hits, you could handle the long string by splitting it in parts and cleaning it.
Change your catch block like below,
String content = null;
for(TestDto testDto: testDtoList) {
try {
content = StringEscapeUtils.unescapeHtml4(testDto.getContent()).
replaceAll("<style(.+?)</style>", "").
replaceAll("<script(.+?)</script>", "").
replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ").
replaceAll("[^a-zA-Z0-9\\\\.]+", " ").
replace("\\n", " ").
replaceAll("\\\\r","").trim();
} catch (ContentTooLongException e) {
System.out.println("Content err: " + e.getMessage());
}catch (Exception e) {
System.out.println("other err: " + e.getMessage());
}
}
Now you'll be able to handle any exception.
Related
I have used the following code to write elements from an arraylist into a file, to be retrieved later on using StringTokenizer. It works perfect for 3 other arraylists but somehow for this particular one, it throws an exception when reading with .nextToken() and further troubleshooting with .countTokens() shows that it only has 1 token in the file. The delimiters for both write and read are the same - "," as per the other arraylists as well.
I'm puzzled why it doesnt work the way it should as with the other arrays when I have not changed the code structure.
=================Writing to file==================
public static void copy_TimeZonestoFile(ArrayList<AL_TimeZone> timezones, Context context){
try {
FileOutputStream fileOutputStream = context.openFileOutput("TimeZones.dat",Context.MODE_PRIVATE);
OutputStreamWriter writerFile = new OutputStreamWriter(fileOutputStream);
int TZsize = timezones.size();
for (int i = 0; i < TZsize; i++) {
writerFile.write(
timezones.get(i).getRegion() + "," +
timezones.get(i).getOffset() + "\n"
);
}
writerFile.flush();
writerFile.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
==========Reading from file (nested in thread/runnable combo)===========
public void run() {
if (fileTimeZones.exists()){
System.out.println("Timezone file exists. Loading.. File size is : " + fileTimeZones.length());
try{
savedTimeZoneList.clear();
BufferedReader reader = new BufferedReader(new InputStreamReader(openFileInput("TimeZones.dat")));
String lineFromTZfile = reader.readLine();
while (lineFromTZfile != null ){
StringTokenizer token = new StringTokenizer(lineFromTZfile,",");
AL_TimeZone timeZone = new AL_TimeZone(token.nextToken(),
token.nextToken());
savedTimeZoneList.add(timeZone);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e){
e.printStackTrace();
}
}
}
===================Trace======================
I/System.out: Timezone file exists. Loading.. File size is : 12373
W/System.err: java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at com.cryptotrac.trackerService$1R_loadTimeZones.run(trackerService.java:215)
W/System.err: at java.lang.Thread.run(Thread.java:764)
It appears that this line of your code is causing the java.util.NoSuchElementException to be thrown.
AL_TimeZone timeZone = new AL_TimeZone(token.nextToken(), token.nextToken());
That probably means that at least one of the lines in file TimeZones.dat does not contain precisely two strings separated by a single comma.
This can be easily checked by making sure that the line that you read from the file is a valid line before you try to parse it.
Using method split, of class java.lang.String, is preferable to using StringTokenizer. Indeed the javadoc of class StringTokenizer states the following.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Try the following.
String lineFromTZfile = reader.readLine();
while (lineFromTZfile != null ){
String[] tokens = lineFromTZfile.split(",");
if (tokens.length == 2) {
// valid line, proceed to handle it
}
else {
// optionally handle an invalid line - maybe write it to the app log
}
lineFromTZfile = reader.readLine(); // Read next line in file.
}
There are probably multiple things wrong, because I'd actually expect you to run into an infinite loop, because you are only reading the first line of the file and then repeatedly parse it.
You should check following things:
Make sure that you are writing the file correctly. What does the written file exactly contain? Are there new lines at the end of each line?
Make sure that the data written (in this case, "region" and "offset") never contain a comma, otherwise parsing will break. I expect that there is a very good chance that "region" contains a comma.
When reading files you always need to assume that the file (format) is broken. For example, assume that readLine will return an empty line or something that contains more or less than one comma.
I am trying to figure out why my inputFile.delete() will not delete the file. After looking at numerous topics it looks like something is still using the file and hence it won't delete. But I can't figure it out. What am I missing??
File inputFile = new File("data/Accounts.txt");
File tempFile = new File("data/tmp.txt");
try {
tempFile.createNewFile();
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
BufferedWriter writer = new BufferedWriter(new FileWriter(tempFile));
String line;
int i = 0;
for (User u : data) {
String toRemove = getIDByUsername(username);
while ((line = reader.readLine()) != null) {
if (line.contains(toRemove + " ")) {
line = (i + " " + username + " " + getStatusByUsername(username) + " " + password);
}
writer.write(line + "\n");
i++;
}
}
reader.close();
writer.close();
} catch (FileNotFoundException e) {
ex.FileNotFound();
} catch (IOException ee) {
ex.IOException();
} finally {
inputFile.delete();
tempFile.renameTo(inputFile);
}
You can have that much shorter and easier by using java.nio:
public static void main(String[] args) {
// provide the path to your file, (might have to be an absolute path!)
Path filePath = Paths.get("data/Accounts.txt");
// lines go here, initialize it as empty list
List<String> lines = new ArrayList<>();
try {
// read all lines (alternatively, you can stream them by Files.lines(...)
lines = Files.readAllLines(filePath);
// do your logic here, this is just a very simple output of the content
System.out.println(String.join(" ", lines));
// delete the file
Files.delete(filePath);
} catch (FileNotFoundException fnfe) {
// handle the situation of a non existing file (wrong path or similar)
System.err.println("The file at " + filePath.toAbsolutePath().toString()
+ " could not be found." + System.lineSeparator()
+ fnfe.toString());
} catch (FileSystemException fse) {
// handle the situation of an inaccessible file
System.err.println("The file at " + filePath.toAbsolutePath().toString()
+ " could not be accessed:" + System.lineSeparator()
+ fse.toString());
} catch (IOException ioe) {
// catch unexpected IOExceptions that might be thrown
System.err.println("An unexpected IOException was thrown:" + System.lineSeparator()
+ ioe.toString());
}
}
This prints the content of the file and deletes it afterwards.
You will want to do something different instead of just printing the content, but that will be possible, too ;-) Try it...
I am trying to figure out why my inputFile.delete() will not delete the file.
That's because the old file API is crappy specifically in this way: It has no ability to tell you why something is not succeeding. All it can do, is return 'false', which it will.
See the other answer, by #deHaar which shows how to do this with the newer API. Aside from being cleaner code and the newer API giving you more options, the newer API also fixes this problem where various methods, such as File.delete(), cannot tell you the reason for why it cannot do what you ask.
There are many, many issues with your code, which is why I strongly suggest you go with deHaar's attempt. To wit:
You aren't properly closing your resources; if an exception happens, your file handlers will remain open.
Both reading and writing here is done with 'platform default encoding', whatever that might be. Basically, never use those FileReader and FileWriter constructors. Fortunately, the new API defaults to UTF_8 if you fail to specify an encoding, which is more sensible.
your exception handling is not great (you're throwing away any useful messages, whatever ex.FileNotFound() might be doing here) - and you still try to delete-and-replace even if exceptions occur, which then fail, as your file handles are still open.
The method should be called getIdByUsername
Your toRemove string is the same every time, or at least, the username variable does not appear to be updated as you loop through. If indeed it never updates, move that line out of your loop.
My data format will be like "#9#0000075".
How to delete "#9" and start with "#0" and the next value. How to get count before "#0".
For info: " #9" sometime not only 3 characters, may be more or less. but it sure need start with "#0".
StringBuilder gate = new StringBuilder();
int count = 1;
String header = "";
String oldHeader = "";
public void processRawPacket(int i) throws Exception {
try {
gate.append((char) i);
oldHeader = gate.toString();
for (int index = 0; index <= oldHeader.length(); index++) {
if (oldHeader == "#0") {
gate.append(oldHeader);
}
index++;
}
process = false;
} catch (Exception e) {
out.println(expMsg + "processRawPacket(i): " + e);
e.printStackTrace();
StringWriter stack = new StringWriter();
e.printStackTrace(new PrintWriter(stack));
out.println("Caught exception packet; decorating with appropriate status template : " + stack.toString());
resetPacket();
}
}
You have a potential issue with String comparison. Operator == is for reference comparing. To compare two string use .equals() method.
Precisely, this code
if (oldHeader == "#0") {
gate.append(oldHeader);
}
Should be
if("#0".equals(oldHeader)) {
gate.append(oldHeader);
}
Use delimiter and then add it to the array. Then add one to consider the length of # also.
String s="#9#0000075";
String[] ss=s.split("#");
Just note while playing with length to consider # make it count by adding one to the length.
This was you can separately handle them and use the lengths.
You could use the native String methods:
final String header = "#0";
String pkg = "#9#048965648";
int oldheader = pkg.indexOf(header); // Yields 2 here
pkg = pkg.substring(oldheader); // Yields #048965648
The str.indexOf(searchString) method returns the first index at which the searchString starts in str. If searchString cannot be found, it returns -1. Cf. JavaDoc If the header you wish to remove is also #0, then lastIndexOf is what you’re looking for, supposing then that there are at most two occurences of #0 – When the old and new headers are both #0.
substring (in this form) returns the string beginning at the specified position, which (here) effectively clips the header you wish to remove. (JavaDoc). Should you want to retain the original header, use the two-integer variant here with 0 as the begin index.
I am generating a StringBuilder from an SQL query result set:
StringBuilder sb = new StringBuilder();
PreparedStatement ps1 = null;
String[] systems = null;
try {
ps1 = c.prepareStatement(systemNames);
ResultSet rs = ps1.executeQuery();
while (rs.next()) {
sb.append(rs.getString("system_name") + ",");
System.out.println(">-------------------->>> " + sb.toString());
}
systems = sb.toString().split(",");
} catch (SQLException e) {
log.error(e.toString(), e);
} finally {
if (ps1 != null) {
try {
ps1.close();
} catch (Exception e) {
}
}
if (c != null) {
try {
c.close();
} catch (Exception e) {
}
}
}
System.out.println(">-------------------->>> " + systems.toString());
What I am confused about is the string object it prints after the split, the first print statement prints:
>-------------------->>> ACTIVE_DIRECTORY,
While the second, after the delimeter split:
>-------------------->>> [Ljava.lang.String;#387f6ec7
How do I print just ACTIVE_DIECTORY without the comma after the split?
It prints out the comma because you appended it.
Just print out your array (system var):
Arrays.toString(systems);
Be sure to check out the oracle documentation on the Arrays utility class method Arrays.toString(Object[] o). This will 'pretty-print' the contents of your array, rather than the memory address.
Also, the last element will have a comma at the end every time, because you are appending it. To solve this you could do:
myString = myString.replaceAll(",", "");
or
myString[myString.length - 1] = myString[myString.length - 1].replaceAll(",", "");
Depending on whether you want to remove the commas before or after the split(","); Doing this on the last element will take the comma out of the String. I recommend after, because doing it before will make your split not work!
Alternately you could do a replaceAll(",", " "); and then split(" "); but then you might as well just append spaces instead of commas to begin with.
try replacing:
System.out.println(">-------------------->>> " systems.toString());
with
System.out.println(">-------------------->>> " systems[0]);
since String[] systems is an array of an object, it prints the hash.
public class Parser {
public static void main(String[] args) {
Parser p = new Parser();
p.matchString();
}
parserObject courseObject = new parserObject();
ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage = " ";
{
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader =
new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine()) != null) {
theWebPage = theWebPage + " " + str;
}
reader.close();
} catch (MalformedURLException e) {
// do nothing
} catch (IOException e) {
// do nothing
}
}
public void matchString() {
// this is my regex that I am using to compare strings on input page
String matchRegex = "#\\w+(-\\w+)+";
Pattern p = Pattern.compile(matchRegex);
Matcher m = p.matcher(theWebPage);
int i = 0;
while (!m.hitEnd()) {
try {
System.out.println(m.group());
courseNames.add(i, m.group());
i++;
} catch (IllegalStateException e) {
// do nothing
}
}
}
}
What I am trying to achieve with the above code is to get the list of departments on the MIT OpencourseWare website. I am using a regular expression that matches the pattern of the department names as in the page source. And I am using a Pattern object and a Matcher object and trying to find() and print these department names that match the regular expression. But the code is taking forever to run and I don't think reading in a webpage using bufferedReader takes that long. So I think I am either doing something horribly wrong or parsing websites takes a ridiculously long time. so I would appreciate any input on how to improve performance or correct a mistake in my code if any. I apologize for the badly written code.
The problem is with the code
while ((str = reader.readLine()) != null)
theWebPage = theWebPage + " " +str;
The variable theWebPage is a String, which is immutable. For each line read, this code creates a new String with a copy of everything that's been read so far, with a space and the just-read line appended. This is an extraordinary amount of unnecessary copying, which is why the program is running so slow.
I downloaded the web page in question. It has 55,000 lines and is about 3.25MB in size. Not too big. But because of the copying in the loop, the first line ends up being copied about 1.5 billion times (1/2 of 55,000 squared). The program is spending all its time copying and garbage collecting. I ran this on my laptop (2.66GHz Core2Duo, 1GB heap) and it took 15 minutes to run when reading from a local file (no network latency or web crawling countermeasures).
To fix this, make theWebPage into a StringBuilder instead, and change the line in the loop to be
theWebPage.append(" ").append(str);
You can convert theWebPage to a String using toString() after the loop if you wish. When I ran the modified version, it took a fraction of a second.
BTW your code is using a bare code block within { } inside a class. This is an instance initializer (as opposed to a static initializer). It gets run at object construction time. This is legal, but it's quite unusual. Notice that it misled other commenters. I'd suggest converting this code block into a named method.
Is this your whole program? Where is the declaration of parserObject?
Also, shouldn't all of this code be in your main() prior to calling matchString()?
parserObject courseObject = new parserObject();
ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage=" ";
{
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine())!=null)
{
theWebPage = theWebPage+" "+str;
}
reader.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
}
You are also catching exceptions and not displaying any error messages. You should always display an error message and do something when you encounter an exception. For example, if you can't download the page, there is no reason to try to parse a empty string.
From you comment I learned about static blocks in classes (thank you, didn't know about them). However, from what I've read you need to put the keyword static before the start of the block {. Also, it might just be better to put the code into your main, that way you can exit if you get a MalformedURLException or IOException.
You can, of course, solve this assignment with the limited JDK 1.0 API, and run into the issue that Stuart Marks helped you solve in his excellent answer.
Or, you just use a popular de-facto standard library, like for instance, Apache Commons IO, and read your website into a String using a no-brainer like this:
// using this...
import org.apache.commons.io.IOUtils;
// run this...
try (InputStream is = new URL("http://ocw.mit.edu/courses/").openStream()) {
theWebPage = IOUtils.toString(is);
}