Cutting down runtime of csv parser

Cutting down runtime of csv parser - java

I have a dataset of 230'000 entries in a csv file. 1 line looks like this:
1849-06-01,24.844,1.402,Abidjan,CÃ´te D'Ivoire,5.63N,3.23W
now when I tried splitting the entries into an array, then creating an object and adding latter object to an arraylist, I failed. I later noticed that it has to do with the number of entries I try it with. I found 20'000 to be the absolute maximum. My parser looks like this:
try {
new RequestBuilder(RequestBuilder.GET, "test20.csv").sendRequest("", new RequestCallback() {
String data[] = new String[20000];
#Override
public void onResponseReceived(Request req, Response resp) {
String text = resp.getText();
data = text.split("\n");
for(String str: data) {
String[] results = new String[6];
results = str.split(",");
// creates DataTableObject using the constructor
DataTableObject object = new DataTableObject(results[0], results[1], results[2], results[3], results[4], results[5], results[6]);
dataSet.add(object);
}
drawTable(dataSet);
}
#Override
public void onError(Request res, Throwable throwable) {
// handle errors
}
});
} catch (RequestException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
the drawTable() method just inserts all the created objects into a flextable.
dataSet is an ArrayList<DataTableObject>, those objects just contain all the Strings (city, country, etc.)
the whole thing runs with GWT 2.7.0 and JDK 1.8.
Do you have any idea on how to be able to draw all the 230'000 entries without hitting a timeout in the browser?
I've tried working without objects, but that doesn't cut down the runtime notably. Neither does only splitting by \n or ,.
Thanks

Related

How to create mock CsvExceptions to use with csvToBean.getCapturedExceptions()

I am trying to write some unit tests to see if a logging method gets called for csv exceptions. The flow goes something like this:
CsvToBean is used to parse some info and each bean that is produced has some work done on it.
After all this, CsvToBean.getCapturedExceptions().forEach() is used to processed the exceptions.
How to I create some of these exceptions for testing?
public void parseAndSaveReportToDB(Reader reader, String reportFileName,ItemizedActivityRepository iaRepo,
ICFailedRecordsRepository icFailedRepo,
String reportCols) throws Exception {
try {
CsvToBean<ItemizedActivity> csvToBean = new CsvToBeanBuilder<ItemizedActivity>(reader).withType(ItemizedActivity.class).withThrowExceptions(false).build();
csvToBean.parse().forEach(itmzActvty -> {
itmzActvty.setReportFileName(reportFileName);
String liteDesc = itmzActvty.getBalanceTransactionDescription();
if (liteDesc.contains(":")) {
liteDesc = liteDesc.substring(liteDesc.indexOf(":")+1).trim();
}
itmzActvty.setLiteDescription(liteDesc);
itmzActvty.setAmount(convertCentToDollar(itmzActvty.getAmount()));
iaRepo.save(itmzActvty);
});
log.info("Successfully saved report data in DB");
csvToBean.getCapturedExceptions().forEach(csvExceptionObj -> logFailedRecords(reportFileName, csvExceptionObj, icFailedRepo, reportCols));
reader.close();
} catch (Exception ex) {
log.error("Exception when saving report data to DB", ex);
throw ex;
}
}
In this code I need to trigger the logFailedRecords method. To do so I need to fill the captured exceptions queue with an exception. I don't know how to get an exception in there.
What I have is not much since I keep hitting walls
#Test
public void testParseAndSaveReportToDBWithExceptions() throws Exception {
// CsvException csvExceptionObject = new CsvException("testException");
CsvToBean<ItemizedActivity> csvToBean = mock(CsvToBean.class);//<ItemizedActivity>(reader).withType(ItemizedActivity.class).withThrowExceptions(false).build().class);
BufferedReader reader = mock(BufferedReader.class);
ReportingMetadata rmd = this.getReportingMetadata();
verify(this.reportsUtil).parseAndSaveReportToDB(reader,"test.csv",
this.iaRepo,this.icFailedRepo,rmd.getReportCols());
// System.out.println(csvToBean.getCapturedExceptions().toString());
}

Download files with netty

I am creating a very basic webserver using netty and java. I will have basic functionality. It's main responsibilities would be to serve responses for API calls done from a client (e.g a browser, or a console app I am building) in JSON form or send a zip file. For that reason I have created the HttpServerHanddler class which is responsible for getting the request, parsing it to find the command and call the appropriate api call.It extends SimpleChannelInboundHandler
and overrides the following functions;
#Override
public void channelActive(ChannelHandlerContext ctx) throws Exception {
LOG.debug("channelActive");
}
#Override
public void channelReadComplete(ChannelHandlerContext ctx) {
LOG.debug("In channelComplete()");
ctx.flush();
}
#Override
public void channelRead0(ChannelHandlerContext ctx, Object msg)
throws IOException {
ctx = processMessage(ctx, msg);
if (!HttpHeaders.isKeepAlive(request)) {
// If keep-alive is off, close the connection once the content is
// fully written.
ctx.writeAndFlush(Unpooled.EMPTY_BUFFER).addListener(
ChannelFutureListener.CLOSE);
}
}
private ChannelHandlerContext processMessage(ChannelHandlerContext ctx, Object msg){
if (msg instanceof HttpRequest) {
HttpRequest request = this.request = (HttpRequest) msg;
if (HttpHeaders.is100ContinueExpected(request)) {
send100Continue(ctx);
}
//parse message to find command, parameters and cookies
ctx = executeCommand(command, parameters, cookies)
}
if (msg instanceof LastHttpContent) {
LOG.debug("msg is of LastHttpContent");
if (!HttpHeaders.isKeepAlive(request)) {
// If keep-alive is off, close the connection once the content is
// fully written.
ctx.writeAndFlush(Unpooled.EMPTY_BUFFER).addListener(
ChannelFutureListener.CLOSE);
}
}
return ctx;
}
private ChanndelHandlerContext executeCommand(String command, HashMap<String, List<String>>> parameters, Set<Cookie> cookies>){
//switch case to see which command has to be invoked
switch(command){
//many cases
case "/report":
ctx = myApi.getReport(parameters, cookies); //This is a member var of ServerHandler
break;
//many more cases
}
return ctx;
}
In my Api class that has the getReport function.
getReport
public ChannelHandlerContext getReportFile(Map<String, List<String>> parameters,
Set<Cookie> cookies) {
//some initiliazations. Actual file handing happens bellow
File file = new File(fixedReportPath);
RandomAccessFile raf = null;
long fileLength = 0L;
try {
raf = new RandomAccessFile(file, "r");
fileLength = raf.length();
LOG.debug("creating response for file");
this.response = Response.createFileResponse(fileLength);
this.ctx.write(response);
this.ctx.write(new HttpChunkedInput(new ChunkedFile(raf, 0,
fileLength,
8192)),
this.ctx.newProgressivePromise());
} catch (FileNotFoundException fnfe) {
LOG.debug("File was not found", fnfe);
this.response = Response.createStringResponse("failure");
this.ctx.write(response);
} catch (IOException ioe) {
LOG.debug("Error getting file size", ioe);
this.response = Response.createStringResponse("failure");
this.ctx.write(response);
} finally {
try {
raf.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return this.ctx;
}
Response class is responsible for handling various types of response creations (JsonString JsonArray JsonInteger File, etc)
public static FullHttpResponse createFileResponse(long fileLength) {
FullHttpResponse response = new DefaultFullHttpResponse(HttpVersion.HTTP_1_1, HttpResponseStatus.OK);
HttpHeaders.setContentLength(response, fileLength);
response.headers().set(HttpHeaders.Names.CONTENT_TYPE, "application/octet-stream");
return response;
}
My Api works great for my Json responses(easier to achieve) but It won't work well with my json responses, but not with my file response. When making a request from e.g chrome it only hangs and does not download the file. Should I do something else when downloading a file using netty? I know its not the best wittern code, I still think I have some bits and pieces missing from totally understanding the code, but I would like your advice on how to handle download on my code. For my code I took under consideration this and this

First, some remarks on your code...
Instead of returning ctx, I would prefer to return the last Future for the last command, such that your last event (no keep alive on) could use it directly.
public void channelRead0(ChannelHandlerContext ctx, Object msg)
throws IOException {
ChannelFuture future = processMessage(ctx, msg);
if (future != null && !HttpHeaders.isKeepAlive(request)) {
// If keep-alive is off, close the connection once the content is
// fully written.
future.addListener(ChannelFutureListener.CLOSE);
}
}
Doing this way will allow to directly close without having any "pseudo" send, even empty.
Important: Note that in Http, the response is managed such that there are chunk send for all data after the first HttpResponse item, until the last one which is empty (LastHttpContent). Sending another empty one (Empty chunk but not LastHttpContent) could break the internal logic.
Moreover, you're doing the work twice (once in read0, once in processMessage), which could lead to some issues perhaps.
Also, since you check for KeepAlive, you should ensure to set it back in the response:
if (HttpHeaders.isKeepAlive(request)) {
response.headers().set(CONNECTION, HttpHeaders.Values.KEEP_ALIVE);
}
On your send, you have 2 choices (depending on the usage of SSL or not): you've selected only the second one, which is more general, so of course valid in all cases but less efficient.
// Write the content.
ChannelFuture sendFileFuture;
ChannelFuture lastContentFuture;
if (ctx.pipeline().get(SslHandler.class) == null) {
sendFileFuture =
ctx.write(new DefaultFileRegion(raf.getChannel(), 0, fileLength), ctx.newProgressivePromise());
// Write the end marker.
lastContentFuture = ctx.writeAndFlush(LastHttpContent.EMPTY_LAST_CONTENT); // <= last writeAndFlush
} else {
sendFileFuture =
ctx.writeAndFlush(new HttpChunkedInput(new ChunkedFile(raf, 0, fileLength, 8192)),
ctx.newProgressivePromise()); // <= last writeAndFlush
// HttpChunkedInput will write the end marker (LastHttpContent) for us.
lastContentFuture = sendFileFuture;
}
This is this lastContentFuture that you can get back to the caller to check the KeepAlive.
Note however that you didn't include a single flush there (except with your EMPTY_BUFFER but which can be the main reason of your issue there!), contrary to the example (from which I copied the source).
Note that both use a writeAndFlush for the last call (or the unique one).

How to watch file for new content and retrieve that content

I have a file with name foo.txt. This file contains some text. I want to achieve following functionality:
I launch program
write something to the file (for example add one row: new string in foo.txt)
I want to get ONLY NEW content of this file.
Can you clarify the best solution of this problem? Also I want resolve related issues: in case if I modify foo.txt I want to see diff.
The closest tool which I found in Java is WatchService but if I understood right this tool can only detect type of event happened on filesystem (create file or delete or modify).

Java Diff Utils is designed for that purpose.
final List<String> originalFileContents = new ArrayList<String>();
final String filePath = "C:/Users/BackSlash/Desktop/asd.txt";
FileListener fileListener = new FileListener() {
#Override
public void fileDeleted(FileChangeEvent paramFileChangeEvent)
throws Exception {
// use this to handle file deletion event
}
#Override
public void fileCreated(FileChangeEvent paramFileChangeEvent)
throws Exception {
// use this to handle file creation event
}
#Override
public void fileChanged(FileChangeEvent paramFileChangeEvent)
throws Exception {
System.out.println("File Changed");
//get new contents
List<String> newFileContents = new ArrayList<String> ();
getFileContents(filePath, newFileContents);
//get the diff between the two files
Patch patch = DiffUtils.diff(originalFileContents, newFileContents);
//get single changes in a list
List<Delta> deltas = patch.getDeltas();
//print the changes
for (Delta delta : deltas) {
System.out.println(delta);
}
}
};
DefaultFileMonitor monitor = new DefaultFileMonitor(fileListener);
try {
FileObject fileObject = VFS.getManager().resolveFile(filePath);
getFileContents(filePath, originalFileContents);
monitor.addFile(fileObject);
monitor.start();
} catch (InterruptedException ex) {
ex.printStackTrace();
} catch (FileNotFoundException e) {
//handle
e.printStackTrace();
} catch (IOException e) {
//handle
e.printStackTrace();
}
Where getFileContents is :
void getFileContents(String path, List<String> contents) throws FileNotFoundException, IOException {
contents.clear();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
String line = null;
while ((line = reader.readLine()) != null) {
contents.add(line);
}
}
What I did:
I loaded the original file contents in a List<String>.
I used Apache Commons VFS to listen for file changes, using FileMonitor. You may ask, why? Because WatchService is only available starting from Java 7, while FileMonitor works with at least Java 5 (personal preference, if you prefer WatchService you can use it). note: Apache Commons VFS depends on Apache Commons Logging, you'll have to add both to your build path in order to make it work.
I created a FileListener, then I implemented the fileChanged method.
That method load new contents form the file, and uses Patch.diff to retrieve all differences, then prints them
I created a DefaultFileMonitor, which basically listens for changes to a file, and I added my file to it.
I started the monitor.
After the monitor is started, it will begin listening for file changes.

JUnit testing for HTML parsing

I'm trying to set up unit tests on a web crawler and am rather confused as to how I would test them. (I've only done unit testing once and it was on a calculator program.)
Here are two example methods from the program:
protected static void HttpURLConnection(String URL) throws IOException {
try {
URL pageURL = new URL(URL);
HttpURLConnection connection = (HttpURLConnection) pageURL
.openConnection();
stCode = connection.getResponseCode();
System.out.println("HTTP Status code: " + stCode);
// append to CVS string
CvsString.append(stCode);
CvsString.append("\n");
// retrieve URL
siteURL = connection.getURL();
System.out.println(siteURL + " = URL");
CvsString.append(siteURL);
CvsString.append(",");
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
and:
public static void HtmlParse(String line) throws IOException {
// create new string reader object
aReader = new StringReader(line);
// create HTML parser object
HTMLEditorKit.Parser parser = new ParserDelegator();
// parse A anchor tags whilst handling start tag
parser.parse(aReader, new HTMLEditorKit.ParserCallback() {
// method to handle start tags
public void handleStartTag(HTML.Tag t, MutableAttributeSet a,
int pos) {
// check if A tag
if (t == HTML.Tag.A) {
Object link = a.getAttribute(HTML.Attribute.HREF);
if (link != null) {
links.add(String.valueOf(link));
// cast to string and pass to methods to get title,
// status
String pageURL = link.toString();
try {
parsePage(pageURL); // Title - To print URL, HTML
// page title, and HTTP status
HttpURLConnection(pageURL); // Status
// pause for half a second between pages
Thread.sleep(500);
} catch (IOException e) {
e.printStackTrace();
} catch (BadLocationException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
}, true);
aReader.close();
}
I've set up a test class in Eclipse and have outline test methods along these lines:
#Test
public void testHttpURLConnection() throws IOException {
classToTest.HttpURLConnection( ? );
assertEquals("Result", ? ? )
}
I don't really know where to go from here. I'm not even sure whether I should be testing live URLs or local files.
I found this question here: https://stackoverflow.com/questions/5555024/junit-testing-httpurlconnection
but I couldn't really follow it and I'm not sure it was solved anyway.
Any pointers appreciated.

There is no one conclusive answer to your question, what you test depends on what your code does and how deep you want to test it.
So if you have a parse method that takes an HTML and returns the string: "this is a parsed html" (obviously not very usefull, but just making a point), you'll test it like:
#Test
public void testHtmlParseSuccess() throws IOException {
assertEquals("this is a parsed html", classToTest.parse(html) ) //will return true, test will pass
}
#Test
public void testHtmlParseSuccess() throws IOException {
assertEquals("this is a wrong answer", classToTest.parse(html) ) //will return false, test will fail
}
There are a lot more methods besides assertEquals() so you should look here.
eventually it is up to you to decide what parts to test and how to test them.

Think about what effects your methods should have. In the first case the expected thing that should happen when HttpURLConnection(url) is called seems to be that the status code and url are appended to something called CvsString. You will have to implement something in CvsString so that you can inspect if that what you expected did actually happen.
However: Looking at your code I would suggest you consult a book about unit testing and how to refactor code so that it becomes well testable. In your code snippets I see a lot of reasons why unit testing your code is difficult if not impossible, e. g. overall use of static methods, methods with side effects, very little separation of concerns etc. Because of this it is impossible to answer your question fully in this context.
Don't get me wrong, this isn't meant in an offending way. It is well worth learning these things it will improve your coding abilities a lot.

Design pattern to implement an iterative fallback mechanism

I have written a word definition fetcher that parses web pages from a dictionary website.
Not all web pages have exactly the same HTML structure, so I had to implement several parsing methods to support the majority of cases.
Below is what I have done so far, which is pretty ugly code.
What do you think would be the cleanest way of coding some kind of iterative fallback mechanism (there may be a more appropriate term), so that I can implement N ordered parsing methods (parsing failures must trigger the next parsing method, whereas exceptions such as IOException should break the process) ?
public String[] getDefinition(String word) {
String[] returnValue = { "", "" };
returnValue[0] = word;
Document doc = null;
try {
String finalUrl = String.format(_baseUrl, word);
Connection con = Jsoup.connect(finalUrl).userAgent("Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17");
doc = con.get();
// *** Case 1 (parsing method that works for 80% of the words) ***
String basicFormOfWord = doc.select("DIV.luna-Ent H2.me").first().text().replace("·", "");
String firstPartOfSpeech = doc.select("DIV.luna-Ent SPAN.pg").first().text();
String firstDef = doc.select("DIV.luna-Ent DIV.luna-Ent").first().text();
returnValue[1] = "<b>" + firstPartOfSpeech + "</b><br/>" + firstDef;
returnValue[0] = basicFormOfWord;
} catch (NullPointerException e) {
try {
// *** Case 2 (Alternate parsing method - for poorer results) ***
String basicFormOfWord = doc.select("DIV.results_content p").first().text().replace("·", "");
String firstDef = doc.select("DIV.results_content").first().text().replace(basicFormOfWord, "");
returnValue[1] = firstDef;
returnValue[0] = basicFormOfWord;
} catch (Exception e2) {
e2.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
return returnValue;
}

Sounds like a Chain-of-Responsibility- like pattern. I would have the following:
public interface UrlParser(){
public Optional<String[]> getDefinition(String word) throws IOException;
}
public class Chain{
private List<UrlParser> list;
#Nullable
public String[] getDefinition(String word) throws IOException{
for (UrlParser parser : list){
Optional<String[]> result = parser.getDefinition(word);
if (result.isPresent()){
return result.get();
}
}
return null;
}
}
I am using Guava's Optional here but you could return a #Nullable from the interface as well. Then define a class for each URL parser you need and inject them into Chain

Chain of Responsibility, as already noted, is a good candidate.
John's answer OTOH does not feature a chain of responsibility in the proper sense, since an UrlParser does not actively decide whether to handle the request to the next parser.
Here's my trivial shot at it:
public class ParserChain {
private ArrayList<UrlParser> chain = new ArrayList<UrlParser>();
private int index = 0;
public void add(UrlParser parser) {
chain.add(parser);
}
public String[] parse(Document doc) throws IOException {
if (index = chain.size()){
return null;
}
return chain.get(index++).parse(doc);
}
}
public interface UrlParser {
public String[] parse(Document doc, ParserChain chain) throws IOException;
}
public abstract class AbstractUrlParser implements UrlParser {
#Override
public String[] parse(Document doc, ParserChain chain) throws IOException {
try {
return this.doParse(doc);
} catch (ParseException pe) {
return chain.parse(doc);
}
}
protected abstract String[] doParse(Document doc) throws ParseException, IOException;
}
Notable things:
This code keeps a stack frame for ParserChain#parse and one for UrlParser#parse for every parser it enters, until some parser stops the chain of responsibility. If you have huge chains, you could run in a stack overflow (how appropriate)
an UrlParser that does not extend AbstractUrlParser can modify the argument String and than delegate the next in chain, or delegate the next in chain and then modify the result.
the ParserChain is not thread-safe (but I'd say this is something inherent to the Chain Of Responsibility pattern)
Edit: corrected code as of Sebastien's comment

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cutting down runtime of csv parser - java

Related

How to create mock CsvExceptions to use with csvToBean.getCapturedExceptions()

Download files with netty

How to watch file for new content and retrieve that content

JUnit testing for HTML parsing

Design pattern to implement an iterative fallback mechanism

Categories

Resources