Fastest way to replace data in Java - java

I need to write a Java method that will:
retrieve HTML from a data table
search the HTML for a specific marker (embedded within a comment)
replace that marker with more HTML
For example, The original HTML could have a page header, the marker and a page footer. I would want to get that HTML and replace the marker with page content, like a blog posting.
My main concerns are speed and functionality. Since the original HTML and the HTML to be injected into the original HTML could be quite large, I need some advice.
I know I could use Strings and use String.replace(), but I'm concerned about the size limitations of a String and how fast that would perform.
I'm also thinking about using the Reader/Writer objects, but I don't know if that would be faster or not.
I know there is a Java Clob object, but I don’t really see if it can be used for my particular situation.
Any ideas/advice would be welcome.
Thanks,
Tim

Stream the data in with a Reader, parse it on the fly to find your tags, and replace the data as it goes by while you are streaming the data out with a Writer.
Yes, you have to write a parser to do this.
Do not load it in to a big buffer, do searches and regexes and whatever on the buffer, and then write it out. Processing the data once is the fastest thing you can do.
If you have data later in the file that will fill in spots higher in the file, then your stuck sucking the whole thing in.
Finally, why aren't you just using something like Apache Velocity?

How big is your HTML? A gigabyte? A megabyte? 100k? 10k? For all but the first, string manipulation will be just fine. If that answer doesn't satisfy you, then use indexOf() to find the start and end of the marker, and use substring() to write the portions of the original string before and after.

StringBuilder (not threadsafe) and StringBuffer (threadsafe) are the two basic constructions for String manipulation. But if you are reading your data from a stream it is probably better if you do it on the fly. (read lines, look for marker, if found write content instead of it)

Related

Java add attribute to HTML tags without changing formatting

A have a task to make a maven plugin which takes HTML files in certain location and adds a service attribute to each tag that doesn't have it. This is done on the source code which means my colleagues and I will have to edit those files further.
As a first solution I turned to Jsoup which seems to be doing the job but has one small yet annoying problem: if we have a tag with multiple long attributes (we often do as this HTML code is a source for further processing) we wrap the lines like this:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}"
class="is-small is-outlined is-hoverable is-foldable"
filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}"
filterFragment="grid_filter" contentFragment="grid_contents"/>
However, Jsoup turns this into one very long line:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}" class="is-small is-outlined is-hoverable is-foldable" filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}" filterFragment="grid_filter" contentFragment="grid_contents"/>
Which is a bad practice and real pain to read and edit.
So is there any other not very convoluted way to add this attribute without parsing and recomposing HTML code or maybe somehow preserve line breaks inside the tag?
Unfortunately JSoup's main use case is not to create HTML that is read or edited by humans. Specifically JSoup's API is very closely modeled after DOM which has no way to store or model line breaks inside tags, so it has no way to preserve them.
I can think of only two solutions:
Find (or write) an alternative HTML parser library, that has an API that preserves formatting inside tags. I'd be surprised if such a thing already exists.
Run the generated code through a formatter that supports wrapping inside tags. This won't preserve the original line breaks, but at least the attributes won't be all on one line. I wasn't able to find a Java library that does that, so you may need to consider using an external program.
It seems there is no good way to preserve breaks inside tags while parsing them into POJOs (or I haven't found one), so I wrote a simple tokenizer which splits incoming HTML string into parts sort of like this:
String[] parts = html.split( "((?=<)|(?<=>))" );
This uses regex lookups to split before < and after >. Then just iterate over parts and decide whether to insert attribute or not.

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Regex: Negating a whole word (needed to optimize a file)

I am trying to do a simple weather widget for Android, that provides temperatures just for my country (Jordan). The website I am using for the weather records provides a JSON file with country regions data for many countries. The problem is that the file contains 2500+ objects, and it takes a really long time to be parsed. Thus, and as I actually need <100 of them (the regions of my country), I thought that I could optimize the file before passing it to the JSON parser, by taking off all of the records I don't need. I don't know if it's a good solution, but it was what I thought of. Anyway, my problem now is getting the right Regex.
This is the URL of the JSON file.
As you can see, every object has four items. The one I need to check for is "icon", which specifies the country of that region.
EXAMPLE:
{"value":"khalda","icon":"Jordan","label":"khalda","desc":"Amman & Madaba"},
What I could came up with so far is the pattern of the object I actually need. However, I need to get the ones I don't need to be able to delete them. Here is the pattern: \{[^\{]*Jordan*[^\}]*\}, (This has to be modified so it validates when "Jordan" does NOT exist, which I couldn't figure out.)
Any help/hint is highly appreciated.
Thanks.
Rather than matching and deleting the objects you don't need, match and extract the single(?) object that you do need. It will be faster.
(And I agree with minitech's comment. Parsing the JSON file is unlikely to be the real bottleneck.)

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).
Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).
Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?
Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

Categories

Resources