Regex: Negating a whole word (needed to optimize a file)

Regex: Negating a whole word (needed to optimize a file) - java

I am trying to do a simple weather widget for Android, that provides temperatures just for my country (Jordan). The website I am using for the weather records provides a JSON file with country regions data for many countries. The problem is that the file contains 2500+ objects, and it takes a really long time to be parsed. Thus, and as I actually need <100 of them (the regions of my country), I thought that I could optimize the file before passing it to the JSON parser, by taking off all of the records I don't need. I don't know if it's a good solution, but it was what I thought of. Anyway, my problem now is getting the right Regex.
This is the URL of the JSON file.
As you can see, every object has four items. The one I need to check for is "icon", which specifies the country of that region.
EXAMPLE:
{"value":"khalda","icon":"Jordan","label":"khalda","desc":"Amman & Madaba"},
What I could came up with so far is the pattern of the object I actually need. However, I need to get the ones I don't need to be able to delete them. Here is the pattern: \{[^\{]*Jordan*[^\}]*\}, (This has to be modified so it validates when "Jordan" does NOT exist, which I couldn't figure out.)
Any help/hint is highly appreciated.
Thanks.

Rather than matching and deleting the objects you don't need, match and extract the single(?) object that you do need. It will be faster.
(And I agree with minitech's comment. Parsing the JSON file is unlikely to be the real bottleneck.)

Related

Android Intents and Lists

I plan on reading several files when my app/game is created and using the information from them for the entirety of the app. I also have to write to the file at one point.
I have two files. One is a 2-column text file that I'll turn into a dictionary for fast searching. The other is a text file that has 11 columns. I'll make a dictionary out of two of the columns, and the other data I need kept as is so I can write to the columns to count the amount of times something happens in different circumstances for datamining.
Currently, I've turned the second file into a list of a list of strings, or List>. I can't figure out how to pass that around in intents. ".putStringArrayListExtra" only works for a list of strings.
Am I going about this the wrong way entirely? This is my first real Android app.

In order to store a data structure into an Intent, it has to be either serializable or parcelable. If your data structure is neither of them, you might create a class that would implement Serializable and manage it. A good example might be found here.
Once done, you then might use Intent.putSerializable(...) to store your data structure. See this:
Using putSerializable in Android
Additionally to this, if you could convert your structure into a JSON structure, you'd already have it done since it would be treated as a String. If not, the above solution should be easy to do.

Using a common query convention for multiple search fields

Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve

Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.

Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.

Writing one file per group in Pig Latin

The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.

Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Fastest way to replace data in Java

I need to write a Java method that will:
retrieve HTML from a data table
search the HTML for a specific marker (embedded within a comment)
replace that marker with more HTML
For example, The original HTML could have a page header, the marker and a page footer. I would want to get that HTML and replace the marker with page content, like a blog posting.
My main concerns are speed and functionality. Since the original HTML and the HTML to be injected into the original HTML could be quite large, I need some advice.
I know I could use Strings and use String.replace(), but I'm concerned about the size limitations of a String and how fast that would perform.
I'm also thinking about using the Reader/Writer objects, but I don't know if that would be faster or not.
I know there is a Java Clob object, but I don’t really see if it can be used for my particular situation.
Any ideas/advice would be welcome.
Thanks,
Tim

Stream the data in with a Reader, parse it on the fly to find your tags, and replace the data as it goes by while you are streaming the data out with a Writer.
Yes, you have to write a parser to do this.
Do not load it in to a big buffer, do searches and regexes and whatever on the buffer, and then write it out. Processing the data once is the fastest thing you can do.
If you have data later in the file that will fill in spots higher in the file, then your stuck sucking the whole thing in.
Finally, why aren't you just using something like Apache Velocity?

How big is your HTML? A gigabyte? A megabyte? 100k? 10k? For all but the first, string manipulation will be just fine. If that answer doesn't satisfy you, then use indexOf() to find the start and end of the marker, and use substring() to write the portions of the original string before and after.

StringBuilder (not threadsafe) and StringBuffer (threadsafe) are the two basic constructions for String manipulation. But if you are reading your data from a stream it is probably better if you do it on the fly. (read lines, look for marker, if found write content instead of it)

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks

You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.