pub.document.sortDocuments not sorting - java

I am stuck, I had this working last week now I have changed something and it will not work!
I have a simple flow service as follows:
pub.file.getFile
pub.flatFile.convertToValues
pub.document.sortDocuments
But the sortDocuments stage is not doing anything.
The recordWithNoID document list is perfect and all the fields are correct (so the schema and dictionary are working as intended), but when I try to sort it on the key "Field1" the sort is not doing anything, the documents are not changing order at all.
See two attached screenshots:
Screenshot 1 shows the pipeline during pub.document.sortDocuments step
key variable is: Field1
order variable is:ascending
Screenshot 2 shows the recordwithNoID after running the flow service. As you can see the Field1 column has not been ordered correctly.(it's still in the original document order) I have also tried mapping the results to other document types with the same result.
As I said above I had this working last week and now cannot seem to get it to work. I have even started the whole process from scratch and it still will not work. Any help would be very much appreciated!
Screenshot1
Screenshot 2
EDIT:
I resolved this issue by mapping to the Document Type created from the Schema.

It appears that you map the ffValues document (IData) and not the recordWithNoID document list (IData array) inside it, which would be the wrong level.
Please map the recordWithNoID instead and let us know if that solves the issue.
While not related to the question, it seems that some "clutter" is on the pipeline. I always recommend to people that they drop variable as early as possible. Mostly to improve readability but also for performance.

I am not sure but maybe this is the problem: on screenshoot1 we can see that you sort ffValues but you are mapping it to the document. (because you are using invoke, it is done automatically)
Is screen number two is showing ffValues or document variable ?
Maybe you are checking wrong, not sorted variable?
I am also want to suggest to use Map and transformer rather than invoke, because using map gives you power to control the pipeline.
While using invoke each variable is save to the pipeline (having varaible with the same name on pipeline ale on the output of the service will result with overwrite on pipeline variable).

Related

XDocReport generate report : loop thru collection in table (java)

I have been struggling with trying to follow a code sample by XDocReport(open source project).
I followed this tutorial from the website:
https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable
I used the Freemarker template style.
I would not iterate and create the table, I just get back: $variable as text in the output doc.
Then I dug further, and discovered that this tutorial on the website was probably not updated for the newer version. I found some more examples in this url, which contains a zip file.
https://code.google.com/p/xdocreport/downloads/detail?name=docxandfreemarker-1.0.4-sample.zip
I still could not get it to work.
I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
I hope one of the developers of XDocReport, angelo.zerr, would give some input on this.
Sincerely,
P
I was hoping someone would have a working code sample that takes a java collection and populates a table in a Word document.
What is the problem with https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMainListFieldInTable?
I suggets you that you create an issue on XDocReport forum with a very simple case (simple Java main + docx)
It seems that the issue was the template. If one sets up a mail merge field in a Word template and don't use it in the Java program, the program then complains it can't find the variable, or something to that effect. And if you just delete the mail merge text in the document, it may still exist as a mail merge field variable in the word document.
So one needs to be very careful it seems with how to set things in the template.
I think the API should be able to ignore if there is a field setup in the template, and we are not referencing it in the code though. But that solved the problem.

Notify when web content change

Im new to java and working on a simple application that monitor an url and notify me when a table is updated whit new items. Looking at the entire page will not work as there are commercials that change all the time and they would give false positives.
My thought was to fetch the url line by line looking for the elements. For each element I will check to see if the element is already in an arraylist. If not the element is added to the arraylist and a notification is send.
What I need support with is not the exact code but advice if this would be a good approach and if I should store the elements in an array list or if I should use a file instead as there are 2 lines of text in each element.
Also It would be good to get recomandation on what methods and libs there would be good to look at.
Thanks in advance
Sebastian
To check the site it'd probably be more stable to parse the HTML and work with an object representation of the DOM. I've never had to do this but in a question regarding how to do this another user suggested using JTidy, maybe you could have a look at that.
As for storing the information (what you currently do in your ArrayList): this really depends on what you use your application for. If you only want to be notified of changes that occur during the runtime of your program this is perfectly fine. If you want to have the information persist you should find a way to store the information in the file system or database.

Dynamic Content Parsing

I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()="&nbspMore Info&nbsp"]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.

Using a common query convention for multiple search fields

Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.

Writing one file per group in Pig Latin

The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Categories

Resources