UIMA for structured data

UIMA for structured data - java

I am new to UIMA ...
I want to connect to a database, extract data and process it using UIMA regex annotator and write back to database.
Example:
Table: emp
Name Department EmpId
AB-C Sale's 2134[3]
XYZ, Fina&nce 23423
PQ#R Marketing 234(47
To be transformed using UIMA regex annotator
Desired Output
Name Department EmpId
ABC Sales 21343
XYZ Finance 23423
PQR Marketing 23447
I have installed UIMA, ECLIPSE and relevant JDBC drivers to connect database.
Thanks in advance

There are a couple of ways to achieve this.
The simplest (not so extendable) way would be to write 3 classes (Use uimaFIT http://uima.apache.org/uimafit.html#Documentation to make coding easier) :
CollectionReader:
- read in all data in objects
- iterate over the objects and create JCASes from each object, you can store the primary key in an annotation.
Analysis Engine:
- use the UIMA regex annotator to manipulate the JCAS's documentText
Consumer:
- read the JCAS documentText and use the primary key to update the database
A better way would be to abstract the reading and writing by creating an external resource (http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.externalresources) that connects to the database (provide a hasNext() and next() method - this is very convenient for use in the CollectionReader and Consumer). This has the advantage that all initialisation logic can be isolated. When using UIMAFit, you can use configuration parameter injection (http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.configurationparameters), for example to make the connection string and the search query configurable.
Use the SimplePipeline class in uimaFIT to run your pipeline: http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.pipelines

Related

Spring JPA Specification - Generate a Plain SQL Where-Clause

we have Spring Hibernate code which generates a Spring JPA data Specification object. This works well (we can use the Specification object to do things like get count). Is there a way to get a plain SQL where-clause from the specification object somehow?
The current code is:
// The line below builds the spec based on business logic. Won't go in the details here, but it is working.
Specification<Crash> querySpec = buildQuerySpec(query);
long count = myDataRepository.count(querySpec);
// Here is what I need: a simple plain T-SQL / Microsoft SQL Server where-clause to be used on other disconnected systems. So something like this:
String whereClause = query.Spec.getPlainSQLWhereClause(...); // e.g. weather in ("raining", "cold") and year in (2014, 2015)
Reason: the specification object (and its related repository) are used in the main system. Now we have other completely disconnected/separate enterprise systems (Esri GIS system), where its APIs can only use plain SQL where-clause.
P.S. I know there's not much code to work with, so some pointers/guides will be much appreciated.

How can I list database tables in a set of databases using the new DBCPConnectionPoolLookup in NiFi?

As of NiFi 1.7.1, the new DBCPConnectionPoolLookup enables dynamic selection of database connections: set an attribute database.name on a FlowFile and when a consuming processor accesses a configured DBCPConnectionPoolLookup controller service, the content of that attribute will be used to get a connection through this lookup's configured properties, which contain a mapping of potential values to DBCPConnectionPool controller service.
I'd like to list the tables in each database that I've configured in the lookup, but the ListDatabaseTables processor does not accept incoming FlowFiles. This seems to mean that it's not usable for listing tables in a dynamic set of databases.
What is the best way to accomplish this?

ListDatabaseTables uses the JDBC API for getting table info from the metadata of an established JDBC connection. This hides the underlying method of how to actually get tables from a particular database.
If all your databases are of the same ilk, then if you have a list of databases, you could generate flow files with one per database, filling in the database.name attribute, then using ExecuteSQL with the DBCPConnectionPoolLookup to execute the corresponding SQL statement to get the tables for that database, such as SHOW TABLES. You can parse the records using any of the record-aware processors such as QueryRecord, UpdateRecord, ConvertRecord, etc. and if you need one table per flow file you can use SplitRecord. If the output is JSON or CSV or XML, you could use EvaluateJsonPath, ExtractText, or EvaluateXPath respectively to get the table name into an attribute, and continue on from there.
I wrote up NIFI-5519 to cover the proposal for ListDatabaseTables to optionally accept incoming connections, in the meantime you'd need 1 ListDatabaseTables instance to correspond to each of your DBCPConnectionPool instances.

using $addToset with java morphia aggregation

I have mongodb aggregation query and it works perfectly in shell.
How can i rewrite this query to use with morphia ?
org.mongodb.morphia.aggregation.Group.addToSet(String field) accepts only one field name but i need to add object to the set.
Query:
......aggregate([
{$group:
{"_id":"$subjectHash",
"authors":{$addToSet:"$fromAddress.address"},
---->> "messageDataSet":{$addToSet:{"sentDate":"$sentDate","messageId":"$_id"}},
"messageCount":{$sum:1}}},
{$sort:{....}},
{$limit:10},
{$skip:0}
])
Java code:
AggregationPipeline aggregationPipeline = myDatastore.createAggregation(Message.class)
.group("subjectHash",
grouping("authors", addToSet("fromAddress.address")),
--------??????------>> grouping("messageDataSet", ???????),
grouping("messageCount", new Accumulator("$sum", 1))
).sort(...)).limit(...).skip(...);

That's currently not supported but if you'll file an issue I'd be happy to include that in an upcoming release.

Thanks for your answer, I can guess that according to source code. :(
I don't want to use spring-data or java-driver directly (for this project) so I changed my document representation.
Added messageDataSet object which contains sentDate and messageId (and some other nested objects) (these values become duplicated in a document which is a bad design).
Aggregation becomes : "messageDataSet":{$addToSet:"$messageDataSet"},
and Java code is: grouping("messageDataSet", addToSet("messageDataSet")),
This works with moprhia. Thanks.

How to store all user activites in a website..?

I have a web application build in Django + Python that interact with web services (written in JAVA).
Now all the database management part is done by web-services i.e. all CRUD operations to actual database is done by web-services.
Now i have to track all User Activities done on my website in some log table.
Like If User posted a new article, then a new row is created into Articles table by web-services and side by side, i need to add a new row into log table , something like "User : Raman has posted a new article (with ID, title etc)"
I have to do this for all Objects in my database like "Article", "Media", "Comments" etc
Note : I am using PostgreSQL
So what is the best way to achieve this..?? (Should I do it in PostgreSQL OR JAVA ..??..And How..??)

So, you have UI <-> Web Services <-> DB
Since the web services talk to the DB, and the web services contain the business logic (i.e. I guess you validate stuff there, create your queries and execute them), then the best place to 'log' activities is in the services themselves.
IMO, logging PostgreSQL transactions is a different thing. It's not the same as logging 'user activities' anymore.
EDIT: This still means you create DB schema for 'logs' and write them to DB.
Second EDIT: Catching log worthy events in the UI and then logging them from there might not be the best idea either. You will have to rewrite logging if you ever decide to replace the UI, or for example, write an alternate UI for, say mobile devices, or something else.

For an audit table within the DB itself, have a look at the PL/pgSQL Trigger Audit Example
This logs every INSERT, UPDATE, DELETE into another table.

In your log table you can have various columns, including:
user_id (the user that did the action)
activity_type (the type of activity, such as view or commented_on)
object_id (the actual object that it concerns, such as the Article or Media)
object_type (the type of object; this can be used later, in combination with object_id to lookup the object in the database)
This way, you can keep track of all actions the users do. You'd need to update this table whenever something happens that you wish to track.

Whenever we had to do this, we overrode signals for every model and possible action.
https://docs.djangoproject.com/en/dev/topics/signals/
You can have the signal do whatever you want, from injecting some HTML into the page, to making an entry in the database. They're an excellent tool to learn to use.

I used django-audit-log and I am very satisfied.
Django-audit-log can track multiple models each in it's own additional table. All of these tables are pretty unified, so it should be fairly straightforward to create a SQL view that shows data for all models.
Here is what I've done to track a single model ("Pauza"):
class Pauza(models.Model):
started = models.TimeField(null=True, blank=False)
ended = models.TimeField(null=True, blank=True)
#... more fields ...
audit_log = AuditLog()
If you want changes to show in Django Admin, you can create an unmanaged model (but this is by no means required):
class PauzaAction(models.Model):
started = models.TimeField(null=True, blank=True)
ended = models.TimeField(null=True, blank=True)
#... more fields ...
# fields added by Audit Trail:
action_id = models.PositiveIntegerField(primary_key=True, default=1, blank=True)
action_user = models.ForeignKey(User, null=True, blank=True)
action_date = models.DateTimeField(null=True, blank=True)
action_type = models.CharField(max_length=31, choices=(('I', 'create'), ('U', 'update'), ('D', 'delete'),), null=True, blank=True)
pauza = models.ForeignKey(Pauza, db_column='id', on_delete=models.DO_NOTHING, default=0, null=True, blank=True)
class Meta:
db_table = 'testapp_pauzaauditlogentry'
managed = False
app_label = 'testapp'
Table testapp_pauzaauditlogentry is automatically created by django-audit-log, this merely creates a model for displaying data from it.
It may be a good idea to throw in some rude tamper protection:
class PauzaAction(models.Model):
# ... all like above, plus:
def save(self, *args, **kwargs):
raise Exception('Permission Denied')
def delete(self, *args, **kwargs):
raise Exception('Permission Denied')
As I said, I imagine you could create a SQL view with the four action_ fields and an additional 'action_model' field that could contain varchar references to model itself (maybe just the original table name).

Restful Web Services - Maintaining Foreign Keys

Am I misunderstanding a basic concept of Restful web services? I have an Android app that I am trying to use a Restful PUT. Two Mysql tables Country and StateProvince with countryId a foreign key on StateProvince table.
If I try to do a PUT to StateProvince using the following
<StateProvince><stateName>Victoria</stateName><countryId>1</countryId></StateProvince>
I get the error below. Am I misunderstanding a basic concept regarding foreign keys and Rest?
Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.2.0.v20110202-r8913): org.eclipse.persistence.exceptions.DatabaseExcepti on
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityCons traintViolationException: Column 'country_id' cannot be null
Error Code: 1048
Call: INSERT INTO state_province (state_id, state_name, country_id) VALUES (?, ?, ?)
bind => [3 parameters bound]
Query: InsertObjectQuery(com.pezoot.models.StateProvince[ stateId=2 ])

Short Answer: country_id is null, so this looks like a database/persistence issue. You probably didn't set the Country for the StateProvince (or add the StateProvince to the Country - haven't seen your code so I don't know how you're mapping things).
Long Answer:
Why is there an database identifier coming in as part of your HTTP request?
You need to start thinking in terms of URIs and resources - your StateProvince representation should have some kind of link that relates to a country at a particular URI (e.g. <link rel="country" href="/country/1" /> and in your resource class that handles the PUT verb, you need to be able to conver that URI in to a domain object, an entity (as it seems you're using EclipseLink) which you can use some setter method or something on to establish the database relation. The REST relationship and the database relationship are fundamentally different.
It takes practice and careful thinking to handle what seems like a simple concept (HTTP verbage) against your persistence unit. Something that seems straightforward like PUT has nontrivial processing required in order to make it work as REST would expect.
It is tempting to use database identifiers in URIs because it is easy (especially if you use subresources that just happen to magically know who their parent is: e.g. country/1/stateprovince/2) but you have to step back and ask yourself, is it country/1 or is it country/usa - you also have to ask yourself, is the country and state/province really an entity? or is it just a value object? Do you really intend to PUT a State/Province in its entirety?

Thanks guys.
Once again (unsurprisingly) it appears to be a syntax error. When I used the following:
<stateProvince>
<countryId>
<countryId>1</countryId>
</countryId>
<stateName>New South Wales</stateName> </stateProvince>
Hey presto it works. I had failed to embed the countryId as shown above previously (see old code below)
<stateProvince>
<countryId>1</countryId>
<stateName>New South Wales</stateName> </stateProvince>
And thanks Doug - your response is the sort of insight I am seeking. I dont believe I have quite wrapped my head around the use of links as you describe - but I will investigate further now

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.