Best practice for storing languages in a database - java

I'm currently working on a project for university. Basically, the task is to implement a basic library management tool (in Java, using the Spring Framework and the Java Persistence API). Part of the software requirement is to be able to add and modify book entries and store them in a database. A book
has multiple properties (title, publication date, ...) and also a specific language. Now, my question is: How do I (neatly) implement that whole concept of "language"? I came up with some ideas, each with its own benefits and trade-offs:
Idea 1:
User Input: plain text field
Database: store language as a string (as an attribute of the book relation)
Pros: very simple to implement
Cons: languages might not be uniform ("english", "en", "English", "ENGLISH", ...), lots of room for human error (typos, ...)
Idea 2:
User Input: drop-down / combo box
Database: predefined language relation, containing all possible languages (e.g. language_id as a foreign key in the book relation)
Pros: languages are uniform (e.g. "English", "German", "Italian", ...), little room for human error
Cons: what is the set of all possible languages? Should/Can some languages be omitted? Is this overkill? Is there too much overhead? What format of languages should be used (human-readable: "English", "english", or more compact: "en", "en-us", ...) ?
Idea 3:
User Input: drop-down / combo box
Source Code: as in idea 2, but now hardcode all possible languages as enumerations in source code
Database: convert the specific enumeration to either a string/ordinal and use it as an attribute in the book relation
Pros: pros of Idea 2 + no hardcoding of language strings in source code (from if language = "english" to if language = Language.ENGLISH)
Cons: cons of Idea 2 + how are the enumerations mapped into the database (string, ordinal , separate relation)? The enumerations and the languages in the database must be "in sync".
To me, Idea 2 might be the "most-reasonable", but I'm still not sure, whether this is actually a "good" approach. Maybe you can help me out.

Separate presentation from business logic.
Use a standardized language code for your internal business logic and data storage. You have a choice of several. I would choose the code used by Java in its Locale class, if that covers your domain’s needs. For example, en for English and fr for French.
For presentation to the user, localize the display name of each language. When the user chooses a language during data-entry you translate that to a language code value for logic and storage.
Use a GUI widget that lets the user pick from a list. For a long list, usually best to use a widget that allows for type-ahead to pick an item as the user types the first few letters of the name.
Get an array of all known Locale objects by calling Locale.getAvailableLocales. Loop those to make a Set, to build a distinct list of language codes.
Or call Locale.getISOLanguages to get list of all 2-letter language codes defined in ISO 639-1. The Locale class also offers the three-letter ISO 639-2 language code. I am not an expert here, so I do not know the difference. But we can see this list of 3 & 2 letter codes defined by ISO 639-1 and 639-2.
To get localized name of a language, pass the user’s preferred locale object to the Locale::getDisplayLanguage method.
String displayNameOfFrenchLanguageForJapaneseUser = Locale.FRENCH.getDisplayLanguage( Locale.JAPAN ) ;
displayNameOfFrenchLanguageForJapaneseUser = フランス語
And for German user.
String displayNameOfFrenchLanguageForGermanUser = Locale.FRENCH.getDisplayLanguage( Locale.GERMAN );
System.out.println( "displayNameOfFrenchLanguageForGermanUser = " + displayNameOfFrenchLanguageForGermanUser );
displayNameOfFrenchLanguageForGermanUser = Französisch
As for defining enums, an Enum is defined at compile time. An Enum cannot be redefined wih additional or fewer items at runtime. Locales and language codes, in contrast, do change occasionally. Upgrading your deployment with a different version of Java may have changes in its known locales. So I would lean towards soft-coding. If your domain applies to a specific number of languages that are unlikely to change, say the Romance languages of Western Europe, then hard-coding with enum might be appropriate.
If you need to track dialects rather than mere broad languages, such as Québec French versus French in general, then you may want to learn about the Common Locale Data Repository (CLDR) now bundled with Java implementations built from OpenJDK.

Related

What are trained models in NLP?

I am new to Natural language processing. Can anyone tell me what are the trained models in either OpenNLP or Stanford CoreNLP? While coding in java using apache openNLP package, we always have to include some trained models (found here http://opennlp.sourceforge.net/models-1.5/ ). What are they?
A "model" as downloadable for OpenNLP is a set of data representing a set of probability distributions used for predicting the structure you want (e.g. part-of-speech tags) from the input you supply (in the case of OpenNLP, typically text files).
Given that natural language is context-sensitive†, this model is used in lieu of a rule-based system because it generally works better than the latter for a number of reasons which I won't expound here for the sake of brevity. For example, as you already mentioned, the token perfect could be either a verb (VB) or an adjective (JJ) and this can only be disambiguated in context:
This answer is perfect — for this example, the following sequences of POS tags are possible (in addition to many more‡):
DT NN VBZ JJ
DT NN VBZ VB
However, according to a model which accurately represents ("correct") English§, the probability of example 1 is greater than of example 2: P([DT, NN, VBZ, JJ] | ["This", "answer", "is", "perfect"]) > P([DT, NN, VBZ, VB] | ["This", "answer", "is", "perfect"])
†In reality, this is quite contentious, but I stress here that I'm talking about natural language as a whole (including semantics/pragmatics/etc.) and not just about natural-language syntax, which (in the case of English, at least) is considered by some to be context-free.
‡When analyzing language in a data-driven manner, in fact any combination of POS tags is "possible", but, given a sample of "correct" contemporary English with little noise, tag assignments which native speakers would judge to be "wrong" should have an extremely low probability of occurrence.
§In practice, this means a model trained on a large, diverse corpus of (contemporary) English (or some other target domain you want to analyze) with appropriate tuning parameters (If I want to be even more precise, this footnote could easily be multiple paragraphs long).
Think of trained model as a "wise brain with existing information".
When you start out machine learning, the brain for your model is clean and empty. You can either download trained model or you can train your own model (like teaching a child)
Usually you only train models for edge cases else you download "Trained models" and get to work in predicting/machine learning.

Formal notation for code translations by compiler

I'm designing a DSL which translates to Java source code. Are their notations which are commonly used to specify the semantics/translation of a compiler?
Example:
DSL:
a = b = c = 4
Translates into:
Integer temp0 = 4;
Integer a = temp0;
Integer b = temp0;
Integer c = temp0;
Thanks in advance,
Jeroen
Pattern matching languages can be used to formalise small tree transforms. For an example of such a DSL take a look at Nanopass framework. A more generic approach is to think of the tree transforms as a form of term rewriting.
Such transforms are formal enough, e.g., they can be certified, as in CompCert.
There are formal languages to define semantics; you can see such languages and definitions in almost any technical paper in conference proceedings on programming languages. Texts on the topic are available: https://mitpress.mit.edu/.../semantics-programming-languages You need to have some willingness to read concise mathematical notations.
As a practical matter, these semantics are not used to drive translations/compilers; this is still a research topic. See http://Fwww.andrew.cmu.edu%2Fuser%2Fasubrama%2Fdissertation.pdf To read these you typically need to have spent some time reading introductory texts such as the above.
There has been more practical work on defining translations; the most practical are program transformation systems.
With such tools, one can write, using the notations of source language (e.g., your DSL), and the notation of the target language (e.g., Java or assembler or whatever), transformation rules
of the form:
replace source_language_fragment by target_language_fragment if condition
These tools are driven by grammar for the source and target languages, and interpret the transformation rules from their readable form into AST to AST rewrites. To fully translate a complex DSL to another language typically requires hundreds of rules, but a key point is they are much more easily read than procedural code typical of hand-written translators.
Trying to follow OP's example, assuming one has grammars for the OP's "MyDSL" and for "Java" as a target, and using our DMS Software Reengineeering Toolkit's style of transformation rules:
source domain dsl;
target domain Java;
rule translate_single_assignment(t: dsl_IDENTIFIER, e: dsl_expression):
" \t = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" int \JavaIdentifier\(\t\)=\e;
".
rule translate_multi_assignment(t1: dsl_IDENTIFIER, t2: dsl_IDENTIFIER, e: dsl_expression):
" \t1 = \t2 = \e " -- MyDSL syntax
-> -- read as "rewrites to"
" \>\dsl \t2 = \e \statement
int \t1;
\t1=\t2;
".
You need two rules: one for the base case of a simple assignment t=e; and one to handle the multiple assignment case. The multiple assignment case peels off the outermost assignment,
and generates code for it, and inserts the remainder of the multiple assignment back in in its original DSL form, to be reprocessed by one of the two rules.
You can see another example of this used for refactoring (source_language == target_language) at https://stackoverflow.com/questions/22094428/programmatic-refactoring-of-java-source-files/22100670#22100670

Simple physical quantity measurement unit parser for Java

I want to be able to parse expressions representing physical quantities like
g/l
m/s^2
m/s/kg
m/(s*kg)
kg*m*s
°F/(lb*s^2)
and so on. In the simplest way possible. Is it possible to do so using something like Pyparsing (if such a thing exists for Java), or should I use more complex tools like Java CUP?
EDIT: To answere MrD's question the goal is to make conversion between quantities, so for example convert g to kg (this one is simple...), or maybe °F/(kg*s^2) to K/(lb*h^2) supposing h is four hour and lb for pounds
This is harder than it looks. (I have done a fair amount of work here). The main problem is there is no standard (I have worked with NIST on units and although they have finally created a markup language few people use it). So it's really a form of natural language processing and has to deal with :
ambiguity (what does "M" mean - meters or mega)
inconsistent punctuation
abbreviations
symbols (e.g. "mu" for micro)
unclear semantics (e.g. is kg/m/s the same as kg/(m*s)?
If you are just creating a toy system then you should create a BNF for the system and make sure that all examples adhere to it. This will use common punctuation ("/", "", "(", ")", "^"). Character fields can be of variable length ("m", "kg", "lb"). Algebra on these strings ("kg" -> 1000"g" has problems as kg is a fundamental unit.
If you are doing it seriously then ANTLR (#Yaugen) is useful, but be aware that units in the wild will not follow a regular grammar due to the inconsistencies above.
If you are REALLY serious (i.e. prepared to put in a solid month), I'd be interested to know. :-)
My current approach (which is outside the scope of your question) is to collect a large number of examples from the literature automatically and create a number of heuristics.

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

LibSVM Input format

I want to represent a set of labelled instances (data) in a file to be fed in to LibSVM as training data. For the problem mentioned in this question. It will include,
Login date
Login time
Location (country code?)
Day of the week
Authenticity (0 - Non Authentic, 1 - Authentic) - The Label
How can I format this data to be input to the SVM?
Are you asking about the data format or how to convert the data? For the latter you're going to have to experiment to find the right way to do this. The general idea is to convert your data into a nominal or ordinal value attribute. Some of these are simple - #4, #6 - some of these are going to be tough - #1-#3.
For example, you could represent #1 as three attributes of day, month and year, or just one by converting it to a UNIX like timestamp.
The IP is even harder - there's no straightforward way to convert that into a meaningful ordinal value. Using every IP as a nominal attribute might not be useful depending on your problem.
Once you figure this out, convert your data, check the LibSVM docs. The general format is followed by : i.e., +1 1:0 2:0 .. etc
I believe there is an unstated assumption in the previous answers. The unstated assumption is that users of libSVM know that they should avoid putting categorical data into the classifier.
For example, libSVM will not know what to do with country codes. If you are trying to predict which visitors are most likely to buy something on your site then you could have problems if USA is between Chad and Niger in your country code list. The bulge from USA will likely skew predictions for the countries located near it.
To fix this I would create one category for each country under consideration (and perhaps an 'other' category). Then for each instance you want to classify, I would set all the country categories to zero except the one to which the instance belongs. (To do this with the libSVM sparse file format, this isn't really a big deal).

Categories

Resources