I want to make a translator ex: English to Spanish.
I want to translate a large text with a map for the translation.
HashMap <String, Object> hashmap = new HashMap <String, Object>();
hashmap.put("hello", "holla");
.
.
.
Witch object should I use to handle my inital text of 1000 words? A String or StringBuilder is fine ?
How can I do a large replace? Without iterate each word with each element of the map ?
I don't want take each word of the string, and see there is a match in my map
Maybe a multimap with the first letter of the word?
If you have any answer or advise thank you
Here is an example implementation:
import java.io.*;
import java.util.*;
public class Translator {
public enum Language {
EN, ES
}
private static final String TRANSLATION_TEMPLATE = "translation_%s_%s.properties";
private final Properties translations = new Properties();
public Translator(Language from, Language to) {
String translationFile = String.format(TRANSLATION_TEMPLATE, from, to);
try (InputStream is = getClass().getResourceAsStream(translationFile)) {
translations.load(is);
} catch (final IOException e) {
throw new RuntimeException("Could not read: " + translationFile, e);
}
}
private String[] translate(String text) {
String[] source = normalizeText(text);
List<String> translation = new ArrayList<>();
for (String sourceWord : source) {
translation.add(translateWord(sourceWord));
}
return translation.toArray(new String[source.length]);
}
private String translateWord(String sourceWord) {
Object value = translations.get(sourceWord);
String translatedWord;
if (value != null) {
translatedWord = String.valueOf(value);
}
else {
// if no translation is found, add the source word with a question mark
translatedWord = sourceWord + "?";
}
return translatedWord;
}
private String[] normalizeText(String text) {
String alphaText = text.replaceAll("[^A-Za-z]", " ");
return alphaText.split("\\s+");
}
public static void main(final String[] args) {
final Translator translator = new Translator(Language.EN, Language.ES);
System.out.println(Arrays.toString(translator.translate("hello world!")));
}
}
And put a file called 'translation_EN_ES.properties' on your classpath (e.g. src/main/resources) with:
hello=holla
world=mundo
If you know all the words before hand you could easily create a Regex Trie.
Then at runtime, compile the regex once. Then you are good to go.
To create the regex, download and install RegexFormat 5 here.
From the main menu, select Tools -> Strings to Regex - Ternary Tree
paste the list in the input box, then press the Generate button.
It spits out a full regex Trie that is as fast as any hash lookup there is.
Copy the compressed output from that dialog into Rxform tab (mdi) window.
Right click window to get the Context menu, select Misc Utilities -> Line Wrap
set it for about a 60 character width, press ok.
Next press the C++ button from the windows toolbar to bring up the MegaString
dialog. Click make C-style strings Lines Catenated-1 press OK.
Copy and paste the result into your Java source.
Use the regex in a Replace-All with callback.
In the callback use the match as a key into your hash table to return the
translation to replace.
Its simple, one pass and oh so fast.
For a more extreme example of the tool see this regex of a 130,000 word dictionary.
Sample of the letter X
"(?:x(?:anth(?:a(?:m|n|te(?:s)?)|e(?:in|ne)|i(?:an|"
"c|n(?:e)?|um)|o(?:ma(?:s|ta)?|psia|us|xyl))|e(?:be"
"c(?:s)?|n(?:arthral|i(?:a(?:l)?|um)|o(?:biotic|cry"
"st(?:s)?|g(?:amy|enous|raft(?:s)?)|lith(?:s)?|mani"
"a|n|ph(?:ile(?:s)?|ob(?:e(?:s)?|ia|y)|ya)|time))|r"
"(?:a(?:fin(?:s)?|n(?:sis|tic)|rch|sia)|ic|o(?:derm"
"(?:a|i(?:a|c))|graphy|m(?:a(?:s|ta)?|orph(?:s)?)|p"
"h(?:agy|ily|yt(?:e(?:s)?|ic))|s(?:is|tom(?:a|ia))|"
"t(?:es|ic))))|i(?:pho(?:id(?:al)?|pag(?:ic|us)|sur"
"an))?|oan(?:a|on)|u|y(?:l(?:e(?:m|n(?:e(?:s)?|ol(?"
":s)?))|i(?:c|tol)|o(?:carp(?:s)?|g(?:en(?:ous)?|ra"
"ph(?:s|y)?)|id(?:in)?|l(?:ogy|s)?|m(?:a(?:s)?|eter"
"(?:s)?)|nic|ph(?:ag(?:an|e(?:s)?)|on(?:e(?:s)?|ic)"
")|rimba(?:s)?|se|tomous)|yl(?:s)?)|st(?:er(?:s)?|i"
"|o(?:i|s)|s|us)?)))"
Related
I am new to Java (along with programming in general). I was working on a personal project where a user types one character to which it is converted to another character. More specifically, a user would type a romanization of a Japanese character to which the Japanese hiragana equivalent is outputted. I am using two separate classes at the moment:
RomaHiraCore.java
import java.util.*;
public class RomaHiraCore
{
public static void main(String[] args)
{
Table.initialize(); // Table.java needed!
Map<String, String> table = Table.getTable();
Scanner roma = new Scanner(System.in);
System.out.println("Romaji: ");
String romaji = roma.nextLine().toLowerCase();
if (table.containsKey(roma))
{
System.out.println(table.get(roma));
}
else
{
System.out.println("Please enter a valid character (e. g. a, ka)");
}
roma.close();
}
}
Tables.java
import java.util.*;
public class Table
{
private static Map<String, String> table = new LinkedHashMap<>();
public static Map<String, String> getTable()
{
return table;
}
public static void initialize()
{
// a - o
table.put("a", "あ");
table.put("i", "い");
table.put("u", "う");
table.put("e", "え");
table.put("o", "お");
// ka - ko
table.put("ka", "か");
table.put("ki", "き");
table.put("ku", "く");
table.put("ke", "け");
table.put("ko", "こ");
}
}
If anyone can point me in the right direction, I would greatly appreciate it. I've attempted to go over the documentation, but I can't seem to grasp it (maybe I'm overthinking it). When I run the program, it allows me to enter a character; however, it will only continue to the "else" statement rather than scan Table.java to see if the input matches any of the values listed. Either I'm overlooking something or need to use an entirely different method altogether.
In your map, you have String keys, and the String which you provide is in the romaji variable. So your if should look like this: if (table.containsKey(romaji)). What is more, in this situation I think that using LinkedHashMap doesn't give you anything, simple HashMap would be as good as LinkedHashMap(even better), because you don't need to maintain insertion order of your characters.
In your main method, you are currently storing the string value inputted by the user however you never access that variable anywhere else.
You wrote table.containsKey(roma) however roma is the Scanner object and not the string they entered so you should be checking if that string is a valid key by using table.containsKey(romaji).
Next, in your else clause, you ask them to reenter an input but never provide them the chance to because you just terminate the scanner.
What you should be doing is something more like this.
String romaji = roma.nextLine().toLowerCase();
while (true) {
if (table.containsKey(romaji)) {
System.out.println(table.get(romaji));
break;
}
else {
System.out.println("Enter a valid char:";
romaji = roma.nextLine().toLowerCase();
}
}
roma.close();
I'm looking for suggestions on how to go about validating input from a user. My assignment is to execute commands based on a textual input from the user. My only concern is that there can be many variations of commands that are acceptable.
For example these commands are all acceptable and do the same thing, "show the game board"
sh board,
sho board,
show board,
show bo,
sho bo,
sh bo
There are about 10 other commands that share this similar property so I was wondering what would be the best practice of going about validating a users input?
Should I store all the different combinations in a hashmap?
Look into regex (regular expressions). These are great for when you want to use values that are not necessarily complete.
For example:
Say I type "shutdo"
With regex you can make your program understand that anything after the string "shutd" means to powerOff()
It looks like the minimum command allowed length is 2.
So first you check if the length of the term is at least 2.
Next, you can loop over the available commands,
and stop at the first that starts with the term, for example:
List<String> commands = Arrays.asList("show", "create", "delete");
for (String command : commands) {
if (command.startsWith(term)) {
// found a match, command is: command
break;
}
}
If the commands are very specific and limited, I would just add all of them into some data structure (hash being one of them).
If the problem was that you're supposed to understand what the user input is supposed to do, then I would say find the pattern using either regex or a simple pattern validation (looks like they're all two words, first starting with "sh" and second starting with "bo").
But honestly, ~15 commands aren't that big of deal in terms of space/efficiency.
Edit:
There are about 10 other commands that share this similar property
If this means 10 more commands like "show board", then I would say store it in hash. But if I misunderstood you and you mean that there are 10 other commands that do similar things ("set piece", "set pie", "se pi", etc), then RegEx is the way to go.
If I understood you correctly, there are N distinct commands, which can be combined. It shall be allowed to abbreviate each command as long it stays unambiguous.
If this is the case, the following methods expandCommands(String) and expandCommand(String) will normalize each command part.
public class Main {
static Set<String> availableCommands = new HashSet<>(Arrays.asList(
"show",
"board",
"btest"
));
public static void main(String[] args) throws Exception {
List<String> testData = Arrays.asList(
"sh board",
"sho board",
"show board",
"show bo",
"sho bo",
"sh bo"
);
String expected = "show board";
for (String test : testData) {
String actual = expandCommands(test);
if (!expected.equals(actual)) {
System.out.println(test + "\t"+ actual);
}
}
try {
expandCommands("sh b");
throw new IllegalStateException();
} catch (Exception e) {
if (!"not unique command: b".equals(e.getMessage())) {
throw new Exception();
}
}
try {
expandCommands("sh asd");
throw new IllegalStateException();
} catch (Exception e) {
if (!"unknown command: asd".equals(e.getMessage())) {
throw new Exception();
}
}
}
private static String expandCommands(String aInput) throws Exception {
final String[] commandParts = aInput.split("\\s+");
StringBuilder result = new StringBuilder();
for (String commandPart : commandParts) {
String command = expandCommand(commandPart);
result.append(command).append(" ");
}
return result.toString().trim();
}
private static String expandCommand(final String aCommandPart) throws Exception {
String match = null;
for (String candidate : availableCommands) {
if (candidate.startsWith(aCommandPart)) {
if (match != null) {
throw new Exception("not unique command: " + aCommandPart);
}
match = candidate;
}
}
if (match == null) {
throw new Exception("unknown command: " + aCommandPart);
}
return match;
}
}
The Set<String> availableCommands contains all possible commands.
Every part of the input command is checked, if it is the start of exactly one available command.
You can use reg-ex matching to validate input. E.g., the pattern below will match anything that starts with sh followed by 0 or more characters, then a space and then bo followed by 0 or more chars.
public class Validator {
public static void main (String[] args) {
String pattern = "sh[\\w]* bo[\\w]*";
System.out.println(args[0].matches(pattern));
}
}
I'm trying to build a text classifier using Weka, but the probabilities with distributionForInstance of the classes are 1.0 in one and 0.0 in all other cases, so classifyInstance always returns the same class as prediction. Something in the training doesn't work correctly.
ARFF training
#relation test1
#attribute tweetmsg String
#attribute classValues {politica,sport,musicatvcinema,infogeneriche,fattidelgiorno,statopersonale,checkin,conversazione}
#DATA
"Renzi Berlusconi Salvini Bersani",politica
"Allegri insulta la terna arbitrale",sport
"Bravo Garcia",sport
Training methods
public void trainClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
for(Instance currentInstance : inputDataset)
{
Instance currentFeatureVector = extractFeature(currentInstance);
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier."+ex);
}
}
private Instance extractFeature(Instance inputInstance) throws Exception
{
String tweet = inputInstance.stringValue(0);
StringTokenizer defaultTokenizer = new StringTokenizer(tweet);
List<String> tokens=new ArrayList<String>();
while (defaultTokenizer.hasMoreTokens())
{
String t= defaultTokenizer.nextToken();
tokens.add(t);
}
Iterator<String> a = tokens.iterator();
while(a.hasNext())
{
String token=(String) a.next();
String word = token.replaceAll("#","");
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
else{
featureWords.add(word);
featureMap.put(featureWords.indexOf(word), 1.0);
}
}
attributeList.clear();
for(String featureWord : featureWords)
{
attributeList.add(new Attribute(featureWord));
}
attributeList.add(new Attribute("Class", classValues));
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances = createInstances("TRAINING_INSTANCES");
return new SparseInstance(1.0,values,indices,1000000);
}
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
ArffLoader trainingLoader = new ArffLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
System.out.println("dataset "+inputDataset.numAttributes());
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}
public static void main(String[] args) throws Exception
{
Classificatore wekaTutorial = new Classificatore();
wekaTutorial.trainClassifier("training_set_prova_tent.arff");
wekaTutorial.testClassifier("testing.arff");
}
public Classificatore()
{
attributeList = new ArrayList<Attribute>();
initialize();
}
private void initialize()
{
featureWords= new ArrayList<String>();
featureMap = new TreeMap<>();
classValues= new ArrayList<String>();
classValues.add("politica");
classValues.add("sport");
classValues.add("musicatvcinema");
classValues.add("infogeneriche");
classValues.add("fattidelgiorno");
classValues.add("statopersonale");
classValues.add("checkin");
classValues.add("conversazione");
}
TESTING METHODS
public void testClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
double[] vv= classifier.distributionForInstance(testInstance);
for(int k=0;k<vv.length;k++){
System.out.println("distribution "+vv[k]); //this are the probabilities of the classes and as result i get 1.0 in one and 0.0 in all the others
}
System.out.println(testingInstances.attribute("Class").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier."+ex);
}
}
I want to create a text classifier for short messages, this code is based on this tutorial http://preciselyconcise.com/apis_and_installations/training_a_weka_classifier_in_java.php . The problem is that the classifier predict the wrong class for almost every message in the testing.arff because the probabilities of the classes are not correct. The training_set_prova_tent.arff has the same number of messages per class.
The example i'm following use a featureWords.dat and associate 1.0 to the word if it is present in a message instead I want to create my own dictionary with the words present in the training_set_prova_tent plus the words present in testing and associate to every word the number of occurrences .
P.S
I know that this is exactly what can i do with the filter StringToWordVector but I haven't found any example that exaplain how to use this filter with two file: one for the training set and one for the test set. So it seems easier to adapt the code I found.
Thank you very much
It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found.
What you (probably) wanted to do in extractFeature is
Split each tweet into words (tokenize)
Count the number of occurrences of these words
Create a feature vector representing these word counts plus the class
What you've overlooked in that method is
You never reset your featureMap. The line
Map<Integer,Double> featureMap = new TreeMap<>();
originally was at the beginning extractFeatures, but you moved it to initialize. That means that you always add up the word counts, but never reset them. For each new tweet, your word count also includes the word count of all previous tweets. I'm sure that is not what you wanted.
You don't initialize featureWords with the words you want as features. Yes, you create an empty list, but you fill it iteratively with each tweet. The original code initialized it once in the initialize method and it never changed after that. There are two problems with that:
With each new tweet, new features (words) get added, so your feature vector grows with each tweet. That wouldn't be such a big problem (SparseInstance), but that means that
Your class attribute is always in another place. These two lines work for the original code, because featureWords.size() is basically a constant, but in your code the class label will be at index 5, then 8, then 12, and so on, but it must be the same for every instance.
indices[i] = featureWords.size();
values[i] = (double) classValues.indexOf(inputInstance.stringValue(1));
This also manifests itself in the fact that you build a new attributeList with each new tweet, instead of only once in initialize, which is bad for already explained reasons.
There may be more stuff, but - as it is - your code is rather unfixable. What you want is much closer to the tutorial source code which you modified than your version.
Also, you should look into StringToWordVector because it seems like this is exactly what you want to do:
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
Current assignment needs me to write a program to read a file with instructions in a very tiny and basic programming language (behaves a little like FORTRAN) and execute those instructions. Basically it's a simple interpreter for the language I guess. It's completely linear, with statements all being defined in sequence and it only has String and integer variables. There are 8 keywords and 4 arithmetic operators I would need to find and define if they exist within the source file, and each line must start off with one of the reserved words.
A program in this language might look something like this:
#COMMENTS
LET.... (declares variables with values)
INTEGER myINT
STRING myString
CALCULATE...
PRINT
PRINTLN
END
Can I use a switch block instead of if-loops to find and then execute all these? My concern is that switches don't work with Strings in Java 6, which is what I'm supposed to be using, but I don't see how to easily assign various int values so the switch block would work. Thanks in advance for any suggestions and advice!
If your language is so simple that every statement begins in its own line and is identified by one word only, then (as Gray pointed out in another comment) you can split the words in each line, then compare the first word against a map. However, I would suggest, instead of mapping the words to ints and then doing one big switch, to map them into objects instead, like this (suggested by Dave Newton):
interface Directive {
public void execute(String line);
}
class LetDirective implements Directive {
public void execute(String line) { ...handle LET directive here... }
}
...define other directives in the same way...
Then define the map:
private Map<String, Directive> directives = new HashMap<String, Directive>();
directives.put("LET", new LetDirective());
...
Then in your parsing method:
int firstSpace = line.indexOf(' ');
String command = line;
if (firstSpace > 0)
command = line.substring(0, firstSpace);
Directive directive = directives.get(command.toUpperCase());
if (directive != null)
directive.execute(line);
else
...show some error...
Each directive would have to parse the rest of the line on its own and handle it correctly inside its execute() method.
The benefit of this over a switch is that you can handle a larger amount of commands without ending up with one gigantic method, but instead with one smaller method per each command.
If you are talking about converting strings to integers then you could do it with an Java enumerated type:
private enum ReservedWord {
LET,
...
}
// skip blank lines and comments
String[] tokens = codeLine.split(" ");
ReservedWord keyword;
try {
keyword = ReservedWord.valueOf(tokens[0]);
} catch (IllegalArgumentException e) {
// spit out nice syntax error message
}
You could also put the processing of the line inside of the enum as a method if you'd like. You could also do it with a Map:
private final Map<String, Integer> reservedWords = new HashMap<String, Integer>();
private final int RESERVED_WORD_LET 1
...
{
reservedWords.put("LET", RESERVED_WORD_LET);
...
}
// skip blank lines and comments
String[] tokens = codeLine.split(" ");
Integer value = reservedWords.get(tokens[0]);
if (value == null) // handle error... ;
switch (value) {
case 1:
// LET
...
}
I am trying to get JLine to do tab completion so I can enter something like the following:
commandname --arg1 value1 --arg2 value2
I am using the following code:
final List<Completor> completors = Arrays.asList(
new SimpleCompletor("commandname "),
new SimpleCompletor("--arg1"),
new SimpleCompletor("--arg2"),
new NullCompletor());
consoleReader.addCompletor(new ArgumentCompletor(completors));
But after I type the value2 tab completion stops.
(Suplementary question, can I validate value1 as a date using jline?)
I had the same problem, and I solved it by creating my own classes to complete the commands with jLine. I just needed to implement my own Completor.
I am developing an application that could assist DBAs to type not only the command names, but also the parameters. I am using jLine for just for the Terminal interactions, and I created another Completor.
I have to provide the complete grammar to the Completor, and that is the objective of my application. It is called Zemucan and it is hosted in SourceForge; this application is initially focused to DB2, but any grammar could be incorporated. The example of the Completor I am using is:
public final int complete(final String buffer, final int cursor,
#SuppressWarnings("rawtypes") final List candidateRaw) {
final List<String> candidates = candidateRaw;
final String phrase = buffer.substring(0, cursor);
try {
// Analyzes the typed phrase. This is my program: Zemucan.
// ReturnOptions is an object that contains the possible options of the command.
// It can propose complete the command name, or propose options.
final ReturnOptions answer = InterfaceCore.analyzePhrase(phrase);
// The first candidate is the new phrase.
final String complete = answer.getPhrase().toLowerCase();
// Deletes extra spaces.
final String trim = phrase.trim().toLowerCase();
// Compares if they are equal.
if (complete.startsWith(trim)) {
// Takes the difference.
String diff = complete.substring(trim.length());
if (diff.startsWith(" ") && phrase.endsWith(" ")) {
diff = diff.substring(1, diff.length());
}
candidates.add(diff);
} else {
candidates.add("");
}
// There are options or phrases, then add them as
// candidates. There is not a predefined phrase.
candidates.addAll(this.fromArrayToColletion(answer.getPhrases()));
candidates.addAll(this.fromArrayToColletion(answer.getOptions()));
// Adds a dummy option, in order to prevent that
// jLine adds automatically the option as a phrase.
if ((candidates.size() == 2) && (answer.getOptions().length == 1)
&& (answer.getPhrases().length == 0)) {
candidates.add("");
}
} catch (final AbstractZemucanException e) {
String cause = "";
if (e.getCause() != null) {
cause = e.getCause().toString();
}
if (e.getCause() != null) {
final Throwable ex = e.getCause();
}
System.exit(InputReader.ASSISTING_ERROR);
}
return cursor;
This is an extract of the application. You could do a simple Completor, and you have to provide an array of options. Eventually, you will want to implement your own CompletionHandler to improve the way that the options are presented to the user.
The complete code is available here.
Create 2 completors, then use them to complete arbituary arguments. Note that not all the arguments need to be completed.
List<Completer> completors = new LinkedList<>();
// Completes using the filesystem
completors.add(new FileNameCompleter());
// Completes using random words
completors.add(new StringsCompleter("--arg0", "--arg1", "command"));
// Aggregate the above completors
AggregateCompleter aggComp = new AggregateCompleter(completors);
// Parse the buffer line and complete each token
ArgumentCompleter argComp = new ArgumentCompleter(aggComp);
// Don't require all completors to match
argComp.setStrict(false);
// Add it all together
conReader.addCompleter(argComp);
Remove the NullCompletor and you will have what you want. NullCompletor makes sure your entire command is only 3 words long.