I'm having a look at uimaFIT and I just had quite some difficulties to add a Dictionary Annotator to a analyse engine.
This is my best shut so far:
public class LocationAnnotator extends JCasAnnotator_ImplBase {
public static final String RES_DICTIONARY = "dictionary";
#ExternalResource(key = RES_DICTIONARY)
private DataResource resource;
private Dictionary dictionary;
#Override
public void initialize(UimaContext context) throws ResourceInitializationException {
super.initialize(context);
try {
DictionaryBuilder dictBuilder = new HashMapDictionaryBuilder();
// create dictionary file parser
DictionaryFileParserImpl fileParser = new DictionaryFileParserImpl();
fileParser.parseDictionaryFile(resource.getUri().getPath(), resource.getInputStream(), dictBuilder);
dictionary = dictBuilder.getDictionary();
} catch (IOException e) {
throw new ResourceInitializationException();
}
}
#Override
public void process(JCas cas) throws AnalysisEngineProcessException {
String docText = cas.getDocumentText();
for (String line : docText.split("\n")) {
for (String word : line.split(" ")) {
if (dictionary.contains(word)) {
int pos = docText.indexOf(word);
Location annotation = new Location(cas, pos, pos + word.length());
annotation.addToIndexes();
}
}
}
}
}
I'm executing the engine like this:
CollectionReaderDescription reader = CollectionReaderFactory.createReaderDescription(CvReader.class, CvReader.PARAM_INPUT_FILE, "docs/simple-doc.txt");
AnalysisEngineDescription tokenizer = AnalysisEngineFactory.createEngineDescription(LocationAnnotator.class);
ExternalResourceFactory.bindResource(tokenizer, LocationAnnotator.RES_DICTIONARY, "META-INF/dictionaries/location.dict.xml");
for (JCas cas : SimplePipeline.iteratePipeline(reader, tokenizer)) {
for (Location location : JCasUtil.select(cas, Location.class)) {
System.out.println("Found location: " + location.getCoveredText());
}
}
Is there no more elegant way? Don't like the initialization. Would expect to init the dictionary with an annotation as the #ExternalResource.
I would be glade if someone could provide me with a more simple example.. Thanks!
Related
I created a spring boot project.
I use spring data with elastic search.
The whole pipeline: controller -> service -> repository is ready.
I now have a file that represents country objects (name and isoCode) and I want to create a job to insert them all in elastic search.
I read the spring documentation and I find that there's too much configuration for such a simple job.
So I'm trying to do a simple main "job" that reads a csv, creates objects and insert them in elastic search.
But I have a bit of trouble to understand how injection would work in this case:
#Component
public class InsertCountriesJob {
private static final String file = "D:path\\to\\countries.dat";
private static final Logger LOG = LoggerFactory.getLogger(InsertCountriesJob.class);
#Autowired
public CountryService service;
public static void main(String[] args) {
LOG.info("Starting insert countries job");
try {
saveCountries();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void saveCountries() throws Exception {
try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
String[] values = null;
while ((values = csvReader.readNext()) != null) {
String name = values[0];
String iso = values[1].equals("N") ? values[2] : values[1];
Country country = new Country(iso, name);
LOG.info("info: country: {}", country);
//write in db;
//service.save(country); <= can't do this because of the injection
}
}
}
}
based on Simon's comment. Here's how I resolved my problem. Might help people that are getting into spring, and that are trying not to get lost.
Basically, to inject anything in Spring, you'll need a SpringBootApplication
public class InsertCountriesJob implements CommandLineRunner{
private static final String file = "D:path\\to\\countries.dat";
private static final Logger LOG = LoggerFactory.getLogger(InsertCountriesJob.class);
#Autowired
public CountryService service;
public static void main(String[] args) {
LOG.info("STARTING THE APPLICATION");
SpringApplication.run(InsertCountriesJob.class, args);
LOG.info("APPLICATION FINISHED");
}
#Override
public void run(String... args) throws Exception {
LOG.info("Starting insert countries job");
try {
saveCountry();
} catch (Exception e) {
e.printStackTrace();
}
LOG.info("job over");
}
public void saveCountry() throws Exception {
try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
String[] values = null;
while ((values = csvReader.readNext()) != null) {
String name = values[0];
String iso = values[1].equals("N") ? values[2] : values[1];
Country country = new Country(iso, name);
LOG.info("info: country: {}", country);
//write in db;
service.save(country);
}
}
}
}
I use LanguageTool for some spellchecking and spell correction functionality in my application.
The LanguageTool documentation describes how to exclude words from spell checking (with call the addIgnoreTokens(...) method of the spell checking rule you're using).
How do you add some words (e.g., from a specific dictionary) to spell checking? That is, can LanguageTool fix words with misspellings and suggest words from my specific dictionary?
Unfortunately, the API doesn't support this I think. Without the API, you can add words to spelling.txt to get them accepted and used as suggestions. With the API, you might need to extend MorfologikSpellerRule and change this place of the code. (Disclosure: I'm the maintainer of LanguageTool)
I have similar requirement, which is load some custom words into dictionary as "suggest words", not just "ignored words". And finally I extend MorfologikSpellerRule to do this:
Create class MorfologikSpellerRuleEx extends from MorfologikSpellerRule, override the method "match()", and write my own "initSpeller()" for creating spellers.
And then for the language tool, create this custom speller rule to replace existing one.
Code:
Language lang = new AmericanEnglish();
JLanguageTool langTool = new JLanguageTool(lang);
langTool.disableRule("MORFOLOGIK_RULE_EN_US");
try {
MorfologikSpellerRuleEx spellingRule = new MorfologikSpellerRuleEx(JLanguageTool.getMessageBundle(), lang);
spellingRule.setSpellingFilePath(spellingFilePath);
//spellingFilePath is the file has my own words + words from /hunspell/spelling_en-US.txt
langTool.addRule(spellingRule);
} catch (IOException e) {
e.printStackTrace();
}
The code of my custom MorfologikSpellerRuleEx:
public class MorfologikSpellerRuleEx extends MorfologikSpellerRule {
private String spellingFilePath = null;
private boolean ignoreTaggedWords = false;
public MorfologikSpellerRuleEx(ResourceBundle messages, Language language) throws IOException {
super(messages, language);
}
#Override
public String getFileName() {
return "/en/hunspell/en_US.dict";
}
#Override
public String getId() {
return "MORFOLOGIK_SPELLING_RULE_EX";
}
#Override
public void setIgnoreTaggedWords() {
ignoreTaggedWords = true;
}
public String getSpellingFilePath() {
return spellingFilePath;
}
public void setSpellingFilePath(String spellingFilePath) {
this.spellingFilePath = spellingFilePath;
}
private void initSpellerEx(String binaryDict) throws IOException {
String plainTextDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getSpellingFileName())) {
plainTextDict = getSpellingFileName();
}
if (plainTextDict != null) {
BufferedReader br = null;
if (this.spellingFilePath != null) {
try {
br = new BufferedReader(new FileReader(this.spellingFilePath));
}
catch (Exception e) {
br = null;
}
}
if (br != null) {
speller1 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 3);
br.close();
}
else {
speller1 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 3);
}
setConvertsCase(speller1.convertsCase());
} else {
throw new RuntimeException("Could not find ignore spell file in path: " + getSpellingFileName());
}
}
private boolean canBeIgnored(AnalyzedTokenReadings[] tokens, int idx, AnalyzedTokenReadings token)
throws IOException {
return token.isSentenceStart() || token.isImmunized() || token.isIgnoredBySpeller() || isUrl(token.getToken())
|| isEMail(token.getToken()) || (ignoreTaggedWords && token.isTagged()) || ignoreToken(tokens, idx);
}
#Override
public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
List<RuleMatch> ruleMatches = new ArrayList<>();
AnalyzedTokenReadings[] tokens = getSentenceWithImmunization(sentence).getTokensWithoutWhitespace();
// lazy init
if (speller1 == null) {
String binaryDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getFileName())) {
binaryDict = getFileName();
}
if (binaryDict != null) {
initSpellerEx(binaryDict); //here's the change
} else {
// should not happen, as we only configure this rule (or rather its subclasses)
// when we have the resources:
return toRuleMatchArray(ruleMatches);
}
}
int idx = -1;
for (AnalyzedTokenReadings token : tokens) {
idx++;
if (canBeIgnored(tokens, idx, token)) {
continue;
}
// if we use token.getToken() we'll get ignored characters inside and speller
// will choke
String word = token.getAnalyzedToken(0).getToken();
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
int index = 0;
Matcher m = tokenizingPattern().matcher(word);
while (m.find()) {
String match = word.subSequence(index, m.start()).toString();
ruleMatches.addAll(getRuleMatches(match, token.getStartPos() + index, sentence));
index = m.end();
}
if (index == 0) { // tokenizing char not found
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
ruleMatches.addAll(getRuleMatches(word.subSequence(index, word.length()).toString(),
token.getStartPos() + index, sentence));
}
}
}
return toRuleMatchArray(ruleMatches);
}
}
I am building a java clone code detector in swing that implemented the ANTLR. This is the Screenshot :
https://www.dropbox.com/s/wnumgsjmpps33v5/SemogaYaAllah.png
if you see the screenshot, there are a main file that compared to another files. The way that I do is compared thats token main file to another file. The problem is, I am failed to get the ID Lexemes or tokens from my lexer class.
This is my ANTLR3JavaLexer
public class Antlr3JavaLexer extends Lexer {
public static final int PACKAGE=84;
public static final int EXPONENT=173;
public static final int STAR=49;
public static final int WHILE=103;
public static final int MOD=32;
public static final int MOD_ASSIGN=33;
public static final int CASE=58;
public static final int CHAR=60;
I ve created a JavaParser.class like this to use that lexer:
public final class JavaParser extends AParser { //Parser is my Abstract Class
JavaParser() {
}
#Override
protected boolean parseFile(JCCDFile f, final ASTManager treeContainer)throws ParseException, IOException {
BufferedReader in = new BufferedReader(new FileReader(f.getFile()));
String filePath = f.getNama(); // getName of file
final Antlr3JavaLexer lexer = new Antlr3JavaLexer();
lexer.preserveWhitespacesAndComments = false;
try {
lexer.setCharStream(new ANTLRReaderStream(in));
} catch (IOException e) {
e.printStackTrace();
return false;
}
//This is the problem
//When I am activated this code pieces, I get the output like this
https://www.dropbox.com/s/80uyva56mk1r5xy/Bismillah2.png
/*
StringBuilder sbu = new StringBuilder();
while (true) {
org.antlr.runtime.Token token = lexer.nextToken();
if (token.getType() == Antlr3JavaLexer.EOF) {
break;
}
sbu.append(token.getType());
System.out.println(token.getType() + ": :" + token.getText());
}*/
final CommonTokenStream tokens = new CommonTokenStream();
tokens.setTokenSource(lexer);
tokens.LT(10); // force load
// Create the parser
Antlr3JavaParser parser = new Antlr3JavaParser(tokens);
StringBuffer sb = new StringBuffer();
sb.append(tokens.toString());
DefaultTableModel model = (DefaultTableModel) Main_Menu.jTable2.getModel();
List<final_tugas_akhir.Report2> theListData = new ArrayList<Report2>();
final_tugas_akhir.Report2 theResult = new final_tugas_akhir.Report2();
theResult.setFile(filePath);
theResult.setId(sb.toString());
theResult.setNum(sbu.toString());
theListData.add(theResult);
for (Report2 report : theListData) {
System.out.println(report.getFile());
System.out.println(report.getId());
model.addRow(new Object[]{
report.getFile(),
report.getId(),
report.getNum(),
});
}
// in CompilationUnit
CommonTree tree;
try {
tree = (CommonTree) parser.compilationUnit().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
} catch (RecognitionException e) {
e.printStackTrace();
return false;
}
walkThroughChildren(tree, treeContainer, parser.getTokenStream()); //this is my method to check the similiar tokens
in.close();
this.posisiFix(treeContainer); //fix position
return true;
}
Once again, this is the error code my java program link: https://www.dropbox.com/s/80uyva56mk1r5xy/Bismillah2.png.
The tokens always give me a null value.
I want to test MessageProcessor1.listAllKeyword method, which in turn
calls HbaseUtil1.getAllKeyword method. Initialy, I had to deal with a problem associated with the static initializer and the constructor. The problem was to initialize a HBASE DB connection. I used powerMock to suppress static and constructor calls and it worked fine.
Even though I mocked HbaseUtil1.getAllKeyword method, actual method is being called and executes all HBase code leading to an exception, in which HBASE server is not up.
EasyMock.expect(hbaseUtil.getAllKeyword("msg", "u1")).andReturn(expectedList);
Please give me any idea on how to avoid an actual method call. I tried many ways but none of them worked.
public class MessageProcessor1
{
private static Logger logger = Logger.getLogger("MQ-Processor");
private final static String CLASS_NAME = "MessageProcessor";
private static boolean keywordsTableExists = false;
public static PropertiesLoader props;
HbaseUtil1 hbaseUtil;
/**
* For checking if table exists in HBase. If doesn't exists, will create a
* new table. This runs only once when class is loaded.
*/
static {
props = new PropertiesLoader();
String[] userTablefamilys = {
props.getProperty(Constants.COLUMN_FAMILY_NAME_COMMON_KEYWORDS),
props.getProperty(Constants.COLUMN_FAMILY_NAME_USER_KEYWORDS) };
keywordsTableExists = new HbaseUtil()
.creatTable(props.getProperty(Constants.HBASE_TABLE_NAME),
userTablefamilys);
}
/**
* This will load new configuration every time this class instantiated.
*/
{
props = new PropertiesLoader();
}
public String listAllKeyword(String userId) throws IOException {
HbaseUtil1 util = new HbaseUtil1();
Map<String, List<String>> projKeyMap = new HashMap<String, List<String>>();
//logger.info(CLASS_NAME+": inside listAllKeyword method");
//logger.debug("passed id : "+userId);
List<String> qualifiers = util.getAllKeyword("msg", userId);
List<String> keywords = null;
for (String qualifier : qualifiers) {
String[] token = qualifier.split(":");
if (projKeyMap.containsKey(token[0])) {
projKeyMap.get(token[0]).add(token[1]);
} else {
keywords = new ArrayList<String>();
keywords.add(token[1]);
projKeyMap.put(token[0], keywords);
}
}
List<Project> projects = buildProject(projKeyMap);
Gson gson = new GsonBuilder().excludeFieldsWithoutExposeAnnotation()
.create();
System.out.println("Json projects:::" + gson.toJson(projects));
//logger.debug("list all keyword based on project::::"+ gson.toJson(projects));
//return gson.toJson(projects);
return "raj";
}
private List<Project> buildProject(Map<String, List<String>> projKeyMap) {
List<Project> projects = new ArrayList<Project>();
Project proj = null;
Set<String> keySet = projKeyMap.keySet();
for (String hKey : keySet) {
proj = new Project(hKey, projKeyMap.get(hKey));
projects.add(proj);
}
return projects;
}
//#Autowired
//#Qualifier("hbaseUtil1")
public void setHbaseUtil(HbaseUtil1 hbaseUtil) {
this.hbaseUtil = hbaseUtil;
}
}
public class HbaseUtil1 {
private static Logger logger = Logger.getLogger("MQ-Processor");
private final static String CLASS_NAME = "HbaseUtil";
private static Configuration conf = null;
public HbaseUtil1() {
PropertiesLoader props = new PropertiesLoader();
conf = HBaseConfiguration.create();
conf.set(HConstants.ZOOKEEPER_QUORUM, props
.getProperty(Constants.HBASE_CONFIGURATION_ZOOKEEPER_QUORUM));
conf.set(
HConstants.ZOOKEEPER_CLIENT_PORT,
props.getProperty(Constants.HBASE_CONFIGURATION_ZOOKEEPER_CLIENT_PORT));
conf.set("hbase.zookeeper.quorum", props
.getProperty(Constants.HBASE_CONFIGURATION_ZOOKEEPER_QUORUM));
conf.set(
"hbase.zookeeper.property.clientPort",
props.getProperty(Constants.HBASE_CONFIGURATION_ZOOKEEPER_CLIENT_PORT));
}
public List<String> getAllKeyword(String tableName, String rowKey)
throws IOException {
List<String> qualifiers = new ArrayList<String>();
HTable table = new HTable(conf, tableName);
Get get = new Get(rowKey.getBytes());
Result rs = table.get(get);
for (KeyValue kv : rs.raw()) {
System.out.println("KV: " + kv + ", keyword: "
+ Bytes.toString(kv.getRow()) + ", quaifier: "
+ Bytes.toString(kv.getQualifier()) + ", family: "
+ Bytes.toString(kv.getFamily()) + ", value: "
+ Bytes.toString(kv.getValue()));
qualifiers.add(new String(kv.getQualifier()));
}
table.close();
return qualifiers;
}
/**
* Create a table
*
* #param tableName
* name of table to be created.
* #param familys
* Array of the name of column families to be created with table
* #throws IOException
*/
public boolean creatTable(String tableName, String[] familys) {
HBaseAdmin admin = null;
boolean tableCreated = false;
try {
admin = new HBaseAdmin(conf);
if (!admin.tableExists(tableName)) {
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
for (int i = 0; i < familys.length; i++) {
tableDesc.addFamily(new HColumnDescriptor(familys[i]));
}
admin.createTable(tableDesc);
System.out.println("create table " + tableName + " ok.");
}
tableCreated = true;
admin.close();
} catch (MasterNotRunningException e1) {
e1.printStackTrace();
} catch (ZooKeeperConnectionException e1) {
e1.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return tableCreated;
}
}
Below is my Test class.
#RunWith(PowerMockRunner.class)
#PrepareForTest(MessageProcessor1.class)
#SuppressStaticInitializationFor("com.serendio.msg.mqProcessor.MessageProcessor1")
public class MessageProcessorTest1 {
private MessageProcessor1 messageProcessor;
private HbaseUtil1 hbaseUtil;
#Before
public void setUp() {
messageProcessor = new MessageProcessor1();
hbaseUtil = EasyMock.createMock(HbaseUtil1.class);
}
#Test
public void testListAllKeyword(){
List<String> expectedList = new ArrayList<String>();
expectedList.add("raj:abc");
suppress(constructor(HbaseUtil1.class));
//suppress(method(HbaseUtil1.class, "getAllKeyword"));
try {
EasyMock.expect(hbaseUtil.getAllKeyword("msg", "u1")).andReturn(expectedList);
EasyMock.replay();
assertEquals("raj", messageProcessor.listAllKeyword("u1"));
} catch (IOException e) {
e.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
}
}
The HbaseUtil1 is instantiated within the listAllKeyword method
public String listAllKeyword(String userId) throws IOException {
HbaseUtil1 util = new HbaseUtil1();
...
So the mock one you create in your test isn't being used at all.
If possible, make the HbaseUtil1 object passable, or settable on the MessageProcessor1 class and then set it in the test class.
Also, and note I'm not 100% familiar with PowerMock, you could include HbaseUtil1 in the prepare for test annotation. I think that will make PowerMock instantiate mocks instead of real objects and then use the expectations you provide in you test.
I have a blog dataset which has a huge number of blog pages, with blog posts, comments and all blog features.
I need to extract only blog post from this collection and store it in a .txt file.
I need to modify this program as this program should collect blogposts tag starts with <p> and ends with </p> and avoiding other tags.
Currently I use HTMLParser to do the job, here is what I have so far:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
HasAttributeFilter filter = new HasAttributeFilter("P");
try {
parser.setResource("d://Blogs/asample.txt");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
thanks in advance
Provided the HTML is well formed, the following method should do what you need:
private static String extractText(File file) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
FileReader reader = new FileReader(file);
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
private int append = 0;
public void handleText(final char[] data, final int pos) {
if(append > 0) {
list.add(new String(data));
}
}
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (Tag.P.equals(tag)) {
append++;
}
}
public void handleEndTag(Tag tag, final int pos) {
if (Tag.P.equals(tag)) {
append--;
}
}
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, false);
reader.close();
String text = "";
for(String s : list) {
text += " " + s;
}
return text;
}
EDIT: Change to handle nested P tags.