collecting text within <p> from html pages - java

I have a blog dataset which has a huge number of blog pages, with blog posts, comments and all blog features.
I need to extract only blog post from this collection and store it in a .txt file.
I need to modify this program as this program should collect blogposts tag starts with <p> and ends with </p> and avoiding other tags.
Currently I use HTMLParser to do the job, here is what I have so far:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
HasAttributeFilter filter = new HasAttributeFilter("P");
try {
parser.setResource("d://Blogs/asample.txt");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
thanks in advance

Provided the HTML is well formed, the following method should do what you need:
private static String extractText(File file) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
FileReader reader = new FileReader(file);
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
private int append = 0;
public void handleText(final char[] data, final int pos) {
if(append > 0) {
list.add(new String(data));
}
}
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
if (Tag.P.equals(tag)) {
append++;
}
}
public void handleEndTag(Tag tag, final int pos) {
if (Tag.P.equals(tag)) {
append--;
}
}
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(reader, parserCallback, false);
reader.close();
String text = "";
for(String s : list) {
text += " " + s;
}
return text;
}
EDIT: Change to handle nested P tags.

Related

How to store an ArrayList in a file?

I have a class which represents an ArrayList stored in a file, because I need an ArrayList with multiple gigabytes of data in it which is obviously too large to be stored in memory. The data is represented by a class called Field and the function Field.parse() is just for converting the Field into a String and the other way.
The Field class stores a list of (strange) chess pieces and their coordinates.
My class is working fine, but it takes a long time to add an element to the file and I need my program to run as fast as possible. Does anyone know a more efficient/faster way of doing things?
Also, I am not allowed to use external libraries/apis. Please keep that in mind.
This is the class which is responsible for storing Field objects in a temp file:
private File file;
private BufferedReader reader;
private BufferedWriter writer;
public FieldSaver() {
try {
file = File.createTempFile("chess-moves-", ".temp");
System.out.println(file.getAbsolutePath());
} catch (IOException e) {
e.printStackTrace();
}
}
public void add(Field field) {
try {
File temp = File.createTempFile("chess-moves-", ".temp");
writer = new BufferedWriter(new FileWriter(temp));
reader = new BufferedReader(new FileReader(file));
String line;
while((line = reader.readLine()) != null ) {
writer.write(line);
writer.newLine();
}
reader.close();
writer.write(field.parse());
writer.close();
file.delete();
file = new File(temp.getAbsolutePath());
} catch (IOException e) {
e.printStackTrace();
}
}
public Field get(int n) {
try {
reader = new BufferedReader(new FileReader(file));
for (int i = 0; i < n; i++) {
reader.readLine();
}
String line = reader.readLine();
reader.close();
return Field.parse(line);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
And this is the Field class:
private WildBoar wildBoar;
private HuntingDog[] huntingDogs;
private Hunter hunter;
private int size;
#Override
public String toString() {
String result = "Wildschwein: " + wildBoar.toString();
for (HuntingDog dog : huntingDogs) {
result += "; Hund: " + dog.toString();
}
return result + "; Jäger: " + hunter.toString();
}
#Override
public boolean equals(Object obj) {
if (obj instanceof Field) {
Field field = (Field) obj;
HuntingDog[] dogs = field.getHuntingDogs();
return wildBoar.equals(field.getWildBoar()) && hunter.equals(field.getHunter()) && huntingDogs[0].equals(dogs[0]) && huntingDogs[1].equals(dogs[1]) && huntingDogs[2].equals(dogs[2]);
}
return false;
}
public Field(int size, WildBoar wildBoar, HuntingDog[] huntingDogs, Hunter hunter) {
this.size = size;
this.wildBoar = wildBoar;
this.huntingDogs = huntingDogs;
this.hunter = hunter;
}
public WildBoar getWildBoar() {
return wildBoar;
}
public HuntingDog[] getHuntingDogs() {
return huntingDogs;
}
public Hunter getHunter() {
return hunter;
}
public int getSize() {
return size;
}
public static Field parse(String s) {
String[] arr = s.split(",");
WildBoar boar = WildBoar.parse(arr[0]);
Hunter hunter = Hunter.parse(arr[1]);
HuntingDog[] dogs = new HuntingDog[arr.length - 2];
for(int i = 2; i < arr.length; i++) {
dogs[i - 2] = HuntingDog.parse(arr[i]);
}
return new Field(8, boar, dogs, hunter);
}
public String parse() {
String result = wildBoar.parse() + "," + hunter.parse();
for(HuntingDog dog : huntingDogs) {
result += "," + dog.parse();
}
return result;
}
Here's an MCVE to do what you want, based on the information you provided.
You can run it and see that it can save a Field to the file and get a Field by index very quickly.
The Fields are constant length, so you can get a Field by index by going to byte offset of index times field length in bytes. This would be significantly more difficult if the field were not constant length.
import java.io.Closeable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
public class FieldSaver implements Closeable {
public static void main(String[] args) throws IOException {
File f = File.createTempFile("chess-moves-", ".temp");
try (FieldSaver test = new FieldSaver(f);) {
for (byte i = 0; i < 100; i++) {
test.add(new Field(8, new WildBoar(i, i), new Hunter(i, i), new HuntingDog[] {
new HuntingDog(i, i),
new HuntingDog(i, i),
new HuntingDog(i, i) }));
}
// Get a few Fields by index
System.out.println(test.get(0));
System.out.println(test.get(50));
System.out.println(test.get(99));
// EOF exception, there is no Field 100
// System.out.println(test.get(100));
}
}
private final RandomAccessFile data;
public FieldSaver(File f) throws FileNotFoundException {
data = new RandomAccessFile(f, "rw");
}
public void add(Field field) throws IOException {
data.seek(data.length());
field.write(data);
}
public Field get(int index) throws IOException {
data.seek(index * Field.STORAGE_LENGTH_BYTES);
return Field.read(data);
}
public void close() throws IOException { data.close(); }
static abstract class Piece {
protected byte xPos;
protected byte yPos;
public Piece(DataInput data) throws IOException {
xPos = data.readByte();
yPos = data.readByte();
}
public Piece(byte xPos, byte yPos) {
this.xPos = xPos;
this.yPos = yPos;
}
public void write(DataOutput data) throws IOException {
data.writeByte(xPos);
data.writeByte(yPos);
}
public String toString() { return "[" + xPos + ", " + yPos + "]"; }
}
static class Hunter extends Piece {
public Hunter(byte xPos, byte yPos) { super(xPos, yPos); }
public Hunter(DataInput data) throws IOException { super(data); }
}
static class HuntingDog extends Piece {
public HuntingDog(byte xPos, byte yPos) { super(xPos, yPos); }
public HuntingDog(DataInput data) throws IOException { super(data); }
}
static class WildBoar extends Piece {
public WildBoar(byte xPos, byte yPos) { super(xPos, yPos); }
public WildBoar(DataInput data) throws IOException { super(data); }
}
static class Field {
// size of boar + hunter + 3 dogs
public static final int STORAGE_LENGTH_BYTES = 2 + 2 + (3 * 2);
private int size;
private WildBoar boar;
private Hunter hunter;
private final HuntingDog[] dogs;
public Field(int size, WildBoar wildBoar, Hunter hunter, HuntingDog[] huntingDogs) {
this.size = size;
this.boar = wildBoar;
this.hunter = hunter;
this.dogs = huntingDogs;
}
public String toString() {
String result = "Wildschwein: " + boar.toString();
for (HuntingDog dog : dogs) {
result += "; Hund: " + dog.toString();
}
return result + "; Jäger: " + hunter.toString();
}
public static Field read(DataInput data) throws IOException {
WildBoar boar = new WildBoar(data);
Hunter hunter = new Hunter(data);
HuntingDog[] dogs = new HuntingDog[3];
for (int i = 0; i < 3; i++) {
dogs[i] = new HuntingDog(data);
}
return new Field(8, boar, hunter, dogs);
}
public void write(DataOutput data) throws IOException {
boar.write(data);
hunter.write(data);
for (HuntingDog dog : dogs) {
dog.write(data);
}
}
}
}
Use a Map implementation like Cache from ehcache. This library will optimize for you so you don't have to handle writing and reading to disk and manage when to keep it in memory or on disk. You can just use it as a normal map. You probably want a map instead of a list for faster lookup so the library can optimize even more for you.
http://www.ehcache.org/
CacheManager cacheManager = CacheManagerBuilder.newCacheManagerBuilder()
.withCache("preConfigured",
CacheConfigurationBuilder.newCacheConfigurationBuilder(Long.class, String.class,
ResourcePoolsBuilder.heap(100))
.build())
.build(true);
Cache<Long, String> preConfigured
= cacheManager.getCache("preConfigured", Long.class, String.class);

How can I sort a Ranking list using a specific column from a file and print the whole file sorted?Java

Already done this but can't make it work.
Also tried to create another while ((line = br.readLine()) != null) {}, and placed the sort before it, but it won't read this while so it wouldnt print anithing.
The file looks like this:
1-Fred-18-5-0
2-luis-12-33-0
3-Helder-23-10-0
And wanted it to print like this:
2-luis-12-33-0
3-Helder-23-10-0
1-Fred-18-5-0
public static void lerRanking() throws IOException {
File ficheiro = new File("jogadores.txt");
BufferedReader br = new BufferedReader(new FileReader(ficheiro));
List<Integer> jGanhos = new ArrayList<Integer>();
int i = 0;
String line;
String texto = "";
while ((line = br.readLine()) != null) {
String[] col = line.split("-");
int colunas = Integer.parseInt(col[3]);
jGanhos.add(colunas);
i++;
if(i>=jGanhos.size()){
Collections.sort(jGanhos);
Collections.reverse(jGanhos);
for (int j = 0; j < jGanhos.size(); j++) {
if(colunas == jGanhos.get(i)){
texto = texto + line + "\n";
}
}
}
}
PL(texto);
}
Make it step by step:
public static void lerRanking() throws IOException {
File ficheiro = new File("jodagores.txt");
// read file
BufferedReader br = new BufferedReader(new FileReader(ficheiro));
List<String> lines = new ArrayList<>();
String line;
while ((line = br.readLine()) != null) {
lines.add(line);
}
// sort lines
lines.sort(new Comparator<String>() {
#Override
public int compare(String s1, String s2) {
// sort by 3rd column descending
return Integer.parseInt(s2.split("-")[3]) - Integer.parseInt(s1.split("-")[3]);
}
});
// concat lines
String texto = "";
for (String l : lines) {
texto += l + "\n";
}
System.out.println(texto);
// PL(texto);
}
Okay so first of all I thounk you should introduce a Java class (in my code this is ParsedObject) to manage your objects.
Second it should implement the Comparable<ParsedObject> interface, so you can easily sort it from anywhere in the code (without passing a custom comparator each time).
Here is the full code:
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String[] args) throws IOException {
lerRanking();
}
public static void lerRanking() throws IOException {
File ficheiro = new File("jodagores.txt");
// read lines to a list
List<String> lines = readLines(ficheiro);
// parse them to a list of objects
List<ParsedObject> objects = ParsedObject.from(lines);
// sort
Collections.sort(objects);
// print the output
writeLines(objects);
}
public static List<String> readLines(File ficheiro) throws IOException {
// read file line by line
BufferedReader br = new BufferedReader(new FileReader(ficheiro));
List<String> lines = new ArrayList<>();
String line;
while((line = br.readLine()) != null) {
lines.add(line);
}
br.close(); // THIS IS IMPORTANT never forget to close a Reader :)
return lines;
}
private static void writeLines(List<ParsedObject> objects) throws IOException {
File file = new File("output.txt");
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
for(ParsedObject object : objects) {
// print the output line by line
bw.write(object.originalLine);
}
bw.flush();
bw.close(); // THIS IS IMPORTANT never forget to close a Writer :)
}
// our object that holds the information
static class ParsedObject implements Comparable<ParsedObject> {
// the original line, if needed
public String originalLine;
// the columns
public Integer firstNumber;
public String firstString;
public Integer secondNumber;
public Integer thirdNumber;
public Integer fourthNumber;
// parse line by line
public static List<ParsedObject> from(List<String> lines) {
List<ParsedObject> objects = new ArrayList<>();
for(String line : lines) {
objects.add(ParsedObject.from(line));
}
return objects;
}
// parse one line
public static ParsedObject from(String line) {
String[] splitLine = line.split("-");
ParsedObject parsedObject = new ParsedObject();
parsedObject.originalLine = line + "\n";
parsedObject.firstNumber = Integer.valueOf(splitLine[0]);
parsedObject.firstString = splitLine[1];
parsedObject.secondNumber = Integer.valueOf(splitLine[2]);
parsedObject.thirdNumber = Integer.valueOf(splitLine[3]);
parsedObject.fourthNumber = Integer.valueOf(splitLine[4]);
return parsedObject;
}
#Override
public int compareTo(ParsedObject other) {
return other.thirdNumber.compareTo(this.thirdNumber);
}
}
}
If you have any more question feel free to ask :) An here is an the example objects list after parsing and sorting.
The easiest way is to first create a class that will hold the data from your file provided your lines keep the same format
public class MyClass {
private Integer column1;
private String column2;
private Integer column3;
private Integer column4;
private Integer column5;
public MyClass(String data) {
String[] cols = data.split("-");
if (cols.length != 5) return;
column1 = Integer.parseInt(cols[0]);
column2 = cols[1];
column3 = Integer.parseInt(cols[2]);
column4 = Integer.parseInt(cols[3]);
column5 = Integer.parseInt(cols[4]);
}
public synchronized final Integer getColumn1() {
return column1;
}
public synchronized final String getColumn2() {
return column2;
}
public synchronized final Integer getColumn3() {
return column3;
}
public synchronized final Integer getColumn4() {
return column4;
}
public synchronized final Integer getColumn5() {
return column5;
}
#Override
public String toString() {
return String.format("%d-%s-%d-%d-%d", column1, column2, column3, column4, column5);
}
}
Next you can get a list of your items like this:
public static List<MyClass> getLerRanking() throws IOException {
List<MyClass> items = Files.readAllLines(Paths.get("jogadores.txt"))
.stream()
.filter(line -> !line.trim().isEmpty())
.map(data -> new MyClass(data.trim()))
.filter(data -> data.getColumn4() != null)
.sorted((o1, o2) -> o2.getColumn4().compareTo(o1.getColumn4()))
.collect(Collectors.toList());
return items;
}
This will read your whole file, filter out any blank lines, then parse the data and convert it to MyClass.
It will then make sure that column4 isn't null in the converted objects.
Finally it will reverse sort the objects based off from the value in column 4 and create a list of those items.
To print the results you can do something like this
public static void main(String[] args) {
List<MyClass> rankingList = getLerRanking();
rankingList.forEach(item -> System.out.println(item));
}
Since we overrode the toString() method, it will print it out the object as it is displayed in the file.
Hope this helps.

Issue loading data from csv file - Java

As it stands I have a data set in the form of a .csv file which you can find here. Also there is some brief documentation on it which you can find here. What I am attempting to do is to manipulate the data set so that I can work with some machine learning algorithms but as it stands I can't seem to print the outputted data to the console
ImageMatrix.java
import java.util.Arrays;
public class ImageMatrix {
public static int[] data;
public int classCode;
public ImageMatrix(int[] data, int classCode) {
assert data.length == 64;
}
public String toString() {
return "Class Code: " + classCode + " DataSet:" + Arrays.toString(data) + "\n";
}
public int[] getData() {
return data;
}
public int getClassCode() {
return classCode;
}
}
ImageMatrixDB.java
import java.io.*;
import java.util.*;
public class ImageMatrixDB implements Iterable<ImageMatrix> {
List<ImageMatrix> list = new ArrayList<ImageMatrix>();
public static ImageMatrixDB load(String f) throws IOException {
ImageMatrixDB result = new ImageMatrixDB();
try (FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr)) {
for (String line; null != (line = br.readLine()); ) {
int lastComma = line.lastIndexOf(',');
int classCode = Integer.parseInt(line.substring(1 + lastComma));
int[] data = Arrays.stream(line.substring(0, lastComma).split(","))
.mapToInt(Integer::parseInt)
.toArray();
result.list.add(new ImageMatrix(data, classCode));
}
System.out.println(ImageMatrix.data.toString());
}
return result;
}
public Iterator<ImageMatrix> iterator() {
return this.list.iterator();
}
public static void main(String[] args){
ImageMatrixDB i = new ImageMatrixDB();
i.load("dataset1.csv"); // <<< ERROR IS HERE
}
}
The error is within my main function on the line i.load(... I know I must be missing something or have made a mistake somewhere, I have tried altering the data from static but it just throws more errors and I can't figure it out. Any ideas?
Your issue is in the ImageMatrix class.
You never set the int[] data in the constructor. You have:
public ImageMatrix(int[] data, int classCode) {
assert data.length == 64;
}
You need:
public ImageMatrix(int[] data, int classCode) {
assert data.length == 64;
this.data = data;
this.classCode = classCode;
}
Here is your updated/complete/working code:
ImageMatrix:
import java.util.*;
public class ImageMatrix {
private int[] data;
private int classCode;
public ImageMatrix(int[] data, int classCode) {
assert data.length == 64;
this.data = data;
this.classCode = classCode;
}
public String toString() {
return "Class Code: " + classCode + " DataSet:" + Arrays.toString(data) + "\n";
}
public int[] getData() {
return data;
}
public int getClassCode() {
return classCode;
}
}
ImageMatrixDB:
import java.util.*;
import java.io.*;
public class ImageMatrixDB implements Iterable<ImageMatrix> {
private List<ImageMatrix> list = new ArrayList<ImageMatrix>();
public ImageMatrixDB load(String f) throws IOException {
try (
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr)) {
String line = null;
while((line = br.readLine()) != null) {
int lastComma = line.lastIndexOf(',');
int classCode = Integer.parseInt(line.substring(1 + lastComma));
int[] data = Arrays.stream(line.substring(0, lastComma).split(","))
.mapToInt(Integer::parseInt)
.toArray();
ImageMatrix matrix = new ImageMatrix(data, classCode);
list.add(matrix);
}
}
return this;
}
public void printResults(){
for(ImageMatrix matrix: list){
System.out.println(matrix);
}
}
public Iterator<ImageMatrix> iterator() {
return this.list.iterator();
}
public static void main(String[] args){
ImageMatrixDB i = new ImageMatrixDB();
try{
i.load("cw2DataSet1.csv");
i.printResults();
}
catch(Exception ex){
ex.printStackTrace();
}
}
}
Your load method can throw an IOException. You need to catch it in order to successfully compile
public static void main(String[] args){
ImageMatrixDB i = new ImageMatrixDB();
try{
i.load("dataset1.csv"); // <<< ERROR IS HERE
}
catch(Exception e){
System.out.println(e.getMessage());
}
}

PIG Custom loader's getNext() is being called again and again

I have started working with Apache Pig for one of our projects. I have to create a custom input format to load our data files. For this, I followed this example Hadoop:Custom Input format. I also created my custom RecordReader implementation to read the data (we get our data in binary format from some other application) and parse that to proper JSON format.
The problem occurs when I use my custom loader in Pig script. As soon as my loader's getNext() method is invoked, it calls my custom RecordReader's nextKeyValue() method, which works fine. It reads the data properly, passes it back to my loader which parses the data and returns a Tuple. So far so good.
The problem arises when my loader's getNext() method is called again and again. It gets called, works fine, and returns the proper output (I debugged it till return statement). But then, instead of letting the execution go further, my loader gets called again. I tried to see the number of times my loader is called, and I could see the number go till 20K!
Can somebody please help me understand the problem in my code?
Loader
public class SimpleTextLoaderCustomFormat extends LoadFunc {
protected RecordReader in = null;
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
#Override
public Tuple getNext() throws IOException {
Tuple t = null;
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
String value = (String) in.getCurrentValue();
byte[] buf = value.getBytes();
int len = value.length();
int start = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
readField(buf, start, i);
start = i + 1;
}
}
// pick up the last field
readField(buf, start, len);
t = mTupleFactory.newTupleNoCopy(mProtoTuple);
mProtoTuple = null;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
e.printStackTrace();
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
return t;
}
private void readField(byte[] buf, int start, int end) {
if (mProtoTuple == null) {
mProtoTuple = new ArrayList<Object>();
}
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
}
#Override
public InputFormat getInputFormat() throws IOException {
//return new TextInputFormat();
return new CustomStringInputFormat();
}
#Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
#Override
public void prepareToRead(RecordReader reader, PigSplit split)
throws IOException {
in = reader;
}
Custom InputFormat
public class CustomStringInputFormat extends FileInputFormat<String, String> {
#Override
public RecordReader<String, String> createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
return new CustomStringInputRecordReader();
}
}
Custom RecordReader
public class CustomStringInputRecordReader extends RecordReader<String, String> {
private String fileName = null;
private String data = null;
private Path file = null;
private Configuration jc = null;
private static int count = 0;
#Override
public void close() throws IOException {
// jc = null;
// file = null;
}
#Override
public String getCurrentKey() throws IOException, InterruptedException {
return fileName;
}
#Override
public String getCurrentValue() throws IOException, InterruptedException {
return data;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
file = split.getPath();
jc = context.getConfiguration();
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
InputStream is = FileSystem.get(jc).open(file);
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
data = writer.toString();
fileName = file.getName();
writer.close();
is.close();
System.out.println("Count : " + ++count);
return true;
}
}
Try this in Loader
//....
boolean notDone = ((CustomStringInputFormat)in).nextKeyValue();
//...
Text value = new Text(((CustomStringInputFormat))in.getCurrentValue().toString())

Dictionary example for uimaFIT code based

I'm having a look at uimaFIT and I just had quite some difficulties to add a Dictionary Annotator to a analyse engine.
This is my best shut so far:
public class LocationAnnotator extends JCasAnnotator_ImplBase {
public static final String RES_DICTIONARY = "dictionary";
#ExternalResource(key = RES_DICTIONARY)
private DataResource resource;
private Dictionary dictionary;
#Override
public void initialize(UimaContext context) throws ResourceInitializationException {
super.initialize(context);
try {
DictionaryBuilder dictBuilder = new HashMapDictionaryBuilder();
// create dictionary file parser
DictionaryFileParserImpl fileParser = new DictionaryFileParserImpl();
fileParser.parseDictionaryFile(resource.getUri().getPath(), resource.getInputStream(), dictBuilder);
dictionary = dictBuilder.getDictionary();
} catch (IOException e) {
throw new ResourceInitializationException();
}
}
#Override
public void process(JCas cas) throws AnalysisEngineProcessException {
String docText = cas.getDocumentText();
for (String line : docText.split("\n")) {
for (String word : line.split(" ")) {
if (dictionary.contains(word)) {
int pos = docText.indexOf(word);
Location annotation = new Location(cas, pos, pos + word.length());
annotation.addToIndexes();
}
}
}
}
}
I'm executing the engine like this:
CollectionReaderDescription reader = CollectionReaderFactory.createReaderDescription(CvReader.class, CvReader.PARAM_INPUT_FILE, "docs/simple-doc.txt");
AnalysisEngineDescription tokenizer = AnalysisEngineFactory.createEngineDescription(LocationAnnotator.class);
ExternalResourceFactory.bindResource(tokenizer, LocationAnnotator.RES_DICTIONARY, "META-INF/dictionaries/location.dict.xml");
for (JCas cas : SimplePipeline.iteratePipeline(reader, tokenizer)) {
for (Location location : JCasUtil.select(cas, Location.class)) {
System.out.println("Found location: " + location.getCoveredText());
}
}
Is there no more elegant way? Don't like the initialization. Would expect to init the dictionary with an annotation as the #ExternalResource.
I would be glade if someone could provide me with a more simple example.. Thanks!

Categories

Resources