Retrieving word definitions from google java - java

I have a list of words (1K+) in a file, and I would like to get their definitions and save them. I was thinking about getting their definitions from Google, as it's the first thing that it shows. The way I thought about doing that is quite rudimental, which is to create a URL instance pointing to the Goole search of the given word, and read the content using streams. Then, "filter" the definition, which is always in between "data-dobid="dfn"><.span>" and "<./span>"
For example:
[...]data-dobid="dfn"><.span>. unwilling or refusing to change one's views or
to agree about something<./span>.[...]
Which is the definition of intransigent
However I would like to know if there is a more "efficient" way of doing so, for example without retrieving all the other results of the search. And also, If it's possible to load multiple results in a background thread so that when I want to "decode" a definition and save it, I don't always have to be waiting for the search to be completed.

The more efficient approach is to download a dictionary which you can then load locally. This gives you a local file or database that is readily searchable.
This approach is not only computationally efficient but it also will ensure you're are using the information correctly under its license. What you are proposing is commonly called "scraping" and may go against various licenses and terms of service.
This blog post lists several freely available and freely licensed dictionaries.
This AskUbuntu.SE question describes some more of the technical work required to acquire a free dictionary and reference it from the command line. You would want to replicate these reading patterns to load the data in Java.
Yet another approach would be to use a freely available and appropriately licensed API such as https://dictionaryapi.com/ . This would still use HTTP calls but is clearly licensed and is also an explicit API for looking up human-language word defintions. This is an advantage over scraping Google because you won't have to parse HTML and it is appropriately licensed for you to use it.
Finally there are some similar, if not duplicate, questions on StackOverflow and StackExchange such as this one: How to implement an English dictionary in Java?

Related

How to generate sequence diagrams automatically on executing junit

I have been given a task of "generate sequence diagrams automatically on execution of junit/test case" in eclipse. I am learning UML. I found tools that can generate a sequence, and I am aware of junit, but how do I club this both.
The tools that I found good were UMLet,ModelGoon UML, Object Aid. But I zeroed in on ModelGoon. I found that simple and easy to use. How do I automate this task, if so please guide me.
If there are any-other tools that are available then guide me.
First: This is a very good idea, and there are several ways to go about it. I will make the assumption that you are working in a jvm language (e.g. Kotlin or Java) so the suggestions I will make are biased by that.
Direct approach
Set up your logging to log using json, it makes the rest much simpler: https://www.baeldung.com/java-log-json-output
Make a library where you log the name of the component/method you are in, and the session you are processing. There are many ways of doing this, but a simple one is to a thread local variable: Set the variable to contain the name of the thing you are tracing ("usecase foobar"), and some unique ID (UUIDs are a decent choice). Another would be to generate some tracing ID (or get one from an external interaction), and send that as a parameter to all involved methods. Both of these will work, and which one is the simplest in practice depends on the architecture of your application.
In the methods you want to trace, write a log entry that contains that tracing information (name of usecase, trace ID, or any combination thereof), the location where the log entry was written, and any other information you want to add to your sequence diagram.
Run your test normally. A log will be produced. You need to be able to retrieve that log. There are many ways this can be done, use one :-)
Filter the log entries so you get only the ones you are interested in. Using the "jq" utility is a decent choice.
Process the filtered output to generate "plant uml" input files (http://plantuml.com/) for sequence diagrams.
Process the plant UML files to get sequence diagrams.
Done.
Industrial approach
Use some standard tooling for tracing like "https://opentracing.io/", instrument your application using this tooling, and extract your diagrams using that standard tooling.
This will also work in production an will probably scale much better than the direct approach, but if scaling isn't your thing, then the direct approach may be what you want to do.

Java crawler with custom file save ability

I'm looking for a open-source web crawler written in Java which, in addition to usual web crawler features such as depth/multi-threaded/etc. has the ability to custom handling each file type.
To be more precise, when a file is downloaded (or is going to be downloaded), I want to handle the saving operation of the files. The HTML files should be saved in a different repository, images to another location and other files somewhere else. Also, the repository could be not just a simple file system.
I've heard a lot about Apache Nutch. Does it have the ability to do this? I'm looking to achieve this as simple and fast as possible.
Based on assumption that you want a lot of control over how crawler works, I would recommend crawler4j. There are many examples, so you can get quick glimpse of how things are working.
You could easily handle resources based on their content type (take a look at Page.java class - it is class of object that contains information about fetched resource).
There is no limitations regarding repository. You can use anything you wish.

MPXJ: How to get columns other than task and resource?

I'm trying to use MPXJ library to get fields from the MS Project mpp file. I managed to to get the task and resources. My file contains additional fields like start date, end date, comments etc. Can anyone help me to extract these fields??
Thanks in advance :)
You may find it useful to take a look at the notes in the "getting started" section of the MPXJ web site. To summarise briefly, data from Microsoft Project, and other project planning tools, typically consists of a top level project, tasks, resources, and assignments (which link tasks and resources together).
This is pretty much how MPXJ represents the data read from a project plan. The attributes of each of these objects can be set or retrieved using the relevant set and get methods on each object. So for example, the Task object in MPXJ will expose setStart() and getStart() methods to allow you to work with the task start date. The method names follow the names used for the attributes in Microsoft Project so hopefully you will find it stratightforward to locate the attributes you need. You may also find the API documentation helpful in this respect too.

java keyword extraction

Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.
It doesn't have to be particularly clever, just use stop words and stemming to match keywords.
I am looking at the KEA package http://code.google.com/p/kea-algorithm/ but I can't figure out how to use their code.
Ideally something simple which has a little example documentation would be good. In the meantime I will set about writing this myself!
EDIT: When I say I can't see how to figure out how to use their code, I mean I can't see a simple way. The individiual classes by themselves have useful methods that will do much of the work.
This is a fairly old question and probably the OP has already solved his problem, but putting it here for others who may stumble upon the question looking for how to use KEA.
For KEA, you will need a training set - some of your documents will need to have keywords already set. The training data consists of a directory of documents (.txt files) and corresponding keywords files (.key files), with one keyword per line. You train KEA on this set, then use the model to extract keywords on the rest of your documents, which are in another directory of .txt files. KEA will write out corresponding .key files in this directory.
For more information, take a look at one or more of the following:
1) The KEA source distribution has a TestKEA.java class which shows how to extract keywords from a small test corpus. The README has details on the directory format required.
2) This blog post has (a somewhat terse IMO) instructions on how to use KEA.
http://kea-pranay.blogspot.com/2010/02/kea-key-extraction-algorithm.html
3) My blog post which I wrote up last weekend while trying to learn how to generate keywords from a corpus I had (which were already manually annotated with keywords). It has Python code to pre-process data to the way KEA expects it, Scala (KEA provides a Java API) code to train and run the extractor, and Python code to do analyze and visualize the generated keywords.
http://sujitpal.blogspot.com/2014/08/keyword-extraction-with-kea.html
You might try the Porter Stemming algorithm: the java version is at http://tartarus.org/~martin/PorterStemmer/java.txt and the main page is at http://tartarus.org/~martin/PorterStemmer/. Its old, but doesn't do a bad job.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.
Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.
Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

Categories

Resources