How to subtract a substring from a string in web harvest

How to subtract a substring from a string in web harvest - java

I am new to webharvest and am using it to get the article data from a website, using the following statement:
let $text := data($doc//div[#id="articleBody"])
and this is the data that I get from the above statement :
The Refine Spa (Furman's Mill) was built as a stone grist mill along the on a tributary of Capoolong Creek by Moore Furman, quartermaster general of George Washington's army
Notable people
Notable current and former residents of Pittstown include:
My question is that, is it possible to subtract a string from another
in the above example : "Notable people" from the content.
Is it possible to do this way? If its possible please let me know how. Thanks.
Is there something that I can do like this:
if (*contains*($text, 'Notable people')) then $text := *minus*($text, 'Notable people')
contains is a example function name to determine is a string is a substring of another,
and minus is a example function name to remove a substring from another
The desired output:
The Refine Spa (Furman's Mill) was built as a stone grist mill along the on a tributary of Capoolong Creek by Moore Furman, quartermaster general of George Washington's army
Notable current and former residents of Pittstown include:

From http://web-harvest.sourceforge.net/manual.php :
regexp
Searches the body for the given regular expression and optionally replaces found occurrences with specified pattern.
If body is a list of values then the regexp processor is applied to every item and final execution result is the list.
You just have to use correct regular expression a correct regexp-pattern and correct regexp-result

Related

Java SE generate words from given Input

I'm currently trying to generate valid words from a given Input.
Here is my pseudocode:
1. String and depth as a parameter for method.
2. Find similar or alike words from the same "wordfamily"
2b. You can specify the "depth". (Explained below)
3. return all found words as a List.
I am not looking for the code, more like the approach or if there is a library or any specific topic I should do some research on
Here is a Testcase:
The parameter is Summer and Depth 1, a possible result may be [Summer, Birds, Sun, Flowers, Warm ...] (Let's say these are "direct" hits.)
Depending on the depth you get a more "abstract", more "abroad" not directly, but in a certain way assocating results with the given word.
Given the same Parameter Summer, but with a higher depth(2), you may now get in addition [Winter, Snow ...]
So the depth somehow influences on how many results you may get back.

Javaslang object decomposition not working

I am using Javaslang-2.1.0-alpha and its Javaslang-match equivalent to do some object decomposition. According to this by blog post by Daniel in the "Match the Fancy way" section:
Match(person).of( Case(Person("Carl", Address($(), $())), (street, number) -> ...) )
Should retrieve values matching the two wildcard patterns inside Address into street and number but the example does not even compile. I later realized all objects must be wrapped inside atomic patterns i.e. "Carl" becomes $("Carl"). This was after reading this issue.
I followed the updated tutorial but there was no update to this example.
I updated the example to this:
Person person = new Person("Carl", new Address("Milkyway", 42));
String result2 = Match(person).of(
Case(Person($("Carl"), Address($(),$())),
(street, number) -> "Carl lives in " + street + " " + number),
Case($(), () -> "not found")
);
System.out.println(result2);
It compiles but my values are not being matched properly, judging from the console output:
Carl lives in Carl Address [street=Milkyway, number=42]
It's clear that street contains Carl and number, the entire Address object.
When I try to add a third lambda parameter to catch Carl:
Case(Person($("Carl"), Address($(),$())),
(name, street, number) -> "Carl lives in " + street + " " + number)
The code can't compile, the lambda expression gets a red underline with the following error text:
The target type of this expression must be a functional interface
There is no way of ignoring a value with $_ in the latest versions of javaslang-match. So I want to match each of the atomic patterns which would return three lambda parameters as above.
I need somebody who understands this library to explain to me how to do this object decomposition in the latest version.

Disclaimer: I'm the creator of Javaslang.
The case needs to handle (String, Address) -> {...}. The $() match arbitrary values but the handler/function receives only the first layer of the decomposed object tree. The $() are at the second layer.
Rule: All layers are matched against patterns, only the first layer is passed to the handler.
The first prototype of Match in fact handled arbitrary tree depths but methods hat to be generated under the hood for all possible combinations - max byte code size easily exceeded and compile time exponentially exploded to infinite.
The current version of Match is the only practical way in Java I see at the moment.
Update:
Please let me give a more figurative update on this topic.
We distinguish between
The object graph of the input
The pattern tree passed to the match case
The decomposed objects of the object graph
Ad 1) The Object Graph
Given an object, the object graph is spanned by traversing the properties (resp. instance variables) of that object. Notably we do not prohibit that an object contains cycles (e.g. a mutable list that contains itself).
In Javaslang there is no natural way how to decompose an object into its parts. We need a so-called pattern for that purpose.
Example of an object graph:
Person <-- root
/ \
"Carl" Address <-- 1st level
/ \
"Milkyway" 42 <-- 2nd level
Ad 2) The Pattern Tree
A pattern (instance) inherently defines how to decompose an object.
In our example the pattern types look like this (simplified generics):
Pattern2<Person, String, Address<String, Integer>>
/ \
Pattern0<String> Pattern2<Address, String, Integer>
/ \
Pattern0<String> Pattern0<Integer>
The called pattern methods return instances of the above types:
Person(...)
/ \
$("Carl") Address(...)
/ \
$() $()
Javaslang's Match API does the following:
The Match instance passes the given person object to the first Case.
The Case passes the person object to the pattern Person(...)
The Person(...) pattern checks if the given object person is of type Person.
If true then the pattern decomposes the object into its parts
(represented by a tuple) and checks if the sub-patterns $("Carl") and Address(...) match these parts (recursively repeats 3.)
If false, then Match passes the object to the next Case (see 2.)
If the pattern is atomic, i.e. it can't decompose the object any more, then equality is checked and the callers are informed all the way back to the match case.
When a match case got a pattern match then it passes the decomposed objects of the first level of the object graph to the match case handler.
Currently Java's type system does not allow us to pass matched objects of arbitrary object graph/tree levels in a typed way to the handler.
Ad 3) Decomposed Objects
We already mentioned object decomposition above in 2). In particular it is used when parts of our given objects are send down the pattern tree.
Because of the limitation of the type system we mentioned above, we separate the process of matching an object from the process of handling decomposed parts.
Java allows us to match arbitrary object graphs. We are not limited to any level here.
However, when an object successfully matched, we can only pass the decomposed objects of the first layer to the handler.
In our example these decomposed objects are name and address of the given person (and not street and number).
I know that this is not obvious to the user of the Match API.
One of the next Java versions will contain value objects and native pattern matching! However, that version of pattern matching will be limited entirely to the first level.
Javaslang allows to match arbitrary object graphs - but it has a price. The handler does receive only the first layer of decomposed objects, which might be confusing.
I hope this answered the question in an understandable way.
- Daniel

How to calculate similarity between Chamber of Commerce numbers?

I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLﬂ0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLﬂ0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)

Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.

Java inflector to convert plurals to singular forms

I am using the Java Inflector library to convert singular forms to plurals, example : 2 boat => 2 boats.
However, it fails when the inputs are already plural.
1 boats => boats,
butterflies => butterflieses
Is there any other Java utility that -
1. Converts plurals to singular when necessary, example : 1 boat => boat
2. Retains plural as it is, if the plural form is required.
Thanks!

You know, you can make a simulation, first avoid to convert verbs that are already in plural, how? take the verb and convert it to singular, then convert it in plural and check if the result string is the same than the string that is in the input, if not, thats ok, know you know that is not in plural.

Natural language processing to recognise numerical data

My requirement is to recognize and extract numerical data from a natural language sentence (English only) in response to queries. Platform is Java. For example if the user query is "What is the height of mount Everest" and we have a paragraph as:
In 1856, the Great Trigonometric Survey of British India established the first published height of Everest, then known as Peak XV, at 29,002 ft (8,840 m). In 1865, Everest was given its official English name by the Royal Geographical Society upon recommendation of Andrew Waugh, the British Surveyor General of India at the time, who named it after his predecessor in the post, and former chief, Sir George Everest.[4] Chomolungma had been in common use by Tibetans for centuries, but Waugh was unable to propose an established local name because Nepal and Tibet were closed to foreigners. (Pasted from wikipedia)
For a user query "Height of mount Everest" from the paragraph I need to get 29002 ft or 8840 m as the answer. Can anyone please suggest any possible ways of doing it in Java? Are there any open source libraries for the same?

Obviously, doing this well is extremely difficult to do. If it's an assignment though then I'm guessing the expectation is a bit lower. Here are some thoughts to hopefully get you started:
I'd split the problem into 2 parts; parsing the question block and then passing the answer block. From the question block, you need to know 2 pieces of information, the noun of what you're searching for, and also the type of the answer. In this case the noun is Everest and the type is height. "Types" of data you can build a dictionary for fairly quickly to search your input string for (e.g. "height", "weight", "distance", "age"). The nouns are more difficult, so I'd say to just assume that every non-type in the question is a potential noun, perhaps removing a dictionary of known non-nouns (such as "at", "the", "of" etc.).
Once you've identified the noun and type from the question, you can begin scanning your answer block. I'd begin by breaking that up into sentences. Then scan each sentence for each of your nouns. If one is found in that sentence, you need to scan the sentence again for numbers (taking into account possible whitespace or comma delimiting). Finally, you need to look "around" any numbers you find for a measurement type. So in this case, your "type" that we parsed from the question was "height". You would need to create a mapping of types to measurements, so "height" would map "km, ft, in, cm, m" etc. If the number has one of these types around it, then return the number and measurement type as the answer.
Hope that gets you started. As stated above, this is not intended to be a robust, commercial solution. It's homework-level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.