jcrfsuite training file format

jcrfsuite training file format - java

From what I understand from the example of POS Tagging given in the examples of jcrfsuite. The training file is tab separated and first token is the label. But I do not get the BigCluster| thing. Can somebody help me with how to specify tokens in training file.
Example below:
O BigCluster|00 BigCluster|0000 BigCluster|000000 BigCluster|00000000 BigCluster|0000000000 BigCluster|000000000000 BigCluster|00000000000000 BigCluster|0000000000000000 NextBigCluster|0100 NextBigCluster|01000101 NextBigCluster|010001011111 POSTagDict|D POSTagDict|N POSTagDict|^ POSTagDict|$ POSTagDict|G NextPOSTag|V 1gramSuff|i 1gramPref|i prevword| prevcurr||i nextword|predict nextword|predict currnext|i|predict Word|I Lower|i Xxdshape|X charclass|1, first-shortcap prevnext||predict t=0
Test file format:
! BigCluster|01 BigCluster|0110 BigCluster|011011 BigCluster|01101100 BigCluster|0110110011 BigCluster|011011001100 BigCluster|01101100110000 BigCluster|0110110011000000 NextBigCluster|1000 NextBigCluster|10001000 NextBigCluster|100010000000 POSTagDict|V NextPOSTag|, metaph_POSDict|N 1gramSuff|n 2gramSuff|nn 3gramSuff|mnn 4gramSuff|mmnn 5gramSuff|mmmnn 6gramSuff|ammmnn 7gramSuff|aammmnn 8gramSuff|aaammmnn 9gramSuff|daaammmnn 1gramPref|d 2gramPref|da 3gramPref|daa 4gramPref|daaa 5gramPref|daaam 6gramPref|daaamm 7gramPref|daaammm 8gramPref|daaammmn 9gramPref|daaammmnn prevword| prevcurr||daaammmnn nextword|. nextword|. currnext|daaammmnn|. Word|Daaammmnn Lower|daaammmnn Xxdshape|Xxxxxxxxx charclass|1,2,2,2,2,2,2,2,2, first-initcap prevnext||. t=0

What is specified after the label is a list of feature-name and feature-value.
It is in a sparse representation instead of tabular representation.
BigCluster is just one of the features and it's relevant to the specific example only. You should create your own features if you are training from scratch.

I have noticed that CRFsuite does not care for the naming convention nor feature design of labels and attributes, because treats them as strings.
CRFsuite learns weights of associations (feature weights) between attributes and labels, without knowing the meaning of labels and attributes. In other words, one can design and use arbitrary features just by writing label and attribute names in data sets, just find the best posible attributes for your example and run some experiments with different sets of attributes and features. And you will good to go.

Related

How can I efficiently extract text from bunch for web pages without extra information

I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.

Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.

In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')

How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.
I have following input:
Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .
Expected output:
["Người dân", "địa phương", "đã", "nhiều", "lần", "báo", "Điện lực", "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]
Training data I have _ connect the word that need to stick together in one token:
Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .
Here is command line I use to train
opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param
with param
Iterations=1000
However, the output cannot connect multiple word in one token but it split by whitespace.
Command I run to get output
opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out
What should I do with training data our config param to train the tokenizer with multiple word each token ?

Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.
opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.
you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.
Hope this helps!

How can I parse CDATA?

How can I find and iterate through all the nodes present under CDATA and those nodes are started by (<) and closed by (>)?
Also, how should I iterate over all the child nodes and get the values like in below child node? I want to retrieve the value.
Input XML
<SOURCE TransactionId="1" ProviderName="ABCDD"><RESPONSE><![CDATA[<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Body><NetworkResponse xmlns="http://www.example.com/"><NetworkResult><Network offering_id="13" transaction_id="2" submission_id="3" timestamp="20140828 16010683 GMT" customer_id="NETTest">
<Network_List>
<Network_Info att0="Y" att1="N" att2="N" att3="Y" att4="Y">
<SIM_DATA>
<SIM><![CDATA[1100040101]]></SIM>
</SIM_DATA>
<NetworkResponseInfo k_status="C">
<KEY1>269</KEY1>
<PARENTNODE>
<CHILDNODE1>
<KEY2>XXXXXXX</KEY2>
<KEY3>YYYYYYY</KEY3>
</CHILDNODE1>
<CHILDNODE2>
<KEY4>N</KEY4>
<KEY5>I</KEY5>
</CHILDNODE2>
<CHILDNODE3>
<KEY6>1</KEY6>
<KEY7>3</KEY7>
</CHILDNODE3>
</PARENTNODE>
<KEY8><![CDATA[some image not visible]]></KEY8>
<KEY9>N</KEY9>
<KEY10>15</KEY10>
</NetworkResponseInfo>
</Network_Info>
</Network_List>
<response_message_list transaction_status_code="000" transaction_status_text="Successful"/>
</Network></NetworkResult></NetworkResponse></soap:Body></soap:Envelope>]]></RESPONSE></SOURCE>
Output XML
<ns3:NetworkResponse>
<Networks_OF_List>
<NetCharSeq>
<Nrep>
<type>Some Image</type>
<data> Data Coming from KEY8 CDATA section</data>
</Nrep>
<Nrep>
<type>ANYTHING</type>
<data>VALUE INSIDE SIM CDATA</data>
</Nrep>
<NetDetail>
<MYKEY1>Value present inside KEY4</MYKEY1>
<MYKEY2>Value present inside KEY5</MYKEY2>
</NetDetail>
<SystemID>Value of KEY2</SystemID>
<SystemPath>Valuelue of KEY3</SystemPath>
</NetCharSeq>
</Networks_OF_List>
</ns3:NetworkResponse>

(Welcome at SO. Please note that you are downvoted by some users because you do not show what you have done so far. Have a look at the How To Ask section to learn how to ask questions that actually can be answered and are considered proper questions in the SO format.)
If you can use XSLT 3.0, you can consider using the new fn:parse-xml function, which will take a document-as-a-string.
However, your CDATA-section contains itself escaped data, which means that, after you apply fn:parse-xml, you will have to do it once again for the text node that is the child of NetworkResult.
A better solution is often to fix this at the source and creating an XML format that allows other XML in certain elements (you can allow this with a proper XSD). It will save you a lot of trouble and at least you XML can then be pre-validated.
If you are stuck with XSLT 2.0 or 1.0, you can use disable-output-escaping (google it, there is a lot of info around on how to use it), but you will have to re-process your output once more because of the double-escape that is used. You may want to consider an XProc pipeline to ease the process.
You wrote: Also, how should I iterate over all the child nodes and get the values like in below child node
That is what XSLT is all about, please read this XSLT Tutorial, or any other tutorial you can find, it will be explained to you in the first minutes.
Update: as suggested by michael.hor257k in the comments, you can also parse the escaped data by hand using string manipulation functions. As he already says in the comments, this is laborious and error-prone, but sometimes, esp. if the XML is not really XML after unescaping, but something like XML, then this may be your only option.

Convert OpenIE triplet to N-Triplet (NT)

I downloaded and used OpenIE4.1 jar file (downloadable from http://knowitall.github.io/openie/) to process some free text documents and produced triplet-like outputs along with the text and confidence score, for instance,
The rail launchers are conceptually similar to the underslung SM-1
0.93 (The rail launchers; are; conceptually similar to the underslung SM-1)
I wrote a java parser to extract OpenIE triplets which confidence score is >= 0.85 and
need to know the way to convert it to N-triplet (NT), format look like.
Not sure if I need to be familiar with the ontology that I'm trying to map to.

After discussion with my colleagues. This is what I should do to create N-Triplet(NT) and Detailed Java codes can be found in another Question: Use RDF API (Jena, OpenRDF or Protege) to convert OpenIE outputs
Create a blank node identifier for each distinct :subject in the file (call it node_s)
Create a blank node identifier for each distinct :object in the file (call it node_o)
Define a URI for each distinct predicate
Create these triples:
1. node_s rdf:type <http://mypage.org/vocab#Corpus>
2. node_s dc:title “The rail launchers”
3. node_s dc:source “Sample File”
4. node_s rdf:predicate <http://mypage.org/vocab#are>
5. node_o rdf:type <http://mypage.org/vocab#Corpus>
6. node_o dc:title “conceptually similar to the underslung SM-1”

Distributing the XML files

I am totally new to XML and its capabilities.
I have a file say xyz.xml.
It contains content like this:
<system-config>
<business-model>
<agent-category key="operator">
<singular-name>Operator</singular-name>
<plural-name>Operators</plural-name>
<attribute>agent-attribute.reference</attribute>
</agent-category>
Next I have
<agent-attribute id="agent-attribute.reference">
<name>Reference</name>
< description>A unique identifier for this agent, typically an MSISDN.</description>
<mandatory>true< /mandatory>
<editable>false< /editable>
<deletable>false< /deletable>
<sensitive>false< /sensitive>
<system-generated>false< /system-generated>
<input-method xsi:type="AgentReferenceInputMethod"></input-method>
<storage-location xsi:type="AgentRefStorage" field="reference"></storage-location>
</agent-attribute>
</business-model>
Now I want to distribute the agent-attributes to different file named agentAttr.xml.
Is it possible to do so (mind it <agent-attribute> is under <system-config><business-model>), if so how?

So you want to extract the agent-attribute portions ?. You can do that with simple XSLT transformation (use e.g. Xalan for that). Another option could be jsoup, parsing it using DOM or manually.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.