parsing multiple lines with regex

parsing multiple lines with regex - java

I'm writing a program in Java that parse bibtex library file. each entry should be parsed to
field and value. this is an example of one single bibtex from a library.
#INPROCEEDINGS{conf/icsm/Ceccato07,
author = {Mariano Ceccato},
title = {Migrating Object Oriented code to Aspect Oriented Programming},
booktitle = {ICSM},
year = {2007},
pages = {497--498},
publisher = {IEEE},
bibdate = {2008-11-18},
bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/icsm/icsm2007.html#Ceccato07},
crossref = {conf/icsm/2007},
owner = {Administrator},
timestamp = {2009.04.30},
url = {http://dx.doi.org/10.1109/ICSM.2007.4362668}
}
in this case, I just read the line and split it using the method split. for example, the first entry (author) is parsed like this:
Scanner in = new Scanner(new File(library.bib));
in.nextLine(); //skip the header
String input = in.nextLine(); //read (author = {Mariano Ceccato},)
String field = input.split("=")[0].trim(); //field = "author"
String value = input.split("=")[1]; //value = "{Mariano Ceccato},"
value = value.split("\\}")[0]; //value = "{Mariano Ceccato"
value = value.split("\\{")[1]; //value = "Mariano Ceccato"
value = value.trim; //remove any white spaces (if any)
up to know every thing is good. However there are a bibtex in the library that has multiple lines' value:
#ARTICLE{Aksit94AbstractingCF,
author = {Mehmet Aksit and Ken Wakita and Jan Bosch and Lodewijk Bergmans and
Akinori Yonezawa },
title = {{Abstracting Object Interactions Using Composition Filters}},
journal = {Lecture Notes in Computer Science},
year = {1994},
volume = {791},
pages = {152--??},
acknowledgement = {Nelson H. F. Beebe, Center for Scientific Computing, University of
Utah, Department of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt
Lake City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1 801
581 4148, e-mail: \path|beebe#math.utah.edu|, \path|beebe#acm.org|,
\path|beebe#computer.org|, \path|beebe#ieee.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Mon May 13 11:52:14 MDT 1996},
coden = {LNCSD9},
issn = {0302-9743},
owner = {aljasser},
timestamp = {2009.01.08}
}
as you see, the acknowledgement field it more than a line, so I can't read it using nextLine(). My parsing function works fine with it if I passed it as a String to it. So what is the best way to read this entry and other multiple lines entry and stile be able to read single line entries ?

The form of these entries is
#<type>{<Id>
<name>={<value>},
....
<name>={<value>}
}
Note that the last name-value pair is not followed by a comma.
If a value is split over several lines, then that simply means that a particular line does not yet contain the closing brace. In that case, scan the next line and append it to the string you are about to split. Keep doing this until the last characters in the string are "}," or "}" (this latter would happen if the 'acknowledgement' was the last name-value pair in the record).
For extra safety, count that the number of closing braces matches the number of opening braces, and keep appending lines to your string until it does. This would be to cover situations where you have a long title in an article that happened to unfortunately break at the wrong place, such as
title = {{Abstracting Object Interactions Using Composition Filters, and other stuff}
},

For these king of issues, it is always better to use a specific parser.
I googled for bibtex parser and find this.
If you like to have your own as what you are doing, one sulotion to this problem is to check whether
the line ends with }, if not append the current line with the next one.
Having said that, there might be other issues, that's why I suggested using a parser

Related

How can I read user data (memory) from EPC RFID tag through LLRP?

I encode two EPC tags through "NiceLabel Pro" with data:
First tag: EPC: 555555555, UserData: 9876543210123456789
Second tag: EPC: 444444444, UserData: 123456789123456789
Now I'm trying to get that data through LLRP (in my Java application):
My LLRPClient (one function):
public void PrepareInventoryRequest() {
AccessCommand accessCommand = new AccessCommand();
// A list to hold the op specs for this access command.
accessCommand.setAccessCommandOpSpecList(GenerateOpSpecList());
// Create a new tag spec.
C1G2TagSpec tagSpec = new C1G2TagSpec();
C1G2TargetTag targetTag = new C1G2TargetTag();
targetTag.setMatch(new Bit(1));
// We want to check memory bank 1 (the EPC memory bank).
TwoBitField memBank = new TwoBitField("2");
targetTag.setMB(memBank);
// The EPC data starts at offset 0x20.
// Start reading or writing from there.
targetTag.setPointer(new UnsignedShort(0));
// This is the mask we'll use to compare the EPC.
// We want to match all bits of the EPC, so all mask bits are set.
BitArray_HEX tagMask = new BitArray_HEX("00");
targetTag.setTagMask(tagMask);
// We only only to operate on tags with this EPC.
BitArray_HEX tagData = new BitArray_HEX("00");
targetTag.setTagData(tagData);
// Add a list of target tags to the tag spec.
List <C1G2TargetTag> targetTagList =
new ArrayList<>();
targetTagList.add(targetTag);
tagSpec.setC1G2TargetTagList(targetTagList);
// Add the tag spec to the access command.
accessCommand.setAirProtocolTagSpec(tagSpec);
accessSpec.setAccessCommand(accessCommand);
...
private List<AccessCommandOpSpec> GenerateOpSpecList() {
// A list to hold the op specs for this access command.
List <AccessCommandOpSpec> opSpecList =
new ArrayList<>();
// Set default opspec which for eventcycle of accessspec 3.
C1G2Read opSpec1 = new C1G2Read();
// Set the OpSpecID to a unique number.
opSpec1.setOpSpecID(new UnsignedShort(1));
opSpec1.setAccessPassword(new UnsignedInteger(0));
// We'll read from user memory (bank 3).
TwoBitField opMemBank = new TwoBitField("3");
opSpec1.setMB(opMemBank);
// We'll read from the base of this memory bank (0x00).
opSpec1.setWordPointer(new UnsignedShort(0));
// Read two words.
opSpec1.setWordCount(new UnsignedShort(0));
opSpecList.add(opSpec1);
return opSpecList;
}
My tag handler function:
private void updateTable(TagReportData tag) {
if (tag != null) {
EPCParameter epcParam = tag.getEPCParameter();
String EPCStr;
List<AccessCommandOpSpecResult> accessResultList = tag.getAccessCommandOpSpecResultList();
for (AccessCommandOpSpecResult accessResult : accessResultList) {
if (accessResult instanceof C1G2ReadOpSpecResult) {
C1G2ReadOpSpecResult op = (C1G2ReadOpSpecResult) accessResult;
if ((op.getResult().intValue() == C1G2ReadResultType.Success) &&
(op.getOpSpecID().intValue() < 1000)) {
UnsignedShortArray_HEX userMemoryHex = op.getReadData();
System.out.println("User Memory read from the tag is = " + userMemoryHex.toString());
}
}
}
...
For the first tag, "userMemoryHex.toString()" = "3938 3736"
For the second tag, "userMemoryHex.toString()" = "3132 3334"
Why? How do I get all user data?
This is my rfid tag.

The values that you get seem to be the first 4 characters of the number (interpreted as an ASCII string):
39383736 = "9876" (when interpreting those 4 bytes as ASCII characters)
31323334 = "1234" (when interpreting those 4 bytes as ASCII characters)
Since the specification of your tag says
Memory: EPC 128 bits, User 32 bits
your tag can only contain 32 bits (= 4 bytes) of user data. Hence, your tag simply can't contain the full value (i.e. 9876543210123456789 or 123456789123456789) that you tried to write as UserData (regardless of whether this was interpreted as a decimal number or a string).
Instead, your writer application seems to have taken the first 4 characters of those values, encoded them in ASCII, and wrote them to the tag.

How to edit a entry sequenced enscribe file

I need some help with this problem. It looks stupid but i could not resolved it. I have a entry sequenced file with variable length records. I only need to replace the first 3 bytes for XXX so i have to rebuild the whole file for this.. The problem i am getting is i am changing the length of all records filling with "NULLS". That's why i have no way to know previously the amount of bytes written for the record.
For example I have this file with three records:
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCC
DDDDDDDDDDDDDD
The file has a REC attribute of 26 (equals to the length of the second record). When I execute my program to change the first three letters, the file remains that (assume "n" as "null character"):
AAAAAAAAAAAAAAAANNNNNNNNNN
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCNNNNNNNNNNNNNNNNNNNNN
DDDDDDDDDDDDDDNNNNNNNNNNNN
How can i change my program to get what i want?
XXXAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCC
DDDDDDDDDDDDDD
This is my code (java)
EnscribeFile p_origin = new EnscribeFile(file);
String first_record;
byte buffer[];
//First, charge all records and then purge the file content
ArrayList<byte[]> records = new ArrayList<byte[]>();
buffer = new byte[et.getRecordLength()];
p_origin.open(EnscribeOpenOptions.READ_WRITE,EnscribeOpenOptions.SHARED);
EnscribeFileAttributes et = p_origin.getFileInfo();
while ( p_origin.read(buffer,et.getRecordLength()) != EnscribeFile.POSITION_UNUSED )
{
byte auxRecord[] = new byte[et.getRecordLength()];
System.arraycopy(buffer,0,auxRecord,0,et.getRecordLength());
buffer = new byte[et.getRecordLength()];
records.add(auxRecord);
}
p_origin.purgeData();
//Second, modify first record
first_record = new String(records.get(0));
first_record = "XXX" + first_record.substring(3);
records.set(0,first_record.getBytes());
//Third, rewrite the records and close the file
Iterator<byte[]> i = records.iterator();
while( i.hasNext() )
p_origin.write(aux,et.getRecordLength()); //Check the note
p_origin.close();
Note: I can not add a function to get the last character before the first null before write becouse a previous null or nulls at the end of records could be possible and acceptable. Example (remember "N" is "null"):
AAAAAAAAAAAAAAAANN
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCNN
DDDDDDDDDDDDDDNN
Must equal to this after the process:
XXXAAAAAAAAAAAAANN
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCNN
DDDDDDDDDDDDDDNN

Ok, I found the solution at other forum. It is very simple. This method
p_origin.read(...)
returns the length of bytes that i did not know, so it is very simple save a variable the length before creating the new record. With some changes the code becomes:
EnscribeFile p_origin = new EnscribeFile(file);
String first_record;
byte buffer[];
//First, charge all records and then purge the file content
ArrayList<byte[]> records = new ArrayList<byte[]>();
buffer = new byte[et.getRecordLength()];
p_origin.open(EnscribeOpenOptions.READ_WRITE,EnscribeOpenOptions.SHARED);
EnscribeFileAttributes et = p_origin.getFileInfo();
int aux_len = p_origin.read(buffer,et.getRecordLength());
while ( aux_len != EnscribeFile.POSITION_UNUSED )
{
byte auxRecord[] = new byte[aux_len];
System.arraycopy(buffer,0,auxRecord,0,et.getRecordLength());
records.add(auxRecord);
aux_len = p_origin.read(buffer,et.getRecordLength());
}
p_origin.purgeData();
//Second, modify first record
first_record = new String(records.get(0));
first_record = "XXX" + first_record.substring(3);
records.set(0,first_record.getBytes());
//Third, rewrite the records and close the file
Iterator<byte[]> i = records.iterator();
while( i.hasNext() )
{
byte aux_byte[] = i.next();
p_origin.write(aux_byte,aux_byte.length);
}
p_origin.close();

Customized Relationship Extraction Between two Entities Stanford NLP

I am looking for similar logic like from here RelationExtraction NLP
as process explain in the answer, I am able to reach out NER and entity sinking but I am very confused with "slot filling" logic and not getting proper resources on Internet.
Here is my code sample
public static void main(String[] args) throws IOException, ClassNotFoundException {
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
//String text = "Mary has a little lamb. She is very cute."; // Add your text here!
String text = "Matrix Partners along with existing investors Sequoia Capital and Nexus Venture Partners has invested R100 Cr in Mumbai based food ordering app, TinyOwl. The series B funding will be used by the company to expand its geographical presence to over 50 cities, upgrade technology and enhance user experience.";
text+="In December last year, it raised $3 Mn from Sequoia Capital India and Nexus Venture Partners to deepen its presence in home market Mumbai. It was seeded by Deap Ubhi (who had earlier founded Burrp) and Sandeep Tandon.";
text+="Kunal Bahl and Rohit Bansal, were also said to be planning to invest in the company’s second round of fund raise.";
text+="Founded by Harshvardhan Mandad and Gaurav Choudhary, TinyOwl claims to have tie-up with 5,000 restaurants and processes almost 2000 orders. The app which competes with the likes of FoodPanda aims to process over 50,000 daily orders.";
text+="The top-line comes from the cut the company takes from each order placed through its app.";
text+="The startup is also planning to come with reviews which would make it a competitor of Zomato, valued at $660 Mn. Also, Zomato is entering the food ordering business to expand its offerings.";
text+="Recently another peer, Bengaluru based food delivery startup, SpoonJoy raised an undisclosed amount of funding from Sachin Bansal (Co-Founder Flipkart) and Mekin Maheshwari (CPO Flipkart), Abhishek Goyal (Founder, Tracxn) and Sahil Barua (Co-Founder, Delhivery).";
text+="-TechCrunch";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
//System.out.println(" word \n"+word);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// System.out.println(" pos \n"+pos);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
//System.out.println(" ne \n"+ne);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
System.out.println(" TREE \n"+tree);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(" dependencies \n"+dependencies);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
System.out.println("graph \n "+graph);
}
}
This gives output with same entities combined now I have to take this ahead with finding relationship between these entities. Example from code string I should get in putput that "Matrix Partners" and "Sequoia Capital" has relation "investor" or similar kind of structure.
Please correct me if am wrong somewhere and lead me to correct way.

How do I get this Scanner to stop right before a blank line?

I am trying to get a file scanner to stop scanning when the next line is blank.
public static void createAudioTypes(File list, Scanner mediaReader, String mediaType) {
if (mediaType.equalsIgnoreCase("CD") || mediaType.equalsIgnoreCase("CASSETTE")) {
String title = mediaReader.next();
String artist = mediaReader.next();
int year = 0;
if (mediaReader.hasNextInt()) {
year = mediaReader.nextInt();
}
String lbl = mediaReader.next();
ArrayList<String> songs = new ArrayList<String>();
System.out.println(title);
System.out.println(artist);
System.out.println(year);
System.out.println(lbl);
mediaReader.useDelimiter(",");
while(mediaReader.nextLine().equalsIgnoreCase(mediaType)) {
songs.add(mediaReader.next());
}
System.out.println(songs);
}
}
Part of the text file I'm reading from:
CD
Immersion
Pendulum
2011
Atlantic
Genesis, Salt in the Wounds, Watercolour, Set Me on Fire, Crush, Under the Waves, Immunize (feat. Liam Howlett), The Island - pt. 1 - Dawn, The Island - pt. 2 - Dusk, Comprachicos, The Vulture, Witchcraft, Self vs Self (feat. In Flames), The Fountain (feat Steven Wilson), Encoder
16.99
CD
Demon Days
Gorillaz
Notice the line with all the track titles does not word wrap. This line is to be read into an Array List. Notice just beneath the line of track titles, there's a price and then a blank line. I want this
blank line to be the stopping point for the scanner, but I can't get the syntax.
Thank you!

read whole line by readLine(), trim() it to see if its isEmpty()

ROME API to parse RSS/Atom

I'm trying to parse RSS/Atom feeds with the ROME library. I am new to Java, so I am not in tune with many of its intricacies.
Does ROME automatically use its modules to handle different feeds as it comes across them, or do I have to ask it to use them? If so, any direction on this.
How do I get to the correct 'source'? I was trying to use item.getSource(), but it is giving me fits. I guess I am using the wrong interface. Some direction would be much appreciated.
Here is the meat of what I have for collection my data.
I noted two areas where I am having problems, both revolving around getting Source Information of the feed. And by source, I want CNN, or FoxNews, or whomever, not the Author.
Judging from my reading, .getSource() is the correct method.
List<String> feedList = theFeeds.getFeeds();
List<FeedData> feedOutput = new ArrayList<FeedData>();
for (String sites : feedList ) {
URL feedUrl = new URL(sites);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl));
List<SyndEntry> entries = feed.getEntries();
for (SyndEntry item : entries){
String title = item.getTitle();
String link = item.getUri();
Date date = item.getPublishedDate();
Problem here --> ** SyndEntry source = item.getSource();
String description;
if (item.getDescription()== null){
description = "";
} else {
description = item.getDescription().getValue();
}
String cleanDescription = description.replaceAll("\\<.*?>","").replaceAll("\\s+", " ");
FeedData feedData = new FeedData();
feedData.setTitle(title);
feedData.setLink(link);
And Here --> ** feedData.setSource(link);
feedData.setDate(date);
feedData.setDescription(cleanDescription);
String preview =createPreview(cleanDescription);
feedData.setPreview(preview);
feedOutput.add(feedData);
// lets print out my pieces.
System.out.println("Title: " + title);
System.out.println("Date: " + date);
System.out.println("Text: " + cleanDescription);
System.out.println("Preview: " + preview);
System.out.println("*****");
}
}

getSource() is definitely wrong - it returns back SyndFeed to which entry in question belongs. Perhaps what you want is getContributors()?
As far as modules go, they should be selected automatically. You can even write your own and plug it in as described here

What about trying regex the source from the URL without using the API?
That was my first thought, anyway I checked against the RSS standardized format itself to get an idea if this option is actually available at this level, and then try to trace its implementation upwards...
In RSS 2.0, I have found the source element, however it appears that it doesn't exist in previous versions of the spec- not good news for us!
[ is an optional sub-element of 1
Its value is the name of the RSS channel that the item came from, derived from its . It has one required attribute, url, which links to the XMLization of the source.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing multiple lines with regex - java

Related

How can I read user data (memory) from EPC RFID tag through LLRP?

How to edit a entry sequenced enscribe file

Customized Relationship Extraction Between two Entities Stanford NLP

How do I get this Scanner to stop right before a blank line?

ROME API to parse RSS/Atom

Categories

Resources