String matching and replace in Java - java

I have a String like this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939"
+" in California) is a computer scientist.[2] She is currently the Ford"
+" Professor of Engineering in the MIT School of Engineering's electrical"
+" engineering and computer science department and an institute professor"
+" at the Massachusetts Institute of Technology.[3]";
I would like to replace all of these elements: [1], [2], [3], etcetera, with a blank space.
I tried with:
if (a.matches("([){1}\\d(]){1}")) {
a = a.replace("");
}
but it does not work!

Your Pattern is all wrong.
Try this example:
String input =
"Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) "
+ "is a computer scientist.[2] She is currently the Ford Professor of Engineering "
+ "in the MIT School of Engineering's electrical engineering and computer "
+ "science department and an institute professor at the Massachusetts Institute "
+ "of Technology.[3]";
// | escaped opening square bracket
// | | any digit
// | | | escaped closing square bracket
// | | | | replace with one space
System.out.println(input.replaceAll("\\[\\d+\\]", " "));
Output (newlines added for clarity)
Barbara Liskov (born Barbara Jane Huberman on November 7,
1939 in California) is a computer scientist.
She is currently the Ford Professor of Engineering in the MIT
School of Engineering's electrical engineering and computer science
department and an institute professor at the Massachusetts Institute of Technology.

Very simple:
a = a.replaceAll("\\[\\d+\\]","");
The changes:
Use replaceAll instead of replace
Escape the [] - they are regex special chars. the partnerships are not escaping them.
No need of {1} on your regex [{1} == [ - both are specifying that the character should be one time
The + added to d+ is for more than one digits numbers such as [12]

About your pattern ([){1}\\d(]){1}:
{1} is always useless since always implicit
[ and ] needs to be escaped with a backslash (which must itself be escaped with another backslash since in a string literal)
\\d has no explicit cardinality, so [12] for example won't match since there are two digits
So, better try: \\[\\d+\\]

Use the String replaceAll(String regex, String replacement).
All you got to do is a=a.replaceAll("\\[\\d+\\]", " ").
You can read Javadoc for more information .

Use this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) is a computer scientist.[2] She is currently the Ford Professor of Engineering in the MIT School of Engineering's electrical engineering and computer science department and an institute professor at the Massachusetts Institute of Technology.[3]";
for(int i =1 ; i<= 3; i++){
a= a.replace("["+i+"]","");
}
System.out.println(a);
This will work.

Related

How to remove all type of prefix salutation from string in java?

Suppose I have string coming from Input with name (for eg: Mr . Aditya Jha). How do I remove salutation from the start of input?
List of salutations that can come are:
Mr, Mrs, Dr, Miss, Ms, Rev, Mr. , Mr. , Dr. , Miss. , Ms. , Rev. , Mr . , Mr . , Dr . , Miss . , Ms . , Rev .
Any solution or regex statement which can consider all of these salutations?
I tried this:
name.replaceAll("\\s{2,}", " ").replaceFirst("(?i)(Mr . )", "").replaceFirst("(?i)(Mr |Mr. )", "").trim()
It is working, but for name like amra khan, it is removing mr.
You may use
name = name.replaceAll("\\s{2,}", " ").replaceFirst("(?i)^\\s*(?:M(?:iss|rs?|s)|Dr|Rev)\\b[\\s.]*", "").trim();
See the regex demo
Pattern details
(?i) - case ignoring option
^ - start of string
\s* - 0+ whitespaces
(?:M(?:iss|rs?|s)|Dr|Rev) - M followed with iss, r, rs, s, or Dr or Rev (you may add more after | here)
\b - word boundary
[\s.]* - 0 or more whitespaces or dots.

How to detect multi set words OpenNLP

I'm doing NER using Java OpenNLP and I'm not sure how can I detect multiple words (eg. New York, Bruno Mars, Hong Kong) by using the custom model I have trained.
My training data do cover multi-word spans:
<START:place> Hong Kong <END> ... <START:person> Putin <END>
I'm pretty sure my trained model and training data are working good. It's just that I do not know how to get the multi-word set. Here is what I did
// testing the model
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
String sentence = "India may US to Japan France so Putin should Hong Kong review Trump";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
// Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);
for (Span s : nameSpans)
System.out.println(s.toString() + " " + tokens[s.getStart()]);
And here is what I get:
[0..1) place India
[0..1) place US
[0..1) place Japan
[0..1) place France
[0..1) person Putin
[0..1) place Hong
[0..1) person Trump
But I want to get [0..1) place Hong Kong instead of splitting them into two categories.
Thanks.
I defined an array list to include all the multiple word place name, eg {"Hong", "New", "North", "South" ... } then use it to check if it contains tokens[s.getStart()]. If yes, add tokens[s.getStart()] + " " + tokens[s.getStart() + 1] else, add tokens[s.getStart()]. Although it's not the best approach but it's enough for me now.

String Split After Second Space Character

I have a string that I download from JSOUP that looks like
Paul Millsap Al Horford Tiago Splitter Jeff Teague Kyle Korver Thabo Sefolosha Mike Scott Shelvin Mack Kent Bazemore Dennis Schröder Tim Hardaway Jr. Walter Tavares Justin Holiday Mike Muscala Lamar Patterson Terran Petteway
I want to split it into an array for use in a list view, so the desired output would be:
Paul Millsap, Al Horford, Tiago Splitter, Jeff Teague, Kyle Korver, Thabo Sefolosha, Mike Scott, Shelvin Mack, Kent Bazemore, Dennis Schröder, Tim Hardaway Jr., Walter Tavares, Justin Holiday, Mike Muscala, Lamar Patterson, Terran Petteway,
How can I do this? Thanks for any help.
Preferred answer:
Since you are parsing page which have nice table and you want to get values from specific columns (names of players which are also links) you can do it easily with:
String url = "http://www.spotrac.com/nba/atlanta-hawks/cap/";
Document doc = Jsoup.connect(url).get();
Elements players = doc.select("table.datatable td.player a");
for (Element player : players){
System.out.println(player.text());
}
which will:
find table tag with class datatable then inside
then from that table we select td.player which represents each cell td element with class player
finally we want to pick these cells which have links a (since names are links)
Original answer:
Based only on example data from your question, you could try to find OneWord[space]SecondWord(optional:[space]Jr.).
Code based on this idea could look like:
String input = " Paul Millsap Al Horford Tiago Splitter Jeff Teague Kyle Korver Thabo Sefolosha Mike Scott Shelvin Mack Kent Bazemore Dennis Schröder Tim Hardaway Jr. Walter Tavares Justin Holiday Mike Muscala Lamar Patterson Terran Petteway";
Pattern p = Pattern.compile("\\w+\\s+\\w+(\\s+Jr[.])?",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
Output:
Paul Millsap
Al Horford
Tiago Splitter
Jeff Teague
Kyle Korver
Thabo Sefolosha
Mike Scott
Shelvin Mack
Kent Bazemore
Dennis Schröder
Tim Hardaway Jr.
Walter Tavares
Justin Holiday
Mike Muscala
Lamar Patterson
Terran Petteway
You could do a basic "split every second space", then examine the next string to see if it has anything (like a period) that would indicate that it belongs to the previous string. Works if things like Jr. have the period, wouldn't work if the punctuation isn't present
Search for two words, and then search for any third word only if it ends in a . character:
\b(\w+ \w+\b(?: \w+\.)?)
Replace with \1,. regex101.com example

Regex to extract a paragraph

I need a regex to extract a each paragraph and store as a string for additional processing from the text buffer containing many such similar paragraphs.
Example: Say, the text buffer is like this:
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Hurlman"
Person Address = "2nd Street Benjamin Blvd NJ"
Persion Age = 25
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Greg"
Person Address = "3rd Street Benjamin Blvd NJ"
Persion Age = 26
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
And I need to iterate through all the paragraphs and store each one of them to further find the specific person details inside.
Each paragraph I need to extract should be of the below format
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
Any help is much appreciated!
you could use this pattern (===.*===[\s\S]*?)(?====|$)
Demo
Using regexes to solve this is possible, but it is likely to give you a poor (inefficient, hard to understand, hard to maintain, etc) solution.
What you have is an informal record structure represented using lines of text. (This is not natural language text, so describing it in terms of "paragraphs" doesn't make sense.)
The way to process it is to read it a line at a time and then use Scanner (or equivalent) to parse each line into name value pairs. You just need some simple logic to detect the record boundaries and / or check that they are appearing at the correct place in the input stream.

How do I split up a line in tex file in read/write?

I currently have a text file that has the following:
1 Commercial & Enterprise 5 SLICE 59.99 IP MICRO
2 Commercial & Enterprise 5 SLICE 59.99 MULTI-USE SWITCH
.
.
.
.
18 Government & Military 6 TCP 15.00 TCP
I am trying to split the line so that I can have the following:
Product number: 18
Category: Government & Military
Product name: TCP
Units in stock: 6
Price: $15.00
Total value: $90.00
Fee: $4.50
Total value: $94.50
I currently have the following code:
while ((line = lineReader.readLine()) != null) {
StringTokenizer tokens = new StringTokenizer(line, "\t");
p = new ActionProduct();
add(p);
String category = p.getCategory();
String name = p.getName();
category = tokens.nextToken();
int item = p.getItem();
double price = p.getPrice();
int units = p.getUnits();
while (tokens.hasMoreTokens()) {
item = Integer.parseInt(tokens.nextToken());
price = Double.parseDouble(tokens.nextToken());
units = Integer.parseInt(tokens.nextToken());
}
System.out.println("Category: " + category);
System.out.println("Product number: " + item);
System.out.println("Product name: " + name);
System.out.println("Units in stock: "+ units);
System.out.println("Price: $" + String.format("%.2f", price));
System.out.println("Total value: $" + String.format("%.2f",p.value()));
System.out.println("Fee: $" + String.format("%.2f", p.fee()));
System.out.println("Total value: $" + String.format("%.2f", value()));
}
And I am getting this output instead:
Category: 1 Commercial & Enterprise 5 SLICE 59.99 IP MICRO
Product number: 0
Product name: null
Units in stock: 0
Price: $0.00
Total value: $0.00
Fee: $0.00
Total value: $0.00
Category: 2 Commercial & Enterprise 5 SLICE 59.99 MULTI-USE SWITCH
Product number: 0
Product name: null
Units in stock: 0
Price: $0.00
Total value: $0.00
Fee: $0.00
Total value: $0.00
So my questions is…what must I do to split up the line, so that I can print each value of my textile individually?? Thanks in advance guys, would really appreciate some direction!
Here is my text file:
1 Commercial & Enterprise 5 SLICE 59.99 IP MICRO
2 Commercial & Enterprise 5 SLICE 59.99 MULTI-USE SWITCH
3 Commercial & Enterprise 4 SLICE 59.99 2100
4 Commercial & Enterprise 6 SLICE 59.99 IP
5 Commercial & Enterprise 4 HDX 45.00 HYBRID CARRIER
6 Commercial & Enterprise 10 TRANSip 45.00 IP Technology Suite
7 Commercial & Enterprise 5 GUI 30.00 LINK COMMAND SYS
8 Commercial & Enterprise 5 GUI 30.00 MAUI
9 Commercial & Enterprise 6 RCP 20.00 RCP
10 Government & Military 5 SLICE 60.00 IP MICRO
11 Government & Military 5 SLICE 60.00 MULTI-USE SWITCH
12 Government & Military 4 SLICE 60.00 2100
13 Government & Military 6 SLICE 55.00 IP
14 Government & Military 4 HDX.C 35.00 HYBRID CARRIER
15 Government & Military 10 TRANSip 30.00 IP Technology Suite
16 Government & Military 5 GUI 20.00 LINK COMMAND SYS
17 Government & Military 5 GUI 20.00 MAUI
18 Government & Military 6 TCP 15.00 TCP
Take a good look at the data. Are you getting more data, or is this the only file?
If you're getting more data, you need to have some kind if spec, so you can be sure, that your parser will continue working.
If you have fixed positioning of the data, then you can use
String part = line.substring(beginIndex, endIndex)
This data file is almost with fixed positions, except when the product number increases..
Instead you can try with regex or line.split(delimitor)
Don't use regex too much before you really understand them.
If this was the only file, I think I would start with a
String[] parts = line.split(" ") //two spaces
and then parse from the string array you get.
The first part, parts[0], would contain both product number and category, but you can split that as well.
Since you want to split the text based on arbitrary pattern, that is exactly what RegEx is for; use a RegEx parser to tokenize your input, then process the tokens as desired.
Simply put, you read the file, pass it to RegEx tokenizer, then work on the tokens (i.e. strings)
An example regex pattern for your data would be
[0-9]+[\s]+[a-zA-Z\s\Q&\E]+[\s]+[0-9]+[\s]+[a-zA-Z]+[\s]+[0-9\Q.\E]+[\s]+[a-zA-Z0-9]+
you can quickly and effectively create your pattern by using e.g.
http://gskinner.com/RegExr/
further reading:
http://en.wikipedia.org/wiki/Regular_expression
http://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

Categories

Resources