I have a string that I download from JSOUP that looks like
Paul Millsap Al Horford Tiago Splitter Jeff Teague Kyle Korver Thabo Sefolosha Mike Scott Shelvin Mack Kent Bazemore Dennis Schröder Tim Hardaway Jr. Walter Tavares Justin Holiday Mike Muscala Lamar Patterson Terran Petteway
I want to split it into an array for use in a list view, so the desired output would be:
Paul Millsap, Al Horford, Tiago Splitter, Jeff Teague, Kyle Korver, Thabo Sefolosha, Mike Scott, Shelvin Mack, Kent Bazemore, Dennis Schröder, Tim Hardaway Jr., Walter Tavares, Justin Holiday, Mike Muscala, Lamar Patterson, Terran Petteway,
How can I do this? Thanks for any help.
Preferred answer:
Since you are parsing page which have nice table and you want to get values from specific columns (names of players which are also links) you can do it easily with:
String url = "http://www.spotrac.com/nba/atlanta-hawks/cap/";
Document doc = Jsoup.connect(url).get();
Elements players = doc.select("table.datatable td.player a");
for (Element player : players){
System.out.println(player.text());
}
which will:
find table tag with class datatable then inside
then from that table we select td.player which represents each cell td element with class player
finally we want to pick these cells which have links a (since names are links)
Original answer:
Based only on example data from your question, you could try to find OneWord[space]SecondWord(optional:[space]Jr.).
Code based on this idea could look like:
String input = " Paul Millsap Al Horford Tiago Splitter Jeff Teague Kyle Korver Thabo Sefolosha Mike Scott Shelvin Mack Kent Bazemore Dennis Schröder Tim Hardaway Jr. Walter Tavares Justin Holiday Mike Muscala Lamar Patterson Terran Petteway";
Pattern p = Pattern.compile("\\w+\\s+\\w+(\\s+Jr[.])?",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
Output:
Paul Millsap
Al Horford
Tiago Splitter
Jeff Teague
Kyle Korver
Thabo Sefolosha
Mike Scott
Shelvin Mack
Kent Bazemore
Dennis Schröder
Tim Hardaway Jr.
Walter Tavares
Justin Holiday
Mike Muscala
Lamar Patterson
Terran Petteway
You could do a basic "split every second space", then examine the next string to see if it has anything (like a period) that would indicate that it belongs to the previous string. Works if things like Jr. have the period, wouldn't work if the punctuation isn't present
Search for two words, and then search for any third word only if it ends in a . character:
\b(\w+ \w+\b(?: \w+\.)?)
Replace with \1,. regex101.com example
Related
I'm doing NER using Java OpenNLP and I'm not sure how can I detect multiple words (eg. New York, Bruno Mars, Hong Kong) by using the custom model I have trained.
My training data do cover multi-word spans:
<START:place> Hong Kong <END> ... <START:person> Putin <END>
I'm pretty sure my trained model and training data are working good. It's just that I do not know how to get the multi-word set. Here is what I did
// testing the model
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
String sentence = "India may US to Japan France so Putin should Hong Kong review Trump";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
// Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);
for (Span s : nameSpans)
System.out.println(s.toString() + " " + tokens[s.getStart()]);
And here is what I get:
[0..1) place India
[0..1) place US
[0..1) place Japan
[0..1) place France
[0..1) person Putin
[0..1) place Hong
[0..1) person Trump
But I want to get [0..1) place Hong Kong instead of splitting them into two categories.
Thanks.
I defined an array list to include all the multiple word place name, eg {"Hong", "New", "North", "South" ... } then use it to check if it contains tokens[s.getStart()]. If yes, add tokens[s.getStart()] + " " + tokens[s.getStart() + 1] else, add tokens[s.getStart()]. Although it's not the best approach but it's enough for me now.
So here is my predicament. I have no idea where to start with this. Lets say i'm given a list of people that are members of multiple organizations like the following:
John NAACP PETA NRA
Bill NRA WHO
Nancy NAACP NRA WHO
Jim PETA WHO
But I want to take another file in that has a list of all the possible organizations and then output something like this (with the organizations in alphabetical order and the members also in alphabetical order, and no names next to an organization if nobody is in it):
NAACP John Nancy
NRA Bill John Nancy
PETA Jim John
WHO Bill Jim Nancy
YEO
I'm new to HashMaps and I have no idea how to go about doing this, so i'd appreciate all the help I can get.
Try something like a HashMap<String, ArrayList<String>>. Insert each organization name as a String key with an empty ArrayList<String>. Then loop over the list of people => organizations, look up the organizations one by one, and insert the person's name in the ArrayList for that organization.
Not the most elegant solution but it will work
To add people to the list you can use the following code:
Map<String, List<String>> storage = new
LinkedHashMap<String,List<String>>();
if(!storage.containsKey("NRA")){
storage.put("NRA", new ArrayList<String>());
}
storage.get("NRA").add("Bill");
storage.get("NRA").add("John Nancy");
To extract and print people you can use the following code:
for(Entry<String, List<String>> entry : storage.entrySet()){
String line = entry.getKey(); //getting company name
for(String name : entry.getValue()){ //extracting name from an array
line += " ";
line += name;
}
System.out.println(line); //printing the result
}
I didn't check this code in IDE but except possible typos it will work.
Create a HashMap for each organization, e.g., HashMap<String,Boolean> memberOfNAACP = new HashMap<String,Boolean>();. As you loop through the members and you find one who is a member of NAACP, run memberOfNAACP.put("John",true). When you're done, dump out the contents of the hash with memberOfNAACP.keySet(). If you don't know all the organizations in advance, use an ArrayList of HashMaps, type ArrayList<HashMap<String,Boolean>>.
I'm working on a utility where I've this requirement:
there is a string which contains parameters like - #p1 or #p2 or #pn, where n can be any number.
for example string is :
Input:
It provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5
Replace all the parameters with #pn#. So if the parameter is #p1 it will become #p1#.
The above string will become :
Output:
It provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p4# Business, #p5#
Any quick help appreciated.
Thanks.
Use string.replaceAll function like below.
string.replaceAll("(#p\\d+)", "$1#");
\d+ matches one or more digits. () called capturing group which capture the characters that the matched by the pattern inside () and it store the captured characters into their corresponding groups. Later we could refer those characters by specifying its index like $1 or $2 .
Example:
String s = "It provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5";
System.out.println(s.replaceAll("(#p\\d+)", "$1#"));
Output:
It provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p5# Business, #p5#
You can try regex like this :
public static void main(String[] args) {
String s = "it provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5";
System.out.println(s.replaceAll("(#p\\d+)(?=\\s+|$)", "$1\\#"));
}
O/P :
it provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p5# Business, #p5#
Explanation :
(#p\\d+)(?=\\s+|$) --> `#p` followed by any number of digits (which are all captured) followed by a space or end of String (which are matched but not captured..)
I have a String like this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939"
+" in California) is a computer scientist.[2] She is currently the Ford"
+" Professor of Engineering in the MIT School of Engineering's electrical"
+" engineering and computer science department and an institute professor"
+" at the Massachusetts Institute of Technology.[3]";
I would like to replace all of these elements: [1], [2], [3], etcetera, with a blank space.
I tried with:
if (a.matches("([){1}\\d(]){1}")) {
a = a.replace("");
}
but it does not work!
Your Pattern is all wrong.
Try this example:
String input =
"Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) "
+ "is a computer scientist.[2] She is currently the Ford Professor of Engineering "
+ "in the MIT School of Engineering's electrical engineering and computer "
+ "science department and an institute professor at the Massachusetts Institute "
+ "of Technology.[3]";
// | escaped opening square bracket
// | | any digit
// | | | escaped closing square bracket
// | | | | replace with one space
System.out.println(input.replaceAll("\\[\\d+\\]", " "));
Output (newlines added for clarity)
Barbara Liskov (born Barbara Jane Huberman on November 7,
1939 in California) is a computer scientist.
She is currently the Ford Professor of Engineering in the MIT
School of Engineering's electrical engineering and computer science
department and an institute professor at the Massachusetts Institute of Technology.
Very simple:
a = a.replaceAll("\\[\\d+\\]","");
The changes:
Use replaceAll instead of replace
Escape the [] - they are regex special chars. the partnerships are not escaping them.
No need of {1} on your regex [{1} == [ - both are specifying that the character should be one time
The + added to d+ is for more than one digits numbers such as [12]
About your pattern ([){1}\\d(]){1}:
{1} is always useless since always implicit
[ and ] needs to be escaped with a backslash (which must itself be escaped with another backslash since in a string literal)
\\d has no explicit cardinality, so [12] for example won't match since there are two digits
So, better try: \\[\\d+\\]
Use the String replaceAll(String regex, String replacement).
All you got to do is a=a.replaceAll("\\[\\d+\\]", " ").
You can read Javadoc for more information .
Use this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) is a computer scientist.[2] She is currently the Ford Professor of Engineering in the MIT School of Engineering's electrical engineering and computer science department and an institute professor at the Massachusetts Institute of Technology.[3]";
for(int i =1 ; i<= 3; i++){
a= a.replace("["+i+"]","");
}
System.out.println(a);
This will work.
I need a regex to extract a each paragraph and store as a string for additional processing from the text buffer containing many such similar paragraphs.
Example: Say, the text buffer is like this:
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Hurlman"
Person Address = "2nd Street Benjamin Blvd NJ"
Persion Age = 25
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Greg"
Person Address = "3rd Street Benjamin Blvd NJ"
Persion Age = 26
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
And I need to iterate through all the paragraphs and store each one of them to further find the specific person details inside.
Each paragraph I need to extract should be of the below format
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
Any help is much appreciated!
you could use this pattern (===.*===[\s\S]*?)(?====|$)
Demo
Using regexes to solve this is possible, but it is likely to give you a poor (inefficient, hard to understand, hard to maintain, etc) solution.
What you have is an informal record structure represented using lines of text. (This is not natural language text, so describing it in terms of "paragraphs" doesn't make sense.)
The way to process it is to read it a line at a time and then use Scanner (or equivalent) to parse each line into name value pairs. You just need some simple logic to detect the record boundaries and / or check that they are appearing at the correct place in the input stream.