How to tokenize in java without using the java.util tokenizer?

How to tokenize in java without using the java.util tokenizer? - java

Consider the following as tokens:
+, -, ), (
alpha charactors and underscore
integer
Implement 1.getToken() - returns a string corresponding to the next token
2.getTokPos() - returns the position of the current token in the input string
Example input: (a+b)-21)
Output: (| a| +| b| )| -| 21| )|
Note: Cannot use the java string tokenizer class
Work in progress - Successfully tokenized +,-,),(. Need to figure out characters and numbers:
OUTPUT: +|-|+|-|(|(|)|)|)|(| |

java.util tokenizer is a deprecated class.
Tokenizing Strings in Java is much easier with "String.split()" since Java 1.4 :
String[] tokens = "(a+b)-21)".split("[+-)(]");
If it is a homework, you probably have to reimplement a "split" method:
read the String character by character
if the character is not a special char, add it to a buffer
when you encounter a special char, add the buffer content to a list and clear the buffer
Since it is (probably) a homework, I let you implement it.

Java lets you examine the characters in a String one by one with the charAt method. So use that in a for loop and examine each character. When you encounter a TOKEN you wrap that token with the pipes and any other character you just append to the output.
public static final char PLUS_TOKEN = '+';
// add all tokens as
public String doStuff(String input)
{
StringBuilder output = new StringBuilder();
for (int index = 0; index < input.length(); index++)
{
if (input.charAt(index) == PLUS_TOKEN)
{
// when you see a token you need to append the pipes (|) around it
output.append('|');
output.append(input.charAt(index);
output.append('|');
}
else if () //compare the current character with all tokens
else
{
// just add to new output
output.append(input.charAt(index);
}
}
return output.toString();
}

If it's not a homework assignment use String.split(). If is a homework assignment, say so and tag it so that we can give the appropriate level of help (I did so for you, just in case...).

Because the string needs to be cut in several different ways, not just on whitespace or parens, using the String.split method with any of the symbols there will not work. Split removes the character used as a seperator. You could try to split on the empty string, but this wouldn't get compound symbols, like 21. To correctly parse this string, you will need to effectively implement your own tokenizer. Try thinking about how you could tell you had a complete token if you looked at the string one character at a time. You could probably start a string that collects the characters until you have identified a complete token, and then you can remove the characters from the original and return the string. Starting from this point, you can probably make a basic tokenizer.
If you'd rather learn how to make a full strength tokenizer, most of them are defined by creating a regular expression that only matches the tokens.

Related

How to count number of symbols like #,#,+ etc in Java

I'm trying to write a code to count number of letters,characters,space and symbols in a String. But I don't know how to count Symbols.
Is there any such function available in java?

That very much depends on your definition of the term symbol.
A straight forward solution could be something like
Set<Character> SYMBOLS = Set.of('#', ' ', ....
for (int i=0; i < someString.length(); i++} {
if (SYMBOLS.contains(someString.charAt(i)) {
That iterates the chars someString, and checks each char whether it can be found within that predefined SYMBOLS set.
Alternatively, you could use a regular expression to define "symbols", or, you can rely on a variety of existing definitions. When you check the regex Pattern language for java, you can find
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
for example. And various other shortcuts that denote this or that set of characters already.

Please post what you have tried so far
If you need the count of individual characters - you better iterate the string and use a map to track the character with its count
Or
You can use a regex if just the overall count would enough like below
while (matcher.find() ) {count++}

One way of doing it would be to just iterate over the String and compare each character to their ASCII value
String str = "abcd!##";
for(int i=0;i<str.length();i++)
{
if(33==str.charAt(i))
System.out.println("Found !");
}
lookup here for ASCII values https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html

Java StreamTokenizer splits Email address at # sign

I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the # sign as an ordinaryChar and space as the only whitespace:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('#');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 del6#uni.at Competition speech University of Innsbruck".
The Tokenizer splits del6#uni.at to "del6" and "uni.at".
Is there a way to tell the tokenizer to not split at # signs?

So here is why it worked like it did:
StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the # as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.
A StreamTokenizer would recognize your line as:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character #
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that # was an "ordinary character", you were telling it to take the # as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer.
It removes any special significance the character has as a comment
character, word component, string delimiter, white space, or number
character. When such a character is encountered by the parser, the
parser treats it as a single-character token and sets ttype field to
the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('#','#') it would have kept the whole e-mail together. My little demo with that added gives:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6#uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.

In order to simply split a String, see the answer to this question (adapted for whitespace):
The best way is to not use a StringTokenizer at all, but use String's
split method. It returns an array of Strings, and you can get the
length from that.
For each line in your file you can do the following:
String[] tokens = line.split(" +");
tokens will now have 6 - 8 Strings. Use tokens.length() to find out
how many, then create your object from the array.
This is sufficient for the given line, and might be sufficient for everything. Here is some code that uses it (it reads System.in):
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class T {
public static void main(String[] args) {
BufferedReader st = new BufferedReader(new InputStreamReader(System.in));
String line;
try {
while ( st.ready() ) {
line = st.readLine();
String[] tokens = line.split(" +");
for( String token: tokens ) {
System.out.println(token);
}
}
} catch ( IOException e ) {
throw new RuntimeException(e); // handle error here
}
}
}

How to concatenate several strings with different format and then split them

Hi all.
I want to concatenate some strings without specified format in java. for example I want to concatenate multiple objects like signature and BigInteger and string, that all of them are converted to string. So i can not use of the specified delimiter because each delimiter may be exist in these strings. how i can concatenate these strings and then split them?
thanks all.

Use a well-defined format, like XML or JSON. Or choose a delimiter and escape every instance of this delimiter in each of the Strings. Or prepend the length of each part in the message. For example:
10/7/14-<10 chars of signature><7 chars of BigInteger><14 chars of string>
or
10-<10 chars of signature>7-<7 chars of BigInteger>14-<14 chars of string>

You can escape the delimiter in your string. For example, let's say you have the following strings:
String a = "abc;def";
String b = "12345:";
String c = "99;red:balloons";
You want to be able to do something like this
String concat = a + delim + b + delim + c;
String[] tokens = concat.split(delim);
But if our delim is ";" then quite clearly this will not suffice, as we will have 5 tokens, and not 3. We could use a set of possible delimiters, search the strings for those delimiters, and then use the first one that isn't in the target strings, but this has two problems. First, how do we know which delimiter was used? Second, what if all delimiters exist in the strings? That's not a valid solution, and it's certainly not robust.
We can get around this by using an escape delimiter. Let us use ":" as our escape delimiter. We can use it to say "The next character is just a regular old character, it doesn't mean anything important."
So if we did this:
String aEscaped = a.replace(";",":;");
String bEscaped = b.replace(";",":;");
String cEscaped = c.replace(";",":;");
Then, we can split the concat'd string like
String tokens = concat.split("[^:];")
But there is one problem: What if our text actually contains ":;" or ends with ":"? Either way, these will produce false positives. In this case, we must also escape our escape character. It basically says the same thing as before: "The next character does nothing special."
So now our escaped strings become:
// note we escape our escape token first, otherwise we'll escape
// real usages of the token
String aEscaped = a.replace(":","::").replace(";",":;");
String bEscaped = b.replace(":","::").replace(";",":;");
String cEscaped = c.replace(":","::").replace(";",":;");
And now, we must account for this in the regex. If someone knows a regex that works for this, they can feel free to edit it in. What occurs to me is something like concat.split("(::;|[^:];)") but it doesn't seem to get the job done. The job of parsing it would be pretty easy. I threw together a small test driver for it, and it seems to work just fine.
Code found at http://ideone.com/wUlyz
Result:
abc;def becomes abc:;def
ja:3fr becomes ja::3fr
; becomes :;
becomes
: becomes ::
83;:;:;;;; becomes 83:;:::;:::;:;:;:;
:; becomes :::;
Final product:
abc:;def;ja::3fr;:;;;::;83:;:::;:::;:;:;:;;:::;
Expected 'abc;def', Actual 'abc;def', Matches true
Expected 'ja:3fr', Actual 'ja:3fr', Matches true
Expected ';', Actual ';', Matches true
Expected '', Actual '', Matches true
Expected ':', Actual ':', Matches true
Expected '83;:;:;;;;', Actual '83;:;:;;;;', Matches true
Expected ':;', Actual ':;', Matches true

You concatenate using the concatenation operator(+) as below:
String str1 = "str1";
String str2 = "str2";
int inte = 2;
String result = str1+str2+inte;
But to split them back again you need some special character as delimiter as the split function in String works on delimiter.

Java String.replaceAll method to sanitize phone numbers

I have databasefield called TelephoneName. In this field, I got different formats of telephone number.
What I need now is to seperate them into countrycode and subscribernumber.
For example, I saw a telephone number +49 (0)711 / 61947-xx.
I want to remove all the slash,brackets,minus,space. The result could be +49 (countrycode) and 071161947**(subsribernumber).
How can I do that with replaceAll method?
replaceAll("//()-","") is that correct?
The thing is I got a lot of unformatted telephone number such as:
+49 04261 85120
+32027400050
It is different to apply every telephone number with same algorithms

The replaceAll method takes a regular expression as argument. To remove everything except digits and +, you could thus do
str = str.replaceAll("[^0-9+]", "")
Here's a more complete example that also figures out the country code (based on the index of the ( symbol):
String str = "+49 (0)711 / 61947-12";
int lpar = str.indexOf('(');
String countryCode = str.substring(0, lpar).trim();
String subscriber = str.substring(lpar).trim();
subscriber = subscriber.replaceAll("[^0-9]", "");
System.out.println(countryCode); // prints +49
System.out.println(subscriber); // prints 07116194712
replaceAll("//()-","") is that correct?
No, not quite. That will remove all //- substrings. To remove those characters you need to put them in [...], like this: replaceAll("[/()-]", "") (and / does not need to be escaped).

The first argument of replaceAll() is a regex pattern, so what you want to do is make it match all non digits (and +). You can do this using the "[^...]" (not one of...) construct :
mystring.replaceAll("[^0-9+]", "")

No, that doesn't work.
ReplaceAll() Replaces each substring of this string that matches the given regular expression with the given replacement.
So your expression would replace all instances in the number that look like /()' with an empty space.
You need to do something like
String output = "+49 (0)711 / 61947-xx".replaceAll("[//()-]","");
The square brackets make it a regex character class ('Either slash or open bracket or close bracket or hypen'), rather than a literal ('slash followed by open bracket followed by close bracket followed by hypen.').

This can be done simply by using :
s=s.replace("/","");
s=s.replace("(","");
s=s.replace(")","");
Then substring it to get country code.

How to convert "string" to "string*"

I need to convert a string like
"string"
to
"*s*t*r*i*n*g*"
What's the regex pattern? Language is Java.

You want to match an empty string, and replace with "*". So, something like this works:
System.out.println("string".replaceAll("", "*"));
// "*s*t*r*i*n*g*"
Or better yet, since the empty string can be matched literally without regex, you can just do:
System.out.println("string".replace("", "*"));
// "*s*t*r*i*n*g*"
Why this works
It's because any instance of a string startsWith(""), and endsWith(""), and contains(""). Between any two characters in any string, there's an empty string. In fact, there are infinite number of empty strings at these locations.
(And yes, this is true for the empty string itself. That is an "empty" string contains itself!).
The regex engine and String.replace automatically advances the index when looking for the next match in these kinds of cases to prevent an infinite loop.
A "real" regex solution
There's no need for this, but it's shown here for educational purpose: something like this also works:
System.out.println("string".replaceAll(".?", "*$0"));
// "*s*t*r*i*n*g*"
This works by matching "any" character with ., and replacing it with * and that character, by backreferencing to group 0.
To add the asterisk for the last character, we allow . to be matched optionally with .?. This works because ? is greedy and will always take a character if possible, i.e. anywhere but the last character.
If the string may contain newline characters, then use Pattern.DOTALL/(?s) mode.
References
regular-expressions.info/Dot Matches (Almost) Any Character and Grouping and Backreferences

I think "" is the regex you want.
System.out.println("string".replaceAll("", "*"));
This prints *s*t*r*i*n*g*.

If this is all you're doing, I wouldn't use a regex:
public static String glitzItUp(String text) {
return insertPeriodically(text, "*", 1);
}
Putting char into a java string for each N characters
public static String insertPeriodically(
String text, String insert, int period)
{
StringBuilder builder = new StringBuilder(
text.length() + insert.length() * (text.length()/period)+1);
int index = 0;
while (index <= text.length())
{
builder.append(insert);
builder.append(text.substring(index,
Math.min(index + period, text.length())));
index += period;
}
return builder.toString();
}
Another benefit (besides simplicity) is that it's about ten times faster than a regex.
IDEOne | Working example

Just to be a jerk, I'm going to say use J:
I've spent a school year learning Java, and self-taught myself a bit of J over the course of the summer, and if you're going to be doing this for yourself, it's probably most productive to use J simply because this whole inserting an asterisk thing is easily done with one simple verb definition using one loop.
asterisked =: 3 : 0
i =. 0
running_String =. '*'
while. i < #y do.
NB. #y returns tally, or number of items in y: right operand to the verb
running_String =. running_String, (i{y) , '*'
i =. >: i
end.
]running_String
)
This is why I would use J: I know how to do this, and have only studied the language for a couple months loosely. This isn't as succinct as the whole .replaceAll() method, but you can do it yourself quite easily and edit it to your specifications later. Feel free to delete this/ troll this/ get inflamed at my suggestion of J, I really don't care: I'm not advertising it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.