Java misinterpreting apostrophes when parsing input - java

So I am trying to use the wikipedia api to read the first paragraph of a given wikipedia page. Unfortunately, I wikipedia uses a weird system to deal with special characters, (http://www.mediawiki.org/wiki/API:Data_formats#JSON_parameters) and I was unable to parse the default response without getting the characters with escape sequences. Obviously the best solution would be to interpret these directly in java, but I'm not sure there is a way to do that, so I force a utf8 response. This approach looks like it should work, but when I pass it through my parsing code, it returns:
Ella Marija Lani Yelich-O'Connor (born 7 November 1996).....named among Time?'?s most influential teenagers in the world, and in the following year, she made her way into Forbes?'?s "30 Under 30" list.
Notice that some apostrophes are kept and some aren't. I think that the misinterpreted characters are the result of parsing of previous parsing (I want the plaintext, so I parse the html tags out). Here is my parsing code, its a bit messy but it almost works:
public static String getWikiParagraph (String url){
try {
//System.out.println(url.substring(url.lastIndexOf('/') + 1));
URL apiURL = new URL("http://www.en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&utf8&exintro=&titles="+url.substring(url.lastIndexOf('/') + 1));
BufferedReader br = new BufferedReader(new InputStreamReader(apiURL.openStream(), Charset.forName("UTF-8")));
StringBuilder sb=new StringBuilder();
String read = br.readLine();
while(read != null) {
sb.append(read);
read =br.readLine();
}
String s = sb.toString();
s = Arrays.toString(getTagValues(s).toArray());
s=s.replace("<i>","");
s=s.replace("</i>","");
s=s.replace("?'?","'"); //makes no difference in output
s=s.replace("u200a","");
s=s.replace("<b>","");
s=s.replace("</b>","");
s=s.replace("\\","");
s=s.substring(1, s.length() -1);
return s;
} catch (MalformedURLException e) {
e.printStackTrace();
} catch(IOException e){
System.out.println("Error fetching data from url");
}
return null;
}
private static List<String> getTagValues(final String str) {
final Pattern TAG_REGEX = Pattern.compile("<p>(.+?)</p>");
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}
Any help would be greatly appreciated.

Use a JSON parser and run the results you want to cleanup through something like JSoup. Sure, you could write your own brittle HTML parser, but this is a bit of a fool's errand. HTML is subtle, and quick to anger. Spend your time building your logic and let the utility classes do the grungy stuff.
And, yes. The comments are correct. This JSON has Unicode sequences in it, at least when I look at that URL, which will not render correctly in most terminals.
EDIT
The JSON format is (apparently) subject to change. I got cleaner output by specifying "&continue=" in the URL to go back to an older continuation format. You should probably find out what these continuation format changes mean for you.

Related

Loop through file and look up text with Regex

I am using Selenium to test web-pages and want to make a simpler way to update the test-cases (not important for the problem).
I loop through lines now with this:
driver.get("http://vg.no"); //open the web page
try {
BufferedReader reader = new BufferedReader(new FileReader("//Users//file.txt"));
try {
String line = null;
while ((line = reader.readLine()) != null) {
driver.findElement(By.cssSelector(line)).click();; //find and click on the data specified in every line in the document
}
} finally {
reader.close();
}
} catch (IOException ioe) {
System.err.println("oops " + ioe.getMessage());
}
Textfile content example now:
a[href*='//nyheter//meninger//']
img[class*='logo-red']
img[class*='article-image']
I want to rebuild it to a solution that start different commands based on regex expressions.
I try to get it to work this way:
vg.no //this will start driver.get("vg.no")
img[class*='logo-red'] //this will start driver.findElement(By.cssSelector("img[class*='logo-red']")).click()
img[class*='article-image']
ItAvisen.no
img[class*='article-image']
img[class*='article-image']
Is there a way I can use regex to start dirrerent parts of the code based on content in the textfile, and use part of the content in the textfile as variables?
It works this way after feedback from cvester:
Finding matches for img[class*='logo-red']
String regexp = "img\\[class\\*=\\'*\\'(.*)\\]";
boolean match = line.matches(regexp);
Will it still be line based?
In that case you can just read line by line and use the String.matches(String regex) for each case you identify.
If you can specify more specific information I might be able to give you a better solution.

Strategy for Processing a Text File with a Header using Reg Exp in Java

I have a file that contains a header with comments (e.g. [Comment] This is a comment) and a subsequent data section. The data starts at "Mk1=".
The program I am working on should:
Copy the header contents
Search and replace only in the data section of the file
Write header and data to a new file
I am currently using:
StringBuffer
Scanner
regex.Pattern;
In my code so far (reduced to its essentials):
public static void main(String[] args) {
File file = readFile("file.ext");
Scanner inputScanner = null;
try {
inputScanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String currentLine = "";
while(inputScanner.hasNext()) {
currentLine = inputScanner.findInLine(regexpPattern);
if (currentLine != null){
fileOutput.append(currentLine + "\n");
}
}
}
Because the Scanner works like a queue, I have trouble figuring out what strategy I should use. I have found examples of using a Matcher instead of a Scanner. To my understanding I also have to work with boolean flags, because of the queue-like structure of Scanner. The findInHorizon() method does not seem helpful as I want the reg exp only to apply beyond the horizon. Is there perhaps a "hack" for the delimiter of the Scanner, assuming I know the series of characters of the header start and end?
File Example
[Comment]
Text goes here.
[Another Comment]
;Instructions: Below you will find Mk1= where the data can be assigned.
;More text.
Mk1=data
Mk2=data
Mk3=data
What strategy should I use?
Assuming you can use java.nio.file.Files (since Java 1.7) and your text file isn't too big, I'd read all lines at once and go for the Matcher:
Charset charset = Charset.forName("UTF-8");
List<String> lines = Files.readAllLines(file.toPath(), charset);
for (String line : lines) {
Matcher matcher = regexpPattern.matcher(line);
if (matcher.matches()) {
// do something
}
}
Using regex groups will prove useful for retrieving parameter-value pairs:
Pattern dataPattern = Pattern.compile("^Mk(\\d+)=(.*)$");
Matcher dataMatcher = dataPattern.matcher(line);
int mk = Integer.parseInt(dataMatcher.group(1));
String data = dataMatcher.group(2);
Parsing is a two step process: You have a tokenizer which recognizes patterns in the input and a parser which reads tokens but also has a state to know where it is.
You can use regexp for the "tokenize" part of the problem but you also need a parser which remembers "I have seen [Comment]" so it knows what could/should be next.
Related:
https://class.coursera.org/compilers/lecture

Special characters coming through as ? in SMPP and Java

I've spent a crazy amount of time trying to get special characters to come through properly in our application. Our provider told us to use "GSM0338, also known as ISO-8859". To me, this means ISO-8895-1, since we want spanish characters.
The flow: (Telling you everything, since I've been playing around with this for a while.)
Used notepad++ to create the message files in UTF-8 encoding. (No option to save as ISO-8859-1).
Sent each file through a quick Java program which converts and writes new files:
String text = readTheFile(....);
output = text.getBytes("ISO-8859-1");
FileOutputStream fos = new FileOutputStream(filesPathWithoutName + "\\converted\\" + filename);
fos.write(output);
fos.close();
SMPP test class in another project reads these files:
private static String readMessageFile(final String filenameOfFirstMessage) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filenameOfFirstMessage));
String message;
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
message = sb.toString();
} finally {
br.close();
}
return message;
}
Calls send
public void send(final String message, final String targetPhone) throws MessageException {
SmppMessage smppMessage = toSmppMessage(message, targetPhone);
smppSmsService.sendMessage(smppMessage);
}
private SmppMessage toSmppMessage(final String message, final String targetPhone) {
SmppMessage smppMessage = new SmppMessage();
smppMessage.setMessage(message);
smppMessage.setRecipientAddress(toGsmAddress(targetPhone));
smppMessage.setSenderAddress(getSenderGsmAddress());
smppMessage.setMessageType(SmppMessage.MSG_TYPE_DATA);
smppMessage.setMessageMode(SmppMessage.MSG_MODE_SAF);
smppMessage.requestStatusReport(true);
return smppMessage;
}
Problem:
SMSs containing letters ñ í ó are delivered, but with these letters displaying as question marks.
Configuration:
smpp.smsc.charset=ISO-8859-1
smpp.data.coding=0x03
Absolutely any help with this would be GREATLY appreciated. Thank you so much for reading.
Well, your provider is wrong. GSM 03.38 is not ISO-8859-1. They are the same up through "Z" (0x5A), but after that they diverge. For instance, in GSM 03.38, ñ is 0x7D, while in ISO-8859-1, it is 0xF1. Since GSM 03.38 is a 7-bit code, anything above 0x7F is going to come out as a "?". Anything after 0x5A is going to come out as something unexpected.
Since Java doesn't usually come with GSM 03.38 support, you're going to have to decode by hand. It shouldn't be too difficult to do, and the following piece of software might already do most of what you need:
Java GSM 03.38 SMS Character Set Translator
You might also find this translation table between GSM 03.38 and Unicode useful.

Cannot start new line with "\n"?

I am working on a project and there is a small part of it really confusing me.
Say I have a String array :
String[]text = {"string","array"};
for example.
And I want to make it to a single String with a new line "\n" between every single word.
So here is my code
public String setText(String [] text) throws UnsupportedEncodingException{
StringBuffer result = new StringBuffer();
String newline = System.getProperty("line.separator");
if (text.length > 0) {
result.append(text[0]);
for (int i=1; i<text.length; i++) {
result.append(newline);
result.append(text[i]);
}
}
return result.toString();
}
My code doesnt work. The return value is a single string but when I use it, it is still in one line.
Anyone can help me with this?
Thank you
Allan
Did you see what System.getProperty("line.separator"); returns?
In different OS it will be different symbols \n - Linux and \r\n in Windows
Try to use
System.getProperty("line.separator", "\r\n"); or System.getProperty("line.separator", "\n");
Depends on how you use the result. Have you tried writing it to System.out / a file or anything similar?
I would recommend using Joiner from Google's Guava library.
Joiner.on(System.getProperty("line.separator")).join(text)

How to replace multiple occurences of a string in a text file with a variable entered by the user and save all to a new file?

public static void main(String args[])
{
try
{
File file = new File("input.txt");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "000000", oldtext = "414141";
while((line = reader.readLine()) != null)
{
oldtext += line + "\r\n";
}
reader.close();
// replace a word in a file
//String newtext = oldtext.replaceAll("drink", "Love");
//To replace a line in a file
String newtext = oldtext.replaceAll("This is test string 20000", "blah blah blah");
FileWriter writer = new FileWriter("input.txt");
writer.write(newtext);writer.close();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
}
A couple suggestions on your sample code:
Have the user pass in old and new on the command line (i.e., args[0] and args1).
If it's sufficient to do this a line at a time, it's going to be much more efficient to read a line, replace old -> new, then stream it out.
Also check out StringUtils and IOUtils, which may make your life easier in this case.
Easiest is the String.replace(oldstring, newstring), or String.replaceAll(regex, newString) function, you can just read the one file and write the replacement into a new file (or do it line by line if you're concerned about file size).
After reading your last comment - that's a totally different story... the preferred solution would be to parse the css file into an object model (like DOM), apply the changes there and serialize the model to css afterwards. It's much easier to find all color attributes in DOM and change them compared to doing the same with search and replace.
I've found some CSS parser in the wild wild web, but none of them looked like being capable of writing CSS files.
If you wanted to replace the color names with search and replace, you'd search for 'color:<colorname>' and replace it with 'color:<youHexColorValue>'. You may have to do the same for 'color:"<colorname>"', because the color name can be set in double quotes (another argument for using a CSS parser..)
String.replaceAll() is the easiest way to do it. Just read the complete CSS file into one String, replace all as suggested above and write the new String to the same (or a temporary) file (first).

Categories

Resources