I'm parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I'm trying to make a list of words with their counts and for that I use a Scanner with this delimiter :
sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");
It works great for my need, but analysing the result, I saw something that looks like a blank token (again...). The character is after (nynorsk) in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don't move).
After further research I've found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).
It's not the only directional character, looking at the Character documentation Java seems to define them.
So I'm wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.
I'd like to avoid to make my own list because I fear I will forgot some of them.
You could always go the other way round and use a whitelist rather than a blacklist:
sc.useDelimiter("[^\\p{L}]+");
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I've looked at JLine, Lanterna, and others, but I'm not seeing a simple way to find the current caret position in the terminal with these tools. I've looked at a number of escape codes, tput, etc. But, I'm looking for the easiest way to get the current column and row where the caret is located with Java. Maybe I haven't found the right call in these libraries...
What's the easiest way to get the row and column of the caret in the terminal?
I'm looking for a pure textual library so that I can re-write the buffer. I'm aware of ansi escape codes and how to manipulate them to produce the effects I'm after. What I'm trying to do is make a Java prompt library in the vain of Inquirer.js for Node. It has a number of simple ways to get info from the user (lists, questions, split lists, etc). All of it text -- so all of it without a UI, and so non-swing. I don't want swing, I just want a decent terminal UI experience.
Edit2
With the http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Caret.html Caret Interface, you can create a CaretListener Object in order to find a caret position.
So you would have to create a new CaretListener that responds to the GUI, with the getDot() method.
This might help... with code anyway.
http://bestjavapractices.blogspot.com.au/2011/11/get-current-caret-position-in.html
Now at the moment that would only work on a GUI/SwingComponents and I'm not sure if that would work for a terminal application ,which is what you want, where I assume you would be using command line arguments and such to get things to work.
I don't think that you could do this as I think the terminal is really just printing out the output, but I will keep checking for a little while anyway.
If you could tell me what you are trying to use the caret for that would help in my efforts.
Hope this helps.
If it doesn't you may need to look through some more Text Toolkits, such as
JCurses - http://sourceforge.net/projects/javacurses/
Charva - http://www.pitman.co.za/projects/charva/index.html
Current version of Lanterna can read the cursor position using ANSI CSI: https://github.com/mabe02/lanterna/blob/master/src/main/java/com/googlecode/lanterna/terminal/ansi/ANSITerminal.java#L266
And here you can find simple Java solution how to send the ANSI command and read from System.in reported cursor position.
How to read the cursor position given by the terminal (ANSI Device Status Report) in JAVA
I've built a content management tool that allows a product team to create and manage product that gets exported to a website and for a different team of designers to create print ads for newspapers displaying the same product data.
My problem is with the InDesign graphic designers and the macros that they use within InDesign. The macros have the ability to take copy/pasted text/data and auto format the text inside InDesign based on the presence of certain characters. In particular the design team uses tab, "soft line break" (shift return), and regular line breaks (hard returns) in their macros.
Right now I generate a block of text with the records and the desired formatting characters in a java Class and then that's sent via DWR to the client side. When there is a requirement for a tab character I send \t, return is \r and I was hoping that a soft line break would be \n however InDesign seems to regard both \r and \n as a regular line break.
I had given up on being able to pass a soft-return until yesterday when I cam across Unicode \u2028 (soft line break) and \u2029 (regular line break). I've tried outputting both of these characters instead of \r and \n in the hopes that InDesign may regard these characters differently. In the box that the designers copy the output from it looks like there is no character there. There's no line break at all in the places where I've specific \u2028 to appear. When I copy/paste the output into a text editor it shows me that there is an unrecognized character there (it displays as a box with a question mark around it).
Platform is Java/MySQL running on Tomcat.
To date, I haven't had to deal too much with character encoding in this application. Header has <meta charset="utf-8" /> set but that's about it so far. I've tried setting this to utf-16 but it doesn't change the output. All of the tables in the MySQL database are set to utf8/utf8_general_ci.
Thoughts? How can I force InDesign to take copy/pasted text and recognize all of its macro capable characters? Actually, it's just the soft line breaks that it's not recognizing. HELP! :)
Thank you. Sorry this is so long!
Ryan V
I've been playing around with ID CS6 (OS X) for a while and I can't for the life of me get it to recognize a pasted LF as a forced line break. LF and CR and CRLF all go to paragraph breaks. U+2028 and U+2029 are display as empty glyphs, not breaks.
I'm a little wary of posting this as an answer, but I'll give it a go:
You might consider providing the text as a downloaded .txt file. CS5 introduced "Tagged Text" (a sort of XML-ish text document with full support for InDesign characters, attributes, etc.,) so this means your designers will be able to place the text file and InDesign will treat everything as intended.
To turn your existing text into CS5+'s Tagged Text (see the reference here), plop a <ASCII-MAC> or <ASCII-WIN> (as appropriate) as the first line and escape any '<' or '>'s with a backslash, then you're free to use <0x000A> as a forced line break. (literally those 8 characters)
That's probably mega-overkill, but it's certainly the most stupidly reliable way I can think of. I'll edit if I get anything else working.
NB. "forced line break" is the term InDesign itself uses for the character produced by Shift+Enter, your "soft line break;" contrast with "paragraph break" for a standard carriage return. InDesign apparently represents forced breaks with LF (U+000A) and paragraph breaks with CR (U+000D).
I'm not sure how you were trying to transfer and print out your characters (if you post your DWR and javascript code I might be able to help more), but one thing I would try is to ensure that your java output is actual UTF-8 using something like:
String yourRecordString = "Some line 1. \u2028Some line 2.";
ByteBuffer bb = Charset.forName("UTF-8").encode(yourRecordString);
Then, you can write out the bytes in bb into an output stream/file and check them. (Make sure to write them as bytes and not as a String nor as chars.) For example, the UTF-8 encoding of \u2028 is E2 80 A8, so you should see that sequence at the appropriate place in your output. (I use hexmode in vim for things like this.)
Then, make sure that these bytes get received back on the javascript side. (While I'm not an expert with DWR, I might prefer to make your java function return something other than a String.)
This should at least help you diagnose where the problem lies. If you do see that sequence and if InDesign still isn't recognizing the soft line breaks, then you at least know the problem is with InDesign and that you will have to find some other solution (such as modifying the designer's macros to recognize other characters).
(Also, note that you can see the default encoding for your JVM using Charset.defaultCharset(). My guess is that your default is not UTF-8 and that InDesign may have also had a problem with the UTF-16 you tried due to endianess or something like that.)
So, I finally discovered that JavaFX lets you use HostServices.showDocument(uri) to open a browser to the given url. I have run into a problem though; I cannot open up urls that contain Chinese characters. It can only interpret them as '?', taking you to the wrong url. AWT's Display.browse(uri) handles characters without a problem, so I know that it can be communicated to the browser technically. I'm not sure if there is anything I can do on my end or not though.
My question is: Is there any way to make JavaFX's HostServices.showDocument() correctly read in Chinese characters?
EDIT:
Sample string
http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=%E6%96%87
You can follow the link through to see the address' chinese character (at the very end of the url). So in doing this, I noticed that it converts the character to a series of %, letters, and numbers. Plugging those into showDocument() in place of the character works fine. So then, I guess the question is now "How do I convert a character to this format?
I was able to figure out that converting the string into a URI, then using the .toASCIIString() method gave me what I needed. (Converting Chinese characters, and I would assume others, into something readable by showDocument(). Thanks for the help jewelsea.
If there is a better way to do this, feel free to give me another answer.
So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?
You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.