Following my question about a me having to deal with a poorly implemented chat server, I have come to the conclusion that I should try to get the chat messages out of the other server responses.
Basically, I receive a string that would look like this:
13{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat {message 1\"}","sender":123,"recipient":321}45{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat} message 2\"}","sender":123,"recipient":321}1
And the result I would like is two substrings:
{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat {message 1\"}","sender":123,"recipient":321}
{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat} message 2\"}","sender":123,"recipient":321}
The output I can receive is a mix between JSON objects (possibly containing other JSON objects) and some numerical data.
I need to extract the JSON objects from that string.
I have thought about counting curly braces to pick what is between the first opening one and the corresponding closing one. However, the messages can possibly contain a curly brace.
I have thought about regular expressions but I can't get one that will work (I am not good at regexes)
Any idea about how to proceed?
This should work:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"\\{ # Match an opening brace. \n" +
"(?: # Match either... \n" +
" \" # a quoted string, \n" +
" (?: # which may contain either... \n" +
" \\\\. # escaped characters \n" +
" | # or \n" +
" [^\"\\\\] # any other characters except quotes and backslashes \n" +
" )* # any number of times, \n" +
" \" # and ends with a quote. \n" +
"| # Or match... \n" +
" [^\"{}]* # any number of characters besides quotes and braces. \n" +
")* # Repeat as needed. \n" +
"\\} # Then match a closing brace.",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
Related
I'm just learning how to use regex's:
I'm reading in a text file that is split into sections of two different sorts, demarcated by
<:==]:> and <:==}:> . I need to know for each section whether it's a ] or } , so I can't just do
pattern.compile("<:==]:>|<:==}:>"); pattern.split(text)
Doing this:
pattern.compile("<:=="); pattern.split(text)
works, and then I can just look at the first char in each substring, but this seems sloppy to me, and I think I'm only resorting to it because I'm not fully grasping something I need to grasp about regex's:
What would be the best practice here? Also, is there any way to split a string up while leaving the delimiter in the resulting strings- such that each begins with the delimiter?
EDIT: the file is laid out like this:
Old McDonald had a farm
<:==}:>
EIEIO. And on that farm he had a cow
<:==]:>
And on that farm he....
It may be a better idea not to use split() for this. You could instead do a match:
List<String> delimList = new ArrayList<String>();
List<String> sectionList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"(<:==[\\]}]:>) # Match a delimiter, capture it in group 1.\n" +
"( # Match and capture in group 2:\n" +
" (?: # the following group which matches...\n" +
" (?!<:==[\\]}]:>) # (unless we're at the start of another delimiter)\n" +
" . # any character\n" +
" )* # any number of times.\n" +
") # End of group 2",
Pattern.COMMENTS | Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
delimList.add(regexMatcher.group(1));
sectionList.add(regexMatcher.group(2));
}
my codes dont seem to properly address what i intend to achieve.
a long string instead of a well broken and seperated string
it does not handle the 'seperator' appropriately ( produces , instead of ",")
also the 'optional' ( produces ' instead of " '")
Current result:
LOAD DATA INFILE 'max.csv'BADFILE 'max.bad'DISCARDFILE
'max.dis' APPEND INTO TABLEADDRESSfields terminated by,optionally enclosed
by'(ID,Name,sex)
the intended result should look like this
is there a better way of doing this or improving the above codes
Yeah. Use the character \n to start a new line in the file, and escape " characters as \". Also, you'll want to add a space after each variable.
content = " LOAD DATA\nINFILE "+ fileName + " BADFILE "+ badName + " DISCARDFILE " +
discardName + "\n\nAPPEND\nINTO TABLE "+ table + "\n fields terminated by \"" + separator
+ "\" optionally enclosed by '" + optional + "'\n (" + column + ")";
This is assuming fileName, badName, and discardName include the quotes around the names.
Don't reinvent the wheel... the apache commons-io library does all that in one line:
FileUtils.write(new File(controlName), content);
Here's the javadoc for FileUtils.write(File, CharSequence):
Writes a CharSequence to a file creating the file if it does not exist
To insert a new line you need to use \n or \r\n for windows
for example
discardName + "\n" //New line here
"APPEND INTO TABLE"
For the double quote symbol on the other hand you need to specifically type \" around the comma:
"fields terminated by \"" + separator +"\""
which will produce this ","
and that is similar to what the optional variable needs to be
I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.
I am trying to harvest all inclusion directives from a PHP file using a regular expression (in Java).
The expression should pick up only those which have file names expressed as unconcatenated string literals. Ones with constants or variables are not necessary.
Detection should work for both single and double quotes, include-s and require-s, plus the additional trickery with _once and last but not least, both keyword- and function-style invocations.
A rough input sample:
<?php
require('a.php');
require 'b.php';
require("c.php");
require "d.php";
include('e.php');
include 'f.php';
include("g.php");
include "h.php";
require_once('i.php');
require_once 'j.php';
require_once("k.php");
require_once "l.php";
include_once('m.php');
include_once 'n.php';
include_once("o.php");
include_once "p.php";
?>
And output:
["a.php","b.php","c.php","d.php","f.php","g.php","h.php","i.php","j.php","k.php","l.php","m.php","n.php","o.php","p.php"]
Any ideas?
Use token_get_all. It's safe and won't give you headaches.
There is also PEAR's PHP_Parser if you require userland code.
To do this accurately, you really need to fully parse the PHP source code. This is because the text sequence: require('a.php'); can appear in places where it is not really an include at all - such as in comments, strings and HTML markup. For example, the following are NOT real PHP includes, but will be matched by the regex:
<?php // Examples where a regex solution gets false positives:
/* PHP multi-line comment with: require('a.php'); */
// PHP single-line comment with: require('a.php');
$str = "double quoted string with: require('a.php');";
$str = 'single quoted string with: require("a.php");';
?>
<p>HTML paragraph with: require('a.php');</p>
That said, if you are happy with getting a few false positives, the following single regex solution will do a pretty good job of scraping all the filenames from all the PHP include variations:
// Get all filenames from PHP include variations and return in array.
function getIncludes($text) {
$count = preg_match_all('/
# Match PHP include variations with single string literal filename.
\b # Anchor to word boundary.
(?: # Group for include variation alternatives.
include # Either "include"
| require # or "require"
) # End group of include variation alternatives.
(?:_once)? # Either one may be the "once" variation.
\s* # Optional whitespace.
( # $1: Optional opening parentheses.
\( # Literal open parentheses,
\s* # followed by optional whitespace.
)? # End $1: Optional opening parentheses.
(?| # "Branch reset" group of filename alts.
\'([^\']+)\' # Either $2{1]: Single quoted filename,
| "([^"]+)" # or $2{2]: Double quoted filename.
) # End branch reset group of filename alts.
(?(1) # If there were opening parentheses,
\s* # then allow optional whitespace
\) # followed by the closing parentheses.
) # End group $1 if conditional.
\s* # End statement with optional whitespace
; # followed by semi-colon.
/ix', $text, $matches);
if ($count > 0) {
$filenames = $matches[2];
} else {
$filenames = array();
}
return $filenames;
}
Additional 2011-07-24 It turns out the OP wants a solution in Java not PHP. Here is a tested Java program which is nearly identical. Note that I am not a Java expert and don't know how to dynamically size an array. Thus, the solution below (crudely) sets a fixed size array (100) to hold the array of filenames.
import java.util.regex.*;
public class TEST {
// Set maximum size of array of filenames.
public static final int MAX_NAMES = 100;
// Get all filenames from PHP include variations and return in array.
public static String[] getIncludes(String text)
{
int count = 0; // Count of filenames.
String filenames[] = new String[MAX_NAMES];
String filename;
Pattern p = Pattern.compile(
"# Match include variations with single string filename. \n" +
"\\b # Anchor to word boundary. \n" +
"(?: # Group include variation alternatives. \n" +
" include # Either 'include', \n" +
"| require # or 'require'. \n" +
") # End group of include variation alts. \n" +
"(?:_once)? # Either one may have '_once' suffix. \n" +
"\\s* # Optional whitespace. \n" +
"(?: # Group for optional opening paren. \n" +
" \\( # Literal open parentheses, \n" +
" \\s* # followed by optional whitespace. \n" +
")? # Opening parentheses are optional. \n" +
"(?: # Group for filename alternatives. \n" +
" '([^']+)' # $1: Either a single quoted filename, \n" +
"| \"([^\"]+)\" # or $2: a double quoted filename. \n" +
") # End group of filename alternativess. \n" +
"(?: # Group for optional closing paren. \n" +
" \\s* # Optional whitespace, \n" +
" \\) # followed by the closing parentheses. \n" +
")? # Closing parentheses is optional . \n" +
"\\s* # End statement with optional ws, \n" +
"; # followed by a semi-colon. ",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Matcher m = p.matcher(text);
while (m.find() && count < MAX_NAMES) {
// The filename is in either $1 or $2
if (m.group(1) != null) filename = m.group(1);
else filename = m.group(2);
// Add this filename to array of filenames.
filenames[count++] = filename;
}
return filenames;
}
public static void main(String[] args)
{
// Test string full of various PHP include statements.
String text = "<?php\n"+
"\n"+
"require('a.php');\n"+
"require 'b.php';\n"+
"require(\"c.php\");\n"+
"require \"d.php\";\n"+
"\n"+
"include('e.php');\n"+
"include 'f.php';\n"+
"include(\"g.php\");\n"+
"include \"h.php\";\n"+
"\n"+
"require_once('i.php');\n"+
"require_once 'j.php';\n"+
"require_once(\"k.php\");\n"+
"require_once \"l.php\";\n"+
"\n"+
"include_once('m.php');\n"+
"include_once 'n.php';\n"+
"include_once(\"o.php\");\n"+
"include_once \"p.php\";\n"+
"\n"+
"?>\n";
String filenames[] = getIncludes(text);
for (int i = 0; i < MAX_NAMES && filenames[i] != null; i++) {
System.out.print(filenames[i] +"\n");
}
}
}
/(?:require|include)(?:_once)?[( ]['"](.*)\.php['"]\)?;/
Should work for all cases you've specified, and captures only the filename without the extension
Test script:
<?php
$text = <<<EOT
require('a.php');
require 'b.php';
require("c.php");
require "d.php";
include('e.php');
include 'f.php';
include("g.php");
include "h.php";
require_once('i.php');
require_once 'j.php';
require_once("k.php");
require_once "l.php";
include_once('m.php');
include_once 'n.php';
include_once("o.php");
include_once "p.php";
EOT;
$re = '/(?:require|include)(?:_once)?[( ][\'"](.*)\.php[\'"]\)?;/';
$result = array();
preg_match_all($re, $text, $result);
var_dump($result);
To get the filenames like you wanted, read $results[1]
I should probably point that I too am partial to cweiske's answer, and that unless you really just want an exercise in regular expressions (or want to do this say using grep), then you should use the tokenizer.
The following should work pretty well:
/^(require|include)(_once)?(\(\s+)("|')(.*?)("|')(\)|\s+);$/
You'll want the fourth captured group.
This works for me:
preg_match_all('/\b(require|include|require_once|include_once)\b(\(| )(\'|")(.+)\.php(\'|")\)?;/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[4];
I have a Regex, which is [\\.|\\;|\\?|\\!][\\s]
This is used to split a string. But I don't want it to split . ; ? ! if it is in quotes.
I'd not use split but Pattern & Matcher instead.
A demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String text = "start. \"in quotes!\"; foo? \"more \\\" words\"; bar";
String simpleToken = "[^.;?!\\s\"]+";
String quotedToken =
"(?x) # enable inline comments and ignore white spaces in the regex \n" +
"\" # match a double quote \n" +
"( # open group 1 \n" +
" \\\\. # match a backslash followed by any char (other than line breaks) \n" +
" | # OR \n" +
" [^\\\\\r\n\"] # any character other than a backslash, line breaks or double quote \n" +
") # close group 1 \n" +
"* # repeat group 1 zero or more times \n" +
"\" # match a double quote \n";
String regex = quotedToken + "|" + simpleToken;
Matcher m = Pattern.compile(regex).matcher(text);
while(m.find()) {
System.out.println("> " + m.group());
}
}
}
which produces:
> start
> "in quotes!"
> foo
> "more \" words"
> bar
As you can see, it can also handle escaped quotes inside quoted tokens.
Here is what I do in order to ignore quotes in matches.
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*? # <-- append the query you wanted to search for - don't use something greedy like .* in the rest of your regex.
To adapt this for your regex, you could do
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?[.;?!]\s*