I'm quite new to Java and I'm facing a situation I can't solve. I have some html code and I'm trying to run a regular expression to store all matches into an array. Here's my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexMatch{
boolean foundMatch = false;
public String[] arrayResults;
public String[] TestRegularExpression(String sourceCode, String pattern){
try{
Pattern regex = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(sourceCode);
while (regexMatcher.find()) {
arrayResults[matches] = regexMatcher.group();
matches ++;
}
} catch (PatternSyntaxException ex) {
// Exception occurred
}
return arrayResults;
}
}
I'm passing a string containing html code and the regular expression pattern to extract all meta tags and store them into the array. Here's how I instantiate the method:
RegexMatch regex = new RegexMatch();
regex.TestRegularExpression(sourceCode, "<meta.*?>");
String[] META_TAGS = regex.arrayResults;
Any hint?
Thanks!
Firstly, parsing HTML with regular expressions is a bad idea. There are alternatives which will convert the HTML into a DOM etc - you should look into those.
Assuming you still want the "match multiple results" idea though, it seems to me that a List<E> of some form would be more useful, so you don't need to know the size up-front. You can also build that in the method itself, rather than having state. For example:
import java.util.*;
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
throws PatternSyntaxException
{
// Want to get x10 and x5 from this
String text = "x10 y5 x5 xyz";
String pattern = "x\\d+";
List<String> matches = getAllMatches(text, pattern);
for (String match : matches) {
System.out.println(match);
}
}
public static List<String> getAllMatches(String text, String pattern)
throws PatternSyntaxException
{
Pattern regex = Pattern.compile(pattern);
List<String> results = new ArrayList<String>();
Matcher regexMatcher = regex.matcher(text);
while (regexMatcher.find()) {
results.add(regexMatcher.group());
}
return results;
}
}
It's possible that there's something similar to this within the Matcher class itself, but I can't immediately see it...
With Jsoup, you could do something as simple as...
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}
For example:
import java.io.IOException;
import java.net.*;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
private static final String MAIN_URL = "http://www.yahoo.com";
public static void main(String[] args) {
try {
Scanner scan = new Scanner(new URL(MAIN_URL).openStream());
StringBuilder sb = new StringBuilder();
while (scan.hasNextLine()) {
sb.append(scan.nextLine() + "\n");
}
List<String> metaList = parseForMeta(sb.toString());
for (String metaStr : metaList) {
System.out.println(metaStr);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}
Related
Iam facing issue with some of the search keywords are not highlighting in chinese documents .Due to confidiential concerns iam not providing actual pdf . search keywords are 1)亿元或2) 收入亿来源 Please find the pdf document path which i tested ,pdfpath link. and ActualResult link .I have already posted related to this issue in following Link but some of the keywords are not highlighting properly in few chinese documents.Kindly provide your inputs to highlight the search keywords which i mentioned.
import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.BufferedInputStream;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;
public class pdfclown2 {
private static int count;
public static void main(String[] args) throws IOException {
highlight("ebook.pdf","C:\\Users\\Downloads\\6.pdf");
System.out.println("OK");
}
private static void highlight(String inputPath, String outputPath) throws IOException {
URL url = new URL(inputPath);
InputStream in = new BufferedInputStream(url.openStream());
org.pdfclown.files.File file = null;
try {
file = new org.pdfclown.files.File("C:\\Users\\Desktop\\pdf\\test123.pdf");
Map<String, String> m = new HashMap<String, String>();
m.put("亿元或","hi");
m.put("收入亿来","hi");
System.out.println("map size"+m.size());
long startTime = System.currentTimeMillis();
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
for (Map.Entry<String, String> entry : m.entrySet()) {
Pattern pattern;
String serachKey = entry.getKey();
final String translationKeyword = entry.getValue();
/*
if ((serachKey.contains(")") && serachKey.contains("("))
|| (serachKey.contains("(") && !serachKey.contains(")"))
|| (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
|| serachKey.contains("*") || serachKey.contains("+")) {s
pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
}
else*/
pattern = Pattern.compile(serachKey, Pattern.CASE_INSENSITIVE);
// 2.1. Extract the page text!
//System.out.println(textStrings.toString().indexOf(entry.getKey()));
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
// 2.3. Highlight the text pattern matches!
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
public boolean hasNext() {
// System.out.println(matcher.find());
// if(key.getMatchCriteria() == 1){
if (matcher.find()) {
return true;
}
/*
* } else if(key.getMatchCriteria() == 2) { if
* (matcher.hitEnd()) { count++; return true; } }
*/
return false;
}
public Interval<Integer> next() {
return new Interval<Integer>(matcher.start(), matcher.end());
}
public void process(Interval<Integer> interval, ITextString match) {
// Defining the highlight box of the text pattern
// match...
System.out.println(match);
/* List<Quad> highlightQuads = new ArrayList<Quad>();
{
Rectangle2D textBox = null;
for (TextChar textChar : match.getTextChars()) {
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null) {
textBox = (Rectangle2D) textCharBox.clone();
} else {
if (textCharBox.getY() > textBox.getMaxY()) {
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
} else {
textBox.add(textCharBox);
}
}
}
textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
highlightQuads.add(Quad.get(textBox));
}*/
List<Quad> highlightQuads = new ArrayList<Quad>();
List<TextChar> textChars = match.getTextChars();
Rectangle2D firstRect = textChars.get(0).getBox();
Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
Rectangle2D rect = firstRect.createUnion(lastRect);
highlightQuads.add(Quad.get(rect).get(rect));
// subtype can be Highlight, Underline, StrikeOut, Squiggly
new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);
}
public void remove() {
throw new UnsupportedOperationException();
}
});
}
}
SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
file.save(new java.io.File(outputPath), serializationMode);
System.out.println("file created");
long endTime = System.currentTimeMillis();
System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);
} catch (Exception e) {
e.printStackTrace();
}
finally{
in.close();
}
}
}
Indeed, when searching for "亿元或" the result highlight is somewhat wrong:
The cause is a PDF Clown bug. When it parses a composite font (aka Type 0 font), it expects the DW (default width) entry in the Type 0 font base dictionary while it is specified to be in the CIDFont subdictionary!
In case of the document at hand the widths of most characters, in particular of the Chinese characters, are not given explicitly and, therefore, default to that DW value. As this value cannot be determined properly due to the bug mentioned above, an average over the explicitly given widths is used, and this average happens to be merely ¾ of the correct value. Thus, the highlighted area is too short.
You can fix this bug in the CompositeFont class (package org.pdfclown.documents.contents.fonts) at the end of the method onLoad. Simply replace
PdfInteger defaultWidthObject = (PdfInteger)getBaseDataObject().get(PdfName.DW);
by
PdfInteger defaultWidthObject = (PdfInteger)getCIDFontDictionary().get(PdfName.DW);
The highlighting now results in
I've been looking for easy way to add ID to HTML tags and spent few hours here jumping form one tool to another before I came up with this little test solving my issues. Hence my sprint backlog is almost empty I have some time to share. Feel free to make it clear and enjoy those whom are asked by QA to add the ID. Just change the tag, path and run :)
Had some issue here to make proper lambda due to lack of coffee today...
how to replace first occurence only, in single lambda? in files I had many lines having same tags.
private void replace(String path, String replace, String replaceWith) {
try (Stream<String> lines = Files.lines(Paths.get(path))) {
List<String> replaced = lines
.map(line -> line.replace(replace, replaceWith))
.collect(Collectors.toList());
Files.write(Paths.get(path), replaced);
} catch (IOException e) {
e.printStackTrace();
}
}
Above was replacing all lines as it found text to replace in next lines. Proper matcher with repleace that has autoincrement would be better to use within this method body isntead of preparing the replaceWith value before the call. If I'll ever need this again I'll add you another final version .
Final version to not waste more time (phase green):
import org.junit.Test;
import org.junit.runner.RunWith;
import org.mockito.runners.MockitoJUnitRunner;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
#RunWith(MockitoJUnitRunner.class)
public class RepalceInFilesWithAutoIncrement {
private int incremented = 100;
/**
* The tag you would like to add Id to
* */
private static final String tag = "label";
/**
* Regex to find the tag
* */
private static final Pattern TAG_REGEX = Pattern.compile("<" + tag + " (.+?)/>", Pattern.DOTALL);
private static final Pattern ID_REGEX = Pattern.compile("id=", Pattern.DOTALL);
#Test
public void replaceInFiles() throws IOException {
String nextId = " id=\"" + tag + "_%s\" ";
String path = "C:\\YourPath";
try (Stream<Path> paths = Files.walk(Paths.get(path))) {
paths.forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
try {
List<String> foundInFiles = getTagValues(readFile(filePath.toAbsolutePath().toString()));
if (!foundInFiles.isEmpty()) {
for (String tagEl : foundInFiles) {
incremented++;
String id = String.format(nextId, incremented);
String replace = tagEl.split("\\r?\\n")[0];
replace = replace.replace("<" + tag, "<" + tag + id);
replace(filePath.toAbsolutePath().toString(), tagEl.split("\\r?\\n")[0], replace, false);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
});
}
System.out.println(String.format("Finished with (%s) changes", incremented - 100));
}
private String readFile(String path)
throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, StandardCharsets.UTF_8);
}
private List<String> getTagValues(final String str) {
final List<String> tagValues = new ArrayList<>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
if (!ID_REGEX.matcher(matcher.group()).find())
tagValues.add(matcher.group());
}
return tagValues;
}
private void replace(String path, String replace, String replaceWith, boolean log) {
if (log) {
System.out.println("path = [" + path + "], replace = [" + replace + "], replaceWith = [" + replaceWith + "], log = [" + log + "]");
}
try (Stream<String> lines = Files.lines(Paths.get(path))) {
List<String> replaced = new ArrayList<>();
boolean alreadyReplaced = false;
for (String line : lines.collect(Collectors.toList())) {
if (line.contains(replace) && !alreadyReplaced) {
line = line.replace(replace, replaceWith);
alreadyReplaced = true;
}
replaced.add(line);
}
Files.write(Paths.get(path), replaced);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Try it with Jsoup.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<html><head><title>Try it with Jsoup</title></head>"
+ "<body><p>P first</p><p>P second</p><p>P third</p></body></html>";
Document doc = Jsoup.parse(html);
Elements ps = doc.select("p"); // The tag you would like to add Id to
int i = 12;
for(Element p : ps){
p.attr("id",String.valueOf(i));
i++;
}
System.out.println(doc.toString());
}
}
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class TagText {
public static void main(String[] args) throws IOException, ClassNotFoundException {
// Initializing the tagger
MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
List<String> lines = new ArrayList<>();
lines = new ReadCSV().readColumn("Tt2.csv", 4);
for (String line : lines) {
String tagged = tagger.tagString(line);
System.out.println(tagged);
}
}
}
I'm trying to parse a CSV file and i have a character (BIN 10010111, —) value which i wanted to the text parser to ignore this character. How would i do that ?
So i guess you want to remove all special characters?
I guess it was sth like: replaceAll("[^\w\s]", "");
Edit: Full Code
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class TagText {
public static void main(String[] args) throws IOException, ClassNotFoundException {
// Initializing the tagger
MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
List<String> lines = new ArrayList<>();
lines = new ReadCSV().readColumn("Tt2.csv", 4);
for (String line : lines) {
String tagged = tagger.tagString(line.replace("\uFFFD",""));
System.out.println(tagged);
}
}
}
I'm using a regex to search for a very specific pattern against a directory that's only about 106 MB in size. It takes about 10 seconds to complete.
Is there anything that I can do to improve the performance?
package com.JFileReader;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileData {
public static void main(String[] args) {
File dir = new File("/Users/me/Desktop/");
if(dir.isFile()) { handleFile(dir); }
if(dir.isDirectory()) { handleDir(dir); }
}
public static void handleFile(File aFile) {
String regex = "[a-zA-Z]+[.][a-zA-Z]+[#][a-zA-Z]+[.][a-zA-Z]+";
Pattern pattern = Pattern.compile(regex);
try {
BufferedReader br = new BufferedReader(new FileReader(aFile));
Matcher m;
String line;
while ((line = br.readLine()) != null) {
m = pattern.matcher(line);
if (m.find()) {
System.out.println("Found: " + aFile);
}
}
br.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
public static void handleDir(File dir) {
for (File file : dir.listFiles()) {
if(file.isFile()) { handleFile(file); }
if(file.isDirectory()) { handleDir(file); }
}
}
}
You can use possessive quantifiers:
String regex = "[a-zA-Z]++\\.[a-zA-Z]++#[a-zA-Z]++\\.[a-zA-Z]++";
When you use possessive quantifiers, the regex engine doesn't record backtrack positions and never go back to try other possibilities when the match fails.
Compiling your regex pattern repeatedly (for each file) is a relatively expensive waste.
You could define that once and keep using the same instance.
I am trying to read contents of a file using string tokenizer and store all the tokens in an array but i keep getting exception in main error. I need advise on how to do this.Below is the code am using for that;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.StringTokenizer;
public class FileTokenizer
{
private static final String DEFAULT_DELIMITERS = "< , { } >";
private static final String DEFAULT_TEST_FILE = "trans1.txt";
public List<String> tokenize(Reader reader) throws IOException
{
List<String> tokens = new ArrayList<String>();
BufferedReader br = null;
try
{
int i = 0;
br = new BufferedReader(reader);
Scanner scanner = new Scanner(br);
while (scanner.hasNext())
{
StringTokenizer st = new StringTokenizer(scanner.next(), DEFAULT_DELIMITERS, true);
while (st.hasMoreElements())
{
String[] t = new String[200];
tokens.add(st.nextToken());
t[i] = st.nextToken();
System.out.println(t[i]);
i++;
}
}
}
finally
{
close(br);
}
return tokens;
}
public static void close(Reader r)
{
try
{
if (r != null)
{
r.close();
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
{
try
{
String fileName = ((args.length > 0) ? args[0] : DEFAULT_TEST_FILE);
FileReader fileReader = new FileReader(new File(fileName));
FileTokenizer fileTokenizer = new FileTokenizer();
List<String> tokens = fileTokenizer.tokenize(fileReader);
//System.out.println(tokens);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
My file looks like;
PDA = (
{ q1, q2, q3, q4},
{ 0, 1 },
{ 0, $ },
{ (q1, #, #) -> { (q2, $) }, (q2, 0, #) -> { (q2, 0) },
(q2, 1, 0) -> { (q3, #) }, (q3, 1, 0) -> { (q3, #) },
(q3, #, $) -> { (q4, #) } },
q1,
{ q1, q4}
)
You will get the java.util.NoSuchElementException since you are calling st.nextToken() twice within the loop
while (st.hasMoreElements())
Modifying harigm's example, you can then add t[i] to tokens as you require
String[] t = new String[200];
System.out.println(t[i]);
tokens.add(t[i]);
Delimiters shouldn't be separated by spaces:
private static final String DEFAULT_DELIMITERS = "<,{}>";
Also, keep the following in mind (from the Javadoc):
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String.split() was introduced in JDK 1.4.
That said:
Using a Scanner to tokenize a stream together with a StringTokenizer looks a bit weird to me;
You call st.nextToken() twice in the inner loop;
t is useless. You re-create it each time in your inner loop and use only one element of it.
It seems that what you are trying to build is a lexical analyzer. Maybe you should look up some documentation on the subject.
HI,
I have modified your code and Now works perfectly fine, check this
package org.sample;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.StringTokenizer;
public class FileTokenizer
{
private static final String DEFAULT_DELIMITERS = "< , { } >";
// private static final String DEFAULT_TEST_FILE = "trans1.txt";
public List<String> tokenize(Reader reader) throws IOException
{
List<String> tokens = new ArrayList<String>();
BufferedReader br = null;
try
{
int i = 0;
br = new BufferedReader(reader);
Scanner scanner = new Scanner(br);
while (scanner.hasNext())
{
StringTokenizer st = new StringTokenizer(scanner.next(), DEFAULT_DELIMITERS, true);
while (st.hasMoreElements())
{
String[] t = new String[200];
// tokens.add(st.nextToken());
// t[i] = st.nextToken();
System.out.println(t[i]);
i++;
}
}
}
finally
{
close(br);
}
return tokens;
}
public static void close(Reader r)
{
try
{
if (r != null)
{
r.close();
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
{
try
{
// String fileName = ((args.length > 0) ? args[0] : DEFAULT_TEST_FILE);
FileReader fileReader = new FileReader(new File("c:\\DevTest\\1.txt"));
FileTokenizer fileTokenizer = new FileTokenizer();
List<String> tokens = fileTokenizer.tokenize(fileReader);
//System.out.println(tokens);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
Looking at your input file, I should point out that its hierarchical and irregular structure makes it more suited to be parsed by an actual parser. You may have to learn how to use a parser generator and write a lexer and grammar for it etc, but in the end you'll end up with a much more maintainable code. Doing this yourself is rather painstaking and error-prone.
I recommend ANTLR. It's quite mature, and it has a wide enough user base that I'm sure you can get help easily.