How to ignore an ASCII Character before parsing? - java

import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class TagText {
public static void main(String[] args) throws IOException, ClassNotFoundException {
// Initializing the tagger
MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
List<String> lines = new ArrayList<>();
lines = new ReadCSV().readColumn("Tt2.csv", 4);
for (String line : lines) {
String tagged = tagger.tagString(line);
System.out.println(tagged);
}
}
}
I'm trying to parse a CSV file and i have a character (BIN 10010111, —) value which i wanted to the text parser to ignore this character. How would i do that ?

So i guess you want to remove all special characters?
I guess it was sth like: replaceAll("[^\w\s]", "");
Edit: Full Code
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class TagText {
public static void main(String[] args) throws IOException, ClassNotFoundException {
// Initializing the tagger
MaxentTagger tagger = new MaxentTagger("taggers/english-left3words-distsim.tagger");
List<String> lines = new ArrayList<>();
lines = new ReadCSV().readColumn("Tt2.csv", 4);
for (String line : lines) {
String tagged = tagger.tagString(line.replace("\uFFFD",""));
System.out.println(tagged);
}
}
}

Related

save single string to array separate by multi space in file text java

I'm a java beginner and I'm doing a small project about dictionary, now I want to save word and translate mean in file, because my native language often have space like chicken will be con gà so, I must use other way, not by space, but I really don't know how to do that, a word and it translation in one line, separate by "tab", mean multi space like chicken con gà now I want to get 2 words and store it in my array of Words which I created before, so I want to do something like
w1=word1inline;
w2=word2inline;
Word(word1inline,word2inline);(this is a member of array);
please help me, thanks a lot, I just know how to read line from file text, and use split to get word but I am not sure how to read by multi space.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
public class docfile {
public static void main(String[] args)
{
String readLine;
ArrayList<String>str=new ArrayList<>(String);
try {
File file = new File("text.txt");
BufferedReader b = new BufferedReader(new FileReader(file));
while ((readLine = b.readLine()) != null) {
str.add()=readLine.split(" ");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
If you stick to using tabs as a separator, this should work:
package Application;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
public class Application {
public static void main(String[] args) {
String line;
ArrayList<String> str = new ArrayList<>();
try {
File file = new File("text.txt");
BufferedReader b = new BufferedReader(new FileReader(file));
while ((line = b.readLine()) != null) {
for (String s : line.split("\t")) {
str.add(s);
}
}
str.forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Why not just use a properties file?
dict.properties:
chicken=con gá
Dict.java:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Properties;
public class Dict {
public static void main(String[] args) throws IOException {
Properties dict = new Properties();
dict.load(Files.newBufferedReader(Paths.get("dict.properties")));
System.out.println(dict.getProperty("chicken"));
}
}
Output:
con gá
If your line is like this chicken con gà you can use indexof() method to find the first space in the string.
Then you can substring each word by using substring() method.
readLine = b.readLine();
ArrayList<String>str=new ArrayList<>();
int i = readLine.indexOf(' ');
String firstWord = readLine.substring(0, i);
String secondWord = readLine.substring(i+1, readLine.length());
str.add(firstWord);
str.add(secondWord);

How to use delimiter having multiple characters using apache commons csv

I have a input file whose delimiter is a combination of characters like #$#. But apache commons CSVParser consider only a character not multiple characters. Please find the input file:
Rajeev Kumar Singh ♥#$#rajeevs#example.com#$#+91-9999999999#$#India
Sachin Tendulkar#$#sachin#example.com#$#+91-9999999998#$#India
Barak Obama#$#barak.obama#example.com#$#+1-1111111111#$#United States
Donald Trump#$#donald.trump#example.com#$#+1-2222222222#$#United States
Code snippet:
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class CSVReader {
private static final String SAMPLE_CSV_FILE_PATH = "./users.csv";
public static void main(String[] args) throws IOException {
try (
Reader reader = Files.newBufferedReader(Paths.get(SAMPLE_CSV_FILE_PATH));
CSVParser csvParser = new CSVParser(reader, CSVFormat.EXCEL.withDelimiter('#'))
) {
long recordCount;
List<CSVRecord> csvRecords = csvParser.getRecords();
}
}
}
Please help me in using delimiter with multiple characters in above example, delimiter is only a single character which is '#'. I need to set the delimiter as '#$#'.
public class ReplacingReader extends BufferedReader {
private final Function<String, String> replacer;
public ReplacingReader(Reader in, Function<String, String> replacer) {
super(in);
this.replacer = replacer;
}
#Override
public String readLine() throws IOException {
return replacer.apply(super.readLine());
}
}
Reader reader = new ReplacingReader(
Files.newBufferedReader(Paths.get(SAMPLE_CSV_FILE_PATH)),
line -> line.replace("#$#", "§")
);
CSVParser csvParser = new CSVParser(reader, CSVFormat.EXCEL.withDelimiter('§'))
The buffering now is done twice, one could use other Reader/FileInputStream and such.
I am not quite clear why use CSVParser in your case. I just tested it locally with your data and come up with this parsing demo:
public static void main(String... args) {
try (Stream<String> lines = Files.lines(Paths.get(Thread.currentThread().getContextClassLoader().getResource("csv.txt").toURI()))) {
lines.forEach(line -> {
String[] words = line.split("#\\$#");
System.out.println(Arrays.toString(words));
});
} catch (URISyntaxException | IOException ignored) {
ignored.printStackTrace();
}
}
The output will be:
[Rajeev Kumar Singh ♥, rajeevs#example.com, +91-9999999999, India]
[Sachin Tendulkar, sachin#example.com, +91-9999999998, India]
[Barak Obama, barak.obama#example.com, +1-1111111111, United States]
[Donald Trump, donald.trump#example.com, +1-2222222222, United States]
By the way, the csv.txt is in the resources:
public List<CSVRecord> getCSVRecords(String path, String delimiter) throws IOException {
List<CSVRecord> csvRecords = null;
Stream<String> lines = Files.lines(Paths.get(path));
List<String> replaced = lines.map(line -> line.replaceAll(Pattern.quote(delimiter), "§")).collect(Collectors.toList());
try (
BufferedReader buffer =
new BufferedReader(new StringReader(String.join("\n", replaced)));
CSVParser csvParser = new CSVParser(buffer, CSVFormat.EXCEL.withDelimiter('§'))
) {
csvRecords = csvParser.getRecords();
return csvRecords;
}
}

Separating words in a file with commas Java?

This is what I have
no error is displayed but it doesn't run please tell me what the problem is
Do i have to import something else???
The file is a paragraph from a book
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class Unique {
public static void main(String[] args) {
}
public void add(String fileName) throws Exception {
File inFile = new File("ReadThis.txt");
ArrayList<String> words = new ArrayList<String>();
Scanner scanner = new Scanner(inFile);
while (scanner.hasNext()) {
String word = scanner.next() ;
word = word.replaceAll("[^a-zA-Z ]", "");
words.add(word) ;
}
scanner.close();
}
}
The entry point of your code is empty.
public static void main(String[] args) {
}
The behavior you describe is exactly what this code does: nothing.
You'll have to insert the code you want to run into the main-method in order to get it running. E.g.:
public static void main(String[] args) {
new Unique().add("someFile");
}

Comparing Phrases to whole text

I have two text files. The first user inputs a paragraph of text. The second is a dictionary of terms gotten from an owl file. Like so:
Inferior salivatory nucleus
Retrosplenial area
lateral agranular part
I have coded the bits to make these files. I am stuck as to compare the files so that any whole phrases that appear in the dictionary and the paragraph of text are printed out in the command line in Java.
Try following code, it will help you. Correct your file path in fileName and enter your search condition into the while loop:
public class JavaReadFile {
public static void main(String[] args) throws IOException {
String fileName = "filePath.txt";
//read using BufferedReader, to read line by line
readUsingBufferedReader(fileName);
}
private static void readUsingBufferedReader(String fileName) throws IOException {
File file = new File(fileName);
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String line;
while((line = br.readLine()) != null){
//process the line
System.out.println(line);
}
//close resources
br.close();
fr.close();
}
}
You could write the file to a string and iterate over the keys in your dictionary and check if they are present in the paragraph with contains. This probably isn't a particularly efficient solution, but it should work.
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
public class Test {
public static void main(String[] args) throws IOException {
String fileString = new String(Files.readAllBytes(Paths.get("dictionary.txt")),StandardCharsets.UTF_8);
Set<String> set = new HashSet<String>();
set.add("ZYMURGIES");
for (String term : set) {
if(fileString.contains(term)) {
System.out.println(term);
}
}
}
}
Here's a Java 8 version of the contains checking.
package insert.name.here;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
public class InsertNameHere {
public static void main(String[] args) throws IOException {
String paragraph = new String(Files.readAllBytes(Paths.get("<paragraph file path>")));
Files.lines(Paths.get("<dictionary file path>"))
.filter(paragraph::contains)
.forEach(phrase -> System.out.printf("Paragraph contains %s", phrase));
}
}

Iterate an Array and test a regular expression to each value (Java)

I'm quite new to Java and I'm facing a situation I can't solve. I have some html code and I'm trying to run a regular expression to store all matches into an array. Here's my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexMatch{
boolean foundMatch = false;
public String[] arrayResults;
public String[] TestRegularExpression(String sourceCode, String pattern){
try{
Pattern regex = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(sourceCode);
while (regexMatcher.find()) {
arrayResults[matches] = regexMatcher.group();
matches ++;
}
} catch (PatternSyntaxException ex) {
// Exception occurred
}
return arrayResults;
}
}
I'm passing a string containing html code and the regular expression pattern to extract all meta tags and store them into the array. Here's how I instantiate the method:
RegexMatch regex = new RegexMatch();
regex.TestRegularExpression(sourceCode, "<meta.*?>");
String[] META_TAGS = regex.arrayResults;
Any hint?
Thanks!
Firstly, parsing HTML with regular expressions is a bad idea. There are alternatives which will convert the HTML into a DOM etc - you should look into those.
Assuming you still want the "match multiple results" idea though, it seems to me that a List<E> of some form would be more useful, so you don't need to know the size up-front. You can also build that in the method itself, rather than having state. For example:
import java.util.*;
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
throws PatternSyntaxException
{
// Want to get x10 and x5 from this
String text = "x10 y5 x5 xyz";
String pattern = "x\\d+";
List<String> matches = getAllMatches(text, pattern);
for (String match : matches) {
System.out.println(match);
}
}
public static List<String> getAllMatches(String text, String pattern)
throws PatternSyntaxException
{
Pattern regex = Pattern.compile(pattern);
List<String> results = new ArrayList<String>();
Matcher regexMatcher = regex.matcher(text);
while (regexMatcher.find()) {
results.add(regexMatcher.group());
}
return results;
}
}
It's possible that there's something similar to this within the Matcher class itself, but I can't immediately see it...
With Jsoup, you could do something as simple as...
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}
For example:
import java.io.IOException;
import java.net.*;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class GetMeta {
private static final String META_QUERY = "meta";
private static final String MAIN_URL = "http://www.yahoo.com";
public static void main(String[] args) {
try {
Scanner scan = new Scanner(new URL(MAIN_URL).openStream());
StringBuilder sb = new StringBuilder();
while (scan.hasNextLine()) {
sb.append(scan.nextLine() + "\n");
}
List<String> metaList = parseForMeta(sb.toString());
for (String metaStr : metaList) {
System.out.println(metaStr);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public static List<String> parseForMeta(String htmlText) {
Document jsDocument = Jsoup.parse(htmlText);
Elements metaElements = jsDocument.select(META_QUERY);
List<String> metaList = new ArrayList<String>();
for (Element element : metaElements) {
metaList.add(element.toString());
}
return metaList;
}
}

Categories

Resources