I have a single column of data, output from Google Sheets as CSV, and also from LibreOffice as CSV as well. I've tried to marshal both files using OpenCSV but am only getting a small portion of data available.
How can I read this file in? I don't really see any commas in this CSV file...but it's only a single column of data.
file:
thufir#dur:~/jaxb$
thufir#dur:~/jaxb$ head input.csv
Field 1
Foo # 16
bar
baz
fdkfdl
fdsfdsfsdfgh
thufir#dur:~/jaxb$
output:
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$ gradle run
> Task :run
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Foo # 16
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Field 1
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Foo # 16
BUILD SUCCESSFUL in 1s
3 actionable tasks: 1 executed, 2 up-to-date
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$
code:
package net.bounceme.dur.basexfromjaxb.csv;
import com.opencsv.CSVReaderHeaderAware;
import java.io.File;
import java.io.FileReader;
import java.net.URI;
import java.util.Collection;
import java.util.Map;
import java.util.logging.Logger;
public class ReaderForCVS {
private static final Logger LOG = Logger.getLogger(ReaderForCVS.class.getName());
private Map<String, String> values;
public ReaderForCVS() {
}
public void unmarshal(URI inputURI) throws Exception {
FileReader f = new FileReader(new File(inputURI));
values = new CSVReaderHeaderAware(f).readMap();
}
public void printMap() {
Collection<String> stringValues = values.values();
for (String s : stringValues) {
LOG.info(s);
}
for (Map.Entry<String, String> item : values.entrySet()) {
String key = item.getKey();
String value = item.getValue();
LOG.info(key);
LOG.info(value);
}
}
}
Frankly, I can't tell whether the library is reading in the file in a funky way, or the file is mangled in someway, or what. I'll be looking for CSV from websites, but not sure what that establishes. I don't see it likely that the library isn't parsing properly, but neither can I see the problem with this data.
There are only so many ways to export data from a spreadsheet as CSV and I've tried a few. The content of the file is immaterial, but that structure: lines with no content, just a single column, special characters, is what I'm dealing with.
Reading in the file as text gives the desired output...
It looks like it works similar to the other CSVReaders OpenCSV Guide Reading. Here is some sample code I used which seemed to work:
CSVReaderHeaderAware csvReaderHeaderAware= new CSVReaderHeaderAware(new StringReader(DAOConstants.IND_DATA));
while (((values = (Map<String, String>) csvReaderHeaderAware.readMap())) != null)
{
for (Map.Entry<String,String> entry : values.entrySet())
System.out.println("Key = " + entry.getKey() + ", Value = " + entry.getValue());
}
Related
I have some Parquet files written in Python using PyArrow. Now I want to read them using a Java program. I tried the following, using Apache Avro:
import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;
public class Main {
private static Path path = new Path("D:\\pathToFile\\review.parquet");
public static void main(String[] args) throws IllegalArgumentException {
try {
Configuration conf = new Configuration();
Schema schema = SchemaBuilder.record("lineitem")
.fields()
.name("reviewID")
.aliases("review_id$str")
.type().stringType()
.noDefault()
.endRecord();
conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
.withConf(conf)
.build();
GenericRecord r;
while (null != (r = reader.read())) {
r.getSchema().getField("reviewID").addAlias("review_id$str");
Object review_id = r.get("review_id$str");
String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
System.out.println("review_id: " + review_id_str);
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
}
My Parquet File contains columns whose name contain the symbols [, ], ., \ and $. (In this case, the Parquet file contains a column review_id$str, whose values I want to read). However, these characters are invalid in Avro (see: https://avro.apache.org/docs/current/spec.html#names). Therefore, I tried to use Aliases (see: http://avro.apache.org/docs/current/spec.html#Aliases). Even though now I don't get any "Invalid Character Errors", I am still unable to get the values, i.e. nothing is getting printed even though the column contains values.
It only prints:
review_id: -
review_id: -
review_id: -
review_id: -
...
And expected would be:
review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...
Am I using the Aliases wrong? Is it even possible to use aliases in this situation? If so, please explain me how I can fix it. Thank you.
Update 2021:
In the end, I decided not to use Java for this task. I stuck to my solution in Python using PyArrow which works perfectly fine.
The code:
package org.javautil.salesdata;
import java.io.File;
import java.io.IOException;
import java.util.Map;
import org.javautil.util.ListOfNameValue;
import com.fasterxml.jackson.databind.MappingIterator;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.CsvSchema;
// https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv
public class Manufacturers {
private static final String fileName= "src/main/resources/pdssr/manufacturers.csv";
ListOfNameValue getManufacturers() throws IOException {
ListOfNameValue lnv = new ListOfNameValue();
File csvFile = new File(fileName);
CsvMapper mapper = new CsvMapper();
CsvSchema schema = CsvSchema.emptySchema().withHeader(); // use first row as header; otherwise defaults are fine
MappingIterator<Map<String,String>> it = mapper.readerFor(Map.class)
.with(schema)
.readValues(csvFile);
while (it.hasNext()) {
Map<String,String> rowAsMap = it.next();
System.out.println(rowAsMap);
}
return lnv;
}
}
The data:
"mfr_id","mfr_cd","mfr_name"
"0000000020","F-L", "Frito-Lay"
"0000000030","GM", "General Mills"
"0000000040","HVEND", "Hershey Vending"
"0000000050","HFUND", "Hershey Fund Raising"
"0000000055","HCONC", "Hershey Concession"
"0000000060","SNYDERS", "Snyder's of Hanover"
"0000000080","KELLOGG", "Kellogg & Keebler"
"0000000115","KARS", "Kar Nut Product (Kar's)"
"0000000135","MARS", "Mars Chocolate "
"0000000145","POORE", "Inventure Group (Poore Brothers)"
"0000000150","WOW", "WOW Foods"
"0000000160","CADBURY", "Cadbury Adam USA, LLC"
"0000000170","MONOGRAM", "Monogram Food"
"0000000185","JUSTBORN", "Just Born"
"0000000190","HOSTESS", "Hostess, Dolly Madison"
"0000000210","SARALEE", "Sara Lee"
The exception is
fasterxml.jackson.databind.exc.RuntimeJsonMappingException: Too many entries: expected at most 3 (value #3 (4 chars) "LLC"")
I thought I would throw out my own CSV parser and adopt a supported project with more functionality, but most of them are far slower, just plain break or have examples all over the web that don't work with current release of the product.
The problem is your file does not meet the CSV standard. The third field always starts with a space
mfr_id","mfr_cd","mfr_name"
"0000000020","F-L", "Frito-Lay"
"0000000030","GM", "General Mills"
"0000000040","HVEND", "Hershey Vending"
"0000000050","HFUND", "Hershey Fund Raising"
From wikipedia:
According to RFC 4180, spaces outside quotes in a field are not allowed; however, the RFC also says that "Spaces are considered part of a field and should not be ignored." and "Implementors should 'be conservative in what you do, be liberal in what you accept from others' (RFC 793, section 2.10) when processing CSV files."
Jackson is being "liberal" in processing most of your records; but when it finds
"0000000160","CADBURY", "Cadbury Adam USA, LLC"
It has no choice but to treat is as 4 fields:
'0000000160'
'CADBURY'
' "Cadbury Adam USA'
' LLC"'
Would suggest fixing the file as that will allow parsing with most CSV libraries. You could try another library, there is no shortage of them.
univocity-parsers can handle that without any issues. It's built to deal with all sorts of tricky and non-standard CSV files and is also faster than the parser you are using.
Try this code:
String fileName= "src/main/resources/pdssr/manufacturers.csv";
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
for(Record record : parser.iterateRecords(new File(fileName))){
Map<String, String> rowAsMap = record.toFieldMap();
System.out.println(rowAsMap);
}
Hope this helps.
Disclosure: I'm the author of this library. It's open source and free (Apache 2.0 license)
I have exactly 278 Html files of essays from different students, every file contains student id, first name and last in the following format
<p>Student ID: 000000</p>
<p>First Name: John</p>
<p>Last Name: Doe</p>
I'm trying to extract Student IDs from all this files, is there a way to extract data between X and Y? X being "<p>Student ID: " and Y being "</p>" which should leave us with ID
What Method/Language/Concept/Software would you recommend to get this work done?
Using java:
import java.io.File;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
public class StudentIDsRetriever {
public static void main(String[] args) throws IOException {
File dir = new File("htmldir");
String[] htmlFiles = dir.list();
List<String> studentIds = new ArrayList<>();
List<String> emailDs = new ArrayList<>();
for (String htmlFile : htmlFiles) {
Path path = FileSystems.getDefault().getPath("htmldir", htmlFile);
List<String> lines = Files.readAllLines(path);
for (String str : lines) {
if (str.contains("<p>Student ID:")) {
String idTag = str.substring(str.indexOf("<p>Student ID:"));
String id = idTag.substring("<p>Student ID:".length(), idTag.indexOf("</p>"));
System.out.println("Id is "+id);
studentIds.add(id);
}
if (str.contains("#") && (str.contains(".com") || str.contains(".co.in"))) {
String[] words = str.split(" ");
for (String word : words)
if (word.contains("#") && (word.contains(".com") || word.contains(".co.in")))
emailDs.add(word);
}
}
}
System.out.println("Student list is "+studentIds);
System.out.println("Student email list is "+emailDs);
}
}
P.S: This works from Java7+
I recommand you python script. If you first use python, It's OK. python is so easy script language and has many references in google.
1) Language: python (version 2.7)
2) Library: beautifulsoup (you can download this with pip(pip is package manager program, pip can be installed in python installer)
traverse files one by one and open your local files. and parse HTML content using beautifulsoup. (see this part https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag)
then, extract content from <p> tag. It return "Student ID: 000000".
split this string with ":". this return str[0] and str[1].
str[1] is student number you want (maybe you erase space character... call ' Hel lo '.strip() -> Hello
If you need a help, reply.
I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA
Discharge Port/Container Arrival Date: 23 Jul 15 E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.
I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:
Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)
package org.jsoup.examples;
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class parse2 {
static parse2 parseIt2 = new parse2();
String companyName = "Platzhalter";
String jobTitle = "Platzhalter";
String location = "Platzhalter";
String timeAdded = "Platzhalter";
public static void main(String[] args) throws IOException {
parseIt2.getData();
}
//
public void getData() throws IOException {
Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
// Parse Data into Elements
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name span[itemprop=name]");
Elements locationElement = element.select(".locality span[itemprop=addressLocality]");
Elements dateElement = element.select(".job_date_added [datetime]");
// Strip Data from unnecessary tags
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
String location = locationElement.text();
String timeAdded = dateElement.attr("datetime");
System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );
}
}
}
Now I want to do the process End-to-End in Talend, and I got assured this works.
I tried this (which looks quite shady to me):
Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).
Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?
EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1
EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".
This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :
Document document = org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");