ElasticSearch - Java API indexing 100K + PDFs using producer & consumer

ElasticSearch - Java API indexing 100K + PDFs using producer & consumer - java

Am indexing pdf using java api. I have installed ingest-attachement processor plugin and from my java code, am converting PDF into base64 and indexing encoded format of PDF.
Actually, PDFs are available in my machine d:\ drive. The file path are available in ElasticSearch index named as documents_local. So, am fetching all the records from documents_local index and getting the file path. Then, am reading the pdf file and encode into base64. Then indexing them.
For this process, am using scrollRequest API to fetch file path from index, because am having more that 100k documents. so, for indexing 20000 PDFs its taking 8 hours of time with the below java code.
So, i tried to seperated this process of indexing.
I have created 3 classses,
Controller.java
Producer.java
Consumer.java
From Controller.java class am reading all the filePath from my index and am storing all the filePath into ArrayList and passing to Producer class.
From Producer.java class am reading PDF using the filePath and converting into base64 and pushing into the queue.
From Consumer.java class i will read all the messages from the queue which are published by producer.java class.
My idea is, i want to index the encoded files in Consumer.java class. ( which is not implemented and am not sure how to do that).
Please find my java code below.
Controller.java
public class Controller {
private static final int QUEUE_SIZE = 2;
private static BlockingQueue<String> queue;
private static Collection<Thread> producerThreadCollection, allThreadCollection;
private final static String INDEX = "documents_local";
private final static String ATTACHMENT = "document_suggestion";
private final static String TYPE = "doc";
private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName());
public static void main(String[] args) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
List<String> filePathList = new ArrayList<String>();
producerThreadCollection = new ArrayList<Thread>();
allThreadCollection = new ArrayList<Thread>();
queue = new LinkedBlockingDeque<String>(QUEUE_SIZE);
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(60L)); //part of Scroll API
searchRequest.scroll(scroll); //part of Scroll API
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = SearchEngineClient.getInstance3().search(searchRequest);
String scrollId = searchResponse.getScrollId(); //part of Scroll API
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
logger.info("Total Hits --->"+totalHits);
//part of Scroll API -- Starts
while (searchHits != null && searchHits.length > 0) {
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = SearchEngineClient.getInstance3().searchScroll(scrollRequest);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
for (SearchHit hit : searchHits) {
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
if(sourceAsMap != null) {
doc.setId((int) sourceAsMap.get("id"));
doc.setApp_language(String.valueOf(sourceAsMap.get("app_language")));
}
filePathList.add(doc.getPath().concat(doc.getFilename()));
}
}
createAndStartProducers(filePathList);
createAndStartConsumers(filePathList);
for(Thread t: allThreadCollection){
try {
t.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.out.println("Controller finished");
}
private static void createAndStartProducers(List<String> filePathList){
for(int i = 1; i <= filePathList.size(); i++){
Producer producer = new Producer(Paths.get(filePathList.get(i)), queue);
Thread producerThread = new Thread(producer,"producer-"+i);
producerThreadCollection.add(producerThread);
producerThread.start();
}
allThreadCollection.addAll(producerThreadCollection);
}
private static void createAndStartConsumers(List<String> filePathList){
for(int i = 0; i < filePathList.size(); i++){
Thread consumerThread = new Thread(new Consumer(queue), "consumer-"+i);
allThreadCollection.add(consumerThread);
consumerThread.start();
}
}
public static boolean isProducerAlive(){
for(Thread t: producerThreadCollection){
if(t.isAlive())
return true;
}
return false;
}
}
Producer.java
public class Producer implements Runnable {
private Path fileToRead;
private BlockingQueue<String> queue;
File file=null;
public Producer(Path filePath, BlockingQueue<String> q){
fileToRead = filePath;
queue = q;
}
public void run() {
String encodedfile = null;
BufferedReader reader = null;
try {
reader = Files.newBufferedReader(fileToRead);
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
File file=new File(reader.toString());
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
fileInputStreamReader.close();
System.out.println(Thread.currentThread().getName()+" finished");
} catch (IOException e) {
e.printStackTrace();
}
}
else
{
System.out.println("File not exists");
}
}
}
Consumer.java (uncompleted class, not sure how i can do index from consumer class , Just showing skeleton of my consumer class.)
public class Consumer implements Runnable {
private BlockingQueue<String> queue;
File file=null;
public Consumer(BlockingQueue<String> q){
queue = q;
}
public void run(){
while(true){
String line = queue.poll();
if(line == null && !Controller.isProducerAlive())
return;
if(line != null){
System.out.println(Thread.currentThread().getName()+" processing line: "+line);
//Do something with the line here like see if it contains a string
}
}
}
}
With the below piece of code i have indexed the encoded file, But its taking more time to index because am having 100k documents. So that am trying for Producer & Consumer concept
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("app_language", doc.getApp_language());
jsonMap.put("fileContent", result);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);

Related

Threads are not running simultaneously to read files

I want to read multiple files through multi threading I wrote the code for the same but my threads are executing one by one which is very time consuming. I wants them to run simultaneously.
Please correct me what I am doing wrong in the below code where I am doing this by implementing the callable interface because I have to read the file and set its data into the variable of Model object and after that I am returning the list of objects.
Thanks In advance.
Class A{
ExecutorService executor = getExecuterService();
private ExecutorService getExecuterService() {
int threadPoolSize = Runtime.getRuntime().availableProcessors() - 1;
System.out.println("Number of COre" + threadPoolSize);
return Executors.newFixedThreadPool(threadPoolSize);
}
#SuppressWarnings({ "unused", "unchecked" })
FutureTask<List<DSection>> viewList = (FutureTask<List<DSection>>) executor
.submit(new MultiThreadedFileReadForDashboard(DashboardSectionList, sftpChannel,customQuery));
executor.shutdown();
while (!executor.isTerminated()) {
}
}
Class for task:
public class MultiThreadedFileReadForDashboard implements Callable {
public MultiThreadedFileReadForDashboard(List<DSection> dashboardSectionList, ChannelSftp sftpChannel,
CustomQueryImpl customQuery) {
this.dashboardSectionList = dashboardSectionList;
this.sftpChannel = sftpChannel;
this.customQuery = customQuery;
}
public List<DSection> call() throws Exception {
for (int i = 0; i < dashboardSectionList.size(); ++i) {
DSection DSection = dashboardSectionList.get(i);
List<LView> linkedViewList = new ArrayList<LView>(DSection.getLinkedViewList());
LView lView;
for (int j = 0; j < linkedViewList.size(); ++j) {
lView = linkedViewList.get(j);
int UserQueryId = Integer.parseInt(lView.getUserQueryId());
outputFileName = customQuery.fetchTableInfo(UserQueryId);
if ((outputFileName != null) && (!outputFileName.equalsIgnoreCase(""))) {
String data = readFiles(outputFileName);
lView.setData(data);
} else {
lView.setData("No File is present");
}
}
if (size == dashboardSectionList.size()) {
break;
}
}
return dSectionList;
}
private String readFiles(String outputFileName) {
String response = null;
try {
InputStream in = sftpChannel.get(outputFileName);
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
StringBuilder inputData = new StringBuilder("");
String line = null;
while ((line = br.readLine()) != null) {
inputData.append(line).append("\n");
}
JSONArray array = null;
if (outputFileName.toLowerCase().contains("csv")) {
array = CDL.toJSONArray(inputData.toString());
} else {
}
response = array.toString();
} catch (Exception e) {
e.printStackTrace();
}
return response;
// TODO Auto-generated method stub
}
}
}

I do not see read multiple files through multi threading. I see one task invoked by the ExecuterService and it is reading all the files. the multi threading feature is achieved by submitting multiple tasks to the ExecuterService, each is given one file to process (can be by constructor).
Here is what I think you should do:
inside the inner for loop, you construct a task that is given outputFileName in constructor and submit it to the executor, getting back a Future instance. after all tasks were submitted, you will have a List<Future> that you can query to see when they are done and get result. the task will call readFiles() (odd name for a method that reads one file...)

ElasticSearch indexing documents using Java Executors service

Am trying to indexing more than 100k documents using Java Executors sevice inorder to index much faster.
Am reading 100k plus file path from the index documents_qa using scroll API. Actual files will be available in my local d:\drive. By using the file path am reading the actual file and converting into base64 and am reindex with the base64 content in another index document_attachment_qa.
Please find my code below
public class DocumentIndex {
private final static String INDEX = "documents_qa";
private final static String TYPE = "doc";
public static void main(String[] args) throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(5);
List<String> filePathList = new ArrayList<String>();
Document doc=new Document();
logger.info("Started Indexing the Document.....");
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(60L)); //part of Scroll API
searchRequest.scroll(scroll); //part of Scroll API
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = SearchEngineClient.getInstance3().search(searchRequest);
String scrollId = searchResponse.getScrollId(); //part of Scroll API
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
//part of Scroll API -- Starts
while (searchHits != null && searchHits.length > 0) {
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = SearchEngineClient.getInstance3().searchScroll(scrollRequest);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
if(sourceAsMap != null) {
doc.setId((int) sourceAsMap.get("id"));
doc.setApp_language(String.valueOf(sourceAsMap.get("app_language")));
doc.setFilename(String.valueOf(sourceAsMap.get("filename")));
doc.setPath(String.valueOf(sourceAsMap.get("path")));
}
if(doc.getPath()!= null && doc.getFilename() != null) {
filePathList.add(doc.getPath().concat(doc.getFilename()));
}
}
for (int i = 0; i < filePathList.size(); i++) {
Runnable worker = new WorkerThread(doc);
executor.execute(worker);
}
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
}
}
Please find the worker Thread :
public class WorkerThread implements Runnable {
private String command;
private Document doc;
private final static String ATTACHMENT = "document_attachment_qa";
private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName());
Map<String, Object> jsonMap ;
List<String> filePathList = new ArrayList<String>();
public WorkerThread(Document doc){
this.doc=doc;
}
#Override
public void run() {
File all_files_path = new File("d:\\All_Files_Path.txt");
File available_files = new File("d:\\Available_Files.txt");
int totalFilePath=1;
int totalAvailableFile=1;
String encodedfile = null;
File file=null;
if(doc.getPath()!= null && doc.getFilename() != null) {
filePathList.add(doc.getPath().concat(doc.getFilename()));
}
PrintWriter out=null;
try{
out = new PrintWriter(new FileOutputStream(all_files_path, true));
for(int i=0;i<filePathList.size();i++) {
out.println("FilePath Count ---"+totalFilePath+":::::::ID---> "+doc.getId()+"File Path --->"+filePathList.get(i));
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
finally {
out.close();
}
for(int i=0;i<filePathList.size();i++) {
file = new File(filePathList.get(i));
if(file.exists() && !file.isDirectory()) {
try {
try(PrintWriter out1 = new PrintWriter(new FileOutputStream(available_files, true)) ){
out1.println("Available File Count --->"+totalAvailableFile+":::::::ID---> "+doc.getId()+"File Path --->"+filePathList.get(i));
totalAvailableFile++;
}
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
fileInputStreamReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
jsonMap = new HashMap<String, Object>();
jsonMap.put("id", doc.getId());
jsonMap.put("app_language", doc.getApp_language());
jsonMap.put("fileContent", encodedfile);
System.out.println(Thread.currentThread().getName()+" End.");
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
try {
IndexResponse response = SearchEngineClient.getInstance3().index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
processCommand();
}
}
private void processCommand() {
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public String toString(){
return this.command;
}
}
Issue is, after indexing first document - Its taking very long time to process another document and finally code getting terminated without indexing further.

How to write a unit test for an XML parser I wrote in Java

The context is as follows:
I've got objects that represent Tweets (from Twitter). Each object has an id, a date and the id of the original tweet (if there was one).
I receive a file of tweets (where each tweet is in the format of 05/04/2014 12:00:00, tweetID, originalID and is in its' own line) and I want to save them as an XML file where each field has its' own tag.
I want to then be able to read the file and return a list of Tweet objects corresponding to the Tweets from the XML file.
After writing the XML parser that does this I want to test that it works correctly. I've got no idea how to test this.
The XML Parser:
public class TweetToXMLConverter implements TweetImporterExporter {
//there is a single file used for the tweets database
static final String xmlPath = "src/main/resources/tweetsDataBase.xml";
//some "defines", as we like to call them ;)
static final String DB_HEADER = "tweetDataBase";
static final String TWEET_HEADER = "tweet";
static final String TWEET_ID_FIELD = "id";
static final String TWEET_ORIGIN_ID_FIELD = "original tweet";
static final String TWEET_DATE_FIELD = "tweet date";
static File xmlFile;
static boolean initialized = false;
#Override
public void createDB() {
try {
Element tweetDB = new Element(DB_HEADER);
Document doc = new Document(tweetDB);
doc.setRootElement(tweetDB);
XMLOutputter xmlOutput = new XMLOutputter();
// display nice nice? WTF does that chinese whacko want?
xmlOutput.setFormat(Format.getPrettyFormat());
xmlOutput.output(doc, new FileWriter(xmlPath));
xmlFile = new File(xmlPath);
initialized = true;
} catch (IOException io) {
System.out.println(io.getMessage());
}
}
#Override
public void addTweet(Tweet tweet) {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document document = (Document) builder.build(xmlFile);
Element newTweet = new Element(TWEET_HEADER);
newTweet.setAttribute(new Attribute(TWEET_ID_FIELD, tweet.getTweetID()));
newTweet.setAttribute(new Attribute(TWEET_DATE_FIELD, tweet.getDate().toString()));
if (tweet.isRetweet())
newTweet.addContent(new Element(TWEET_ORIGIN_ID_FIELD).setText(tweet.getOriginalTweet()));
document.getRootElement().addContent(newTweet);
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
//break glass in case of emergency
#Override
public void addListOfTweets(List<Tweet> list) {
for (Tweet t : list) {
addTweet(t);
}
}
#Override
public List<Tweet> getListOfTweets() {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return null;
}
try {
SAXBuilder builder = new SAXBuilder();
Document document;
document = (Document) builder.build(xmlFile);
List<Tweet> $ = new ArrayList<Tweet>();
for (Object o : document.getRootElement().getChildren(TWEET_HEADER)) {
Element rawTweet = (Element) o;
String id = rawTweet.getAttributeValue(TWEET_ID_FIELD);
String original = rawTweet.getChildText(TWEET_ORIGIN_ID_FIELD);
Date date = new Date(rawTweet.getAttributeValue(TWEET_DATE_FIELD));
$.add(new Tweet(id, original, date));
}
return $;
} catch (JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
}
Some usage:
private TweetImporterExporter converter;
List<Tweet> tweetList = converter.getListOfTweets();
for (String tweetString : lines)
converter.addTweet(new Tweet(tweetString));
How can I make sure the the XML file I read (that contains tweets) corresponds to the file I receive (in the form stated above)?
How can I make sure the tweets I add to the file correspond to the ones I tried to add?

Assuming that you have the following model:
public class Tweet {
private Long id;
private Date date;
private Long originalTweetid;
//getters and seters
}
The process would be the following:
create an isntance of TweetToXMLConverter
create a list of Tweet instances that you expect to receive after parsing the file
feed the converter the list you generated
compare the list received by parsing the list and the list you initiated at the start of the test
public class MainTest {
private TweetToXMLConverter converter;
private List<Tweet> tweets;
#Before
public void setup() {
Tweet tweet = new Tweet(1, "05/04/2014 12:00:00", 2);
Tweet tweet2 = new Tweet(2, "06/04/2014 12:00:00", 1);
Tweet tweet3 = new Tweet(3, "07/04/2014 12:00:00", 2);
tweets.add(tweet);
tweets.add(tweet2);
tweets.add(tweet3);
converter = new TweetToXMLConverter();
converter.addListOfTweets(tweets);
}
#Test
public void testParse() {
List<Tweet> parsedTweets = converter.getListOfTweets();
Assert.assertEquals(parsedTweets.size(), tweets.size());
for (int i=0; i<parsedTweets.size(); i++) {
//assuming that both lists are sorted
Assert.assertEquals(parsedTweets.get(i), tweets.get(i));
};
}
}
I am using JUnit for the actual testing.

Xstream - change ArrayList elements name

I'm trying to create XML report, that can be opened as xls table.
I have following output:
<Report>
<test>
<string>4.419</string>
<string>4.256</string>
</test>
</Report>
from this code:
/**
* declare arrays
*/
// ArrayList<String> test = new ArrayList<String>();
ArrayList<String> stats = new ArrayList<String>();
// ArrayList<String> count = new ArrayList<String>();
/**
*return array list with loading times
*/
public ArrayList launch() {
for (int i = 0; i < 2; i++) {
// ui.off();
// ui.on();
device.pressHome();
ui.openProgramInMenu("ON");
long TStart = System.currentTimeMillis();
ui.detectContactList();
long TStop = System.currentTimeMillis();
float res = TStop - TStart;
res /= 1000;
ui.log("[loading time]: " + res);
// ui.off();
test.add(i, "Loading time");
stats.add(i, Float.toString(res));
count.add(i, Integer.toString(i));
}
System.out.println(stats);
}
where rep.class has code:
public class ReportSettings {
public List<String> test = new ArrayList<String>();
public List<String> count = new ArrayList<String>();
public List<String> stats = new ArrayList<String>();
/**
* Test method
*/
public static void main(String[] args) {
ReportSettings rep = new ReportSettings();
rep.saveXML("report/data.xml");
// System.out.println(rep.test);
// rep = rep.loadXML("report/data.xml");
// System.out.println(rep.home);
System.out.println(rep.getXML());
}
public void createReport() {
ReportSettings rep = new ReportSettings();
rep.saveXML("report/data.xml");
}
public String getXML() {
XStream xstream = new XStream();
xstream.alias("Report", ReportSettings.class);
xstream.autodetectAnnotations(true);
return xstream.toXML(this);
}
public void saveXML(String filename) {
if (!filename.contains(".xml")) {
System.out.println("Error in saveReport syntax");
return;
}
String xml = this.getXML();
File f = new File(filename);
try {
FileOutputStream fo = new FileOutputStream(f);
fo.write(xml.getBytes());
fo.close();
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
}
public ReportSettings loadXML(String filename) {
if (!filename.endsWith(".xml")) {
System.out.println("Error in loadReport syntax!");
throw new RuntimeException("Error in loadReport syntax!");
}
File f = new File(filename);
XStream xstream = new XStream(new DomDriver());
xstream.alias("Report", ReportSettings.class);
xstream.autodetectAnnotations(true);
ReportSettings ort = (ReportSettings)xstream.fromXML(f);
return ort;
}
}
Finally I want to create table from 3 ArrayList, where {stats, count, test}*i. /n
How can I use Xstream.alias to change <strings> to <somethingAnother> in the XML file? I need to change them to stringOne and stringTwo as example.

You can use the ClassAliasMapper in Xstream to give the items in your collection a different tag when serializing to XML.
You add a block like this (for each collection: stats, count, test):
ClassAliasingMapper statsMapper = new ClassAliasingMapper(xstream.getMapper());
mapper.addClassAlias("somethingAnother", String.class);
xstream.registerLocalConverter(
InteractionSession.class,
"stats",
new CollectionConverter(mapper)
);

A Producer-Consumer implemented using java threads writes only half the data to file

Hello I have a problem wherein I have to read a huge csv file. remove first field from it, then store only unique values to a file. I have written a program using threads which implements producer-consumer pattern.
Class CSVLineStripper does what the name suggests. Takes a line out of csv, removes first field from every line and adds it to a queue. CSVLineProcessor then takes that field stores all one by one in an arraylist and checks if fields are unique so only uniques are stored. Arraylist is only used for reference. every unique field is written to a file.
Now what is happening is that all fields are stripped correctly. I run about 3000 lines it's all correct. When I start the program for all lines, which are around 7,00,000 + lines, i get incomplete records, about 1000 unique are not taken. Every field is enclosed in double-quotes. What is weird is that the last field in the file that is generated is an incomplete word and ending double quote is missing. Why is this happening?
import java.util.*;
import java.io.*;
class CSVData
{
Queue <String> refererHosts = new LinkedList <String> ();
Queue <String> uniqueReferers = new LinkedList <String> (); // final writable queue of unique referers
private int finished = 0;
private int safety = 100;
private String line = "";
public CSVData(){}
public synchronized String getCSVLine() throws InterruptedException{
int i = 0;
while(refererHosts.isEmpty()){
if(i < safety){
wait(10);
}else{
return null;
}
i++;
}
finished = 0;
line = refererHosts.poll();
return line;
}
public synchronized void putCSVLine(String CSVLine){
if(finished == 0){
refererHosts.add(CSVLine);
this.notifyAll();
}
}
}
class CSVLineStripper implements Runnable //Producer
{
private CSVData cd;
private BufferedReader csv;
public CSVLineStripper(CSVData cd, BufferedReader csv){ // CONSTRUCTOR
this.cd = cd;
this.csv = csv;
}
public void run() {
System.out.println("Producer running");
String line = "";
String referer = "";
String [] CSVLineFields;
int limit = 700000;
int lineCount = 1;
try {
while((line = csv.readLine()) != null){
CSVLineFields = line.split(",");
referer = CSVLineFields[0];
cd.putCSVLine(referer);
lineCount++;
if(lineCount >= limit){
break;
}
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("<<<<<< PRODUCER FINISHED >>>>>>>");
}
private String printString(String [] str){
String string = "";
for(String s: str){
string = string + " "+s;
}
return string;
}
}
class CSVLineProcessor implements Runnable
{
private CSVData cd;
private FileWriter fw = null;
private BufferedWriter bw = null;
public CSVLineProcessor(CSVData cd, BufferedReader bufferedReader){ // CONSTRUCTOR
this.cd = cd;
try {
this.fw = new FileWriter("unique_referer_dump.txt");
} catch (IOException e) {
e.printStackTrace();
}
this.bw = new BufferedWriter(fw);
}
public void run() {
System.out.println("Consumer Started");
String CSVLine = "";
int safety = 10000;
ArrayList <String> list = new ArrayList <String> ();
while(CSVLine != null || safety <= 10000){
try {
CSVLine = cd.getCSVLine();
if(!list.contains(CSVLine)){
list.add(CSVLine);
this.CSVDataWriter(CSVLine);
}
} catch (Exception e) {
e.printStackTrace();
}
if(CSVLine == null){
break;
}else{
safety++;
}
}
System.out.println("<<<<<< CONSUMER FINISHED >>>>>>>");
System.out.println("Unique referers found in 30000 records "+list.size());
}
private void CSVDataWriter(String referer){
try {
bw.write(referer+"\n");
} catch (Exception e) {
e.printStackTrace();
}
}
}
public class RefererCheck2
{
public static void main(String [] args) throws InterruptedException
{
String pathToCSV = "/home/shantanu/DEV_DOCS/Contextual_Work/excite_domain_kw_site_wise_click_rev2.csv";
CSVResourceHandler csvResHandler = new CSVResourceHandler(pathToCSV);
CSVData cd = new CSVData();
CSVLineProcessor consumer = new CSVLineProcessor(cd, csvResHandler.getCSVFileHandler());
CSVLineStripper producer = new CSVLineStripper(cd, csvResHandler.getCSVFileHandler());
Thread consumerThread = new Thread(consumer);
Thread producerThread = new Thread(producer);
producerThread.start();
consumerThread.start();
}
}
This is how a sample input is:
"xyz.abc.com","4432"."clothing and gifts","true"
"pqr.stu.com","9537"."science and culture","false"
"0.stu.com","542331"."education, studies","false"
"m.dash.com","677665"."technology, gadgets","false"
Producer stores in queue:
"xyz.abc.com"
"pqr.stu.com"
"0.stu.com"
"m.dash.com"
Consumer stores uniques in the file, but after opening file contents one would see
"xyz.abc.com"
"pqr.stu.com"
"0.st

Couple things, you are breaking after 700k, not 7m, also you are not flushing your buffered writer, so the last stuff you could be incomplete, add flush at end and close all your resources. Debugger is a good idea :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ElasticSearch - Java API indexing 100K + PDFs using producer & consumer - java

Related

Threads are not running simultaneously to read files

ElasticSearch indexing documents using Java Executors service

How to write a unit test for an XML parser I wrote in Java

Xstream - change ArrayList elements name

A Producer-Consumer implemented using java threads writes only half the data to file

Categories

Resources