I have a problem while using the weka StringToWordVector. How can I create a word matrix from a list of strings?
In my code, I create instances from strings. As soon as I want to identify the setInputFormat(), the code runs into some kind loop with no process activity in the background. A result is never reached, it just continues running without progress and no error.
Here my code example that causes my problems:
ArrayList<String> stringList = new ArrayList<>();
stringList.add("test1");
stringList.add("test2");
Attribute attributeContent= new Attribute("content", (ArrayList<String>) null);
attributesList.add(attributeContent);
Instances data = new Instances("Strings", attributesList, stringList .size());
for (String s: stringList) {
DenseInstance instance = new DenseInstance(1);
instance.setValue(attributesList.get(0), s);
data.add(instance);
}
StringToWordVector filter = new StringToWordVector();
Instances newData = null;
try {
filter.setInputFormat(data);
newData = Filter.useFilter(data, filter);
} catch (Exception e) {
e.printStackTrace();
}
Ok, the code works fine... I had a problem with my dependecies in Java. Because of the play framework I used, Java did not perform an auto-load of an alternative for the netlib library. This library picks a linear regression optimised for an OS, I had to set the following flag in Intellij and it works:
Related
I'm using Java with JDBC to run MySql code. I want to execute a DDL script, but JDBC can only execute a single statement at a time, which makes it unsuitable to execute a whole .sql file out of the box.
What I'm trying to do is use Antlr4 to parse the .sql file so I can break up each individual statement and then iteratively execute them with JDBC.
I've gotten this far:
InputStream resourceAsStream = Main.class.getClassLoader()
.getResourceAsStream("an-arbitrary-ddl.sql");
CharStream codePointCharStream = CharStreams.fromStream(resourceAsStream);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
// Where do I go from here?
I'm sure I'm just not searching for the correct terms because I'm new to Antlr and manually parsing code. I can't find any reference from here as to what I need to do to get individual sql statements out of the MySqlParser. What do I need to do next?
A parser is not the right tool for this kind of problem. A statement splitter is pretty easy to write manually and much faster if you do it yourself. I implemented such a splitter in C++ in MySQL Workbench. Shouldn't be difficult to port this to Java. The code is very fast (1 Mio LOC SQL code in under 1 sec on an average machine). A parser would need much longer.
I'm sure this can be improved, however, as the most simple way I could create this was creating a listener and provide the constructor with a Consumer<String> object. The listener looks at individual statements and recursively constructs them. There is probably a more optimal solution, however, I no longer have time to try to optimize this if there is.
/**
* #author Paul Nelson Baker
* #see GitHub
* #see LinkedIn
* #since 2018-09
*/
public class SqlStatementListener extends MySqlParserBaseListener {
private final Consumer<String> sqlStatementConsumer;
public SqlStatementListener(Consumer<String> sqlStatementConsumer) {
this.sqlStatementConsumer = sqlStatementConsumer;
}
#Override
public void enterSqlStatement(MySqlParser.SqlStatementContext ctx) {
if (ctx.getChildCount() > 0) {
StringBuilder stringBuilder = new StringBuilder();
recreateStatementString(ctx.getChild(0), stringBuilder);
stringBuilder.setCharAt(stringBuilder.length() - 1, ';');
String recreatedSqlStatement = stringBuilder.toString();
sqlStatementConsumer.accept(recreatedSqlStatement);
}
super.enterSqlStatement(ctx);
}
private void recreateStatementString(ParseTree currentNode, StringBuilder stringBuilder) {
if (currentNode instanceof TerminalNode) {
stringBuilder.append(currentNode.getText());
stringBuilder.append(' ');
}
for (int i = 0; i < currentNode.getChildCount(); i++) {
recreateStatementString(currentNode.getChild(i), stringBuilder);
}
}
}
Next you need to traverse the statements, the string consumer from earlier allows you to lazily redirect the output wherever you need. This can be as simple as just printing to stdout, however, it can just as easily be used to append to a list.
public List<String> mySqlStatementsFrom(String sourceCode) {
List<String> statements = new ArrayList<>();
mySqlStatementsToConsumer(sourceCode, statements::add);
return statements;
}
public void mySqlStatementsToConsumer(String sourceCode, Consumer<String> mySqlStatementConsumer) {
CharStream codePointCharStream = CharStreams.fromString(sourceCode);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
SqlStatementListener statementListener = new SqlStatementListener(mySqlStatementConsumer);
ParseTreeWalker.DEFAULT.walk(statementListener, mySqlParser.sqlStatements());
}
I'm trying to search for strings in an array list as below which is working fine. But, the challenge is when the string i'm searching for is embedded in between some text example "XXXXTESTXXXX".
The binary search what i have so far seems to not find it. I have also tried "contains" method but didn't work either. Not sure where i'm i going wrong. Please suggest.
Array Content :[PIGEON, XXXBEARXXX , XXXCAT, XXXDOG, XXXELEPHANTXXX , XXXHORSEXXX , XXXLIONXXX , XXXMOUSEXXX , XXXOWLXXX , XXXPARROTXXX , XXXTIGERXXX ]
Example search string:-
Search String = "BEAR"
Code
ArrayList File_F1_Array = new ArrayList();
// Read the lines of the Source file (File_1) in to Arraylist
try {
BufferedReader File_F1_Br = new BufferedReader (new FileReader(File_F1));
while ((File_F1_Line = File_F1_Br.readLine()) !=null ) {
File_F1_Array.add(File_F1_Line);
}
File_F1_Br.close();
} catch (Exception e) {
e.printStackTrace();
}
// Sort the array list
Collections.sort(File_F1_Array);
// Search lines from Refernce file (File_2) in Arraylist
try {
BufferedReader File_F2_br = new BufferedReader (new FileReader(File_F2));
while ((File_F2_Line = File_F2_br.readLine()) !=null) {
int index = Collections.binarySearch(File_F1_Array, File_F2_Line);
boolean StringCheck = File_F1_Array.contains(File_F2_Line);
}
your algorithm is not correct! you are searching in array which has "xx..." and you want to get rid of "xx..."s! you have to create another arraylist without "xx.." and put some binding key between 2 arrays in order to keep the first array with Xs.
I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.
I tried to enable stop word filtering with two different approaches.
Approach #1:
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
Approach #2:
tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();
The full code is available here:
https://stackoverflow.com/a/36237769/462347
My questions:
Why Lucene doesn't filter stop words?
How can I enable the stop words filtering in Lucene 5.5 / 6.0?
Just tested both approach 1 and approach 2, and they both seem to filter out stop words just fine. Here is how I tested it:
public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException
{
StandardTokenizer stdToken = new StandardTokenizer();
stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
TokenStream tokenStream;
//You're code starts here
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
//And ends here
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(token.toString());
}
tokenStream.close();
}
Results:
some
stuff
need
analysis
Which has eliminated the four stop words in my sample.
The pitfall was in the default Lucene's stop words list, I expected, it is much more broader.
Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:
CharArraySet stopWordsSet;
try {
// use customized stop words list
String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
// use standard stop words list
stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();
For IP reasons, I'm not able to post full source code. However, I made a call to submit an Amazon Elastic Map Reduce Job (EMR) which now runs to completion. Previously it failed with essentially a file not found error.
RunJobFlowResult result=emr.runJobFlow(request);
succeeds and I can get the job flow ID from it.
Later, I have a loop polls for the status by first
DescribeJobFlowsRequest request=new DescribeJobFlowsRequest(jobFlowIdArray);
I check each state in a loop by calling
request.getJobFlowStates()
Unfortunately, that call always returns an empty collection, regardless of whether the job is running, failed or succeeded. How can I get at least some indication of what's going on?
AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduceClient client = new AmazonElasticMapReduceClient(credentials);
client.setEndPoint("elasticmapreduce.us-east-1.amazonaws.com");
StepFactory stepFactory = new StepFactory();
StepConfig enableDebugging = new StepConfig()
.withActionOnFailure("TERMINATE_JOB_FLOW")
.withHadoopjJarStep(stepFactory.newEnableDebuggingStep());
String[] arguments={...} // Custom jar arguments
HadoopJarStepConfig jarConfig = new HadoopJarStepConfig();
jarConfig.setJar(JAR_NAME);
jarConfig.setArgs(Arrays.asList(arguments));
StepConfig runJar = new StepConfig(JAR_NAME.substring(JAR_NAME.indexOf('/')+1),jarConfig);
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("...")
.withSteps(runJar)
.withLogUri("...")
.withInstances(
new JobFlowInstancesCOnfig()
.withHadoopVersion("1.0.3")
.withInstanceCount(5)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType("m1.small")
.withSlaveInstanceType("m1.small");
RunJobFlowResult result = client.runJobFlow(request);
String jobFlowID=result.getJobFlowID();
List<String> describeJobFlowIdList=new ArrayList<String>(1);
describeJobFlowIdList.add(jobFlowID);
String lastState="";
boolean jobMonitoringNotDone=true;
while(jobMonitoringNotDone){
SescribeJobFlowsRequest describeJobFlowsRequest=
new DescribeJobFlowsRequest(describeJobFlowIdList);
// Call to describeJobFlowsRequest.getJobFlowStates() always returns
// empty list even when job succeeds or fails.
for(String state : describeJobFlowsRequest.getJobFlowStates()){
if(DONE_STATES.contains(state)){
jobMonitoringNotDone=false;
} else if(!lastState.equals(state)){
lastState = state;
System.out.println("Job "+state + " at "+ new Date().toString());
}
}
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
The code above was missing a call similar to
DescribeJobFlowsResult describeJobFlowsResult = client.describeJobFlows(describeJobFlowsRequest);
This got me a solution that works, but unfortunately Amazon deprecated the method but didn't provide an alternative. I wish I had a non deprecated solution so this is only a partial answer.
I am getting Java Heap Space Error while writing large data from database to an excel sheet.
I dont want to use JVM -XMX options to increase memory.
Following are the details:
1) I am using org.apache.poi.hssf api
for excel sheet writing.
2) JDK version 1.5
3) Tomcat 6.0
Code i have wriiten works well for around 23 thousand records, but it fails for more than 23K records.
Following is the code:
ArrayList l_objAllTBMList= new ArrayList();
l_objAllTBMList = (ArrayList) m_objFreqCvrgDAO.fetchAllTBMUsers(p_strUserTerritoryId);
ArrayList l_objDocList = new ArrayList();
m_objTotalDocDtlsInDVL= new HashMap();
Object l_objTBMRecord[] = null;
Object l_objVstdDocRecord[] = null;
int l_intDocLstSize=0;
VisitedDoctorsVO l_objVisitedDoctorsVO=null;
int l_tbmListSize=l_objAllTBMList.size();
System.out.println(" getMissedDocDtlsList_NSM ");
for(int i=0; i<l_tbmListSize;i++)
{
l_objTBMRecord = (Object[]) l_objAllTBMList.get(i);
l_objDocList = (ArrayList) m_objGenerateVisitdDocsReportDAO.fetchAllDocDtlsInDVL_NSM((String) l_objTBMRecord[1], p_divCode, (String) l_objTBMRecord[2], p_startDt, p_endDt, p_planType, p_LMSValue, p_CycleId, p_finYrId);
l_intDocLstSize=l_objDocList.size();
try {
l_objVOFactoryForDoctors = new VOFactory(l_intDocLstSize, VisitedDoctorsVO.class);
/* Factory class written to create and maintain limited no of Value Objects (VOs)*/
} catch (ClassNotFoundException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (InstantiationException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (IllegalAccessException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
}
for(int j=0; j<l_intDocLstSize;j++)
{
l_objVstdDocRecord = (Object[]) l_objDocList.get(j);
l_objVisitedDoctorsVO = (VisitedDoctorsVO) l_objVOFactoryForDoctors.getVo();
if (((String) l_objVstdDocRecord[6]).equalsIgnoreCase("-"))
{
if (String.valueOf(l_objVstdDocRecord[2]) != "null")
{
l_objVisitedDoctorsVO.setPotential_score(String.valueOf(l_objVstdDocRecord[2]));
l_objVisitedDoctorsVO.setEmpcode((String) l_objTBMRecord[1]);
l_objVisitedDoctorsVO.setEmpname((String) l_objTBMRecord[0]);
l_objVisitedDoctorsVO.setDoctorid((String) l_objVstdDocRecord[1]);
l_objVisitedDoctorsVO.setDr_name((String) l_objVstdDocRecord[4] + " " + (String) l_objVstdDocRecord[5]);
l_objVisitedDoctorsVO.setDoctor_potential((String) l_objVstdDocRecord[3]);
l_objVisitedDoctorsVO.setSpeciality((String) l_objVstdDocRecord[7]);
l_objVisitedDoctorsVO.setActualpractice((String) l_objVstdDocRecord[8]);
l_objVisitedDoctorsVO.setLastmet("-");
l_objVisitedDoctorsVO.setPreviousmet("-");
m_objTotalDocDtlsInDVL.put((String) l_objVstdDocRecord[1], l_objVisitedDoctorsVO);
}
}
}// End of While
writeExcelSheet(); // Pasting this method at the end
// Clean up code
l_objVOFactoryForDoctors.resetFactory();
m_objTotalDocDtlsInDVL.clear();// Clear the used map
l_objDocList=null;
l_objTBMRecord=null;
l_objVstdDocRecord=null;
}// End of While
l_objAllTBMList=null;
m_objTotalDocDtlsInDVL=null;
-------------------------------------------------------------------
private void writeExcelSheet() throws IOException
{
HSSFRow l_objRow = null;
HSSFCell l_objCell = null;
VisitedDoctorsVO l_objVisitedDoctorsVO = null;
Iterator l_itrDocMap = m_objTotalDocDtlsInDVL.keySet().iterator();
while (l_itrDocMap.hasNext())
{
Object key = l_itrDocMap.next();
l_objVisitedDoctorsVO = (VisitedDoctorsVO) m_objTotalDocDtlsInDVL.get(key);
l_objRow = m_objSheet.createRow(m_iRowCount++);
l_objCell = l_objRow.createCell(0);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(String.valueOf(l_intSrNo++));
l_objCell = l_objRow.createCell(1);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getEmpname() + " (" + l_objVisitedDoctorsVO.getEmpcode() + ")"); // TBM Name
l_objCell = l_objRow.createCell(2);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDr_name());// Doc Name
l_objCell = l_objRow.createCell(3);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPotential_score());// Freq potential score
l_objCell = l_objRow.createCell(4);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDoctor_potential());// Freq potential score
l_objCell = l_objRow.createCell(5);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getSpeciality());//CP_GP_SPL
l_objCell = l_objRow.createCell(6);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getActualpractice());// Actual practise
l_objCell = l_objRow.createCell(7);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPreviousmet());// Lastmet
l_objCell = l_objRow.createCell(8);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getLastmet());// Previousmet
}
// Write OutPut Stream
try {
out = new FileOutputStream(m_objFile);
outBf = new BufferedOutputStream(out);
m_objWorkBook.write(outBf);
} catch (Exception ioe) {
ioe.printStackTrace();
System.out.println(" Exception in chunk write");
} finally {
if (outBf != null) {
outBf.flush();
outBf.close();
out.close();
l_objRow=null;
l_objCell=null;
}
}
}
Instead of populating the complete list in memory before starting to write to excel you need to modify the code to work in such a way that each object is written to a file as it is read from the database. Take a look at this question to get some idea of the other approach.
Well, I'm not sure if POI can handle incremental updates but if so you might want to write chunks of say 10000 Rows to the file. If not, you might have to use CSV instead (so no formatting) or increase memory.
The problem is that you need to make objects written to the file elligible for garbage collection (no references from a live thread anymore) before writing the file is finished (before all rows have been generated and written to the file).
Edit:
If can you write smaller chunks of data to the file you'd also have to only load the necessary chunks from the db. So it doesn't make sense to load 50000 records at once and then try and write 5 chunks of 10000, since those 50000 records are likely to consume a lot of memory already.
As Thomas points out, you have too many objects taking up too much space, and need a way to reduce that. There is a couple of strategies for this I can think of:
Do you need to create a new factory each time in the loop, or can you reuse it?
Can you start with a loop getting the information you need into a new structure, and then discarding the old one?
Can you split the processing into a thread chain, sending information forwards to the next step, avoiding building a large memory consuming structure at all?