We are building search feature in our application to traverse over more than 100,000 xml files for content.
Data content are in form of huge number of xml files.
Is this a good idea to keep huge number of xml files and on search (like by name etc) traverse through each file for result? It may reduce our application search performance.
Or what is the best way?
You want elasticsearch here. It will give you what you need.
Related
I'm trying to index a bunch of XML files on my hard drive with Apache Lucene in Java. My idea was, for the best performance, to use the tags in the files as fields in the Lucene index, so when you search for a specific tag in a file (for example, an Ordernumber), the Lucene query would simply be ordernumber:123.
For example, I have this part of a file:
<contactPerson>
<identification>
<source>CUSTOMER</source>
<sourceId>12345678</sourceId>
</identification>
<lastName>Vader</lastName>
<firstName>Darth</firstName>
<telefon>0000</telefon>
<emailAddress>darth.vader#Imperium.com</emailAddress>
<roleType>ORDERER</roleType>
</contactPerson>
Is it now possible to index the XML file, so that I can use the query lastname:Vader to look for the last name? Would you recommend a better solution to search in XML files? Is there maybe an out-of-the-box feature I could use? Would it be easier if the XML files were stored in a database?
Also, later I will use Elasticsearch or Solr for the same task, but then the XML files will be in a database. What different options would be available when using one of these?
I've used Apache Flume to pipe a large amount of tweets into the HDFS of Hadoop. I was trying to do sentiment analysis on this data - just something simple to begin with, like positive v negative word comparison.
My problem is that all the guides I find showing me how to do it have a text file of positive and negative words and then a huge text file with every tweet.
As I used Flume, all my data is already in Hadoop. When I access it using localhost:50070 I can see the data, in separate files according to month/day/hour, with each file containing three or four tweets. I have maybe 50 of these files for every hour. Although it doesn't say anywhere, I'm assuming they are in JSON format.
Bearing this in mind, how can I perform my analysis on them? In all the examples I've seen where the Mapper and Reducer have been written, there has been a single file this has been performed on, not a large collection of small JSON files. What should my next step be?
This example should get you started
https://github.com/cloudera/cdh-twitter-example
Basically use hive external table to map your json data and query using hiveql
When you want to process all the files in a directory, you can just specify the path of the directory as your input file to your hadoop job so that it will consider all the files in that directory as its input.
For example if your small files are in the directory /user/flume/tweets/.... then in your hadoop job you can just specify /user/flume/tweets/ as your input file.
If you want to automate the analysis for every one hour you need to write one oozie workflow.
You can refer to the below link for sentiment analysis in hive
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
I have an application that receives weather information every x seconds. I am wanting to save this data to an XML file.
Should I create a new XML file for each weather notification, or append each notification to the same XML file? I am not sure of the XML standards of what is common practice.
I highly recommend appending not because that is a standard practice of XML, but more because creating a new file every x seconds will likely be a very difficult way to manage your data. You may also run into limitations of your file system (e.g. maximum files per directory).
You might also consider using a database instead of files to store your data.
XML files have only one root element. You can write multiple XML fragments into the file but it won't be a valid document then. So while both options are fine, and you should consider your other requirements too, the standard somewhat nudges you towards writing a file (or a database row) per notification.
More specifically large XML webpages (RSS Feeds). I am using the excellent Rome library to parse them, but the page I am currently trying to get is really large and Java runs out of memory before getting the whole document.
How can I split up the webpage so that I can pass it to XMLReader? Should I just do it myself and pass the feeds in parts after adding my own XML to start and finish them?
First off learn to set the java command line options for Xms and Xmx to appropriate values, all the DOM based parsers each crap loads of memory. Second look at using a Pull Parser, it will not have to load the entire XML into a document before processing it.
I'm trying to generate some graphs with prefuse, and it seems like the easiest way to load the data into prefuse is to use a GraphML file.
Is there an easy way to write these files from my data?
Or is there an easier way to load my data into prefuse?
Thanks
yEd can export graphs in GraphML format and JGraphT has a GraphMLExporter. Leaves the problem on how to get your data into those products or libraries. But at least both can create the desired format.
on the other hand - GraphML is in XML format so you can easily use jdom or dom4j to create a DOM, add the nodes based on your data an serialize it to an XML file. This shouldn't be to complicated.
You could use the Network Workbench, which allows you to load data in a lot of different forms including edge lists. Edge lists are usually the easiest format to generate.
I'm not completely sure if you can export from NWB to say GraphML, but NWB includes a number of visualizations, some of which are based on Prefuse.
If you want to do more with your data than just visualize it then NWB might help you.
Check PyGraphML, a basic Python library designed to parse and generate GraphML files. http://hadim.github.io/pygraphml/index.html