Virus/Malware Danger While Web Crawling [closed] - java

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I recently wrote a custom web crawler/spider using Java and the JSoup (http://jsoup.org/) HTML parser. The web crawler is very rudimentary - it uses the Jsoup connect and get methods to get the source of pages and then other JSoup methods to parse the content. It randomly follows almost any links it finds, but no point does it attempt to download files or execute scripts.
The crawler picks seed pages from a long list of essentially random webpages, some of which probably contain adult content and/or malicious code. Recently while I was running the crawler my anti virus (Avast) flagged down one of the requests as a "threat detected". The offending URL looked malicious.
My question is, can my computer get a virus or any sort of malware through my web crawler? Are there any precautions or checks I should put in place?

In theory, it can.
However, as you don't execute Flash and similar plugins, but only process the text data, chances are pretty high that your HTML parser does not have a known vulnerability.
Furthermore, all the viruses and mailicious web sites target the big user groups. There are only so few users using JSoup. Most are using Internet Exploder, for example. That is why the viruses target these platforms. These days, Mac OSX is becoming more and more attractive. I just read about a new malware that infects Mac OSX users only, via some old Java security issue, when they visit a web site. It was found on Dalai Lama related web sites, so maybe it's Chinese.
If you really are paranoid, set up a "nobody" user on your system, which you heavily restrict. This works best with Linux. In particular with SELinux you can narrow down the permissions of the web crawler up to the point where you can stop it from being able to access anything except load an external web site and send the result to a database. An attacker can then only crash your crawler, or maybe abuse it for a DDoS attack, but not corrupt or take over your system.

Related

Can Java Applets be dangerous? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
So I'm currently reading the "Java A Beginners Guide 7th Edition" book. And the following sentences seemed to me that Applets could be used as virusis. Was this done?
An Applet is a special kind of Java program that is designed to be transmitted over the Internet and automatically executed inside a Java-compatible web browser.
The key feature of applets is that they execute locally...
To me it sounds like it wouldn't be hard to build in a virus into an Applet.
The problem with applets is that they run automatically when you load the page. They're also so complex (compared to html or javascript) it was just to complicated to be able to meaningfully secure them. Run Automatically + Complicated to Secure + Doesn't Update Automatically = impossible to completely secure.
Regular apps are far far more dangerous to your machine than applets were. But, they don't run automatically when you visit a web page.
Desktop apps written in languages (like C or C++) where you manipulate the memory with pointers and don't automatically bounds check arrays, are much harder to write securely. Languages (like Java or C#) that don't have pointers and do automatically bounds check arrays are easier to write secure apps in.
Java includes many safewards to prevent any ill behavior, but time after time, those security features were not enough because of different bugs or design problems.
As standalone apps they are as safe or risky as any other app. Just make sure to download your app from trusted sources.

How to use Java in PHP? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to use WordPress for my web development, which is PHP written, including the database connection to MySql. The whole thing is PHP. But I need to use Java to back-end data processing and a number of existing Java open source libraries.
A google search shows that PHP/Java Bridge is a way to go. Is that bridge best way to go? If everything is PHP with WordPress, is still a way to use J2EE technologies, inlcuding JSP, Servelet, etc?
edit
Java is needed becaue I need to run machine learning algorithms, libraries for which are only available for Java. Also, PHP may run into efficiency issues when it's used to process large amount of data.
A good example of libraries in Java I am going to use is those processing Big Data, which are mainly Java, like Hadoop.
The very simple answer here is don't
PHP is designed to, at every page request start up, execute a small series of scripts as a single operation, output the data associated with those scripts and then immediately die after generating the output. It literally does not have time to wait for your Java programs and libraries to do their thing, so don't try to put one in the other, which is why PHP scripts that rely on databases tend to have heavily optimised databases for immediate retrieval, instead of general databases that rely on joins and selects that take a few seconds to form the correct data response. Neither PHP or users browsing websites have time for that.
What you could do is wrap your java tools in Java Servlets and have them running on the same server/host that your PHP instance is running from, so that your scripts can access the Servlets as http://127.0.0.1:7254/... as it would any other restful API it needs to use while generating your script output, as long as you make damn sure that you're not going to make PHP wait: if it has to send data to your tools, that is a post-and-forget operation, PHP should not be getting any response back other than an immediate "data accepted" or "data rejected" before the data is then actually handled by your tools. If you need to post data and then get a result back, you're going to have to use two calls. One to post the data, and then a second call to request the result of that posting.
For instance:
web page generation chain: WordPress CMS based on PHP -> your database
web input for processing: WordPress CMS based on PHP -> Java Servlets for machine learning
data processing chain: Java Servlets for machine learning -> your database
So you build pages only based on what's in your data base, you post data to your java Servlets only to get them to start doing something and you don't wait for a response, their result will end up in your database and you'll get it for pages once it's in, and your java programs do what they need to do independently of your WordPress setup.
And if you're going to do that, you should probably write that functionality as a WordPress plugin that can talk to your Java Servlets.
And now you have a second project you need to work on: turning your java programs into web servers. Not terribly complex, but definitely something you're going to lose some time on doing right (because you'll need to wrap with servlets, as well as make sure you can have those running without crashing on the same server as your wordpress instance, which is always fun)

How to make only one instance of application to be used by multiple users? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
This is not about restricting opening multiple instances. I wrote a little app that creates reports and send scheduled emails. This app is on a common drive folder that everybody in our company has access to.
I want to set it up the way so that it would really execute only from my computer (like a server). However, all other people could open it and see all the processes that are going on at the instance that is open on my computer and could also make modifications etc
How can I do it?
A single copy of an app running on a server and handling requests from multiple locations... that's called "client-server" and you have essentially two choices:
A modern HTML-based web application (aka "thin client", but the "thin" part is debatable nowadays). The user interface is implemented in HTML/Javascript/CSS, runs on the client's browser, and interacts with a web server over the network (HTTP or AJAX or both) to execute the application logic. The main advantage of this is that the client needs only a modern web browser and can be run on any platform that supports the browser (Windows, Linux, iOS, MacOS, etc)
A "fat client" application. You write the user interface using Java/Swing/AWT/GWT/etc, and a server component also using Java. They communicate over the network using whatever you want to layer on top of TCP/IP. This can also run on many clients but they must have Java installed, so iOS is probably out. And clients may need to install Java, and some users may not want to. I.e. some clients might encounter a barrier to running your app.
A detailed explanation of how to write client-server apps is far beyond the scope of SO. You'll need to do a lot of reading and studying.

Data hosting for a android application [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
i want to create a android application, where it will fetch stories (probably html or text) files from internet. i want to know where can host these files(no problem with paid service).
Users should be able to search the stories, rate , and options like mostread and NEW..etc
is there any predefined web services are available for this kind of purpose?
If NO, then what are the technologies i should be familier with to achieve this in a normal web server.
Thanks in advance
I suggest you start with shared web hosting.
Starting at ~ $5/month, shared hosting offers usually have the following advantages:
No need to set up yourself the linux system, Apache and MySQL server
cPanel administration
Support of your preferred server-side language: PHP, Python (less common than PHP) etc.
Migrating to another host is pretty simple
The choice of the programming language + framework depends on your taste and experience.
Two very popular options are PHP/Code Igniter and Python/Django.
Of course, if the traffic becomes significant or if you already expect a very fast growth, you may also consider a scalable solution (which shared or even dedicated hosting is not). Amazon, Google and Microsoft provide this kind of service in the cloud.
From my personal experience with Amazon S3, setting up a web service in the could is far more time-consuming than with a traditional web host. I would not recommend it unless your traffic forecast is over dozens or hundreds of hits/second.
If you want to create your own service - checkout Google App Engine (GAE).
Enabled Java deployment (no need to programming in PHP, Pythone etc.)
Scalable (almost every one mobile app has potential to gain 1M users)
free quotas (free start)
Good integration with Google services and tools (GWT i.e.)
Disadvantages:
There is no option to (easy) migrate your solution to other service.
No ready to install apps (forums, etc).

What to choose for article management website [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this question
I have some questions regarding article management system.
I am thinking of making a website where people will become members and write their articles, they can publish them, rank them etc.
And i have been googling for past two weeks that which technology is best.
And how to store the article so that search engines (like google, yahoo, etc.) can find those articles.
If the articles are stored as html somewhere on my server then Google Spider programs will be able to get them for search results
but if i store the content of my article in MySQL (the database which i want to use), how would search engines rank my website articles.
I am really confused, please guide me.
I need to know if there is any PHP article management script which is open source which i can update or change to suit my needs and
has not been hacked. Or Java Content management script or something which can save me the time to develop this whole thing.
I would really appreciate it.
Generally if you store the content in a database, you have scripts which serve up that content, and thus search engine spiders index the served versions of the article.
There are many content management systems out there, it's really a subjective choice which one you choose. Whether or not something "has been hacked" is a poor indicator of whether it can currently and/or might in the future be compromised; the developers of CMS software tend to patch known holes and it's impossible to predict future holes based on past ones, so really, you're best bet is to just try to find something with solid support and active development, and patch frequently as security updates are released.
As others have said, storing article data in a database is no problem. The articles will get rendered into HTML by some script, and displayed on your site, where search engines will find them. There are a bunch of techniques to improve how well your articles will show up in search.
In this day and age, I wouldn't recommend rolling your own system. There are a great number of off-the-shelf software packages that can handle your requirements. Wordpress is a very popular blogging system, written in PHP (with MySQL), that will probably meet all of your requirements. It supports multiple authors (and various roles for authors such as author/editor/administrator), commenting/discussion, and has a huge array of plugins that provide additional or altered functionality. It's well documented (both user and developer documentation), actively maintained, and pretty good overall.
If Wordpress doesn't float your boat, I'd look around at some of the other PHP-driven blogging tools. There are a ton of them, and it's very likely that one will fit your needs, and you can avoid reinventing the wheel for the 900th time.
I am sorry i still didn't understand. Here is the example lets say user1 submitted the article1 and the content got stored in the database. Now on a home page there is a link "How to train your pet" and user clicks on this link and it goes to a servlet which pulls the article content and information from the database and generates an output and displays it into ... what an html or what like will it save the output as an html on the server so that next time when another user clicks on "How to train your pet" on the homepage he will be directed to this generated html
Or another case where servlet will generate the output on the browser where user will read and vote, rank etc. but in this case there is no html file so how would search engines will rank this article as this file doesn't exist. Its so confusing.
If the articles are stored as html somewhere on my server then Google Spider programs will be able to get them for search results but if i store the content of my article in MySQL (the database which i want to use), how would search engines rank my website articles.
It doesn't matter how they are stored, only that they are addressable via http URIs. Browsers don't access data in databases, they make requests to web servers (which might run programs to fetch data from databases or might fetch data from files on the file system).
I need to know if there is any PHP article management script which is open source which i can update or change to suit my needs and has not been hacked. Or Java Content management script or something which can save me the time to develop this whole thing.
Hundreds, in both Java and PHP.

Categories

Resources