There are primarily two kinds of spam, and the users of the internet experience various side effects from each. Usenet spam is defined as a single message that is distributed to twenty or more newsgroups on Usenet. (Users of Usenet have discovered through years of experience that any message that is sent to a large number of newsgroups is often irrelevant to the majority of them or all of them.) The individuals that are targeted by usenet spam are known as lurkers. These are persons who browse newsgroups but seldom or never post and give away their address. Spam on Usenet deprives users of the benefits of participating in newsgroups by inundating them with an onslaught of advertisements and other postings that serve no purpose. In addition to this, Usenet spam makes it impossible for system administrators and owners to effectively govern the subjects that are allowed on their own systems.
I believe that it is feasible to stop spam, and that using filters based on the content of the spam is the way to achieve it. The message that the spammers send is the spammers' achilles heel. They are able to go through any additional barriers that you put up. At least up to this point, they have. However, they are obligated to communicate their message, whatever it may be. If we are able to develop software that is capable of recognizing their communications, there is no way that they will be able to circumvent this. Direct mail communications are sent to specific users, like in the case of email spam. Scanning the posts on Usenet, stealing addresses from internet mailing groups, or scouring the web for addresses are common methods used to construct email spam lists. Receiving spam emails often requires recipients to pay a fee out of their own wallet. Many individuals, including anybody who has metered phone service, read or receive their mail "while the meter is running," which is another way of saying "when the phone is ringing." They have to pay more money because of the spam. In addition to this, it is expensive for Internet service providers (ISPs) and other online businesses to send spam, and these expenses are passed straight on to the customers.
When writing spam filters, the statistical technique is not often the first one that people explore. The initial inclination of the vast majority of hackers is to attempt to create software that can identify the many characteristics of spam. When you look at spam, you may think to yourself, "These people have the audacity to attempt sending me email that starts with Dear Friend or that has a subject line that is all capital letters and finishes in eight exclamation points." I just need around one line of code to filter out all of that information.
But the most significant benefit of using a Bayesian technique is, of course, the fact that you are aware of what it is that you are measuring. Email receives a spam score when processed using feature-recognizing filters such as SpamAssassin. The Bayesian method ascribes an actual probability to each outcome. The trouble with a score is that nobody really understands what it means to have one. Not only does the user not understand what it implies, but even the person who created the filter has no idea what it refers to. How many more points need to be awarded to an email just because it contains the word sex? It is possible to get a probability wrong, but there is little room for debate over what it implies or the appropriate way to weigh the many pieces of evidence while doing so. According to my corpus, sex suggests a likelihood of 0.97 that the email it contains is spam, while sexy indicates a probability of 0.99 that it contains spam. And Bayes' Rule, which is equally unambiguous, states that an email that contains both words would have a 99.97% chance of being spam in the (unlikely) absence of any other evidence. This is an extremely unlikely scenario.