Thursday, September 29, 2005

My Recommendations for Google

Most of these applies specifically to Google but few does apply to other blog service providers as well.

1. Advertisers and online marketing companies should stop doing business with spammers. The motivations behind blog spammers are no different than any other spammers. It's all about money. If you reduce the money for spammers you reduce the spam. Google needs to be much more proactive on this front. I don't believe that Google is doing enough to cut the funding of AdSense revenue to spammers. From my experience Google hasn't shut down many AdSense accounts of spammers. There is another side of this as well. Google is currently letting spammers advertise their blog spamming software via AdWord. Just google for "blogburner" and you'll see that Rick Butts' blog spamming software is being advertised through Google's AdWord. I think it's about time Google makes it's position clear on where they stand when it comes to blog spam.

2. Blogger could put limits on various activities. If the limit is high enough it should not affect the blogging activity of normal or even highly active bloggers but it should prevent spammers from going about their daily spamming.

  • Limit on number of account a person can create in one day
  • Limit on number of blogs a person can create in one day
  • Limit on number of blogs a person can create per account
  • Limit on number of blog posts in one day
  • Limit on number of comments in one day

3. Put a timed moritorium on newly created blogs. Make it so that newly created blogs will not be indexed by search engines or show up on Next Blog ring for the first 30 days or whatever the approprite time may be. This will be enough to identify and delete spam blogs before it sees the light of day and do anything to pollute the search indexes. Currently a spammer can create dozens if not hundreds of blogs and they will just show up on Next Blog ring and get indexed immediately. This sort of immediate gratification is what needs to be addressed. Spammers are by nature not patient. They are out to make a quick and easy buck and slowing them down should frustrate them enough to make most of them go away.

4. Template tampering prevention is needed. Spammers are working around the Blogger's metatags by stripping the <$BlogMetaData$> tag from template and replacing it with their own metatags tricking search engines to index and crawl their spam blogs. This sort of tampering is well documented here. It should be very simple to put a check in the system to prevent such a tampering.

5. "Mark this comment as spam" feature is needed. This is analogous to the flagging feature but applies to individual comment. The information gathered from this kind of feature can make current antispam measures more effective and less intrusive. The current implementation of word verification feature to block comment spam has a serious problem with visually impaired users and I think most bloggers aren't aware of it's implications. I believe there is a smarter method that can solve these two problems at once. Instead of a word verification being all or nothing, it can be applied only when it's necessary. There is already a one pattern that exists for just about every comment spam. Every comment spam I've seen has a link to get user to click on them. A server can keep track of comments with links as well as comment flagging numbers to apply a word verification for suspecious spam comment. For example when there is lot of comment being flagged as spam and if those comments have link to some domain like buyfakewatches.info then word verification would kick in. There are other potentially effective means of curbing comment spam that can be implemented when data collection is in place. If a Blogger user is flagged as making numerous number of comment spam then they can be throttled down to not allow comment for set number of time. Perhaps a combining the ideas with #2 word verification is mandatory for newly created accounts. I'm sure there are plenty of other technical means to curb comment spams and they should be discussed further.

Ultimately spammers are depending on their ability to create mass number of accounts, blogs and comments. There are technical means to hinder this unrestricted creation of junk. Unlike email Google does have full control over their own infrastructure and therefore I believe blog spammer's days are numbered.

Wednesday, September 28, 2005

Current Status

Returning from a short vacation of sort I see that my trusty script has spidered over 120,000 blogspot blog pages while I was gone. The total number of blogs I need to sift through now stands at over 180,000 pages totaling about 15 GB of raw data. My useful little perl script that extracted links from html is no longer all that useful anymore. It used to take about a minute to process 2000 pages but now that I'm dealing with data size much larger I'll need to figure out more efficient means to process all this data at greater speed.

Tuesday, September 20, 2005

Scope of the Splog Problem

I've been monitoring and processing about 27000 blogs daily looking for splogs and it appears at least 15000 new splogs are being created daily give or take a thousand. Of course this is only from blogspot and I haven't even begun to identify splogs in other free blog services. The problem of splog is bigger than I imagined and it's growing at a rapid pace. The problem does need immediate attention from various industry leaders like Google and Yahoo to address this problem head on. Hopefully something will come out of the second web spam summit hosted by Technorati.

Friday, September 16, 2005

Lost and Found

Recently I've seen the number of visible splogs recede like a tide. Even though they are trying to make a comeback to next blog ring it is no where visible like it once was. I began wondering where they were. Did spammers just give up? I didn't think so. After poking around I found a way to efficiently retrieve a list of potential splogs and it's huge. It's much larger than I ever expected. Currently I have a list of 22000 blogs and from my rough estimate about 15000 of them are splogs. I've been able to definitively confirm only about thousand of them today. Obviously I will be exploring various ways to identify splogs. The growth rate of splog count is not yet known since today was the first day I've taken advantage of this new information source. As a result of this my list has suddenly jumped to 3021 splogs.

Wednesday, September 14, 2005

2085 Splogs

I've submitted my list of 2085 splogs to Blogger for their review and hopefully prompt removal. We'll see if this will bring up the number of splogs being deleted daily.

Sunday, September 11, 2005

Post Flag Day Status

Splog flag day came and went without much hoopla. I don't know how many people have participated and it's effectiveness is not apparent as of yet. Blogger doesn't automatically remove splogs from next button ring so obviously we're not going to see the result for few days. What I have noticed however is it seems like there is more splogs in next button ring. It looks as though spammers have stepped up their spamming activities and figured out Blogger's filtering heuristics. I'm actually bit surprised by this because I thought their primary motivation for creating splog was to be indexed by search engines and they can achive that goal regardless of being in the next button ring. I may have thought this wrong. Maybe they do need to be visible because Google is doing a better job at ignoring the spammer's obvious attempt at PageRank manipulation. Anyway as result of splogs being more visible on next button ring my splog list have grown by about 500 during the three days. It now stands at 1867 as of this writing. Sadly Blogger has deleted only nine splogs on Friday.

Something I have not really thought about came to my attention as of today. About three weeks ago Blogger implemented the word verification or also known as captcha feature to prevent comment spamming on blogs. To me this seems like a pretty good idea for preventing spams since most comment spammers are done by automated programs. Now I see how this has an unintended consequence of preventing blinds from using blogs. Here is a blog by a blind person voicing his opinion about word verification requirement during creation of blog back in April. Obviously things have gotten even worse for this guy since Blogger added comment word verification feature. I see this as a very bad thing. Blind bloggers are now cut off because of spammers. Becaus of this, I have decided to turn off word verification on comments and urge others to do the same. I imagine I will get spammed now but that's ok. I will turn this into my advantage. This blog will serve as a honeypot of sort to collect more information about spammers.

Friday, September 09, 2005

Flag Day Tommorow

I've just posted a list of 350 splogs on flag day wiki for all to see and flag if they wish. Having done that I have some doubt as to whether flagging actually works. Somthing I've observed lately is that Google doesn't seem to be deleting splogs based on high flagging. I think others have noticed as well. I think Blogger is using flag information to fine tune their splog identification heuristics to filter out splogs from their next blog button ring but that's about it. They are not deleting the blogs that they've identified as splogs for some unknown reason. At the same time I've noticed that splogs I've listed on this blog are all gone. I don't know if Google is looking at this blog or if others have taken the list and submitted them to Google for removal. I guess I'll wait and see what Google will do with all the new flags tommorow. Will they remove all the splogs as they should?

Tuesday, September 06, 2005

Case #14 - e5y3461.blogspot.com and more

I've been working on some perl and shell scripts to identify and extract data from splogs. This is the first fruit of my labor. Here is a list of 85 splogs which all point to a domain talk-stuff.com. The total number of links pointing to this domain is whopping 60098. That's right, it's over 60 thousand links to one domain! Obviously this guy is really trying to pump up his page ranks on Google.


Case #13 - corincent.blogspot.com

Initially splogs were nothing more than page full of links. However I'm noticing that more and more it's trying to look like it has content. Some do better job at that but this one isn't it. Anyway, I'm still going after them one by one.

Case #12 - digitalcameracorner.blogspot.com

This is a typical splog picked at random. I realize that just targeting large spammers may not be the best way to go about it hence the randomness. I've reported this splog to AdSense as with any other.

Monday, September 05, 2005

Case #11 - ioanaani.blogspot.com

At first appearance this is yet another splog but it's really a precursor of lot more. What's different about this splog is that it has a link to sex animation web pages. AdSense has a policy of not allowing AdSense on "Pornography, adult, or mature content." Obviously I've alerted AdSense people about this. What's becoming clear is there is a definite rise in pornography related splogs.

Friday, September 02, 2005

Current Status Updated

It appears spammers have changed their tactics. They seemingly have gone away but they are still there. They're just hidden. They know that people like me are coming after them vigorously. They are now creating splogs that will not show up on blogger directory so clicking on Next Blog button will not show the splogs. Google will still index them thus fulfilling spammer's goal of manipulating Google search result generating revenue. I guess it's time for me to read up on Google API to ferret out these spammers. This may actually be a good thing since I can now use Google to find splogs instead of clicking on Next Blog button. They can run but they can't hide.

Current Status

I admit I haven't posted much in here in last couple days. Just so you know I haven't given up. I have four cases that I've worked on but not posted anything about. Having said that, four cases for four days isn't much. I've been busy working on a set of perl and shell scripts to automate many analysis tasks. With this new tool I've found a set of splogs that has over 60000 links pointing to spammer's website. I believe it will accelerate many other tasks greatly. Current number of splogs in my database is 1292. The number tends to fluctuate between 1200 and 1500 depending on how many new splogs are being created each day and how many are being shut down. I've noticed that on average Blogger deletes 75 to 150 splogs on my list. If anyone wants to report splogs you've found, please send it to fightsplog@gmail.com.

I've noticed sudden surge of pornography splogs yesterday. At the same time I think Blogger is starting to step up their efforts to curb the growth of splogs. So far we are winning the battle. I was able to click through about twenty consecutive legitimate blogs via Next Blog button and that's pretty remarkable.