Building a better captcha
A simple concept to limit blog comment spam
First off, this is a quick post. I am writing this now because it is something I have recently been practicing with great success and Im afraid if I wait too long to discuss it, someone will beat me to it.
Anyone who has a blog knows that comment spam is a HUGE pain in the ass. There are many MANY solutions out there to deal with the ever increasing volume of blog comment spam. These solutions range from simple captcha, to simple mathmatical questions (ie: what is 4 plus 9?) to some even more esoteric solutions (Sidenote: The Hacker Webzine is a blog I can't recommend enough for people interested in Internet Security). While the solution presented on the Hacker Webzine link just mentioned seems to be very effective, I'd like to propose to you all a much more unorthodox method to go about combating Comment Spam.
A normal captcha presents some data (usually in the form of a rastered image) and asks a user to interpret and enter that data to be validated by the script. This is very effective but recently many systems which employ it (Digg?) have had their captcha's cracked via rather reliable methods. What does this mean? It means its time to start thinking about the next big thing to prevent comment spam! Captcha by themselves range at about 70% efficiency in preventing spam comments, but there are always those that slip through. Especially when we are dealing with others (like myself and probably many of the readers here) who have enough creativity and drive to come up with new techniques and tricks.
I would like to suggest a Reverse Captcha. In addition to the captcha you already use, consider this: Create a blank text field and give it a name that you expect to get recognized by a spam bot...for example, if your comment form does not ask the person to enter their email twice, create a text field that is called email2. Make this field as you normally would, but use CSS rules to make it invisible to the user (dont use display:none;) then write a back-end script that validates the comment form by checking for data in that hidden email2 field.
if (!empty($_POST['email2'])) { die("Sorry sucka.") };
If your script detects that there is a value passed from that field then it can safely assume that the comment was not submitted by a human because any human viewing the screen will not ever see that field. Bam, you've just made yourself a reverse captcha.
Apply this very simple concept to your pre-existing comment form and I guarantee you will see a dramatic decrease in your comment spam.
And for all my Black Hat's out there....one more fucking thing for you to consider. Time to up your game again!
The Canucks Suck (and the Canadian Real Estate industry doesnt know shit about SEM)
Still....the Canucks? disgraceful.....
Tonight I was invited to a hockey game by some of my suppliers.
First off, the Canucks sucked. Im sorry Vancouver, but WHAT THE FUCK WAS THAT?!?!? We were playing against Nashville....what the hell! We shoulda creamed them, but it was like our boys were asleep at the switch...we ended with something like 36 shots on goal, and 0 points...whereas Nahville had like 5 shots on goal and 4 points. Disgraceful........
.............but I digress...............As I was saying, I was invited to a company suite at GM Place by a supplier who I spend alot of advertising dollars with. Now keep in mind, Im not talking about internet marketing or search advertising. I'm talking about print advertising.
Whenever you get a couple beers in business people, and especially in an informal setting, you get some pretty insightful converastion's happening, and tonight didn't disappoint. Inevitably, conversation turned towards the power of internet marketing & new media in the real estate industry, with me contending that America's struggling market (especially Arizona) is light-years ahead of Canadian Real Estate marketers in terms of their adaption and utilisation of digital media......
These fellows are executives from one of the major real estate publications, and as conversation continued, I started explaining to them that, in my opinion, one of the things I think they fail at is their web approach. Their website is a rather embarassing pre-1998 piece of nastiness that really feels like it has been forgotten about. Now a days, with the popularization of brand based networks and the ubiquitousness of digital media, I expect a real estate publication to have nothing less then a complete MLS style listing of projects listed in its pages, as well as video/photographic slide show tours of homes, not to mention full editorial opinion delivered via blog and an open forum for discussion on the market and product offerings. Not only would I be charging developers an arm and a leg to have their ads displayed on page and in email correspondance with users, but I would also be tapping into the realtor market and gleaning advertising dollars from that impressive sub-section of the overall real estate market....
This, more or less, was met with agreement from the executives I was conversing with, but still they had no clue about how to even wrap their heads around the needs and demands of today's internet user. One said to me "We have been seeing income from our Buy & Sell site dwindly quite rapidly over the past year." Which prompted me to ask, "What is your plan as to how to contend with the likes of Craigs List?".........only one person in the crowd knew what Craigs List was.
Then it occured to me: Real Estate is an 'old boy' industry. I mean...I've always known that the industry was full of rough necks and old boys, but it never really hit home until tonight. Tonight, I realized that at some point, someone is going to really step up and shine by showing people the true power of digital media as it pertains to the Canadian Real Estate industry. Here lies before us a market untapped.
Goal, Result, Consequence - Picking Effective Strategies
I want to write a post about picking the most effective search strategies to achieve desired results. Further, I want to talk about whether one should go about acheiving those result by employing whitehat practices, or blackhat practices.
The problem is, I refuse to enter into the blackhat vs. whitehat debate and sound off as an advocate for either side. The two terms have become so clouded in ethical debate that its impossible to convey the concepts that they actually represent. So, it's therefore impossible for me to write the post that I want to write.
Unless you agree to a proposition......
I propose we approach a specific SEO problem, and that for the sake of this post, we consider whitehat seo as “organic search manipulation”, and blackhat seo as “artifical search manipulation”. I propose, for the sake of this post, we approach the problem while forgetting the word “spam”; we forget all moral, ethical and professional objections we might have to artifical/organic search manipulation.
Instead, we'll simply think of the following: Goal, Result, Consequence.
Let's say we have a Real Estate lead-generating site which has a variety of high converting landing pages that target particular communities and geographic areas. These leads are later sold on, for a premium, to realtors/developers in those areas.
Our problem is this: The landing pages on the site don't rank well for their respective targets because there isn't much content and the entire site lacks authority. Let's also assume that we want to be certain that the moneymaking site itself doesn't appear questionable and therefore doesn't have a lot of keyword stuffed content.
So our Goal is: Deliver quality traffic to each lead generating landing page without relying on direct search engine traffic.
For this example, we need to funnel both link juice and quality traffic to those pages through intermediary sites. These networks of sites are the middle-man between the search engine and the money making site. Their job is to rank in the search engine and pass traffic onto their respective landing page targets.
So, there are two ways to approach this, organically or artificially.
The organic approach is to create many legitimate sites with content written by content writers; each sites content to be targeted to its respective landing page. These sites have quality design, nice imagery and generally look entirely innocuous. There should be many links on the site that point to the target landing page (save for a few barely noticeable outbound links to related quality sites), also including onsite advertising (banners etc). This way it becomes pretty hard for the visitor to NOT end up at your lead generating page.
Normally the artificial approach would differ from the organic approach in that we wouldn't be hiring content writers, but rather, we'd be mass generating content that would target our keywords broadly. For the sake of generating Real Estate leads, we know that won't cut it, because we want to deliver high quality leads that convert into sales. Besides, since we know how well these leads can pay, we can still afford to hire content writers. So, in that sense, the artifcial approach to this situation is much the same as the organic one in that we create sites with targeted human written content. But that's where the similarities end. Instead of creating sites that please the human eye, we employ IP delivery. We show bots a bare bones site with our targeted content. This site also has some respectable outgoing links and some images and a basic layout, so as to appear legitimate to the algorithm. When the site receives hits from IPs not identified as being bots, they are immediately transferred over to the lead generating landing page.
So, those are our two approaches. Now, how about the results?
The results for the organic search manipulation might play out something like this:
For every 1000 uniques to the organic intermediary site, we might achieve 500-600 visits to our landing page. At that point, how well they convert into leads depends on the performance of our landing page. These people who convert will be people who A) resonated with the search result displaying our intermediary site, B) resonated with the content of our intermediary site enough to click on something instead of bouncing and C) resonated with the landing page enough to fill out the lead form. That all adds up to very targeted leads. They've passed a three filter process and we've successfully funneled them into our database.
For the results of the artificial search manipulation we can expect that for every 1000 non-bot visitors to our intermediary site, we will achieve 1000 visits to our landing page. Once again, how well those 1000 convert into leads depends largely on the landing page. Whereas above, the people passed a three filter funnel, the artificial approach passes visitors through a two filter funnel: they resonate with the search listing and click, and then they submit the lead form. These leads will be less targeted than the leads generated organically. That said, there are many more leads generated through this approach.
The consequences of the artificial method, in this case, are poor leads (if we care about our business reputation, we must deliver quality leads), and potential penalization for violating ToS. The costs and time required are relatively low compared to organically building site networks, but don't fool yourself: the degree of effort required is about the same.
The consequences of the organic method, in this case, is that we spend way more time playing grounds keeper to our farm of blogs (which definitely require upkeep!) but we at least deliver higher quality leads. We pay more out of pocket for the upkeep of our farm, and we have a much much reduced, but still present, potential of being penalized by Google. (face it, they're out to get'cha)
At the end of the day, the choice is yours. Perhaps you can cheaply pre-qualify your leads before reselling them and so the artificial approach makes financial sense. Perhaps you deem the risk of penalization to be too big of a threat to your business, so you decide that the organic approach is for you.
What's important is that we place all our options on a level playing field, and choose the one best suited to our goal, with the most desirous results, and least negative consequences.
Taking Content Generation to the Next Level
A discussion on how best to generate content to pass Human inspection.
I just read two blog articles echoing the same sentiment.
One by Mark on Digerati, and one by SlightlyShady. These two articles both highlighted the most obvious epiphany one could glean from the Google Spam Docs: The algorithm cannot be perfect. Google needs a huge team to catch the spam that fools the algo. They need humans. And as SlightlyShady wrote "Humans are easy to fool.".
There are many aspects of a website that might indicate it is spam: design & layout, imagery, links (and linking patterns), site architecture, age, TLD, and of course, content. Content is probably the greatest stumbling block when it comes to creating sites that can pass human inspection.
We know how to create content that can fool the algorithm, but how do we create content that can fool Google's army of Monkeys at typewriters?
We can't expect to get away with markov content or synonym replacement; not for any respectable period of time anyways. After a while, if our sites eventually rank high enough to warrant human intervention, even scraped content is easy to detect as being duplicate.
A basic directive of the Google Spam Wranglers guidelines is to pare away all the scraped content and if whatever is left is just ads, then its most likely spam.
So how do we take content generation to the next level?
We need to create legible, syntacticly correct content that a human can read and make sense of. The key to this is taking small distinct chunks of data and splicing them together with joiner words or phrases. Yes, I'm talking about madlibbing.
A madlib script can create legible content and have thousands of different iterations. It takes a lot more creativity to create a madlib script than it does to set up some feed scraper blog, but the extra time is an investment in the future. This content has great staying power over the long term, and if executed properly, it will never result in your site getting banned.
If you are really worried about the time and creativity required to write madlib scripts, hire a writer. Think of it like this, you could pay an article writer $5 to write a great article that you can use once. Or you can pay an article writer $10 to write a great madlib script that you can use to create 1000 different articles.
Of course, there are more things than content to consider when planning to create spam that passes human inspection.
How about page Design?
You always need a template to spin blogs from, but you really kind of need an `un-template`; something highly versatile. Wordpress does this well.
Wordpress works well because you can easily switch themes. Create a wordpress install package that you can use on all your servers, include a whole bunch of different themes. Also think about including a plugin like this for rotating header images (hxxp://mhough.com/wordpress/2007/header-image-rotator-plugin/) this way, you can always have a different image that is not the themes default image. Be sure to include a whole bunch of different images in your package!
I imagine the truly intrepid among you will rewrite the code for the default blogroll links in your install package.
Remember Google's directive of paring back content to see if just ads are leftover? Well here's a revelation: How about you don't include ads? The decision is really up to you, of course. You have to ask yourself why you are creating these sites...ad money or linking power?
And don't forget linking.
To quote an old post Birds, Bimbos, and Blog Networks:
I'd suggest breaking your network of blogs into chunks of say 5 -10 blogs, assigning an independent IP to each chunk. Consider interlinking the blogs in a chunk if each part of the chunk seems relevant to the others. Obviously, don't link to other chunks/other IP ranges. If you want to take caution a step further, be aware of not cross promoting links on each chunk; by back searching a sites links, your network can be laid open to public view.
Now that I read this quote, I'd also add that sometimes you actually want to link 4 or 5 chunks to the same URL, because you are going to need more linking power then just one chunk can provide. As long as you are sure that you aren't interlinking your chunks en-masse, you can quarantine off parts on your network for other uses.
At any rate, these are just a few extraneous thoughts off the top of the dome. What I'm really focused on right now is the content angle, not so much the other factors. I have many more thoughts on the subject of creating content, and specifically on how to use the technique to maximum advantage. For now, though, I want to ask you guys: What do you think is the best way to create content that passes human inspection?
Announcements & News 15 Posts
General news relating to this site
Google Hacking 9 Posts
Oh, the treasures that are to be found on Google!
Links & Points of Interest 9 Posts
Links of interest
Technical 14 Posts
Scripts, Programming, Advanced SEO Techniques
Theory 23 Posts
Off the top of the dome...
Tools & Applications 5 Posts
Tools to help you grow your empire
Twitter 6 Posts
Anything and everything having to do with Twitter
Website Development 4 Posts
Principals and Best Practices for general web development
recent comments:
Navin on Off to Affiliate Summit EastRob on New Datapresser Site Tracker Video
free on New Datapresser Site Tracker Video
Zetrys on An Introduction to Datapresser's Content Generator
supaswag on Dear Twitter Spammers: You're Doing it Wrong.
supaswag on Dear Twitter Spammers: You're Doing it Wrong.
underWorld on New Datapresser Site Tracker Video
Seoplayer on An Introduction to Datapresser's Content Generator
underWorld on I Could Be Anything
pressbox on I Could Be Anything
Subscribe to Recent Posts
Subscribe to Featured Databases
Subscribe to Free Downloads
