OK….to start, let me state the obvious:
The Intertube is overrun with scraper sites. They are everywhere, and many people are profiting from them. While they may be questionable in practice, they can still be very successful.
As you already know, volume is one of keys to success in seo. And scraper sites give you the ability to quickly and easily automate the production of a huge amount of copy for use in your niches. Some implementations of the scraper-site are better then others, for example…..this one isn’t bad: http://www.red78.net/loafers/
And here’s a pretty bad one: http://opensource.votio.com/php/forum/Loafers
Whats sets them apart is simply advertising integration….I gotta wonder if it is converting as well as the first one.
Anwyays, this is very interesting and all, but I dont just want to give an overview generic scraper sites for you guys…I want to talk about something called Parasite Scraping.
You see, all these BlackHatters out there are busting their ass creating scraper sites. They are coming up with the perfect Markov Chains, creating huge synonym databases, scraping old cached websites, scraping wikipedia, etc etc etc……
Point is, people are dedicating a huge amount of effort, and processing power to create scraper sites.
And amongst all this hype for scraper sites, here I am; and Im thinking, “I hate dedicating effort to pretty much anything I find mudane, especially ‘huge’ amounts of it.”
So, whats a lazy bastard like me to do?……Scrape the Scrapers
Lets dive right in and focus on the crappy one, shall we?
You see, the Crappy One, as we’re calling it, has this url: http://opensource.votio.com/php/forum/Loafers
By altering this final term, we can get a whole new set of free content, served up for us and easy to extract! Watch: (By the way, I program in ruby…it’s pretty easy to read and follow along)
This would be a simple process to extract the <td class=”box”> at the top and bottom of the page. You’d do this by using its xpath //td.box.
This box contains tags relevant to the keyword (in this case: Loafers)
require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")tags = doc.search("td.box").inner_texttags = tags.split(',')tags.each do |tag| tag = tag.gsub(/(Tags:)/, '').squeeze(' ').strip puts tagend
**DEMO REMOVED**
Now lets get ourselves some free rss links…shall we?
require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")links = doc.search("a.rsslink")links.each do |link| puts link.inner_text puts link[:href]end
**DEMO REMOVED**
Annnnnnnnd to wrap things up, lets really get evil:
require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")page = doc.search("html").inner_htmlpage = page.gsub(/^google_ad_client = quot;pub-([0-9])+";/, 'google_ad_client = "pub-XXXXXXXXXXXXXXXXX";')puts page
It doesn’t take a rocket scientist to tell what this does, **DEMO REMOVED**
So there you have it! Why continue to waste your time, money, and server resource?!? Its a pain in the ass to set up a fleet of scraper sites. All the effort finding sources, creating wiki scrapers, scraping search results, even building templates and hard coding the structure of the site itself!! I have just shown you how to take a simple keyword based scraper site and republish it under your own adsense ID; and using only 7 lines of code.
I want you guys to keep in mind that those two example websites are just the tip of the iceberg. Like everything else, creativity must be applied here. There are many websites that scrape and markov content, unlike the example sites that just directly scraped rss feeds and serps. So be on the lookout for good quality scraper sites!
In terms the ethical issues surrounding parasite scraping….I think this posts slug-line says it best:
Two Wrongs CAN Make a Right
**UPDATE: I removed the demos and links, sorry guys***

Isn't it kind of risky to have that kind of site? –> http://opensource.votio.com/php/forum/poker-gambling
Yeah, as we're talking about here, it is definately easy to exploit scraper sites….not even just ones that have the scraped query in the URL through mod_rewrite…..there are other techniques to exploit other kinds of scrapers….but thats fodder for a later post.
There are tons of sites like this this you know how to find them
Thanks for this post. I have been reading about scraping recently and I'm trying to learn how to do it. This is interesting – scraping the scrapers… I like it.
This is a newbie question, but how do you implement the code examples above on a webpage to show the scraped content?
That is Ruby. It's not as simple to run as just copy and pasting…..It requires its own interpreter program to execute the code, and it is not standard on many hosts. Do some research.
Thanks for the response, off to my friend google I go
Good post, hopefully all the newbies will be able to benefit off of everyone elses Wordze accounts.
Hehe.
I just got finished writing my own content generation program that is similar, the first example scraper site is really good. That is a good blackhatter, and the sites are similar to mine (very clean and neat looking template).
I have been pretty sick of the spammiest looking spam sites you can ever imagine coming up in a G search. I'd also be suprised if those sites converted at all, for clicks or for affiliate programs. They just look so shitty.
The secret is to look neat, and have your PPC ads above the fold. Easy.
Whenever I write a post in the wee horus of the morning (like this one), I invariably forget to mention some key things….so here's a very quick run-down of the key things you should also thing about…
Be anonymous. Proxy. Rotate IPs if possible. Install tor on your server if you have to…..
You shouldn't do anything half ass, especially this….don't forget to consider all aspects of your code that might leak your sites IP or other identify features.
So be slick by remembering to pay attention to how you display the CSS for a page….I usually have my scraper scrape all pages ending in .css and then copy them into <style> tags in the header of the page I am scraping.
Always filter your html to avoid opening yourself to XSS
Keep in mind that the people you are going to be targeting will generally have no qualms about fucking with you if they catch you. It wouldnt be hard for them to feed your scraper malicious code.
Oh…that reminds me….I didnt proxy those examples…..oh shit..uhhhhhhhhhhhhhh……
# bus error
You mentioned installing Tor..
Do you know of any sites that give instructions for installing Tor on a linux server, so that I can run my scripts a "bit" more secretly..
Hopeing you can help
Bruce
Bruce, my suggestion is that you read all the help files on the tor website.
There are so many ways to set up an install on linux dependent on the environment, so I suggest you read the doc files.
Make sure you have libevent installed on your machine (that was a stumbling block for me at first)
Another way to avoid detection would be to do your scraping off-site on a test server, injecting content into your own db, then uploading to your own sites. It's also a good work around if you don't have VPSes or your own dedicated servers yet. Can you say WP splogs?