SEO, Theory

Parasite Scraping

OK….to start, let me state the obvious:

The Intertube is overrun with scraper sites. They are everywhere, and many people are profiting from them. While they may be questionable in practice, they can still be very successful.

As you already know, volume is one of keys to success in seo. And scraper sites give you the ability to quickly and easily automate the production of a huge amount of copy for use in your niches. Some implementations of the scraper-site are better then others, for example…..this one isn’t bad: http://www.red78.net/loafers/

And here’s a pretty bad one: http://opensource.votio.com/php/forum/Loafers

Whats sets them apart is simply advertising integration….I gotta wonder if it is converting as well as the first one.

Anwyays, this is very interesting and all, but I dont just want to give an overview generic scraper sites for you guys…I want to talk about something called Parasite Scraping.

You see, all these BlackHatters out there are busting their ass creating scraper sites. They are coming up with the perfect Markov Chains, creating huge synonym databases, scraping old cached websites, scraping wikipedia, etc etc etc……

Point is, people are dedicating a huge amount of effort, and processing power to create scraper sites.

And amongst all this hype for scraper sites, here I am; and Im thinking, “I hate dedicating effort to pretty much anything I find mudane, especially ‘huge’ amounts of it.”

So, whats a lazy bastard like me to do?……Scrape the Scrapers

Lets dive right in and focus on the crappy one, shall we?

You see, the Crappy One, as we’re calling it, has this url: http://opensource.votio.com/php/forum/Loafers

By altering this final term, we can get a whole new set of free content, served up for us and easy to extract!  Watch: (By the way, I program in ruby…it’s pretty easy to read and follow along)

This would be a simple process to extract the <td class=”box”> at the top and bottom of the page. You’d do this by using its xpath //td.box.

This box contains tags relevant to the keyword (in this case: Loafers)

require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")tags = doc.search("td.box").inner_texttags = tags.split(',')tags.each do |tag|    tag = tag.gsub(/(Tags:)/, '').squeeze(' ').strip    puts tagend

**DEMO REMOVED**

Now lets get ourselves some free rss links…shall we?

require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")links = doc.search("a.rsslink")links.each do |link|    puts link.inner_text    puts link[:href]end

**DEMO REMOVED**

Annnnnnnnd to wrap things up, lets really get evil:

require 'mechanize'agent = WWW::Mechanize.newdoc = agent.get("http://opensource.votio.com/php/forum/Loafers")page = doc.search("html").inner_htmlpage = page.gsub(/^google_ad_client = quot;pub-([0-9])+";/, 'google_ad_client = "pub-XXXXXXXXXXXXXXXXX";')puts page 

It doesn’t take a rocket scientist to tell what this does, **DEMO REMOVED**

So there you have it! Why continue to waste your time, money, and server resource?!? Its a pain in the ass to set up a fleet of scraper sites. All the effort finding sources, creating wiki scrapers, scraping search results, even building templates and hard coding the structure of the site itself!! I have just shown you how to take a simple keyword based scraper site and republish it under your own adsense ID; and using only 7 lines of code.

I want you guys to keep in mind that those two example websites are just the tip of the iceberg. Like everything else, creativity must be applied here. There are many websites that scrape and markov content, unlike the example sites that just directly scraped rss feeds and serps. So be on the lookout for good quality scraper sites!

In terms the ethical issues surrounding parasite scraping….I think this posts slug-line says it best:
Two Wrongs CAN Make a Right

**UPDATE: I removed the demos and links, sorry guys***

some posts that may be related

10 Comments

speak up

Add your comment below.

Subscribe to these comments.

*Required Fields