Thin Content Checker

Domain:

Optional Arguments

About the Thin Content Checker

Thin content is content with little added value. Search engines tend to penalize these less valuable pages in the search results. One approach to avoiding thin content is to pay attention to page word count. The standard rule of thumb is a minimum of 200-300 words per page. The right threshold for you may differ. Quality always trumps quantity.

About the Spider
In order for this tool to work, we must crawl the site or page you want analyzed. We do this with DatayzeBot, the datayze spider.

Our spider crawls at a leisurely rate of 1 page ever 1.5 seconds. While the spider doesn't keep track of the contents of the pages it crawls, it does keep track of the number of requests issued by each visitor. Currently the crawler is limited to 1000 pages per user per day. Since the DatayzeBot does not index or cache any pages it crawls, rerunning the Thin Content Checker will count against your daily allowed number of page crawls. You can get around the cap by pausing the crawler and resuming it another day.

DatayzeBot now respects the robots exclusion standard. To specifically allow (or disallow) the crawler to access a page or directory, create a new set of rules for "DatayzeBot" in your robots.txt file. DatayzeBot will follow the longest matching rule for a specified page, rather than the first matching rule. If no matching rule is found, DatayzeBot assumes it is allowed to crawl the page. Not sure if a page is excluded by your robots.txt file? The Index/No Index app will parse HTML headers, meta tags and robots.txt and summarize the results for you.

Interested in Web Development? Try our other tools, like the Site Navigability Analyzer, which can let you see what a spider sees. It can analyze your anchor text diversity and find the length shortest path to any page. The Site Validator can summaries the types of HTML errors on your site, as well as provide a page by page breakdown. A common need among web developers is to know which pages of theirs are being indexed, and thus which are not. Thus we created the Sitemap Index Analyzer.

Parameters to CrawlSome URL parameters can change page content. Which parameters should the spider pay attention to when crawling?:	*Comma or new line separated
Directories and URLs to ExcludeExcluding pages can reduce the load on the crawler and keep you from reaching the URL cap so you can analyze more of your sites. Enter the full path, or a substring of the URLs you wish to exclude.:	*Comma or new line separated
Specific ElementsLong headers, footers, or menus can make a page appear to have more content than it actually has. By limit the Thin Content Analyzer to specific HTML elements by specifying element types, class names (e.g. ".text") or ids (e.g. #content), word length more closely matches the true amount of content each page has.:	*Comma or new line separated. Elements will be included if they match any of the matching criteria. For example "p, #page" will return all the text present in any p elements, as well as the contents of the element with id "page."
	With These Settings