Understanding Site Indexing vs. Caching, Using Wikipedia vs. IF4IT as Examples

fullboar

New Member
Hello All,

Some background info...

We were profiling Wikipedia vs. IF4IT site indexing & caching and doing so uncovered some questions that I figured I'd bring to the group. I was hoping some of you could provide insight into some of what we're seeing.

If you execute the command: "site:http://www.wikipedia.org" you should see varying numbers. For example, the last time I ran it, I got approximately 6,670 pages cached for Wikipedia. Another time I ran it (from a different browser on a different IP address), I got only 6 pages cached as a result.

When I perform the same command for IF4IT "site:http://www.if4it.com", I get approximately 5,600 pages cached. The number fluctuate, daily, but I've watched the number rise steadily, over the last few months.

In the case of IF4IT, the numbers seem normal and follow what I would consider to be a predictable growth pattern that currently rises toward the maximum number of pages in the sitemap.xml file, which is currently around 11,000. However, I believe Wikipedia has many millions of pages and I would, therefore, expect to see millions of pages indexed and cached.

QUESTION #1: Why wouldn't executing the "site" command for Wikipedia yield results in the many millions or, at least, hundreds of thousands?

QUESTION #2: I believe I read somewhere that Google has a maximum of 50,000 pages in a single sitemap but will index and cache much more if you break your sitemaps into separately linked sitemaps. Is this accurate? If so, can someone please point me to the reference article?

QUESTION #3: Is there a difference in the meaning of # of pages indexed vs. # of pages cached? I would believe so, as crawling and indexing pages that get registered with your master index, to me, is technically very different than locally caching page content (text, images, formatting, etc.) to facilitate speed. If there is a difference, does the "site:http://www.mysite" command focus on what's been indexed OR on what's been cached?

Thanks very much, in advance, for your help.

My Best,

Frank Quote: Originally Posted by Guerino1 f you execute the command: "site:http://www.wikipedia.org" you should see varying numbers. For example, the last time I ran it, I got approximately 6,670 pages cached for Wikipedia. Another time I ran it (from a different browser on a different IP address), I got only 6 pages cached as a result. Try modifying your command to "site:wikipedia.org" and I think the results you'll see will be more along the lines of what you were expecting. Quote: Originally Posted by ~CReed Try modifying your command to "site:wikipedia.org" and I think the results you'll see will be more along the lines of what you were expecting. Hi CReed,

Thanks for the info. I just reran it the way you recommended and got 38,000,000. Huge difference.

Would you happen to know why there's a difference between www.sitename vs. just sitename? Shouldn't they yield the same results?

My Best,

Frank Hi Frank,

I'm not sure why G would index wikipedia as it has - the non-www version redirects to the www.wkipedia version so you think they would index it that way.

I have seen numerous sites that show different results for the site search, but it was usually because both versions resolved.

Surprised you're only getting 38,000,000 results - I'm seeing 186,000,000 results. We must be hitting different data centers.
 
Back
Top