bsrgpfxpyy
New Member
What?:
I'm trying to get page-to-page link map (matrix) of wikipedia pages by \[code\]page_id\[/code\] in following format:\[code\]from1 to1 to2 to3 ...from2 to1 to2 to3 ......\[/code\]Why?:
I'm looking for data set (pages from wikipedia) to try out PageRank.Problem:
At dumps.wikimedia.org it is possible to download pages-articles.xml which is XML with this kind of format:\[code\]<page> <title>...</title> <id>...</id> // pageid <text>...</text></page>\[/code\]that I will use for retrieving articles (\[code\]text\[/code\]), then also base per-page data (page.sql) which contains some details about pages by \[code\]page_id\[/code\] and last one that seems relevant to me is pagelinks.sql that contains page-to-page link records. Problem is that \[code\]pagelinks\[/code\] table has following fields: \[code\]pl_from\[/code\], \[code\]pl_namespace\[/code\] and \[code\]pl_title\[/code\].Idea: Create temporary database, import \[code\]page\[/code\] and \[code\]pagelinks\[/code\] tables and create this matrix by using \[code\]pagelinks\[/code\] table and retrieving \[code\]page_id\[/code\]s according to \[code\]pl_title\[/code\]sQuestion:
Is there a place where to get this kind of matrix of page-to-page links by \[code\]page_id\[/code\] so that I don't need to create it on my own ?Or if not, is there any faster way how to get this kind of matrix than idea that I've pointed out?
I'm trying to get page-to-page link map (matrix) of wikipedia pages by \[code\]page_id\[/code\] in following format:\[code\]from1 to1 to2 to3 ...from2 to1 to2 to3 ......\[/code\]Why?:
I'm looking for data set (pages from wikipedia) to try out PageRank.Problem:
At dumps.wikimedia.org it is possible to download pages-articles.xml which is XML with this kind of format:\[code\]<page> <title>...</title> <id>...</id> // pageid <text>...</text></page>\[/code\]that I will use for retrieving articles (\[code\]text\[/code\]), then also base per-page data (page.sql) which contains some details about pages by \[code\]page_id\[/code\] and last one that seems relevant to me is pagelinks.sql that contains page-to-page link records. Problem is that \[code\]pagelinks\[/code\] table has following fields: \[code\]pl_from\[/code\], \[code\]pl_namespace\[/code\] and \[code\]pl_title\[/code\].Idea: Create temporary database, import \[code\]page\[/code\] and \[code\]pagelinks\[/code\] tables and create this matrix by using \[code\]pagelinks\[/code\] table and retrieving \[code\]page_id\[/code\]s according to \[code\]pl_title\[/code\]sQuestion:
Is there a place where to get this kind of matrix of page-to-page links by \[code\]page_id\[/code\] so that I don't need to create it on my own ?Or if not, is there any faster way how to get this kind of matrix than idea that I've pointed out?