PHP implementation of Bayes classificator: Assign topics to texts

bousegastut

New Member
In my news page project, I have a database table news with the following structure:\[code\] - id: [integer] unique number identifying the news entry, e.g.: *1983* - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name* - topic: [string] category which should be chosen by the classificator, e.g: *Sports*\[/code\]Additionally, there's a table bayes with information about word frequencies:\[code\] - word: [string] a word which the frequencies are given for, e.g.: *real estate* - topic: [string] same content as "topic" field above, e.h. *Economics* - count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*\[/code\]Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.Is this the correct implementation? Can you improve it?\[code\]<?phpinclude 'mysqlLogin.php';$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";$get2 = mysql_abfrage($get1);// pTOPICS BEGIN$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";$pTopics2 = mysql_abfrage($pTopics1);$pTopics = array();while ($pTopics3 = mysql_fetch_assoc($pTopics2)) { $pTopics[$pTopics3['topic']] = $pTopics3['count'];}// pTOPICS END// pWORDS BEGIN$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";$pWords2 = mysql_abfrage($pWords1);$pWords = array();while ($pWords3 = mysql_fetch_assoc($pWords2)) { if (!isset($pWords[$pWords3['topic']])) { $pWords[$pWords3['topic']] = array(); } $pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];}// pWORDS ENDwhile ($get3 = mysql_fetch_assoc($get2)) { $pTextInTopics = array(); $tokens = tokenizer($get3['title']); foreach ($pTopics as $topic=>$documentsInTopic) { if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; } foreach ($tokens as $token) { echo '....'.$token; if (isset($pWords[$topic][$token])) { $pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]); } } $pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments } asort($pTextInTopics); // pick topic with lowest value if ($chosenTopic = each($pTextInTopics)) { echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>'; }}?>\[/code\]The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.Sample learning data:word topic countkaczynski Politics 1sony Technology 1bank Economics 1phone Technology 1sony Economics 3ericsson Technology 2Sample output/result:Title of the text: Phone test Sony Ericsson Aspen - sensitive WinberryPolitics....phone....test....sony....ericsson....aspen....sensitive....winberryTechnology....phone FOUND....test....sony FOUND....ericsson FOUND....aspen....sensitive....winberryEconomics....phone....test....sony FOUND....ericsson....aspen....sensitive....winberryResult: The text belongs to topic Technology with a likelihood of 0.013888888888889Thank you very much in advance!
 
Back
Top