Schinke Latin stemming algorithm in PHP

paulkangaroo · Sep 13, 2012

This website offers the "Schinke Latin stemming algorithm" for download to use it in the Snowball stemming system.I want to use this algorithm, but I don't want to use Snowball.The good thing: There's some pseudocode on that page which you could translate to a PHP function. This is what I've tried:\[code\]<?phpfunction stemLatin($word) { // output = array(NOUN-BASED STEM, VERB-BASED STEM) // DEFINE CLASSES BEGIN $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque'); $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u'); $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't'); // DEFINE CLASSES END $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind $word = str_replace('j', 'i', $word); // replace all <j> by <i> $word = str_replace('v', 'u', $word); // replace all <v> by <u> if (substr($word, -3) == 'que') { // if word ends with -que if (in_array($word, $queWords)) { // if word is a queWord return array($word, $word); // output queWord as both noun-based and verb-based stem } else { $word = substr($word, 0, -3); // remove the -que } } foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A) if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix $word = substr($word, 0, -strlen($suffixA)); // remove the suffix break; // remove only one suffix } } if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B) if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix switch ($suffixB) { case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i> case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi> case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri> default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix } break; // remove only one suffix } } if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters return array($nounBased, $verbBased);}?>\[/code\]My questions:1) Will this code work correctly? Does it follow the algorithm's rules?2) How could you improve the code (performance)?Thank you very much in advance!

Schinke Latin stemming algorithm in PHP

paulkangaroo

New Member