The best way to understand this part is to read the article: A Statistical Approach to the Spam Problem.
Else that would kind of the translation of a translation... >continue
Learning will used the two steps seen before (tokenization and storage). The aim is to have a representative base of pages of both allowed and forbidden categories. We need a relatively big amount of this pages as for the learning to be useful, and so that the data stored can cover a correct (...) >continue
When retrieving tokens, on each wab page, we need to store them: temporarily during learning step as well as during scoring step, in order to deal with them, permanently when storing tokens and their number of occurences (when building the learning database).
The better way is to store them (...) >continue
The principle of tokenization is to cut into significant parts (tokens) the HTML code of a wabpage. In order to determine what is significant in the content of a wabpage we need to analyse it.
Example: http://www.cplair.com/
> continue