Tokenization
Wednesday 26 April 2006
by
ClarK
The principle of tokenization is to cut into significant parts (tokens) the HTML code of a wabpage.
In order to determine what is significant in the content of a wabpage we need to analyse it.
Example: http://www.cplair.com/
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>AVERTISSEMENT</title>
</head>
<body bgcolor="#000000">
<table border="0" cellpadding="0" cellspacing="1" width="100%">
<tr>
<td width="100%">
<p align="center"><b><font size="3" color="#FF0000">AVERTISSEMENT
!</font><font color="#ff00ff" size="3">
Ce site Internet est réservé à un public majeur.<br>
Il contient des photos classées X qui peuvent être choquantes.<br>
</font><font color="#FF0000" size="3">
<u>l'accès est interdite aux personnes mineures</u></font><font
color="#ff00ff" size="3">
<br>
<br>
20
Je certifie sur l'honneur :<br>
- d'être majeur selon la loi en vigueur dans mon pays.<br>
- que les lois de mon pays m'autorisent a accéder à ce site.<br>
- je consulte ce site a titre personnel <br>
Je m'engage sur l'honneur à :<br>
- ne pas faire état de l'existence de ce serveur et à ne pas en
diffuser
le contenu à des mineurs.<br>
- assumer ma responsabilité, si un mineur accède à ce serveur à cause
de négligences de ma part : absence de protection de l'ordinateur
personnel, absence de logiciel de censure, divulgation ou perte du
mot de
passe de sécurité.<br>
- Je m'interdis dès à présent de poursuivre l'éditeur et l'auteur de
ce site sur toute action judiciaire.<br>
J'ai lu attentivement les paragraphes ci-dessus et signe<br>
électroniquement mon accord avec ce qui précède en cliquant sur le
bouton ENTRER</font></b></td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="1" width="100%">
<tr>
<td width="50%">
<p align="center"><a href="accueil.htm"><img border="0"
src="entrer.gif" width="132" height="33"></a></td>
<td width="50%">
<p align="center"><b><font color="#00FFFF"><u><font size="4">pour les
mineurs</font></u> <a href="http://www.sitespourenfants.com/">
<img border="0" src="sortir.gif" width="121"
height="35"></a></font></b></td>
</tr>
</table>
</body>
</html>
From this analyse we can notice that:
- HTML tags can inform us about the content of a page. They can, for example, contain the look of the page, as the colors used (<body bgcolor="#000000"> ou <font ... color="#FF0000">) or the fonts which can give us infomation about the type of content. Indeed porn websites usually use the same kind of styles (eg: pink background). In the same way, the number of images can be significant (<img border="0" src="sortir.gif" width="121" height="35">) or the links (<a href="http://www.sitespourenfants.com/">), the tables’ titles (<thead...>), or the vocabulary used into the <meta> tags (wich are used by webmasters to reference websites on search engines and contain an explicit vocabulary about the content of the website).
Obviously all the other tags can also be important.
- Domain names (www.sitespourenfants.com) can also inform us. Indeed many websites are interconnected by hypertext links and the domain name inform us about the web hosts which generally host one of the too types of content. Some domain names can be often found and contribute to the score of the page.
- Eventually, once all the code has been erased (all the tags), only remains the visible text. Once more the visible text is explicit of the type of content. Retrieving each word can give us important information. In order to improve the "understanding" of the text we will also take into account the bi-words, which represent a couple of two consecutive words.
Indeed, if we take the couple "être majeur", the word "majeur" alone give some sens but not "être". The couple formed by "être majeur" give more information.
Distinguishing the different types of tokens prevents from being dependant of only one. Some webpages are really poor in text but contain HTML code (as for thumbnails galleries).
-->
Forum
-
GHD
5 August 2010, by
GHD
The efficiency of the badge leave enable you to good incarnate for longer periods. further esteemed technology facet provided by the Pink buss
GHD edition is the godsend of claret charring. According to the customer
GHD Straighteners mk4 styler is willing a esteemed investment, really avail the price. And the reasons for
GHD Hair Straighteners are pleasing strange and roused discerning they set about the superlative.
http://www.ghdhairgenuinerealorfake.co.uk
-
GHD
1 September 2010, by
jhNiDFEk
MfeTiQ
mwdvkrivbvig, [url=http://axaoktoseuhr.com/]axaoktoseuhr[/url], [link=http://xhjbqfbggdha.com/]xhjbqfbggdha[/link], http://vheznukybsjd.com/
-
Reebok EasyTone
5 August 2010, by
Reebok EasyTone
MBT Shoes are huge this fall winter season, mbt shoes sale, which makes complete fashion sense since a pair of MBT Man M.Walk ,especially leopard print,
MBT Kisumu Sandals,can give a monochromatic winter fashion shoes.
Nike Air Max 2009 mostly use full grain leather, with leather made from the most durable part of the leather.Daily cellulite exercises are a perfect natural cure for cellulite as they work to increase your metabolism by helping to
Reebok EasyTone Go Outside burn fat more efficiently.If your personal budget has left you a little short, you may want to consider
Skechers Shape Ups, another brand of rocker-sole shoes from Skechers USA.
http://www.togetshoes.com
-
Reebok EasyTone
31 August 2010, by
uZNgOOFAfqwtiM
W1gwUM
uxzwbpppnflz, [url=http://pycvwihnrlhz.com/]pycvwihnrlhz[/url], [link=http://rozzfefvojxb.com/]rozzfefvojxb[/link], http://plkcdkpzakhb.com/
Navigation