Filtrage Web/ Web Filter

Storage

Wednesday 26 April 2006 by ClarK

When matching patterns, knowing which regular expression has matched the token, when can know its type.
We can then store it
So as to do it, we first need to cut the tokens already retrieved (HTML tags and words) using separators in order to obtain the final tokens (smaller et more precise). We then indicate the type (domain, tag, word or bi-word) of the token.

We need to store the token and its number of occurrences. The aim is thus to store a couple key/value in which the key is the token and the value its number of occurrences (in the allowed and forbidden categories).

There is no complex request to execute on this database. we just want to store the couple and access it quickly when needed.

Sleepycat Software has developped Berkeley DB which encounters our need.
Indeed, Berkeley DB enables to store data using a couple key/value in which the key is unique.
As for the structure we are free to chose betwwen a hash, a B-Tree or a FIFO structure. Moreover it permits to manipulate the size of the buffer in order to optimise the data access time.

Though, we have to store a coupl key/value, and in order to optimise storage size we have to encode this couple.

Instead of storing 2 long sized integers we will encode the values on 4 bytes (two times two bytes, two bytes per integer) to store on the one hand the number of allowed occurrences and in the other hand the number of forbidden occurrences.

The higher number we can encode on one byte is 28 - 1 = 255 and we can encode 28 = 256 values (from 0 to 255).
In the same way on two bytes we can encode 2562 = 65536 (from 0 to 65535).
A base 10 number can be written : X10 = a256 + b256.

then:

  • a = E(x/256)
  • b = Xmod(256)

As a token is only counted once per page during learning step (and is encoded on two bytes), We should have 65536 pages in the learning step, in one or the other category, before it overexceed memory.


-->

Forum

Home page | Contact | Site Map | Private area | visits: 4468

RSS RSSen

Site created with SPIP 1.8.3 + ALTERNATIVES