Filtrage Web/ Web Filter

Regular expressions

Wednesday 26 April 2006 by ClarK

Regular expressions are precise and powerful tools that enable to match patterns in text.
They are going to be useful there, to erase the code we don’t want (commentaries, java scripts, style sheets etc.), and to detect the tokens we want to retrieve and store.

We want to retrieve HTML tags, domain names (which ones can be in tags or not), as well as words and bi-words.

Each step can be sum up in one regular expression. I will detail them here in Ruby language (which has been the language used to make the prototype) but they really are similar to the ones used with KDE Qt librairy or Flex.

The tokens stored are as follow.

Domain names

As we can find domain names in HTML tags or in the visible text, we will start by matching them, in case of the code being altered further.

https?:\/\/([\w\-\.]+)
Regular expression mathing domain names

It will match:
  • any string of character beginning with http:// or https:// - with https?:\/\/,
  • followed by a word or by a number - matched by \w - or a minus or a dot, and this at least once.

As we are not matching a final match (’/’) the detection will stop at the first one (or at any other character not matched).
We will thus obtain the domain names.

HTML tags

In order to only retrieve the necessary tags we first need to suppress the ones we don’t want.

<!--((.|\n)*?)--> | <script [^>]+?> ((.|\n)*?)\/script>
Regular expression matching scripts and commentaries

It will match:
  • commentaries beginning with <!--, finishing with --> and containing whatever even new line character,
  • as well as the scripts (in the same way).

Once these expressions matched we suppress them.
We now can retrieve all the other tags.

<(\"[^\"]*\" | \'[^\']*\' | [^\'\">])*>
Regular expression detecting HTML tags

We retrieve:
  • any text beginning with a "smaller than" sign and finishing with a "greater than" sign,
  • wich contain whatever between double quote or simple quote or whatever without "greater than" sign.

We suppress these tokens from the HTML code as we detect them.

Words

We now have the HTML page without all the code. So we can retrieve the text after having suppressed all the useless characters :

&nbsp;? | &lt;? | &#[\d]+;? | &gt;? | \s+
Regular expression matching useless characters

All the non significative characters are suppressed (&nbsp;, &lt;, &#[\\d]+;).

&(\w)\w+;
Regular expression matching accentuated characters

We also match all the accentuated characters (&eacute;, &egrave; etc.) or special characters (&ccedil;) replacing them with teir equivalent (e, c etc.).

[-\w\d$!?+%@=]+
Regular expression matching words

We eventually match the words, group of letters, ciphers and ponctuation characters ($!?+%@=-). The other characters not matched will be used as separators.

Example

<a href="http://www.ac-rouen.fr"> Rectorat de Rouen </a>
Exemple de code HTML

The regular expressions will produce the result:
Domain names:
domain:ac-rouen.fr
HTML tags :
tag:a
tag:a_href
tag:a_http
tag:a_www
tag:a_ac
tag:a_rouen
tag:a_fr
Words
word:Rectorat
word:de
word:Rouen
Bi-words
biword:Rectorat_de
biword:de_Rouen


-->

Forum

Home page | Contact | Site Map | Private area | visits: 4468

RSS RSSen

Site created with SPIP 1.8.3 + ALTERNATIVES