Friday 21 April 2006 by ClarK
This website aims at presenting the work I’ve done as a last year internship for my engineering studies.
The subject was to develop a tool that would classify an internet webpage as allowed or forbidden.
The system is based on a spam detection method, working with a learning system and bayésian calculations.
The idea is:
More detailed informations can be found in the right sections for how to use the code (dependencies, compilation, execution).
As for now, the most efficient code is the Flex one.
It is still too slow to score the webpages at click (10 million hits per day at the french Rectorat of Rouen). So, each night Crontabed scripts are running, retrieving all the pages logged during the day by Squid proxy in order to scrore them and then add them in Dansguardian blacklist if necessary. I will try to add those scripts too...
Feel free to visit the website and to try the different programs. The contact links will permit you to send me an email for any question/proposition about the evolution of this project and website.
Nicolas PEYRUSSIE.