Pretreatment of web log files
Abstract
The pretreatment of web data is often the most laborious and requires the most time, this due in particular to the lack of structuration and the large amount of noise present in the raw data. Pretreatment of Web log files is to clean and organize the data contained in these files to prepare them for future analysis. Web log files are often text type, an objective of the pretreatment step is to transfer the data in an easier to use environment (eg in a database).
In this paper we will start with the presentation of different formats of web log files, then we will present the different pretreatment methods that we used as cleaning of Web robots queries, removing queries relating to scripts (.js, .css, .swf), identifications of users, sessions and visits.Downloads
References
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing patterns. Knowledge and information systems, 1(1), 5-32.
Tan, P. N., & Kumar, V. (2004). Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis (pp. 193-222). Springer Berlin Heidelberg.
M. Spiliopoulou. Data Mining for the Web. Proceedings of the Symposium on Principles of Knowledge Discovery in Databases (PKDD), 1999.
Tanasa, D., Trousse, B., Masseglia, F., & AxIS, P. (2004). Application des techniques de fouille de données aux logs web: Etat de l’art sur le Web Usage Mining. Mesures de l'internet, 126-143.
Tanasa, D., & AxIS, A. (2002, December). Lessons from a web usage mining intersites experiment. In Proceedings of the First International Workshop on Data Cleaning and Preprocessing of the ICDM02 (pp. 99-107).
R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, 2000.
Aye, T. T. (2011, March). Web log cleaning for mining of web usage patterns. InComputer Research and Development (ICCRD), 2011 3rd International Conference on (Vol. 2, pp. 490-494). IEEE.
Pamutha, T., Chimphlee, S., Kimpan, C., & Sanguansat, P. (2012). Data Preprocessing on Web Server Log Files for Mining Users Access Patterns.International Journal of Research and Reviews in Wireless Communications (IJRRWC) Vol, 2.
Merzoug, N., & Bessa, H. Application du processus de fouille de donnees d'usage du web sur les fichiers logs du site cubba.
Charrad, M. (2005). Techniques d'extraction de connaissances appliquees aux donnees du Web. Transformation, 56, 5-2.
Tanasa, D., & Trousse, B. (2003). Le pretraitement des fichiers logs web dans le “Web Usage Mining” multi-sites. Journees Francophones de la Toile (JFT’2003), 113-122.
Langhnoja, S., Barot, M., & Mehta, D. (2012). Pre-Processing: Procedure on Web Log File for Web Usage Mining. International Journal for Emerging Technology and advanced enfineering, 2(12).
Tanasa, D., Trousse, B., Masseglia, F., & AxIS, P. (2004). Application des techniques de fouille de données aux logs web: Etat de l’art sur le Web Usage Mining. Mesures de l'internet, 126-143.
Charrad, M., Ahmed, M. B., & Lechevallier, Y. (2005). Extraction des connaissances à partir des fichiers logs. Atelier fouille du Web EGC2006, 768.
Sharma, A. (2008). Web Usage Mining: Data Preprocessing, Pattern Discovery and Pattern Analysis on the RIT Web Data (Doctoral dissertation, PhD thesis, Rochester Institute of Technology).
Khalil Gdoura, Web Usage Mining-Determination des facteurs de succes d’un site web par un modele de regression logistique, Ecole Superieure de la Statistique et de l’Analyse de l’Information, 2008 / 2009.
www.developer.mozilla.org
Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the World-Wide Web Computer Networks and ISDN systems, 27(6), 1065-1073.
Copyright (c) 2015 Journal of Information Sciences and Computing Technologies
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
TRANSFER OF COPYRIGHT
JISCT is pleased to undertake the publication of your contribution to Journal of Information Sciences and Computing Technologies
The copyright to this article is transferred to JISCT(including without limitation, the right to publish the work in whole or in part in any and all forms of media, now or hereafter known) effective if and when the article is accepted for publication thus granting JISCT all rights for the work so that both parties may be protected from the consequences of unauthorized use.