Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method

TLDR

Many online sites overwhelm users with too many choices, making it difficult and time‑consuming to find the right product or information. This study develops an automatic web‑usage mining and recommendation system that infers user needs from click‑stream data on an RSS reader to provide relevant information without explicit requests. Using a real‑time K‑Nearest‑Neighbor classifier trained on cleaned and grouped RSS click‑stream sessions, the system matches users to groups and recommends tailored browsing options. The K‑NN classifier proved transparent, consistent, simple, and superior to other machine learning methods when little prior knowledge of data distribution exists.

Abstract

The major problem of many on-line web sites is the presentation of many choices to the client at a time; this usually results to strenuous and time consuming task in finding the right product or information on the site. In this work, we present a study of automatic web usage data mining and recommendation system based on current user behavior through his/her click stream data on the newly developed Really Simple Syndication (RSS) reader website, in order to provide relevant information to the individual without explicitly asking for it. The K-Nearest-Neighbor (KNN) classification method has been trained to be used on-line and in Real-Time to identify clients/visitors click stream data, matching it to a particular user group and recommend a tailored browsing option that meet the need of the specific user at a particular time. To achieve this, web users RSS address file was extracted, cleansed, formatted and grouped into meaningful session and data mart was developed. Our result shows that the K-Nearest Neighbor classifier is transparent, consistent, straightforward, simple to understand, high tendency to possess desirable qualities and easy to implement than most other machine learning techniques specifically when there is little or no prior knowledge about data distribution.

References

Page 1

	Year	Citations

Page 1