Name

Name: Mohammed F. Fofana
Metrix number: Tp2016/17/H/1295
Degree obtained from previous institution: B.Sc. in Information Technology- United Methodist University (UMU)
Program applied for: M.Sc. Software Engineering
Supervisor and Co-Supervisor: Professor G.A. Aderounmu and Dr. Odukoya H.O
.Research area: Software Engineering
Research Topic: Identifying web spam with user behaviour analysis.

Chapter one
INTRODUCTION
1.0 BACKGROUND
With user behavior analyses there have been numerals challenges in combating web spam in search engines. With the improvement of growth with the modern approaches of information’s distribution around the word, web search engines have create more important to people life around the world, it make information sharing reading more easy from one place to another by the internet. With all of these beautiful life the internet and web engines providing for the life of people, web spam has become one of the major factors of embarrassment for search engine and web owners. Web spam is junk you see in search results when websites try to cheat their way into higher positions in search results or otherwise violate search engine quality guidelines. A decade ago, the spam situation was so bad that search engines would regularly return off-topic web spam for many different searches.

In Nigeria there are about ninety one million five hundred ninety eighty thousand seven hundred fifty seven internet users (91,598,757) which constitute about 47.7% of the population using internet, search engines frequently use by about 36.5% regard using search engines as a major way to find newly appeared information of their choice. With the explosive growth of information on the Web, search engines become more and more important in people’s daily lives.

After a certain kind of Web spam appears in search result lists, engineers examine the characteristics of this spam type and design the specific strategies to identify it. However, once a kind of spam is detected and banned, the spammers will turn to develop new Web spam instantly. With this method, anti-spam techniques can only identify Web spam which has already caused severe loss and drawn search engineers’ attention.

In contrast to the prevailing approaches, we propose a different anti-spam framework in which spam sites are identified because of their deceitful motivation instead of their content/hyper-link appearance. We introduce three features developed from user behavior pattern analyses and design a learning-based approach to combine these behavior features to identify Web spam pages. During the check for information using the search engine, it usually returns thousands of results for a certain query, most search engine users only view the first few pages in result lists according to the statistic gather from the research.

There are many anti-spam dictions system built to combat the spread of spam from search engine to another but most of those system purposely use to only detect a specific types of spam. As a consequence, the ranking position has become a major concern of internet service providers. In order to get “an unjustifiably favorable relevance or importance score for some Web page, considering the page’s true value” various kinds of Web spam techniques were designed to mislead search engines. In 2006, it is estimated that about one seventh of English Web pages are spam and these spam lead to great obstacle in users’ information acquisition process.
There are two main types of spam, and they have different effects on Internet users. Cancellable Usenet spam is a single message sent to 20 or more Usenet newsgroups. Usenet spam is aimed at “lurkers”, people who read newsgroups but rarely or never post and give their address away. Usenet spam robs users of the utility of the newsgroups by overwhelming them with a barrage of advertising or other irrelevant posts. Furthermore, Usenet spam subverts the ability of system administrators and owners to manage the topics they accept on their systems.

Email spam targets individual users with direct mail messages. Email spam lists are often created by scanning Usenet postings, stealing Internet mailing lists, or searching the Web for addresses. Email spams typically cost users money out-of-pocket to receive. Many people – anyone with measured phone service – read or receive their mail while the meter is running, so to speak. Spam costs them additional money. On top of that, it costs money for ISPs and online services to transmit spam, and these costs are transmitted directly to subscribers.

Therefore, spam detection is regarded as a major challenge for Web search service providers. State-of-the-art anti-spam techniques usually make use of Web page features, either content-based or hyper-link structure based, to construct Web spam classifiers. In this spam detection framework, when a certain kind of Web spam appears in search engine results, anti-spam engineers examine the characteristics of this kind of spam and design specific strategies to identify it. However, once one kind of spam is detected and banned, the spammers will develop new Web spam techniques instantly. Since the beginning of search engines’ wide adoption in the late 1990s, Web spam has evolved from term spamming, link spamming to current hiding and JavaScript spamming techniques.

Although machine learning based methods have shown their superiority for being easily adapted to newly-developed spam, these approaches still require researchers to provide specific spam page’s features and build up suitable training sets.
Research Statement
After the research we realized that this kind of anti-spam framework has caused many problems in the development of Web search engines. Anti-spam has become an ever-lasting process but it can only detect Web spam types which have caused severe loss and have drawn anti-spam engineers’ attention.

It is quite difficult for anti-spam techniques to be designed and implemented in time because when the engineers are aware of a certain spam type, it has succeed in attracting much users’ attention. Compared with the prevailing approaches, with the help of this research we propose a different anti-spam framework: the User Behavior-oriented Web Spam Detection framework.

Web spam attempts to deceive search engine ranking algorithm instead of meeting Web user’s information needs as ordinary pages. Therefore, the user-visiting patterns of Web spam pages differ from ordinary Web pages. By collecting and analyzing large-scale user-access data of Web pages, we find several user behavior features of spam pages. These features are used to develop an anti-spam algorithm to identify Web spam in a timely, effective, and type-independent manner.

Aim/goal
This project intended to provide a system that will protect and guide web engine and web user from spammer and spam, that normally embarrassment web user.

This system will provide a strong security against spammer provider and the spam itself from been spread over a network of a company or agency, the system will also help prevent spam for spreading virus and malicious software to computers.

Objectives
The objective of this project include but limited:
Is to develop web spam detection framework in which spam sites are identified because of their deceitful motivation instead of their content/hyper-link appearance.
We introduce three features developed from user behavior pattern analyses and these features can identify spam Web sites from ordinary sites timely and effectively.

We design a learning-based approach to combine the proposed user-behavior features and compute the likelihood that the Web sites are spam.

Going to create corresponding Web spam training sets, this data set will be used for evaluating performance search engine.
The system going have a spam dictions.
Methodology
The method that is going be used for this project is to structure and develop a well define system that will be able to trick and block all Spam pages and spam hyper/links. Analysis will be made to discover areas in which web spam are been identify the weaknesses and the objectives of the proposed system and its design will be implemented. This method consisted of but limited to:
The web engine systems will alert the end user about a spam massages
The system will advise the user not to follow the link because It is a spam massage
User will advise and worn about the thread that involve if he/go on the page that they have been worn about
The system will delete on wanted web spam massage
If the page detected is not a spam massage but spam it resemble the name of a spam page you will be given the right to choose go or not.
Add fields that only spam bots can see and fill in.

The system will use a CAPTCHA.

The system will have human-friendly bot-unfriendly test question.

The session tokens that are applied at the site level and required by the form.

Data record from the form submissions like IP address and use that to block spammers.

Justification
The implementation of a web spam with user behaviour analysis system will help improve prevent the limitation of web spam from been spread on a company website or organization.

The system will do the following but limited:
The system will have CAPTCHA.

The system will be human-friendly bot-unfriendly test question.

The session tokens that are applied at the site level and required by the form.

We will collect data record from the form submissions like IP address and use that to block spammers.

Scope of the project
The scope of this project is to enlighten the user more on what a spam entitles, and why the user community has become a major focus by spammer.