畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數(shù)據(jù)網站個性化
《畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數(shù)據(jù)網站個性化》由會員分享,可在線閱讀,更多相關《畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數(shù)據(jù)網站個性化(9頁珍藏版)》請在裝配圖網上搜索。
1、 南京理工大學泰州科技學院 畢業(yè)設計(論文)外文資料翻譯 系 部: 計算機科學與技術 專 業(yè): 計算機科學與技術 姓 名: 學 號: 外文出處: Dipartimento di Informatica, Universita di Pisa 附 件: 1.外文資料翻譯譯文;2.外文原文。 指導教師評語: 簽
2、名: 年 月 日 注:請將該封面與附件裝訂成冊。 附件1:外文資料翻譯譯文 預處理和挖掘Web日志數(shù)據(jù)網站個性化 摘要:我們描述了Web使用挖掘活動的一個持續(xù)項目要求,我們叫它ClickWorld3,旨在提取導航行為的一個網站的用戶的模型。該模型的推斷在訪問日志的網絡服務器通過數(shù)據(jù)和Web挖掘技術的功能。提取的知識是部署的個性化和主動提供網絡服務給用戶。第一,我們描述預處理步驟訪問日志必要的步驟,選擇并準備數(shù)據(jù),知識提取。然后,我們表現(xiàn)出兩套實驗:第一,一
3、個嘗試性預測的用戶基礎上訪問的網頁;第二,試圖預測是否用戶可能有興趣參觀的一部分網頁。 關鍵詞:知識發(fā)現(xiàn),Web挖掘,分類。 1、導言 Web挖掘是利用數(shù)據(jù)挖掘技術在自動化發(fā)現(xiàn)和提取信息從網絡的文件和服務。一個常見的分類Web挖掘的三個主要的研究項目明確的規(guī)定:內容分鐘法,結構挖掘和使用挖掘。區(qū)分這些類別沒有一個明確的界限,而是將經常使用的方法相結合區(qū)分出不同的類別。 內容涵蓋數(shù)據(jù)挖掘技術提取模型,網絡對象的內容,包括純文字,半結構化文件(例如,HTML或XML語言),結構化文件(數(shù)字圖書館),動態(tài)的文件,多媒體文件。提取模型被用于分類的網頁對象,提取關鍵字用于信息檢索,推斷結構的半結
4、構化或非結構化的對象。 結構挖掘旨在發(fā)掘基本的拓撲結構的互連,籌措之間的網絡對象。該模型建立可用于分類和排名的網站,并發(fā)現(xiàn)了它們之間的相似性。 使用挖掘是應用數(shù)據(jù)挖掘技術發(fā)現(xiàn)使用從網絡模式的數(shù)據(jù)。數(shù)據(jù)通常是收集用戶的互動關系在網上,例如網站/代理服務器日志,用戶查詢,登記數(shù)據(jù)。使用挖掘工具發(fā)現(xiàn)和預測用戶行為,以幫助設計師為改善網站,來吸引游客,或給普通用戶的個性化和適應性的服務。 在本文中,我們描述了Web使用挖掘活動的一個持續(xù)項目要求ClickWorld ,旨在提取模型,以用戶的行為為目的的個性化網站。我們從中期全國性大型門戶網站vivacity.it收集和預處理訪問日志,花費的時間為
5、5個月。該網站包括了民族地區(qū)如網址為:www.vivacity.it的新聞,論壇,笑話等,以及30多個地方,例如,www.roma.vivacity.it與城市專用信息,如本地新聞,餐廳地址,戲劇節(jié)目,巴士的時間表,ECC等。 預處理步驟包括數(shù)據(jù)選擇,清洗和轉化和通過驗證的用戶和用戶會話。結果預處理,方法是一個數(shù)據(jù)集市的網絡訪問和注冊信息。從預處理的數(shù)據(jù),Web挖掘的目的是發(fā)現(xiàn)模式調整方法從統(tǒng)計數(shù)據(jù),數(shù)據(jù)挖掘,機器學習和模式識別。其中基本數(shù)據(jù)挖掘技術,我們提到的關聯(lián)規(guī)則,發(fā)現(xiàn)集團的物體,常常要求用戶一起;集群,集團用戶提供類似的瀏覽方式,或集團類似的物體內容或訪問的模式;分類,而有利于的用戶
6、被分到某一類或類別;和序列模式,即序列請求這是常見的許多用戶。 在ClickWorld項目,有幾個上述方法,目前被用來提取有用的信息主動提供個性化網頁網站。在本文中,我們描述了兩套分類實驗。第一個,一項旨在提取一分類模型能夠性別歧視的用戶根據(jù)設置的網頁訪問。第二次試驗的目的是提取一分類模型能夠歧視這些用戶訪問的網頁有關例如:提供給典型的實驗。 2、預處理的Web個性化 我們已經制定了一個數(shù)據(jù)集市的網頁記錄特殊的支持網絡個人化分析。該數(shù)據(jù)集市是人口從一個網絡日志數(shù)據(jù)倉庫房子,如中所描述的,或更簡單地說,從原材料網絡/代理服務器日志種來。在這一節(jié)中,我們描述了一些預處理和編碼步驟進行數(shù)據(jù)的選
7、擇,理解,清洗和轉化。雖然其中一些是一般數(shù)據(jù)準備步驟,Web使用挖掘,值得注意的是,在許多人的一種領域知識必須一定要包括以清潔,正確和完整的輸入數(shù)據(jù)根據(jù)網頁的個性化需求。 2.1用戶注冊數(shù)據(jù) 除了網頁訪問日志,我們考慮輸入包括個人資料的一個子集的用戶,即那些誰注冊的vivacity.it網站,備注:注冊法不是強制性的。對于注冊用戶,該系統(tǒng)記錄了以下資料:性別,城市,省,婚姻狀況,出生日期。此信息是提供由用戶在一個網頁表單在登記時,作為一個可預計,數(shù)據(jù)的標準是對用戶公平。作為預處理步驟,難以置信的數(shù)據(jù)檢測并刪除,如出生數(shù)據(jù)在未來或在遙遠的過去。此外,一些額外的投入沒有進口的數(shù)據(jù)信息,因為幾乎
8、所有的值分別為左為默認選擇的網頁表單。換言之,領域被認為是不利于區(qū)分用戶的選擇和喜好。 為了避免用戶位數(shù)的登錄名和密碼在每個訪問vivacity.it網站采用的Cookie重復。如果一個Cookie是由用戶的瀏覽器,然后認證并不是必需的。否則,身份驗證后,一個新的Cookie 發(fā)送到用戶的瀏覽器。隨著這一機制,可以跟蹤任何用戶只要她刪除的Cookie的體系。此外,如果用戶注冊,該協(xié)會登錄cookie是可以在輸入數(shù)據(jù),然后可以跟蹤用戶后,還原她刪除的cookie. 這種機制使檢測非人類的用戶,如系統(tǒng)診斷診斷和監(jiān)測方案。通過檢查的數(shù)量分配給cookie每個用戶,我們發(fā)現(xiàn),用戶登錄‘test00
9、9’被派到以上24.000獨特的Cookie。這不僅是可能的,如果用戶是一些程序,自動刪除指定的cookie,例如:系統(tǒng)診斷程序。 2.2網站的網址 一方面,有一些標準化的網頁必須形成的統(tǒng)一的網址,以消除不相關的句法的差異。例如,主機可以在IP格式或自身格式,如131.114.2.91是相同的主機作為kdd.di.unipi.it。另一方面,也有一些網絡服務器程序采用非標準格式的參數(shù)傳遞。網站的vivacity.it 服務器程序是其中之一。例如,在以下網址: http://roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,0
10、0.html文件的名字1,3478,|DX,載有00碼的地方網站,網頁識別碼(3478)及其專用的參數(shù)(DX型)。 上述的形式設計了效率的機器進程。作為一個例子,網頁標識是一個關鍵的數(shù)據(jù)庫表的網頁模板發(fā)現(xiàn),雖然參數(shù)可以檢索的網頁內容在一些其他就座。不幸的是,這是一場噩夢時,挖掘點擊的網址。句法功能的網址是很少的幫助:我們需要一些語義信息,或本論文指定的網址。 在最好的,我們可以預期,一個應用程序級別的日志是,即日志的訪問語義相關的對象。例如,應用程序級日志是記錄用戶進入網站主頁,然后參觀了體育與新聞頁面上足球代表隊,等等。這將需要一個系統(tǒng)模塊監(jiān)測用戶的步驟在語義水平的力度。在這個Click
11、World項目中這樣一個模塊被稱為ClickObserve。不幸地,然而,該模塊是一個可交付的項目,它不適用于在收集數(shù)據(jù)的開始該項目。 因此,我們決定提取兩個句法和語義信息從網址通過一個半自動的辦法。該辦法包括通過在逆向工程的網址,從網站設計者說明這意味著每一個URL路徑,網頁id和網頁的參數(shù)。使用PERL腳本,從設計師的描述,我們從原來的提取網址以下信息: 本地網絡服務器,即vivacity.it或roma.vivacity.it等,這些親志愿給我們一些空間信息的用戶的利益;第一級分類的網址有24種,其中一些是:家庭,新聞,財政,照片,笑話,購物。論壇,酒吧;第二個級別的網址取決于第一級
12、之一,例如:網址分類版購物可進一步分類版的圖書購物或PC購物等;第三級分類的網址取決于第二級之一,例如網址分類版的圖書購物可進一步分類版編程該書敘事購物或購物和書籍等;參數(shù)信息,還詳細介紹了三個層次分類,如網址分類版的編程書籍購物可能的ISBN書碼作為參數(shù)的深度分類,即一日的網址,如果只有一個第一級別分類,如果網址的第一和第二級分類,等等。 當然,采取的辦法主要是其中的一個啟發(fā)式,隨著本次設計的層次上升。此外,本次設計不利用任何基于內容的分類,即說明新聞分類,如體育新聞的編號為12345的代碼,即第一級是新聞,并沒有提及的新聞內容。
13、 附件2:外文原文 Preprocessing and Mining Web Log Data for Web Personalization M. Baglioni1, U. Ferrara2, A. Romei1, S. Ruggieri1, and F. Turini1 1 Dipartimento di Informatica, Universita di Pisa, Via F. Buonarroti 2, 56125 Pisa Italy fbaglioni,romei,ruggieri,turinig@di.unipi.it 2 KSolut
14、ions S.p.A. Via Lenin 132/26, 56017 S. Martino Ulmiano (PI) Italy ferrara@ksolutions.it Abstract. We describe the web usage mining activities of an on-going project, called ClickWorld3, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from
15、the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare d
16、ata for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site. Keywords: knowledge discovery, web mining, c
17、lassification. 1 Introduction According to [10], Web Mining is the use of data mining techniques to auto-matically discover and extract information from web documents and services. A common taxonomy of web mining defines three main research lines: content mining, structure mining and usage mining.
18、 The distinction between those categories is not a clear cut, and very often approaches use combination of techniques from different categories.Content mining covers data mining techniques to extract models from web object contents including plain text, semi-structured documents (e.g., HTML orXML),
19、structured documents (digital libraries), dynamic documents, multimedia documents. The extracted models are used to classify web objects, to extract keywords for use in information retrieval, to infer structure of semi-structured or unstructured objects. Structure Mining aims at finding the underl
20、ying topology of the interconnections between web objects. The model built can be used to categorize and to rank web sites, and also to find out similarity between them. 2 M. Baglioni et al. Usage mining is the application of data mining techniques to discover usage patterns from web data. Data is
21、 usually collected from users interaction with the web, e.g. web/proxy server logs, user queries, registration data. Usage mining tools [3,4,9,15] discover and predict user behavior, in order to help the designer to improve the web site, to attract visitors, or to give regular users a personalized a
22、nd adaptive service. In this paper, we describe the web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behavior of users for the purpose of web site personalization [6]. We have collected and preprocessed access logs from a mediu
23、m-large national web portal,vivacity.it, over a period of five months. The portal includes a national area (www.vivacity.it) with news, forums, jokes, etc., and more than 30 local areas (e.g., www.roma.vivacity.it) with city-specific information, such as local news, restaurant addresses, theatre pro
24、gramming, bus timetable, ecc. The preprocessing steps include data selection, cleaning and transformation and the identification of users and of user sessions [2]. The result of preprocessing is a data mart of web accesses and registration information. Starting from preprocessed data, web mining ai
25、ms at pattern discovery by adapting methods from statistics, data mining, machine learning and pattern recognition. Among the basic data mining techniques [7], we mention association rules, discovering groups of objects that are frequently requested together by users; clustering, grouping users with
26、 similar browsing patterns, or grouping objects with similarcontent or access patterns; classification, where a profile is built for users belonging to a given class or category; and sequential patterns, namely sequences of requests which are common for many users. In the ClickWorld project, severa
27、l of the mentioned methods are currently being used to extract useful information for proactive personalization of web sites. In this paper, we describe two sets of classification experiments. The first one aims at extracting a classification model able to discriminate the sex of a user based on the
28、 set of web pages visited. The second experiment aims at extracting a classification model able to discriminate those users that visit pages regarding e.g. sport or finance from those that typically do not. 2 Preprocessing for Web Personalization We have developed a data mart of web logs specifica
29、lly to support web personalization analysis. The data mart is populated starting from a web log data warehouse (such as those described in [8,16]) or, more simply, from raw web/proxy server log files. In this section, we describe a number of preprocessing and coding steps performed for data selectio
30、n, comprehension, cleaning and transformation.While some of them are general data preparation steps for web usage mining[2,16], it is worth noting that in many of them a form of domain knowledge must necessarily be included in order to clean, correct and complete the input data according to the web
31、personalization requirements. 2.1 User registration data In addition to web access logs, our given input includes personal data on a subset of users, namely those who are registered to the vivacity.it website (registration is not mandatory). For a registered user, the system records the following
32、information: sex, city, province, civil status, born date. This information is provided by the user in a web form at the time of registration and, as one could expect, the quality of data is up to the user fairness. As preprocessing steps, improbable data are detected and removed, such as born data
33、in the future or in the remote past. Also, some additional input fields were not imported in the data mart since almost all values were left as the default choice in the web form. In other words, the fields were considered not to be useful in discriminating user choices and preferences. In order to
34、 avoid users to digit their login and password at each visit, the vivacity.it web site adopts cookies. If a cookie is provided by the user browser, then authentication is not required. Otherwise, after authentication, a new cookie is sent to the user browser. With this mechanism, it is possible to t
35、rack any user as long as she deletes the cookies on her system. In addition, if the user is registered, the association login-cookie is available in the input data, and then it is possible to track the user also after she deletes the cookies. This mechanism allows for detecting non-human users, such
36、 as system diagnosis and monitoring programs. By checking the number of cookies assigned to each user, we discovered that the user login test009 was assigned more than 24.000 distinct cookies. This is possible only if the user is some program that automatically deletes assigned cookies, e.g. a syste
37、m diagnosis program. 2.2 Web URL Resources in the World Wide Web are uniformly identified by means of URLs (Uniform Resource Locators). The syntax of an http URL is: http:// host.domain [:port] [ abs path [? query]]where{ host.domain[:port] is the name of the server site. The TCP/IP port is optio
38、nal (the default port is 80),{ abs path is the absolute path of the requested resource in the server filesystem. We further consider abs path of the form path / filename [.extension], i.e. consisting of the filesystem path, filename and file extension.{ query is an optional collection of parameters,
39、 to be passed as an input to a resource that is actually an executable program, e.g. a CGI script. On the one side, there are a number of normalizations that must be performed on URLs, in order to remove irrelevant syntactic differences (e.g., thehost can be in IP format or host format { 131.114.2.
40、91 is the same host as kdd.di.unipi.it). On the other side, there are some web server programs that adopt non-standard formats for passing parameters. The vivacity.it web server program is one of them. For instance, in the following URL:http://roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX
41、,00.html the file name 1,3478,|DX,00 contains a code for the local web site (1 stands for roma.vivacity.it), a web page id (3478) and its specific parameters (DX). The form above has been designed for excient machine processing. For instance, the web page id is a key for a database table where the p
42、age template is found, while the parameters allow for retrieving the web page content in some other table. Unfortunately, this is a nightmare when mining clickstream of URLs. Syntactic features of URLs are of little help: we need some semantic information,or ontology [5,13], assigned to URLs. At th
43、e best, we can expect that an application-level log is available, i.e. a log of accesses to semantic-relevant objects. An example of application-level log is one recording that the user entered the site from the home page, then visited a sport page with news on a soccer team, and so on. This would r
44、equire a system module monitoring user steps at a semantic level of granularity. In the ClickWorld project such a module is called Click Observe. Unfortunately , however, the module is a deliverable of the project, and it was not available for collecting data at the beginning of the project. Therefo
45、re, we decided to extract both syntactic and semantic information from URLs via a semi-automatic approach. The adopted approach consists in reverse-engineering URLs, starting from the web site designer description of the meaning of each URL path, web page id and web page parameters. Using a PERL scr
46、ipt, starting from the designer description we extracted from original URLs the following information: {local web server (i.e., vivacity.it or roma.vivacity.it etc.), which provides us with some spatial information about user interests;{ a first-level classification of URLs into 24 types, some of wh
47、ich are: home , news, finance, photo galleries, jokes, shopping, forum, pubs;{ a second-level classification of URLs depending on the first-level one, e.g.URLs classified as shopping may be further classified as book shopping or pcshopping and so on;{ a third-level classification of URLs depending o
48、n the second-level one, e.g.URLs classified as book shopping may be further classified as programming book shopping or narrative book shopping and so on;{ a parameter information, further detailing the three level classification, e.g.URLs classified as programming book shopping may have the ISBN boo
49、k code as parameter; { the depth of the classification, i.e. 1 if the URL has only a first-level classification, 2 if the URL has first and second-level classification, and so on .Of course, the adopted approach was mainly an heuristics one, with the hierarchical ontology designed at posteriori. Als
50、o, the designed ontology does not exploit any content-based classification, i.e. the description of an elementary object such as sport news with id 12345 is its code (i.e., first-level is news, second level is sport, parameter information 12345), with no reference to the content of the news (was the news reporting about any specific player?).
- 溫馨提示:
1: 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2: 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3.本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 裝配圖網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。