九九热最新网址,777奇米四色米奇影院在线播放,国产精品18久久久久久久久久,中文有码视频,亚洲一区在线免费观看,国产91精品在线,婷婷丁香六月天

畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化

上傳人:1777****777 文檔編號:36204098 上傳時間:2021-10-29 格式:DOC 頁數:9 大?。?6.50KB
收藏 版權申訴 舉報 下載
畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化_第1頁
第1頁 / 共9頁
畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化_第2頁
第2頁 / 共9頁
畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化_第3頁
第3頁 / 共9頁

下載文檔到電腦,查找使用更方便

15 積分

下載資源

還剩頁未讀,繼續(xù)閱讀

資源描述:

《畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化》由會員分享,可在線閱讀,更多相關《畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化(9頁珍藏版)》請在裝配圖網上搜索。

1、 南京理工大學泰州科技學院畢業(yè)設計(論文)外文資料翻譯系 部: 計算機科學與技術 專 業(yè): 計算機科學與技術 姓 名: 學 號: 外文出處: Dipartimento di Informatica, Universita di Pisa 附 件: 1.外文資料翻譯譯文;2.外文原文。指導教師評語: 簽名: 年 月 日注:請將該封面與附件裝訂成冊。附件1:外文資料翻譯譯文預處理和挖掘Web日志數據網站個性化摘要:我們描述了Web使用挖掘活動的一個持續(xù)項目要求,我們叫它ClickWorld3,旨在提取導航行為的一個網站的用戶的模型。該模型的推斷在訪問日志的網絡服務器通過數據和Web挖掘技術的功能。

2、提取的知識是部署的個性化和主動提供網絡服務給用戶。第一,我們描述預處理步驟訪問日志必要的步驟,選擇并準備數據,知識提取。然后,我們表現(xiàn)出兩套實驗:第一,一個嘗試性預測的用戶基礎上訪問的網頁;第二,試圖預測是否用戶可能有興趣參觀的一部分網頁。關鍵詞:知識發(fā)現(xiàn),Web挖掘,分類。1、導言Web挖掘是利用數據挖掘技術在自動化發(fā)現(xiàn)和提取信息從網絡的文件和服務。一個常見的分類Web挖掘的三個主要的研究項目明確的規(guī)定:內容分鐘法,結構挖掘和使用挖掘。區(qū)分這些類別沒有一個明確的界限,而是將經常使用的方法相結合區(qū)分出不同的類別。內容涵蓋數據挖掘技術提取模型,網絡對象的內容,包括純文字,半結構化文件(例如,HT

3、ML或XML語言),結構化文件(數字圖書館),動態(tài)的文件,多媒體文件。提取模型被用于分類的網頁對象,提取關鍵字用于信息檢索,推斷結構的半結構化或非結構化的對象。結構挖掘旨在發(fā)掘基本的拓撲結構的互連,籌措之間的網絡對象。該模型建立可用于分類和排名的網站,并發(fā)現(xiàn)了它們之間的相似性。使用挖掘是應用數據挖掘技術發(fā)現(xiàn)使用從網絡模式的數據。數據通常是收集用戶的互動關系在網上,例如網站/代理服務器日志,用戶查詢,登記數據。使用挖掘工具發(fā)現(xiàn)和預測用戶行為,以幫助設計師為改善網站,來吸引游客,或給普通用戶的個性化和適應性的服務。在本文中,我們描述了Web使用挖掘活動的一個持續(xù)項目要求ClickWorld ,旨在

4、提取模型,以用戶的行為為目的的個性化網站。我們從中期全國性大型門戶網站vivacity.it收集和預處理訪問日志,花費的時間為5個月。該網站包括了民族地區(qū)如網址為:www.vivacity.it的新聞,論壇,笑話等,以及30多個地方,例如,www.roma.vivacity.it與城市專用信息,如本地新聞,餐廳地址,戲劇節(jié)目,巴士的時間表,ECC等。預處理步驟包括數據選擇,清洗和轉化和通過驗證的用戶和用戶會話。結果預處理,方法是一個數據集市的網絡訪問和注冊信息。從預處理的數據,Web挖掘的目的是發(fā)現(xiàn)模式調整方法從統(tǒng)計數據,數據挖掘,機器學習和模式識別。其中基本數據挖掘技術,我們提到的關聯(lián)規(guī)則,

5、發(fā)現(xiàn)集團的物體,常常要求用戶一起;集群,集團用戶提供類似的瀏覽方式,或集團類似的物體內容或訪問的模式;分類,而有利于的用戶被分到某一類或類別;和序列模式,即序列請求這是常見的許多用戶。在ClickWorld項目,有幾個上述方法,目前被用來提取有用的信息主動提供個性化網頁網站。在本文中,我們描述了兩套分類實驗。第一個,一項旨在提取一分類模型能夠性別歧視的用戶根據設置的網頁訪問。第二次試驗的目的是提取一分類模型能夠歧視這些用戶訪問的網頁有關例如:提供給典型的實驗。2、預處理的Web個性化我們已經制定了一個數據集市的網頁記錄特殊的支持網絡個人化分析。該數據集市是人口從一個網絡日志數據倉庫房子,如中所

6、描述的,或更簡單地說,從原材料網絡/代理服務器日志種來。在這一節(jié)中,我們描述了一些預處理和編碼步驟進行數據的選擇,理解,清洗和轉化。雖然其中一些是一般數據準備步驟,Web使用挖掘,值得注意的是,在許多人的一種領域知識必須一定要包括以清潔,正確和完整的輸入數據根據網頁的個性化需求。2.1用戶注冊數據除了網頁訪問日志,我們考慮輸入包括個人資料的一個子集的用戶,即那些誰注冊的vivacity.it網站,備注:注冊法不是強制性的。對于注冊用戶,該系統(tǒng)記錄了以下資料:性別,城市,省,婚姻狀況,出生日期。此信息是提供由用戶在一個網頁表單在登記時,作為一個可預計,數據的標準是對用戶公平。作為預處理步驟,難以

7、置信的數據檢測并刪除,如出生數據在未來或在遙遠的過去。此外,一些額外的投入沒有進口的數據信息,因為幾乎所有的值分別為左為默認選擇的網頁表單。換言之,領域被認為是不利于區(qū)分用戶的選擇和喜好。為了避免用戶位數的登錄名和密碼在每個訪問vivacity.it網站采用的Cookie重復。如果一個Cookie是由用戶的瀏覽器,然后認證并不是必需的。否則,身份驗證后,一個新的Cookie 發(fā)送到用戶的瀏覽器。隨著這一機制,可以跟蹤任何用戶只要她刪除的Cookie的體系。此外,如果用戶注冊,該協(xié)會登錄cookie是可以在輸入數據,然后可以跟蹤用戶后,還原她刪除的cookie.這種機制使檢測非人類的用戶,如系統(tǒng)

8、診斷診斷和監(jiān)測方案。通過檢查的數量分配給cookie每個用戶,我們發(fā)現(xiàn),用戶登錄test009被派到以上24.000獨特的Cookie。這不僅是可能的,如果用戶是一些程序,自動刪除指定的cookie,例如:系統(tǒng)診斷程序。2.2網站的網址一方面,有一些標準化的網頁必須形成的統(tǒng)一的網址,以消除不相關的句法的差異。例如,主機可以在IP格式或自身格式,如131.114.2.91是相同的主機作為kdd.di.unipi.it。另一方面,也有一些網絡服務器程序采用非標準格式的參數傳遞。網站的vivacity.it 服務器程序是其中之一。例如,在以下網址:http:/roma.vivacity.it/spe

9、ciali/EditColonnaSpeciale/1,3478,|DX,00.html文件的名字1,3478,|DX,載有00碼的地方網站,網頁識別碼(3478)及其專用的參數(DX型)。上述的形式設計了效率的機器進程。作為一個例子,網頁標識是一個關鍵的數據庫表的網頁模板發(fā)現(xiàn),雖然參數可以檢索的網頁內容在一些其他就座。不幸的是,這是一場噩夢時,挖掘點擊的網址。句法功能的網址是很少的幫助:我們需要一些語義信息,或本論文指定的網址。在最好的,我們可以預期,一個應用程序級別的日志是,即日志的訪問語義相關的對象。例如,應用程序級日志是記錄用戶進入網站主頁,然后參觀了體育與新聞頁面上足球代表隊,等等。

10、這將需要一個系統(tǒng)模塊監(jiān)測用戶的步驟在語義水平的力度。在這個ClickWorld項目中這樣一個模塊被稱為ClickObserve。不幸地,然而,該模塊是一個可交付的項目,它不適用于在收集數據的開始該項目。因此,我們決定提取兩個句法和語義信息從網址通過一個半自動的辦法。該辦法包括通過在逆向工程的網址,從網站設計者說明這意味著每一個URL路徑,網頁id和網頁的參數。使用PERL腳本,從設計師的描述,我們從原來的提取網址以下信息:本地網絡服務器,即vivacity.it或roma.vivacity.it等,這些親志愿給我們一些空間信息的用戶的利益;第一級分類的網址有24種,其中一些是:家庭,新聞,財政

11、,照片,笑話,購物。論壇,酒吧;第二個級別的網址取決于第一級之一,例如:網址分類版購物可進一步分類版的圖書購物或PC購物等;第三級分類的網址取決于第二級之一,例如網址分類版的圖書購物可進一步分類版編程該書敘事購物或購物和書籍等;參數信息,還詳細介紹了三個層次分類,如網址分類版的編程書籍購物可能的ISBN書碼作為參數的深度分類,即一日的網址,如果只有一個第一級別分類,如果網址的第一和第二級分類,等等。當然,采取的辦法主要是其中的一個啟發(fā)式,隨著本次設計的層次上升。此外,本次設計不利用任何基于內容的分類,即說明新聞分類,如體育新聞的編號為12345的代碼,即第一級是新聞,并沒有提及的新聞內容。附件

12、2:外文原文Preprocessing and Mining Web Log Data forWeb PersonalizationM. Baglioni1, U. Ferrara2, A. Romei1, S. Ruggieri1, and F. Turini11 Dipartimento di Informatica, Universita di Pisa,Via F. Buonarroti 2, 56125 Pisa Italyfbaglioni,romei,ruggieri,turinigdi.unipi.it2 KSolutions S.p.A.Via Lenin 132/26, 5

13、6017 S. Martino Ulmiano (PI) Italyferraraksolutions.itAbstract. We describe the web usage mining activities of an on-going project, called ClickWorld3, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by me

14、ans of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we

15、 show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site.Keywords: knowledge discovery, web mining, classification.1 IntroductionAccording t

16、o 10, Web Mining is the use of data mining techniques to auto-matically discover and extract information from web documents and services. A common taxonomy of web mining defines three main research lines: content mining, structure mining and usage mining. The distinction between those categories is

17、not a clear cut, and very often approaches use combination of techniques from different categories.Content mining covers data mining techniques to extract models from web object contents including plain text, semi-structured documents (e.g., HTML orXML), structured documents (digital libraries), dyn

18、amic documents, multimedia documents. The extracted models are used to classify web objects, to extractkeywords for use in information retrieval, to infer structure of semi-structured or unstructured objects.Structure Mining aims at finding the underlying topology of the interconnections between web

19、 objects. The model built can be used to categorize and to rank web sites, and also to find out similarity between them.2 M. Baglioni et al.Usage mining is the application of data mining techniques to discover usage patterns from web data. Data is usually collected from users interaction with the we

20、b, e.g. web/proxy server logs, user queries, registration data. Usage mining tools 3,4,9,15 discover and predict user behavior, in order to help the designer to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service. In this paper, we describe the web

21、 usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behavior of users for the purpose of web site personalization 6. We have collected and preprocessed access logs from a medium-large national web portal,vivacity.it, over a period of

22、 five months. The portal includes a national area (www.vivacity.it) with news, forums, jokes, etc., and more than 30 local areas (e.g., www.roma.vivacity.it) with city-specific information, such as local news, restaurant addresses, theatre programming, bus timetable, ecc.The preprocessing steps incl

23、ude data selection, cleaning and transformation and the identification of users and of user sessions 2. The result of preprocessing is a data mart of web accesses and registration information. Starting from preprocessed data, web mining aims at pattern discovery by adapting methods from statistics,

24、data mining, machine learning and pattern recognition. Among the basic data mining techniques 7, we mention association rules, discovering groups of objects that are frequently requested together by users; clustering, grouping users with similar browsing patterns, or grouping objects with similarcon

25、tent or access patterns; classification, where a profile is built for users belonging to a given class or category; and sequential patterns, namely sequences of requests which are common for many users.In the ClickWorld project, several of the mentioned methods are currently being used to extract us

26、eful information for proactive personalization of web sites. In this paper, we describe two sets of classification experiments. The first one aims at extracting a classification model able to discriminate the sex of a user based on the set of web pages visited. The second experiment aims at extracti

27、ng a classification model able to discriminate those users that visit pages regarding e.g. sport or finance from those that typically do not.2 Preprocessing for Web PersonalizationWe have developed a data mart of web logs specifically to support web personalization analysis. The data mart is populat

28、ed starting from a web log data warehouse (such as those described in 8,16) or, more simply, from raw web/proxy server log files. In this section, we describe a number of preprocessing and coding steps performed for data selection, comprehension, cleaning and transformation.While some of them are ge

29、neral data preparation steps for web usage mining2,16, it is worth noting that in many of them a form of domain knowledge must necessarily be included in order to clean, correct and complete the input data according to the web personalization requirements.2.1 User registration dataIn addition to web

30、 access logs, our given input includes personal data on a subset of users, namely those who are registered to the vivacity.it website (registration is not mandatory). For a registered user, the system records the following information: sex, city, province, civil status, born date. This information i

31、s provided by the user in a web form at the time of registration and, as one could expect, the quality of data is up to the user fairness. As preprocessing steps, improbable data are detected and removed, such as born data in the future or in the remote past. Also, some additional input fields were

32、not imported in the data mart since almost all values were left as the default choice in the web form. In other words, the fields were considered not to be useful in discriminating user choices and preferences.In order to avoid users to digit their login and password at each visit, the vivacity.it w

33、eb site adopts cookies. If a cookie is provided by the user browser, then authentication is not required. Otherwise, after authentication, a new cookie is sent to the user browser. With this mechanism, it is possible to track any user as long as she deletes the cookies on her system. In addition, if

34、 the user is registered, the association login-cookie is available in the input data, and then it is possible to track the user also after she deletes the cookies. This mechanism allows for detecting non-human users, such as system diagnosis and monitoring programs. By checking the number of cookies

35、 assigned to each user, we discovered that the user login test009 was assigned more than 24.000 distinct cookies. This is possible only if the user is some program that automatically deletes assigned cookies, e.g. a system diagnosis program.2.2 Web URLResources in the World Wide Web are uniformly id

36、entified by means of URLs(Uniform Resource Locators). The syntax of an http URL is: http:/ host.domain :port abs path ? querywhere host.domain:port is the name of the server site. The TCP/IP port is optional (the default port is 80), abs path is the absolute path of the requested resource in the ser

37、ver filesystem. We further consider abs path of the form path / filename .extension, i.e. consisting of the filesystem path, filename and file extension. query is an optional collection of parameters, to be passed as an input to a resource that is actually an executable program, e.g. a CGI script.On

38、 the one side, there are a number of normalizations that must be performed on URLs, in order to remove irrelevant syntactic differences (e.g., thehost can be in IP format or host format 131.114.2.91 is the same host as kdd.di.unipi.it). On the other side, there are some web server programs that adop

39、t non-standard formats for passing parameters. The vivacity.it web server program is one of them. For instance, in the following URL:http:/roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,00.html the file name 1,3478,|DX,00 contains a code for the local web site (1 stands for roma.vivacity.i

40、t), a web page id (3478) and its specific parameters (DX). The form above has been designed for excient machine processing. For instance, the web page id is a key for a database table where the page template is found, while the parameters allow for retrieving the web page content in some other table

41、. Unfortunately, this is a nightmare when mining clickstream of URLs.Syntactic features of URLs are of little help: we need some semantic information,or ontology 5,13, assigned to URLs. At the best, we can expect that an application-level log is available, i.e. a log of accesses to semantic-relevant

42、 objects. An example of application-level log is one recording that the user entered the site from the home page, then visited a sport page with news on a soccer team, and so on. This would require a system module monitoring user steps at a semantic level of granularity. In the ClickWorld project su

43、ch a module is called Click Observe. Unfortunately , however, the module is a deliverable of the project, and it was not available for collecting data at the beginning of the project. Therefore, we decided to extract both syntactic and semantic information from URLs via a semi-automatic approach. Th

44、e adopted approach consists in reverse-engineering URLs, starting from the web site designer description of the meaning of each URL path, web page id and web page parameters. Using a PERL script, starting from the designer description we extracted from original URLs the following information: local

45、web server (i.e., vivacity.it or roma.vivacity.it etc.), which provides us with some spatial information about user interests; a first-level classification of URLs into 24 types, some of which are: home , news, finance, photo galleries, jokes, shopping, forum, pubs; a second-level classification of

46、URLs depending on the first-level one, e.g.URLs classified as shopping may be further classified as book shopping or pcshopping and so on; a third-level classification of URLs depending on the second-level one, e.g.URLs classified as book shopping may be further classified as programming book shoppi

47、ng or narrative book shopping and so on; a parameter information, further detailing the three level classification, e.g.URLs classified as programming book shopping may have the ISBN book code as parameter; the depth of the classification, i.e. 1 if the URL has only a first-level classification, 2 i

48、f the URL has first and second-level classification, and so on .Of course, the adopted approach was mainly an heuristics one, with the hierarchical ontology designed at posteriori. Also, the designed ontology does not exploit any content-based classification, i.e. the description of an elementary object such as sport news with id 12345 is its code (i.e., first-level is news, second level is sport, parameter information 12345), with no reference to the content of the news (was the news reporting about any specific player?).

展開閱讀全文
溫馨提示:
1: 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2: 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3.本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 裝配圖網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

相關資源

更多
正為您匹配相似的精品文檔
關于我們 - 網站聲明 - 網站地圖 - 資源地圖 - 友情鏈接 - 網站客服 - 聯(lián)系我們

copyright@ 2023-2025  zhuangpeitu.com 裝配圖網版權所有   聯(lián)系電話:18123376007

備案號:ICP2024067431-1 川公網安備51140202000466號


本站為文檔C2C交易模式,即用戶上傳的文檔直接被用戶下載,本站只是中間服務平臺,本站所有文檔下載所得的收益歸上傳人(含作者)所有。裝配圖網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對上載內容本身不做任何修改或編輯。若文檔所含內容侵犯了您的版權或隱私,請立即通知裝配圖網,我們立即給予刪除!