九九热最新网址,777奇米四色米奇影院在线播放,国产精品18久久久久久久久久,中文有码视频,亚洲一区在线免费观看,国产91精品在线,婷婷丁香六月天

歡迎來到裝配圖網! | 幫助中心 裝配圖網zhuangpeitu.com!
裝配圖網
ImageVerifierCode 換一換
首頁 裝配圖網 > 資源分類 > DOC文檔下載  

畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化

  • 資源ID:36204098       資源大?。?span id="24d9guoke414" class="font-tahoma">56.50KB        全文頁數:9頁
  • 資源格式: DOC        下載積分:15積分
快捷下載 游客一鍵下載
會員登錄下載
微信登錄下載
三方登錄下載: 微信開放平臺登錄 支付寶登錄   QQ登錄   微博登錄  
二維碼
微信掃一掃登錄
下載資源需要15積分
郵箱/手機:
溫馨提示:
用戶名和密碼都是您填寫的郵箱或者手機號,方便查詢和重復下載(系統(tǒng)自動生成)
支付方式: 支付寶    微信支付   
驗證碼:   換一換

 
賬號:
密碼:
驗證碼:   換一換
  忘記密碼?
    
友情提示
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站資源下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。
5、試題試卷類文檔,如果標題沒有明確說明有答案則都視為沒有答案,請知曉。

畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化

南京理工大學泰州科技學院畢業(yè)設計(論文)外文資料翻譯系 部: 計算機科學與技術 專 業(yè): 計算機科學與技術 姓 名: 學 號: 外文出處: Dipartimento di Informatica, Universita di Pisa 附 件: 1.外文資料翻譯譯文;2.外文原文。指導教師評語: 簽名: 年 月 日注:請將該封面與附件裝訂成冊。附件1:外文資料翻譯譯文預處理和挖掘Web日志數據網站個性化摘要:我們描述了Web使用挖掘活動的一個持續(xù)項目要求,我們叫它ClickWorld3,旨在提取導航行為的一個網站的用戶的模型。該模型的推斷在訪問日志的網絡服務器通過數據和Web挖掘技術的功能。提取的知識是部署的個性化和主動提供網絡服務給用戶。第一,我們描述預處理步驟訪問日志必要的步驟,選擇并準備數據,知識提取。然后,我們表現出兩套實驗:第一,一個嘗試性預測的用戶基礎上訪問的網頁;第二,試圖預測是否用戶可能有興趣參觀的一部分網頁。關鍵詞:知識發(fā)現,Web挖掘,分類。1、導言Web挖掘是利用數據挖掘技術在自動化發(fā)現和提取信息從網絡的文件和服務。一個常見的分類Web挖掘的三個主要的研究項目明確的規(guī)定:內容分鐘法,結構挖掘和使用挖掘。區(qū)分這些類別沒有一個明確的界限,而是將經常使用的方法相結合區(qū)分出不同的類別。內容涵蓋數據挖掘技術提取模型,網絡對象的內容,包括純文字,半結構化文件(例如,HTML或XML語言),結構化文件(數字圖書館),動態(tài)的文件,多媒體文件。提取模型被用于分類的網頁對象,提取關鍵字用于信息檢索,推斷結構的半結構化或非結構化的對象。結構挖掘旨在發(fā)掘基本的拓撲結構的互連,籌措之間的網絡對象。該模型建立可用于分類和排名的網站,并發(fā)現了它們之間的相似性。使用挖掘是應用數據挖掘技術發(fā)現使用從網絡模式的數據。數據通常是收集用戶的互動關系在網上,例如網站/代理服務器日志,用戶查詢,登記數據。使用挖掘工具發(fā)現和預測用戶行為,以幫助設計師為改善網站,來吸引游客,或給普通用戶的個性化和適應性的服務。在本文中,我們描述了Web使用挖掘活動的一個持續(xù)項目要求ClickWorld ,旨在提取模型,以用戶的行為為目的的個性化網站。我們從中期全國性大型門戶網站vivacity.it收集和預處理訪問日志,花費的時間為5個月。該網站包括了民族地區(qū)如網址為:www.vivacity.it的新聞,論壇,笑話等,以及30多個地方,例如,www.roma.vivacity.it與城市專用信息,如本地新聞,餐廳地址,戲劇節(jié)目,巴士的時間表,ECC等。預處理步驟包括數據選擇,清洗和轉化和通過驗證的用戶和用戶會話。結果預處理,方法是一個數據集市的網絡訪問和注冊信息。從預處理的數據,Web挖掘的目的是發(fā)現模式調整方法從統(tǒng)計數據,數據挖掘,機器學習和模式識別。其中基本數據挖掘技術,我們提到的關聯規(guī)則,發(fā)現集團的物體,常常要求用戶一起;集群,集團用戶提供類似的瀏覽方式,或集團類似的物體內容或訪問的模式;分類,而有利于的用戶被分到某一類或類別;和序列模式,即序列請求這是常見的許多用戶。在ClickWorld項目,有幾個上述方法,目前被用來提取有用的信息主動提供個性化網頁網站。在本文中,我們描述了兩套分類實驗。第一個,一項旨在提取一分類模型能夠性別歧視的用戶根據設置的網頁訪問。第二次試驗的目的是提取一分類模型能夠歧視這些用戶訪問的網頁有關例如:提供給典型的實驗。2、預處理的Web個性化我們已經制定了一個數據集市的網頁記錄特殊的支持網絡個人化分析。該數據集市是人口從一個網絡日志數據倉庫房子,如中所描述的,或更簡單地說,從原材料網絡/代理服務器日志種來。在這一節(jié)中,我們描述了一些預處理和編碼步驟進行數據的選擇,理解,清洗和轉化。雖然其中一些是一般數據準備步驟,Web使用挖掘,值得注意的是,在許多人的一種領域知識必須一定要包括以清潔,正確和完整的輸入數據根據網頁的個性化需求。2.1用戶注冊數據除了網頁訪問日志,我們考慮輸入包括個人資料的一個子集的用戶,即那些誰注冊的vivacity.it網站,備注:注冊法不是強制性的。對于注冊用戶,該系統(tǒng)記錄了以下資料:性別,城市,省,婚姻狀況,出生日期。此信息是提供由用戶在一個網頁表單在登記時,作為一個可預計,數據的標準是對用戶公平。作為預處理步驟,難以置信的數據檢測并刪除,如出生數據在未來或在遙遠的過去。此外,一些額外的投入沒有進口的數據信息,因為幾乎所有的值分別為左為默認選擇的網頁表單。換言之,領域被認為是不利于區(qū)分用戶的選擇和喜好。為了避免用戶位數的登錄名和密碼在每個訪問vivacity.it網站采用的Cookie重復。如果一個Cookie是由用戶的瀏覽器,然后認證并不是必需的。否則,身份驗證后,一個新的Cookie 發(fā)送到用戶的瀏覽器。隨著這一機制,可以跟蹤任何用戶只要她刪除的Cookie的體系。此外,如果用戶注冊,該協會登錄cookie是可以在輸入數據,然后可以跟蹤用戶后,還原她刪除的cookie.這種機制使檢測非人類的用戶,如系統(tǒng)診斷診斷和監(jiān)測方案。通過檢查的數量分配給cookie每個用戶,我們發(fā)現,用戶登錄test009被派到以上24.000獨特的Cookie。這不僅是可能的,如果用戶是一些程序,自動刪除指定的cookie,例如:系統(tǒng)診斷程序。2.2網站的網址一方面,有一些標準化的網頁必須形成的統(tǒng)一的網址,以消除不相關的句法的差異。例如,主機可以在IP格式或自身格式,如131.114.2.91是相同的主機作為kdd.di.unipi.it。另一方面,也有一些網絡服務器程序采用非標準格式的參數傳遞。網站的vivacity.it 服務器程序是其中之一。例如,在以下網址:http:/roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,00.html文件的名字1,3478,|DX,載有00碼的地方網站,網頁識別碼(3478)及其專用的參數(DX型)。上述的形式設計了效率的機器進程。作為一個例子,網頁標識是一個關鍵的數據庫表的網頁模板發(fā)現,雖然參數可以檢索的網頁內容在一些其他就座。不幸的是,這是一場噩夢時,挖掘點擊的網址。句法功能的網址是很少的幫助:我們需要一些語義信息,或本論文指定的網址。在最好的,我們可以預期,一個應用程序級別的日志是,即日志的訪問語義相關的對象。例如,應用程序級日志是記錄用戶進入網站主頁,然后參觀了體育與新聞頁面上足球代表隊,等等。這將需要一個系統(tǒng)模塊監(jiān)測用戶的步驟在語義水平的力度。在這個ClickWorld項目中這樣一個模塊被稱為ClickObserve。不幸地,然而,該模塊是一個可交付的項目,它不適用于在收集數據的開始該項目。因此,我們決定提取兩個句法和語義信息從網址通過一個半自動的辦法。該辦法包括通過在逆向工程的網址,從網站設計者說明這意味著每一個URL路徑,網頁id和網頁的參數。使用PERL腳本,從設計師的描述,我們從原來的提取網址以下信息:本地網絡服務器,即vivacity.it或roma.vivacity.it等,這些親志愿給我們一些空間信息的用戶的利益;第一級分類的網址有24種,其中一些是:家庭,新聞,財政,照片,笑話,購物。論壇,酒吧;第二個級別的網址取決于第一級之一,例如:網址分類版購物可進一步分類版的圖書購物或PC購物等;第三級分類的網址取決于第二級之一,例如網址分類版的圖書購物可進一步分類版編程該書敘事購物或購物和書籍等;參數信息,還詳細介紹了三個層次分類,如網址分類版的編程書籍購物可能的ISBN書碼作為參數的深度分類,即一日的網址,如果只有一個第一級別分類,如果網址的第一和第二級分類,等等。當然,采取的辦法主要是其中的一個啟發(fā)式,隨著本次設計的層次上升。此外,本次設計不利用任何基于內容的分類,即說明新聞分類,如體育新聞的編號為12345的代碼,即第一級是新聞,并沒有提及的新聞內容。附件2:外文原文Preprocessing and Mining Web Log Data forWeb PersonalizationM. Baglioni1, U. Ferrara2, A. Romei1, S. Ruggieri1, and F. Turini11 Dipartimento di Informatica, Universita di Pisa,Via F. Buonarroti 2, 56125 Pisa Italyfbaglioni,romei,ruggieri,turinigdi.unipi.it2 KSolutions S.p.A.Via Lenin 132/26, 56017 S. Martino Ulmiano (PI) Italyferraraksolutions.itAbstract. We describe the web usage mining activities of an on-going project, called ClickWorld3, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site.Keywords: knowledge discovery, web mining, classification.1 IntroductionAccording to 10, Web Mining is the use of data mining techniques to auto-matically discover and extract information from web documents and services. A common taxonomy of web mining defines three main research lines: content mining, structure mining and usage mining. The distinction between those categories is not a clear cut, and very often approaches use combination of techniques from different categories.Content mining covers data mining techniques to extract models from web object contents including plain text, semi-structured documents (e.g., HTML orXML), structured documents (digital libraries), dynamic documents, multimedia documents. The extracted models are used to classify web objects, to extractkeywords for use in information retrieval, to infer structure of semi-structured or unstructured objects.Structure Mining aims at finding the underlying topology of the interconnections between web objects. The model built can be used to categorize and to rank web sites, and also to find out similarity between them.2 M. Baglioni et al.Usage mining is the application of data mining techniques to discover usage patterns from web data. Data is usually collected from users interaction with the web, e.g. web/proxy server logs, user queries, registration data. Usage mining tools 3,4,9,15 discover and predict user behavior, in order to help the designer to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service. In this paper, we describe the web usage mining activities of an on-going project, called ClickWorld, that aims at extracting models of the navigational behavior of users for the purpose of web site personalization 6. We have collected and preprocessed access logs from a medium-large national web portal,vivacity.it, over a period of five months. The portal includes a national area (www.vivacity.it) with news, forums, jokes, etc., and more than 30 local areas (e.g., www.roma.vivacity.it) with city-specific information, such as local news, restaurant addresses, theatre programming, bus timetable, ecc.The preprocessing steps include data selection, cleaning and transformation and the identification of users and of user sessions 2. The result of preprocessing is a data mart of web accesses and registration information. Starting from preprocessed data, web mining aims at pattern discovery by adapting methods from statistics, data mining, machine learning and pattern recognition. Among the basic data mining techniques 7, we mention association rules, discovering groups of objects that are frequently requested together by users; clustering, grouping users with similar browsing patterns, or grouping objects with similarcontent or access patterns; classification, where a profile is built for users belonging to a given class or category; and sequential patterns, namely sequences of requests which are common for many users.In the ClickWorld project, several of the mentioned methods are currently being used to extract useful information for proactive personalization of web sites. In this paper, we describe two sets of classification experiments. The first one aims at extracting a classification model able to discriminate the sex of a user based on the set of web pages visited. The second experiment aims at extracting a classification model able to discriminate those users that visit pages regarding e.g. sport or finance from those that typically do not.2 Preprocessing for Web PersonalizationWe have developed a data mart of web logs specifically to support web personalization analysis. The data mart is populated starting from a web log data warehouse (such as those described in 8,16) or, more simply, from raw web/proxy server log files. In this section, we describe a number of preprocessing and coding steps performed for data selection, comprehension, cleaning and transformation.While some of them are general data preparation steps for web usage mining2,16, it is worth noting that in many of them a form of domain knowledge must necessarily be included in order to clean, correct and complete the input data according to the web personalization requirements.2.1 User registration dataIn addition to web access logs, our given input includes personal data on a subset of users, namely those who are registered to the vivacity.it website (registration is not mandatory). For a registered user, the system records the following information: sex, city, province, civil status, born date. This information is provided by the user in a web form at the time of registration and, as one could expect, the quality of data is up to the user fairness. As preprocessing steps, improbable data are detected and removed, such as born data in the future or in the remote past. Also, some additional input fields were not imported in the data mart since almost all values were left as the default choice in the web form. In other words, the fields were considered not to be useful in discriminating user choices and preferences.In order to avoid users to digit their login and password at each visit, the vivacity.it web site adopts cookies. If a cookie is provided by the user browser, then authentication is not required. Otherwise, after authentication, a new cookie is sent to the user browser. With this mechanism, it is possible to track any user as long as she deletes the cookies on her system. In addition, if the user is registered, the association login-cookie is available in the input data, and then it is possible to track the user also after she deletes the cookies. This mechanism allows for detecting non-human users, such as system diagnosis and monitoring programs. By checking the number of cookies assigned to each user, we discovered that the user login test009 was assigned more than 24.000 distinct cookies. This is possible only if the user is some program that automatically deletes assigned cookies, e.g. a system diagnosis program.2.2 Web URLResources in the World Wide Web are uniformly identified by means of URLs(Uniform Resource Locators). The syntax of an http URL is: http:/ host.domain :port abs path ? querywhere host.domain:port is the name of the server site. The TCP/IP port is optional (the default port is 80), abs path is the absolute path of the requested resource in the server filesystem. We further consider abs path of the form path / filename .extension, i.e. consisting of the filesystem path, filename and file extension. query is an optional collection of parameters, to be passed as an input to a resource that is actually an executable program, e.g. a CGI script.On the one side, there are a number of normalizations that must be performed on URLs, in order to remove irrelevant syntactic differences (e.g., thehost can be in IP format or host format 131.114.2.91 is the same host as kdd.di.unipi.it). On the other side, there are some web server programs that adopt non-standard formats for passing parameters. The vivacity.it web server program is one of them. For instance, in the following URL:http:/roma.vivacity.it/speciali/EditColonnaSpeciale/1,3478,|DX,00.html the file name 1,3478,|DX,00 contains a code for the local web site (1 stands for roma.vivacity.it), a web page id (3478) and its specific parameters (DX). The form above has been designed for excient machine processing. For instance, the web page id is a key for a database table where the page template is found, while the parameters allow for retrieving the web page content in some other table. Unfortunately, this is a nightmare when mining clickstream of URLs.Syntactic features of URLs are of little help: we need some semantic information,or ontology 5,13, assigned to URLs. At the best, we can expect that an application-level log is available, i.e. a log of accesses to semantic-relevant objects. An example of application-level log is one recording that the user entered the site from the home page, then visited a sport page with news on a soccer team, and so on. This would require a system module monitoring user steps at a semantic level of granularity. In the ClickWorld project such a module is called Click Observe. Unfortunately , however, the module is a deliverable of the project, and it was not available for collecting data at the beginning of the project. Therefore, we decided to extract both syntactic and semantic information from URLs via a semi-automatic approach. The adopted approach consists in reverse-engineering URLs, starting from the web site designer description of the meaning of each URL path, web page id and web page parameters. Using a PERL script, starting from the designer description we extracted from original URLs the following information: local web server (i.e., vivacity.it or roma.vivacity.it etc.), which provides us with some spatial information about user interests; a first-level classification of URLs into 24 types, some of which are: home , news, finance, photo galleries, jokes, shopping, forum, pubs; a second-level classification of URLs depending on the first-level one, e.g.URLs classified as shopping may be further classified as book shopping or pcshopping and so on; a third-level classification of URLs depending on the second-level one, e.g.URLs classified as book shopping may be further classified as programming book shopping or narrative book shopping and so on; a parameter information, further detailing the three level classification, e.g.URLs classified as programming book shopping may have the ISBN book code as parameter; the depth of the classification, i.e. 1 if the URL has only a first-level classification, 2 if the URL has first and second-level classification, and so on .Of course, the adopted approach was mainly an heuristics one, with the hierarchical ontology designed at posteriori. Also, the designed ontology does not exploit any content-based classification, i.e. the description of an elementary object such as sport news with id 12345 is its code (i.e., first-level is news, second level is sport, parameter information 12345), with no reference to the content of the news (was the news reporting about any specific player?).

注意事項

本文(畢業(yè)設計論文 外文文獻翻譯 中英文對照 計算機科學與技術 預處理和挖掘Web日志數據網站個性化)為本站會員(1777****777)主動上傳,裝配圖網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知裝配圖網(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。




關于我們 - 網站聲明 - 網站地圖 - 資源地圖 - 友情鏈接 - 網站客服 - 聯系我們

copyright@ 2023-2025  zhuangpeitu.com 裝配圖網版權所有   聯系電話:18123376007

備案號:ICP2024067431-1 川公網安備51140202000466號


本站為文檔C2C交易模式,即用戶上傳的文檔直接被用戶下載,本站只是中間服務平臺,本站所有文檔下載所得的收益歸上傳人(含作者)所有。裝配圖網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。若文檔所含內容侵犯了您的版權或隱私,請立即通知裝配圖網,我們立即給予刪除!