進階搜尋


下載電子全文  
系統識別號 U0026-1608201610194600
論文名稱(中文) 大數據即時平台上的重贅節省資料傳輸方法
論文名稱(英文) Fast Deduplication Data Transmission Scheme on a Big Data Real-time Platform
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 104
學期 2
出版年 105
研究生(中文) 陳建廷
研究生(英文) Jian-Ting Chen
學號 P76034575
學位類別 碩士
語文別 英文
論文頁數 49頁
口試委員 指導教授-鄭憲宗
口試委員-王英宏
口試委員-蕭宏章
口試委員-蔡垂雄
口試委員-周承復
中文關鍵字 大數據  資料重複刪除技術  In-memory Computing  Spark 
英文關鍵字 Big Data  Deduplication  In-memory Computing  Spark 
學科別分類
中文摘要 隨著巨量資料時代來臨,對於如何有運用這些巨量資料成為了一大難題。如何能夠在更短的時間處理更多資料,甚至是即時地處理這些資料,過去的分散式運算架構MapReduce已不符合Real-time的需求。為了解決這項問題,記憶體內運算(In-memory Computing IMC)的概念被提出來。
記憶體內運算如同其字面上的意義,解決了MapReduce過分地對硬碟存取資料所造成的成本問題,並能夠有效地去執行分散式疊代運算。可是,IMC分散式運算依然無法擺脫一個瓶頸,即網路的頻寬,其將資料從來源取得以及分散至各個節點都受到頻寬限制。根據觀察,來自感應裝置的部份資料會因為時間或空間相依性而有所重複。因此,重複資料刪除技術將會是一個不錯的解決方案,以消除數據的重複部分來提高資料的傳送效率。
本研究提出了重贅節省資料傳輸方法來優化IMC平行化即時運算平台Spark Streaming,利用重複資料刪除技術針對來源資料可能的重複區塊進行剔除的動作,以期望提高對資料的使用率。因此在同一頻寬下,這個方法將能夠傳輸更多的資料進而提高運算平台的處理能力。
英文摘要 With the huge amount of information era is coming, it is a difficult problem to exploit and compute these data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time even real time. Hence, it presented “In-memory Computing (IMC)” to resolve the problem of Hadoop MapReduce.
IMC with its literal meaning, uses computing in memory to solve the cost problem which Hadoop undue access data to disk caused and can be effectively distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed that receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution, the technology with eliminating duplicate part of data is capable of improving data utilization.
This study presents a distributed real-time IMC platform “Spark Streaming” optimization which is used deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.
論文目次 摘要 I
Abstract II
ACKNOWLEDGEMENT IV
TABLE OF CONTENTS V
LIST OF TABLES VII
LIST OF FIGURES VIII
Chapter 1. Introduction and Motivation 1
1.1. Introduction 1
1.2. Motivation 2
1.3. Thesis Overview 5
Chapter 2. Backgrounds 6
2.1. Spark 6
2.1.1. Spark Core 6
2.1.2. Spark Streaming 10
2.2. Rsync 13
2.2.1. Algorithm 13
2.2.2. The Problem of FSP 15
2.3. Content Defined Chunking (CDC) 16
Chapter 3. Data Deduplication Transmission Scheme 18
3.1. Problem Description 18
3.2. Scheme Overview 19
3.2.1. Bandwidth-saving Model 23
3.3. Block Fingerprint 24
3.4. Data Chunk Preprocess 28
Chapter 4. Implementation and Experiment 32
4.1. Experiment Environment and Settings 32
4.2. Implementation 33
4.3. Experiment Result 35
4.3.1. Length of Data Block 36
4.3.2. Repetition Rate 38
4.3.3. Length of Fingerprint 40
4.3.4. Bandwidth 42
4.3.5. Physical World Taxi GPS Trajectory Dataset 44
Chapter 5. Conclusions and Future work 46
References 48


參考文獻 [1] “Google MapReduce,” 2011, http://research.google.com/archive/mapreduce.html
[Jun. 30, 2016].
[2] “Hadoop,” 2014, http://hadoop.apache.org/ [Jun. 30, 2016].
[3] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenkr, and I. Stoica. “Spark: cluster computing with working sets,” in HotCloud, 2010.
[4] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cyirci, “A Survey on Sensor Networks,” IEEE Communications Magazine, vol. 40, no. 8, Aug. 2002, pp.102 -114.
[5] The MD5 Message-Digest Algorithm, IETF RFC 1320, April 1992; www.rfc-editor.org/rfc/rfc1320.txt.
[6] US secure hash algorithm 1 (SHA1), IETF RFC 3174, 2001; www.rfc-editor.org/rfc/rfc3174.txt.
[7] A. Tridgell and P. Mackerras, “The Rsync Algorithm,” Technical Report TR-CS-96-05, Department of Computer Science, The Australian National University, Canberra, Australia, June 1998. Available: https://rsync.samba.org/tech_report/ [Jun. 30, 2016].
[8] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault tolerant abstraction for in-memory cluster computing,” In Proceedings of the 9th USENIX conference on Netwroked Systems Design and Implementation, pages 2-2, USENIX Association, 2012.
[9] “The Scala programming language,” 2016, http://www.scala-lang.org [Jun. 30, 2016].
[10] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters,” In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, pages 10–10. USENIX Association, 2012.
[11] ZLIB compressed data format specification version 3.3, IETF RFC 1950, May 1996; www.rfc-editor.org/rfc/rfc1950.txt.
[12] M. Athicha, B. Chen, and D. Mazieres. “A low-bandwidth network file system.” ACM SIGOPS Operating Systems Review. Vol. 35. No.5. ACM, 2001.
[13] M. O. Rabin. “Fingerprinting by random polynomials.” Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
[14] D. T. Meyer and W. J. Bolosky, “A Study of Practical Deduplication,” in Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’ 11), 2011, pp. 1-14.
[15] The ‘application/zlib’ and ‘application/gzip’ Media Types, IETF RFC 6713, August 2012; www.rfc-editor.org/rfc/rfc6713.txt.
[16] Y. Collet, “xxhash,” https://github.com/Cyan4973/xxHash [Jun. 27, 2016].
[17] A. Appleby, “SMHasher & MurmurHash,” 2012, https://github.com/aappleby/
smhasher [Jun. 27, 2016].
[18] J. Yuan, Y. Zheng, X. Xie, and G. Sun, “Driving with knowledge from the physical world,” In The 17th ACM SIGKDD (international conference on Knowledge Discovery and Data mining), KDD'11, New York, NY, USA, 2011. ACM.
[19] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338-353, 1965.
[20] J. Kreps, N. Narkhede, and J. Rao. “Kafka: A distributed messaging system for log processing.” In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-08-20起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-08-20起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw