進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-0512201907085600
論文名稱(中文) 大數據系統於半導體產業之設計、實現與應用
論文名稱(英文) Distributed Big Data Platform Services and Their Applications on Semiconductor Wafer Fabrication Foundries
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 108
學期 1
出版年 108
研究生(中文) 蔡嘉平
研究生(英文) Chia-Ping Tsai
電子信箱 chia7712@gmail.com
學號 P78011078
學位類別 博士
語文別 英文
論文頁數 92頁
口試委員 指導教授-蕭宏章
口試委員-賴冠州
口試委員-許慶賢
口試委員-陳朝鈞
口試委員-張哲唯
口試委員-劉炳宏
口試委員-梁勝富
口試委員-張燕光
口試委員-莊坤達
中文關鍵字 巨量資料  分散式系統  資料系統  運算系統 
英文關鍵字 HBase  Hadoop  Distributed  big data 
學科別分類
中文摘要 此論文介紹與台灣半導體業者歷經四年共同研究、設計以及實現的兩種應用服務,Hadoop data service(HDS)以 及Distributed R language computing service(DRS)。HDS是個基於HDFS、HBase的分散式檔案存取系統,具可擴充性、可用性以及穩定性,且提供諸多強大的功能,包含1) 基於Restful APIs操作方式,方便不同程式語言操作HDS、2) 優化大小檔案的存取,寫入資料時HDS會偵測該資料的大小並選擇合適的存放位置、3) 銜接熱門的大數據分析工具,如spark、hive。DRS是個基於Yarn的分散式資料分析服務,擁有類似HDS的分散式優點以外,在設計上也支援容錯、資源管理、運算環境隔離、運算環境封裝等特性。
在實際應用上DRS首先支援R的運行環境,R語言的使用者可撰寫程式、透過HDS存取資料,打包程式後透過DRS將程式運行於分散式的環境。
此論文也探討了在現在NoSQL應用上常見的資料還原的方法。該資料還原方法目的為建構一個具有可還原資料至任意時間戳記的系統,此特性可提供非常多的延伸應用,例如資料除錯、審核或是測試等等。在實作上,此論文提出了不同的方式以因應不同的情境下的還原,包含考量記憶體負擔、平行化程度、軟體相容性、以及還原的顆粒度等等。上述條件各有好壞以及程度差異,此論文提出了不同的理論及方法以滿足變化以適應各種情境、並且透過Apache HBase實現相關理論。該些實作也經過社群的驗證合併回Apache HBase成為該系統的還原工具之一。
英文摘要 This thesis firstly presents two novel infrastructural services based on Hadoop for big data storage and computing in a Taiwan’s semiconductor wafer fabrication foundry. The two services include Hadoop data service (HDS) and distributed R language computing service (DRS), which have been built and operated in production systems for 4 years. They evolve over time by incrementally accommodating users’ requirements. HDS is a web-based distributed big data storage facility. Users simply rely on HDS to access data objects stored in Hadoop with the HTTP protocol. In addition, HDS is scalable and reliable. Moreover, HDS is efficient and effective by intelligently selecting either Hadoop distributed file system (HDFS) or database (HBase) for publishing data objects. Specifically, HDS is transparent to existing analytics and data inquiry applications, such as Spark and Hive. While HDS is a unified storage for supporting sequential and random data accesses for big data in the wafer fabrication foundry, DRS is a distributed computing framework for typical R language users. R users employ DRS to enjoy data-parallel computations, effortlessly and seamlessly.
Similar to HDS, DRS can be horizontally scaled out. It guarantees the completion of computational jobs even with failures. In particular, it adaptively reallocates computational resources on the fly, minimizing job execution time and maximizing utilization of allocated resources. This thesis discusses the design and implementation features for HDS and DRS.
It also demonstrates their performance metrics.In addition the discussions for HDS and DRS, this thesis further addresses the data restoring issue for Not Only SQL (NoSQL) distributed databases (or NoSQL in brief). NoSQL is a state-of-the-art technology that is scalable and provides flexible schemas, thereby complementing existing relational database technologies.
Although NoSQL is flourishing, present solutions lack the features required by enterprises for critical missions. In this thesis, we also explore solutions to the data recovery issue in NoSQL. Data recovery for any database table entails restoring the table to a prior state or replaying (insert/update) operations over the table given a time period in the past. Recovery of NoSQL database tables enables applications such as failure recovery, analysis for historical data, debugging, and auditing. In this thesis, we identify the design and implementation issues with regard to the data recovery problem for NoSQL databases, including time length of recovery, fault tolerance, scalability, memory constraint, software compatibility, and quality of recovery. Four solutions are then proposed and evaluated to address the data recovery problem in NoSQL; each solution has its pros and cons. We implement our proposals based on Apache HBase, a popular NoSQL database in the Hadoop ecosystem. Our implementations are extensively benchmarked with an industrial NoSQL benchmark under real environments. Based on our study, one of our solution has been rigorously reviewed by Apache HBase community. It is currently integrated in Apache Base and distributed worldwide.
論文目次 Abstract ..............I
摘要 ..............III
Acknowledgement .............IV
1.Introduction ............1
1.1 The Hadoop Data Service and Distributed R Computing Platform ...1
1.2 Our Contributions ...........3
1.3 The Time Machine in NoSQL ........4
1.4 Our Contributions ...........7
1.5 The Thesis Organization .........8
2.The Hadoop Data Service and Distributed R Computing Platform ...9
2.1 Overview .............9
2.1.1 The Roadmap ...........10
2.2 Building Blocks ............11
2.2.1 HDFS ...........11
2.2.2 HBase ...........12
2.2.3 YARN ...........13
2.2.4 ZooKeeper ..........13
2.3 Desgin and Implementation ..........14
2.3.1 Hadoop Data Service .........14
2.3.2 Distributed R Language Computing Service .....20
2.4 Performance Evaluation ..........25
2.4.1 Experimental Setup ..........25
2.4.2 Performance Results .........26
2.5 Related Works ...........33
3.The Time Machine in NoSQL .........36
3.1 Overview .............36
3.1.1 The Roadmap ...........38
3.2 Background ............38
3.3 System Model and Research Problem ........42
3.3.1 System Model ..........42
3.3.2 Research Problem ..........44
3.4 Proposals .............47
3.4.1 Design Considerations ........47
3.4.2 Architecture ..........49
3.4.3 Application Programming Interfaces (APIs) .....51
3.4.4 Data Structure ..........52
3.4.5 Approaches ..........55
3.4.6 Discussions ..........65
3.5 Performance Evaluation ..........68
3.5.1 Experimental Setup ..........68
3.5.2 Performance Results .........71
3.6 Related Works ...........79
4.Summary ..............83
參考文獻 [1] Apache Hadoop. [Online]. Available: http://hadoop.apache.org/
[2] Apache HDFS. [Online]. Available: http://hadoop.apache.org/docs/r1.2.1/hdfs design.
html
[3] K. McKusick and S. Quinlan, “GFS: Evolution on Fast-Forward,” Commun. ACM, vol. 53, no. 3, pp. 42–49, Jan. 2010.
[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proc. 19th ACM Symp. Operating Systems Principles (SOSP’03), Oct. 2003, pp. 29–43.
[5] Apache HBase. [Online]. Available: http://hbase.apache.org/
[6] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?” IEEE Computer, vol. 43, no. 2, pp. 12–14, Feb. 2010.
[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” in Proc. 7th USENIX Symp. Operating Systems Design and Implementation (OSDI’06), Nov. 2006, pp. 205–218.
[8] The R Project for Statistical Computing. [Online]. Available: https://www.r-project.org/
[9] Python. [Online]. Available: https://www.python.org/
[10] G. C. Deka, “A Survey of Cloud Database Systems,” IT Professional, vol. 16, no. 2, pp. 50–57, March-April 2014.
[11] Apache Cassandra. [Online]. Available: http://cassandra.apache.org/
[12] Couchbase. [Online]. Available: http://www.couchbase.com/
[13] MongoDB. [Online]. Available: http://www.mongodb.org/
[14] MySQL. [Online]. Available: http://www.mysql.com/
[15] Apache Phoenix. [Online]. Available: http://phoenix.apache.org/
[16] Samba. [Online]. Available: https://en.wikipedia.org/wiki/Samba (software)
[17] Apache Spark. [Online]. Available: https://spark.apache.org/
[18] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, “Apache Spark: A Unified Engine for Big Data Processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016.
[19] Apache Hive. [Online]. Available: https://hive.apache.org/
[20] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy, “Hive—A Petabyte Scale Data Warehouse Using Hadoop,” in Proc. of IEEE Int’l Conf. Data Engineering (ICDE), Mar. 2010, pp. 996–1005.
[21] R. Ihaka and R. Gentleman, “R: A Language for Data Analysis and Graphics,” Journal of Computational and Graphical Statistics, vol. 5, no. 3, pp. 299–314, Sep. 1995.
[22] H.-C. Hsiao, H.-Y. Chung, H. Shen, and Y.-C. Chao, “Load Rebalancing for Distributed File Systems in Clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 5, pp. 951–962, May 2013.
[23] Apache Hadoop YARN. [Online]. Available: https://hadoop.apache.org/docs/current/
hadoop-yarn/hadoop-yarn-site/YARN.html
[24] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator,” in Proc. ACM Symp. Cloud Computing (SOCC’13), Oct. 2013.
[25] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proc. 6th USENIX Symp. Operating System Design and Implementation
(OSDI’04), Dec. 2004, pp. 137–150.
[26] Apache ZooKeeper. [Online]. Available: https://zookeeper.apache.org/
[27] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” in USENIX Annual Technical Conference, 2010.
[28] S. Adve and K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial,” IEEE Computer, vol. 29, no. 12, pp. 66–76, Dec. 1996.
[29] H.-C. Hsiao and C.-W. Chang, “A Symmetric Load Balancing Algorithm with Performance Guarantees for Distributed Hash Tables,” IEEE Transactions on Computers, vol. 62, no. 4, pp. 662–675, Apr. 2013.
[30] H.-C. Hsiao, H. Liao, S.-T. Chen, and K.-C. Huang, “Load Balance with Imperfect Information in Structured Peer-to-Peer Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, pp. 634–649, Apr. 2011.
[31] M. Mitzenmacher and E. Upfal, Probability and Computing. Cambridge, 2005.
[32] Hadoop Archives. [Online]. Available: https://hadoop.apache.org/docs/current/
hadoop-archives/HadoopArchives.html
[33] Hadoop Sequence Files. [Online]. Available: https://wiki.apache.org/hadoop/
SequenceFile
[34] HBase Coprocessor. [Online]. Available: https://blogs.apache.org/hbase/entry/
coprocessor introduction
[35] VMware. [Online]. Available: http://www.vmware.com/
[36] Apache Web HDFS REST API. [Online]. Available: https://hadoop.apache.org/docs/
r1.0.4/webhdfs.html
[37] Apache Hadoop HttpFS. [Online]. Available: https://hadoop.apache.org/docs/r2.4.1/
hadoop-hdfs-httpfs/index.html
[38] Apache Sqoop. [Online]. Available: http://sqoop.apache.org/
[39] Apache Flume. [Online]. Available: https://flume.apache.org/
[40] Apache Kafka. [Online]. Available: https://kafka.apache.org/
[41] S. Tarkoma, Publish/Subscribe Systems: Design and Principles. WILEY, 2012.
[42] S. Venkataraman, I. Roy, A. AuYoung, and R. S. Schreiber, “Using R for Iterative and Incremental Processing,” in 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’12), June 2012.
[43] RevolutionAnalytics RHadoop. [Online]. Available: https://github.com/
RevolutionAnalytics/rhadoop/wiki
[44] SparkR (R on Spark). [Online]. Available: https://spark.apache.org/docs/latest/sparkr.
html
[45] S. Venkataraman, Z. Yang, D. Liu, E. Liang, H. Falaki, X. Meng, R. Xin, A. Ghodsi, M. J. Franklin, I. Stoica, and M. Zaharia, “SparkR: Scaling R Programs with Spark,” in Proc. ACM Int’l Conf. Management of Data (SIGMOD’16), June 2016, pp. 1099–1104.
[46] A. R. Chang, Y.-L. Chen, Y.-Z. Huang, H.-C. Hsiao, M. Hsu, C.-C. Lee, H.-Y. Lee, W.-A. Shih, C.-P. Tsai, and K.-P. Tseng, “The Case of Operational Distributed Stor age Service for Big Data in a Semiconductor Wafer Fabrication Foundry,” in Taiwan Academic Network Conf., Oct. 2018, Best Paper Award.
[47] A. R. Chang, Y.-L. Chen, Y.-Z. Huang, H.-C. Hsiao, M. Hsu, C.-C. Lee, H.-Y. Lee, W.- A. Shih, H.-P. Su, C.-P. Tsai, and K.-P. Tseng, “The Case of a Novel Operational Distributed Storage Service for Big Data in a Semiconductor Wafer Fabrication Foundry,” in Int’l Workshop BigData Processing Systems in conjunction with IEEE Int’l Conf. Parallel and Distributed Systems, Dec. 2018.
[48] A. R. Chang, Y.-L. Chen, P.-Y. Chou, Y.-Z. Huang, H.-C. Hsiao, T.-T. Hsieh, M. Hsu, C.-C. Lee, H.-Y. Lee, Y.-C. Shih, W.-A. Shih, C.-H. Tang, C.-P. Tsai, and K.-P. Tseng, “The Case of Big Data Platform Services for Semiconductor Wafer Fabrication Foundries,” in Int’l Conf. ICT Convergence, Oct. 2018.
[49] ——, “A Distributed R-Language Computing Platform Service for a Semiconductor Wafer Fabrication Foundry,” in Int’l Computer Symposium, Dec. 2018.
[50] The Network Time Protocol. [Online]. Available: http://www.ntp.org/
[51] HBase Regions. [Online]. Available: https://hbase.apache.org/book/regions.arch.html
[52] HBase APIs. [Online]. Available: http://hbase.apache.org/0.94/apidocs/
[53] IBM BladeCenter HS23. [Online]. Available: http://www-03.ibm.com/systems/
bladecenter/hardware/servers/hs23/
[54] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking Cloud Serving Systems with YCSB,” in Proc. ACM Symp. Cloud Computing (SOCC’10), June 2010, pp. 143–154.
[55] HBase Snapshots. [Online]. Available: https://hbase.apache.org/book/ops.snapshots.
html
[56] Cloudera Snapshots. [Online]. Available: http://www.cloudera.com/content/
cloudera-content/cloudera-docs/CM5/latest/Cloudera-Backup-Disaster-Recovery/
cm5bdr snapshot intro.html
[57] HBase Replication. [Online]. Available: http://blog.cloudera.com/blog/2012/07/
hbase-replication-overview-2/
[58] HBase Export. [Online]. Available: https://hbase.apache.org/book/ops mgt.html#
export
[59] HBase CopyTable. [Online]. Available: https://hbase.apache.org/book/ops mgt.htm#
copytable
[60] Oracle Database Backup and Recovery. [Online]. Available: http://docs.oracle.com/
cd/E11882 01/backup.112/e10642/rcmintro.htm#BRADV8001
[61] J. Zhou, N. Bruno, and W. Lin, “Advanced Partitioning Techniques for Massively Distributed Computation,” in Proc. of ACM SIGMOD, May 2012, pp. 13–24.
[62] C. Hong, D. Zhou, M. Yang, C. Kuo, L. Zhang, and L. Zhou, “KuaFu: Closing the Parallelism Gap in Database Replication,” in Proc. of IEEE Int’l Conf. Data Engineering
(ICDE), April 2013, pp. 1186–1195.
[63] S.-W. Lee and B. Moon, “Transactional In-Page Logging for Multiversion Read Consistency and Recovery,” in Proc. of IEEE Int’l Conf. Data Engineering (ICDE), April 2011, pp. 876–887.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2024-12-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2024-12-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw