進階搜尋


 
系統識別號 U0026-2608201316405100
論文名稱(中文) 一個使用Map-Phase-Multi-Join方法的SPARQL查詢系統
論文名稱(英文) A SPARQL Query Processing System using Map-Phase-Multi join
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 101
學期 2
出版年 102
研究生(中文) 余家豪
研究生(英文) Chia-Ho Yu
學號 Q36001054
學位類別 碩士
語文別 英文
論文頁數 47頁
口試委員 指導教授-謝錫堃
口試委員-陳俊良
口試委員-周立德
口試委員-侯廷偉
口試委員-曾黎明
中文關鍵字 語意網  雲端運算  雲端資料庫 
英文關鍵字 Semantic Web  MapReduce  NoSQL  Linked Open Data  TripleStore  SPARQL  RDF 
學科別分類
中文摘要 由於以RDF格式儲存的資料如雨後春筍般越來越多,用來儲存與處理這些大量RDF資料的TripleStore面臨了儲存大量資料之壓力與處理效能之瓶頸。而目前在處理大量資料與處理效能的技術方面以雲端技術為主軸,利用Google提出來的運算模型MapReduce,我們可以藉由分散式環境來提高處理大量資料的效能,所以在改善TripleStore處理效能的相關研究們多是利用這些持續發展中的雲端技術,包含了使用MapReduce來處理SPARQL query與使用NoSQL資料庫來儲存這些上億筆的RDF資料。然而,在處理複雜的SPARQL query之部分依舊沒有一個研究能夠有效率的處理它,處理的時間依資料量的大小可能為幾十秒或是上百秒,很顯然這個系統還有許多可以改進的機會。而在本篇論文中,我們提出了一個Map-Phase-Multi Join演算法來處理SPARQL query,包含了僅利用一個MapReduce工作來處理一個複雜的query,避免開啟多個工作可能需要更多的初始時間之問題,另外此演算法也實作在Map端避免那些不會成為解答之資料還必須在群集中被傳輸,浪費分散式運算中寶貴的頻寬,其他的改進部分如設計一個全新的HBase儲存schema以及一個Join-Order的分析工具都是此系統不可或缺的元件。在系統效能評測的部分,我們證明了此系統不論在處理簡單的query或複雜的query幾乎都打敗了傳統的系統,而與目前相關研究中效能最好的系統H2RDF相比,我們也在部分測資上大幅度領先,在檢討效能比較差的測資時,我們也發現了許多不足的地方尚待改進,期許未來我們能夠繼續精進我們的系統並讓它成為一個實用且有貢獻的系統。
英文摘要 In recent years, the RDF data grows in a rapid manner. Therefore the TripleStore, which stores and processes these data, requires more scalable and efficient technologies. The programming model “MapReduce” is the most representative of Cloud technology which aims to solve scalable and efficiency of computing. In the past few years, there were several researches using cloud technologies to construct their TripleStore. The implementations of these approaches have something in common, a MapReduce algorithm to serve the SPARQL query and a distributed storage system like HDFS or HBase to store billions of RDF data. However, these systems still exhibit poor performance in processing complex SPARQL queries. In this thesis, we propose a Map-Phase-Multi Join algorithm for processing SPARQL queries. The advantages includes the using of multi-join to reduce the job initialization time by avoiding iterative of MapReduce jobs and map-phase join to save bandwidth cost by preventing join-less data transferring among cluster. Furthermore, we design a storage schema in HBase which enhances our join algorithm’s performance. Finally, we formulate a join-order rule which aims to reduce the input size of mappers and the number of join computations. The evaluation results of our proposed system defeats traditional join approaches in most simple queries and complex queries. To summarized, the performance of our system is closed related to HBase’s read request time and the input size of mappers. We will keep refining our system to work out these issues which have high correlations to the system performance.
論文目次 Chapter 1: Introduction 1
Chapter 2: Background 7
2.1 Sematic Web Technology 7
2.1.1 RDF 8
2.1.2 SPARQL 9
2.1.3 TripleStore 11
2.2 Cloud Technology 12
2.2.1 MapReduce Programming Model 12
2.2.2 Hadoop Framework 12
2.2.3 Apache HBase 13
Chapter 3: Related Work 15
3.1 Research Topics in Semantic Web 15
3.2 Join Algorithms in MapReduce 17
Chapter 4: System Design 19
4.1 System Overview 19
4.2 Design Issue of Map-Phase-Multi Join Algorithm 21
4.2.1 Execution Process 21
4.3 Table Schema Design 23
4.4 Query Analyzer Design 27
Chapter 5: Implementation 28
5.1 Data Preprocessing 28
5.1.1 HBase Bulkload 28
5.2 Query Processing Engine 30
5.2.1 Query Analyzer 30
5.2.2 Map-Phase-Multi join 33
Chapter 6: Performance Evaluation 35
6.1 Environments 35
6.2 Evaluation with LUBM Queries 36
Chapter 7: Conclusion and Future Work 43
Reference 45
參考文獻 [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008.
[2] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, et al., "A comparison of approaches to large-scale data analysis," in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, 2009, pp. 165-178.
[3] J. Dean and S. Ghemawat, "MapReduce: a flexible data processing tool," Communications of the ACM, vol. 53, pp. 72-77, 2010.
[4] H. Choi, J. Son, Y. Cho, M. K. Sung, and Y. D. Chung, "SPIDER: a system for scalable, parallel/distributed evaluation of large-scale RDF data," in Proceedings of the 18th ACM conference on Information and knowledge management, 2009, pp. 2087-2088.
[5] N. Soule, "Efficient SPARQL Query Processing via Map-Reduce-Merge."
[6] D. Jiang, A. K. Tung, and G. Chen, "Map-join-reduce: Toward scalable and efficient data analysis on large clusters," Knowledge and Data Engineering, IEEE Transactions on, vol. 23, pp. 1299-1311, 2011.
[7] M. Husain, J. McGlothlin, M. M. Masud, L. Khan, and B. M. Thuraisingham, "Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing," Knowledge and Data Engineering, IEEE Transactions on, vol. 23, pp. 1312-1327, 2011.
[8] M. F. Husain, L. Khan, M. Kantarcioglu, and B. Thuraisingham, "Data intensive query processing for large RDF graphs using cloud computing tools," in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, 2010, pp. 1-10.
[9] J. Sun and Q. Jin, "Scalable RDF store based on HBase and MapReduce," in Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference on, 2010, pp. V1-633-V1-636.
[10] S. Lynden, Y. Tanimura, I. Kojima, and A. Matono, "Dynamic data redistribution for MapReduce joins," in Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, 2011, pp. 717-723.
[11] F. N. Afrati and J. D. Ullman, "Optimizing joins in a map-reduce environment," in Proceedings of the 13th International Conference on Extending Database Technology, 2010, pp. 99-110.
[12] N. Papailiou, I. Konstantinou, D. Tsoumakos, and N. Koziris, "H2RDF: adaptive query processing on RDF data in the cloud," in Proceedings of the 21st international conference companion on World Wide Web, 2012, pp. 397-400.
[13] S. Harris, N. Lamb, and N. Shadbolt, "4store: The design and implementation of a clustered RDF store," in 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2009), 2009, pp. 94-109.
[14] J. Broekstra, A. Kampman, and F. Van Harmelen, "Sesame: A generic architecture for storing and querying rdf and rdf schema," in The Semantic Web—ISWC 2002, ed: Springer, 2002, pp. 54-68.
[15] A. Kiryakov, D. Ognyanov, and D. Manov, "OWLIM–a pragmatic semantic repository for OWL," in Web Information Systems Engineering–WISE 2005 Workshops, 2005, pp. 182-192.
[16] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, et al., "Bigtable: A distributed storage system for structured data," ACM Transactions on Computer Systems (TOCS), vol. 26, p. 4, 2008.
[17] J. Urbani, S. Kotoulas, E. Oren, and F. Van Harmelen, "Scalable distributed reasoning using mapreduce," in The Semantic Web-ISWC 2009, ed: Springer, 2009, pp. 634-649.
[18] J. Urbani, S. Kotoulas, J. Maassen, F. Van Harmelen, and H. Bal, "WebPIE: A Web-scale parallel inference engine using MapReduce," Web Semantics: Science, Services and Agents on the World Wide Web, vol. 10, pp. 59-75, 2012.
[19] M. Przyjaciel-Zablocki, A. Schätzle, T. Hornung, C. Dorner, and G. Lausen, "Cascading map-side joins over HBase for scalable join processing," arXiv preprint arXiv:1206.6293, 2012.
[20] J. Myung, J. Yeon, and S.-g. Lee, "SPARQL basic graph pattern processing with iterative MapReduce," in Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, 2010, p. 6.
[21] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds, "SPARQL basic graph pattern optimization using selectivity estimation," in Proceedings of the 17th international conference on World Wide Web, 2008, pp. 595-604.
[22] Y. Guo, Z. Pan, and J. Heflin, "LUBM: A benchmark for OWL knowledge base systems," Web Semantics: Science, Services and Agents on the World Wide Web, vol. 3, pp. 158-182, 2005.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2015-08-30起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2015-08-30起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw