Workshop on Big Graph Analysis Systems – University of Copenhagen

Forward this page to a friend Resize Print kalender-ikon Bookmark and Share

Department of Computer Science DIKU > Event Calendar 2017 > Workshop on Big Graph ...

There is an increasing amount of data that takes the form of complex graphs in various applications, such as social network, linked data, telecommunication networks, chemistry, life science, etc. The analysis of such data should not only focus on the attributes attached to the nodes or edges, but also on the way how the nodes are interconnected.

To address the challenges of the analysis of big graph data and develop supporting system technologies for data scientists, it requires joint efforts from various research areas, including but not limited to graph data management, linked data management, datalog, high-performance computing, distributed/parallel systems, etc.

The purpose of this workshop is to get together experts from several relevant communities that are actively addressing the problems related to big graph data analysis to identify commonalities and synergies and to stimulate cross-collaboration that can be expected but yet to be explored.

The workshop will host invited talks and will feature plenty of free time for discussions.

Organizers

  • Yongluan Zhou, University of Copenhagen
  • Amol Deshpande, University of Maryland at College Park
  • Marcos António Vaz Salles, University of Copenhagen

Workshop Program (Tentative)

21 August

09.00 - 09.15 Welcome with a short introduction to the university and the department
09.15 - 10.00 Yannis Papakonstantinou, University of California San Diego

The GQ-Fast accelerator for queries over typed graphs


Abstract:


The GQ-Fast system processes (potentially extended) SQL queries over relational data. It is specialized for the case where the relational data capture an encoding of an annotated graph and the queries correspond to graph analytics. In the current version, these queries involve aggregation, join, semijoin, intersection and selection thus presenting a wide superset of fixed-length graph reachability queries and of tree pattern queries. Extensions to variable-length reachability are in process.

We present challenging real-world interactive (OLAP) graph analytics scenarios, and show that row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. GQ-Fast's internal structure corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access.
We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2–4 orders of magnitude and MonetDB, Vertica and Neo4j by 1–3 orders of magnitude when all of them are running on RAM. We discuss the origins of these performance benefits.

Finally, we present an under-development extension where the GQ-Fast memory-resident database/processor is in charge of the computationally demanding part of processing graph analytics, while a polystore accesses multiple sources, mirrors in GQ-Fast the graph structure and decomposes graph analytics queries into parts executed at GQ-Fast and parts executed at the original sources. 


Speaker:


Yannis Papakonstantinou is a Professor of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over one hundred research articles that have received more than 14,000 citations, according to Google Scholar. A common theme of his research is the extension of database platforms and query processors beyond centralized relational databases and into semistructured & graph databases, integrated views of distributed databases and web services, textual data and queries involving keyword search, and most recently spatiotemporal sensor data. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops. He is a co-director and teaches for UCSD's Master of Advanced Studies in Data Science.

Yannis enjoys to commercialize his research and to inform his research accordingly. He was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early Enterprise Information Integration platform for structured and semistructured data. The Enosys Software was OEM'd and sold under the BEA Liquid Data and BEA Aqualogic brand names, eventually acquired in 2003 by BEA Systems. He has also consulted for Amazon Web Services and multiple startups.

Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997) and an NSF CAREER award for his work on data integration.

10.00 - 10.45 Amol Deshpande, University of Maryland

GraphGen: Adaptive Graph Extraction and Analytics over Relational Databases


Abstract:


Graph querying and analytics are becoming an increasingly important component of the arsenal of tools for extracting different kinds of insights from data. However, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. This cumbersome and costly process not only raises barriers in using graph analytics, but also makes it hard to explore and identify hidden or implicit graphs in the data.

In this talk, I will present our ongoing work on an end-to-end graph analysis framework, called GraphGen, that sits atop an RDBMS and enables users to declaratively specify graph extraction tasks, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library. GraphGen has a fundamentally different goal from recent work on using relational databases to store graph data through "shredding". Instead, GraphGen is intended to analyze graphs that are present in existing relational databases. GraphGen attempts to utilize the underlying relational database to the full extent possible by pushing down computation, uses a novel condensed representation to handle graphs that may be too large to extract in their entirety, allows writing programs using a general subgraph-centric API, and features several optimizations for efficient extraction and querying of large graphs.


Speaker:


Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.

10.45 - 11.15 Coffee break
11.15 - 12.00 Dan Olteanu, University of Oxford

TBA

Abstract:


Speaker:


Dan Olteanu is a Computer Science professor at Oxford and a computer scientist at LogicBlox. He has also taught at the universities of California Berkeley, Munich, Saarland, and Heidelberg. He received his PhD in Computer Science from University of Munich in 2005. His research interests are in databases and adjacent areas. Dan contributed to XML query processing, incomplete information and probabilistic databases, and more recently to factorized databases, in-database analytics, and the LogicBlox commercial system. He co-authored the book "Probabilistic Databases" (2011). He has served as associate editor for PVLDB'13 and IEEE TKDE, track chair for ICDE'15, group leader for SIGMOD'15, and vice chair for SIGMOD'17. His current research is supported by an ERC consolidator grant and awards from Google, LogicBlox, and Ordnance Survey.

12.00 - 12.15 Pimin Konstantin Kefaloukos (Industrial Talk)

Reverse engineering a social network from web logs


Abstract:


Social networks contain a lot of information that is valuable for advertisers, such as the products purchased, interests, family relations and socio-demographic brackets of the network users. But what do you do if you don't own a social network? Every day advertising networks send billions of web requests to AudienceProject servers, where an individual log lines may contains information about device identifiers, URLs, IP addresses and more. This talks shows how AudienceProject has used novel and existing graph algorithms to reverse engineered a vast social and knowledge graph from web logs and used it to aid adversiters and media agencies. The focus of the talk will be on graph data in the advertisement world as well as some of the graph algorithms innovated or utilised by AudienceProject in this domain.


Speaker:


Pimin Konstantin Kefaloukos works in the ad-tech industry for AudienceProject in Copenhagen. He has a Ph.D. in Computer Science from the University of Copenhagen.







12.15 - 13.30 Lunch
13.30 - 14.15 Hai Jin, Huazhong University of Science and Technology

TBA


Abstract:


TBA


Speaker:


Hai Jin is a Cheung Kung Scholars Chair Professor of computer science and engineering at Huazhong University of Science and Technology (HUST) in China. Jin received his PhD in computer engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. Jin worked at The University of Hong Kong between 1998 and 2000, and as a visiting scholar at the University of Southern California between 1999 and 2000. He was awarded Excellent Youth Award from the National Science Foundation of China in 2001. Jin is the chief scientist of ChinaGrid, the largest grid computing project in China, and the chief scientists of National 973 Basic Research Program Project of Virtualization Technology of Computing System, and Cloud Security.

Jin is a Fellow of CCF, senior member of the IEEE and a member of the ACM. He has co-authored 22 books and published over 800 research papers. His research interests include computer architecture, virtualization technology, cluster computing and cloud computing, peer-to-peer computing, network storage, and network security.

14.15 - 15.00 Hassan Chafi, Oracle Labs

Large Scale Graph Processing Opportunities and Challenges, an Industry Perspective


Abstract:


This talk will present Oracle Labs’ ongoing work in the area of large scale graph processing. We will present our approach to the problem both from a research and product perspective. In particular, we will highlight some of the unique aspects of our approach which relies on the adoption of high level Domain-Specific Languages (DSLs) to express both algorithms and queries against graph datasets. Next, we will discuss our approach to distributed graph computation which leverages and adapts existing state of the art techniques as well as some of our own novel ideas. We will highlight in particular, recent work to support distributed graph querying using an asynchronous graph pattern matching engine.The final part of the talk will present some our current streams of research and our perspective on some of the more practical challenges of large scale graph processing.


Speaker:


Hassan Chafi is Sr. Director of Research and Advanced Development at Oracle Labs where he currently leads various projects. His research investigates high-performance, parallel, in-memory Graph Analytics and using domain specific languages (DSLs) to simplify parallel programming. A second major project focuses on enabling extensibility within Oracle’s various data management solutions, including the Oracle Database. Dr. Chafi received his PhD from Stanford University. His thesis work at Stanford focused on building a Domain Specific Language Infrastructure, Delite. He was advised by Dr. Kunle Olukotun. Prior to that, Hassan worked in the area of hardware transactional memory as part of the Transactional Coherence and Consistency (TCC) project at Stanford where he developed a scalable extension to the original TCC protocol.

15.00 - 15.30 Coffee break
15.30 - 16.15 Jun Huan, University of Kansas

Investigating feed-forward Deep Neural Networks for Modeling Graph-based High Throughput Chemical Bioactivity Data


Abstract:


In recent years, research in Artificial Neural Networks (ANNs) has resurged, now under the Deep-Learning umbrella, and grown extremely popular due to major breakthroughs in methodological and computing capabilities. Deep-Learning methods are part of representation-learning algorithms that attempt to extract and organize discriminative information from the data. Recently reported success of DL techniques in crowd-sourced quantitative structure activity relationship modeling and predictive toxicology competitions has showcased these methods as powerful tools for cheminformatics. Nevertheless, reported applications of Deep Learning techniques for modeling complex bioactivity data for small molecules remain limited.

In this talk I will present our recent work on optimizing feed-forward Deep Neural Nets (DNNs) hyper-parameters and performance evaluation of these methods as compared to shallow methods. In our study 48 DNNs, 24 Random Forest, 20 SVM and 6 Naïve Bayes arbitrary but reasonably selected configurations were compared employing 7 diverse bioactivity datasets assembled from ChEMBL repository combined with circular fingerprints as molecular descriptors. The non-parametric Wilcoxon paired singed-rank test was employed to compare the performance of DNN to RF, SVM and NB. Overall it was found that DNNs with 2 hidden layers, 2,000 neurons per each hidden layer, ReLU activation function and Dropout regularization technique achieved strong classification performance across all tested datasets. Our results demonstrate that DNNs are powerful modeling techniques for modeling complex bioactivity data.


Speaker:


Dr. Jun (Luke) Huan is a professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Data Science and Computational Life Sciences Laboratory at KU. Dr. Huan holds courtesy appointments at the KU Bioinformatics Center and the KU Bioengineering Program.

Dr. Huan is an internationally recognized investigator in data science. He has published more than 120 peer-reviewed papers in leading conferences and journals and has graduated more than 10 graduate students including seven Ph.D.s. He was a recipient of the National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert. Dr. Huan’s editorial memberships have included Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics among others. He regularly serves the program committee of top-tier international conferences on Machine Learning, Data Mining, Big Data, and Bioinformatics.

Since 2015 Dr. Huan has served as a Program Director in the Information and Intelligent Systems division at the US National Science Foundation. At NSF he manages programs such as IIS core, Big Data, and Partnerships for International Research and Education.

16.15 - 17.00 George Fletcher, Eindhoven University of Technology

Schema-driven generation of synthetic graphs and queries with gMark


Abstract:


Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this talk, we present the design and engineering principles of gMark, a domain- and query language-independent graph instance and query workload generator. A core contribution of gMark is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of gMark by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.

This is joint work with colleagues at CNRS Lyon (France), INRIA Lille (France), Université Clermont Auvergne (France), and TU Eindhoven (Netherlands).


Speaker:


George Fletcher (PhD, Indiana University Bloomington) is an associate professor of computer science at Eindhoven University of Technology. His research interests span query language design and engineering, foundations of databases, and data integration. His current focus is on management of massive graphs such as social networks and linked open data. He is a member of the LDBC Graph Query Language Standardization Task Force.

18.00 Conference dinner

22 August

09.00 - 09.45 Semih Salihoglu, University of Waterloo

Surprises in a Survey of the Graphs, Computations, and Software Used in Practice


Abstract:


This is a two-part talk on two projects at University of Waterloo. The main part of the talk will be about
an online user survey we conducted in April 2017 among industry and research users of a wide-spectrum of graph data processing technologies, including graph databases, RDF engines, distributed graph processing systems, libraries for graph processing, graph visualization software, and the Gremlin query language. The survey was intended to understand three high-level questions: (1) What kinds of graph data do participants have? (2) What kind of computations do participants do on their graphs? and (3) What kind of software do participants use to perform these computations. We also asked the participants about their major challenges when processing their graphs. An analysis of the 89 participants revealed very surprising facts about these questions, which we will share during the talk.

A second (and small) part of the talk will be about a new prototype active graph database called Graphflow that we are building at University of Waterloo. Unlike existing passive graph databases, Graphflow is an active system that supports triggers based on continuous shortest paths and subgraph queries. For example, users can register triggers asking Graphflow to perform an action each time a particular subgraph emerges or gets deleted or the shortest paths between two nodes change. We will give a very high-level overview of the system and the techniques we use to support these triggers.


Speaker:


Semih Salihoglu is an assistant professor at University of Waterloo's Cheriton School of Computer Science. He is member of the Data Systems Research Group. Previously, he was a PhD student at Stanford University, advised by Jennifer Widom. His current research focuses graph databases and distributed graph processing engines.

09.45 - 10.30 Milos Nikolic, University of Oxford

A Unified Approach to Incremental View Maintenance of In-Database Analytics


Abstract:


In this talk, we study the problem of incremental computation of analytical tasks over joins. We introduce a principled incremental view maintenance (IVM) mechanism that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpler views. The views are functions mapping keys, which are tuples of input data values, to payloads, which are elements from a task-specific ring. The computation over the keys is the same for all tasks, while the computation over the payloads depends on the task. Our IVM approach achieves efficiency by factorizing the computation of the keys, payloads, and updates. 
We demonstrate our technique on a variety of analytical tasks, such as gradient computation for learning regression models, linear algebra operations, and factorized evaluation of conjunctive queries, and experimentally show that it can achieve orders of magnitude better performance and lower memory consumption than existing state-of-the-art IVM techniques.

Speaker:


Milos Nikolic is a departmental lecturer in the Department of Computer Science at the University of Oxford. His research focuses on the design and implementation of data-intensive systems. His work studies the incremental computation of complex analytical queries, such as database queries and machine learning models, in local and distributed streaming environments using novel approaches to query optimization and compilation. He received a Ph.D. in Computer Science from EPFL.

10.30 - 10.45 Tim D. Ward (Industrial Talk)

How to build a decision engine with Neo4j and Machine Learning Techniques


Abstract:


There is no doubt that machine learning and graphs go hand in hand — and Neo4j makes it possible to take machine learning to the next level. In this presentation, we'll share how CluedIn built Neo4j into a decision engine that turns company data into actionable insights. By using machine learning techniques and the graph as a decision tree, we were able to achieve amazing precision in merging and identifying insights in the enterprise. With practical demos and techniques, viewers will leave this presentation with new and effective ways of working with Neo4j.


Speaker:


Tim is an Engineer at CluedIn where they specialize in combining Neo4j and other database technologies with Machine Learning. He has been using the Graph for over 5 years and the team at CluedIn are currently running a Neo4j Graph Database with over 180 Million nodes in production.



10.45 - 11.15 Coffee break
11.15 - 12.00 Ioana Manolescu, INRIA Saclay

TBA

TBA

12.00 - 12.45 Yongluan Zhou, University of Copenhagen

Parallel SPARQL Query Optimization


Abstract:


Existing parallel SPARQL query optimizers assume hash-based data partitioning and adopt plan enumeration algorithms with unnecessarily high complexity. Therefore, they cannot easily accommodate other partitioning methods and only consider an unnecessarily limited plan space. To address these problems, we first define a generic RDF data partitioning model to capture the common structure of various state-of-the-art RDF data partitioning methods. Then we propose a query plan enumeration algorithm that not only has an optimal efficiency, but also accommodates different data partitioning methods. Furthermore, based on a solid analysis of the complexity of the plan enumeration algorithm, we propose two new heuristic methods that can consider a much larger plan space than the existing methods, and at the same time can still confine the search space of the algorithm. An autonomous approach is proposed to choose one of the two methods by considering the structure and the size of a complex SPARQL query. We conduct extensive experiments using synthetic and a real-world dataset, which show the superiority of our algorithms in comparing to existing ones.


Speaker:


Yongluan is an Associate Professor in the Department of Computer Science at the University of Copenhagen. Before joining KU, he has been an Associate Professor at the University of Southern Denmark and a postdoc at Ecole Polytechnique Fédérale de Lausanne (EPFL). He obtained his PhD in Computer Science at the National University of Singapore. He also held visiting positions at several international institutes, including the University of Maryland at College Park, the Advanced Digital Sciences Center at the University of Illinois at Urbana-Champaign and INRIA Rennes. His research interests span database systems and distributed systems. His recent research focuses on scalable streaming data processing and graph data processing. He has published over 50 research articles in international journals and conference proceedings.

12.45 - 14.00 Lunch
14.00 - 14.45 Nikolay Yakovets, Eindhoven University of Technology

Optimization of Regular Path Queries in Graph Databases


Abstract:


Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Se- mantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query plan—which we call a waveplan (WP)—which guides the query evaluation. There are numerous choices in the construction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WP’s costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a prototype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers.


Speaker:


Nikolay Yakovets is an assistant professor of computer science at Eindhoven University of Technology. He obtained his PhD from Lassonde School of Engineering at York University in 2016. He worked on various database topics at IBM CAS Canada and Empress Software Canada. His current focus is on design and implementation of core database technologies, management of massive graph data, and efficient processing of queries on graphs.

15.00 - 17.00 Discussions and walking tour of the city

Venue

Room 4-0-02, Biocenter 
University of Copenhagen
Ole Maaløes Vej 5
2200 Copenhagen N