Mining and Searching Text with Graph Databases

This article talks about how (NLP) techniques are used for handling huge amount of data and converting it into useful sources of information for further processing.

Information Extraction is defined as the process of analysing text, and checking the relationships between semantic entities. Relationships can be stored in database and they can be accesed through ‘cipher’ queries. In the article it is also included the concept of GraphAware NLP, which is a relatively new tool for managing textual data.

Learning to Speed Up Query Planning in Graph Databases

This paper is not directly related with our project, however it considers to improve query response times of the graph databases. Since we will use Neo4j graph database and have not direct access to its source code, we are dependent the way Neo4j plans computational steps of query processing. However, this paper’s focus is important to make the execution of queries more efficient especially in large graphs. Since querying in large graphs can be very slow in terms of response times, this paper presents a framework that is applicable to a large class of queries and claims a significant improvement in STAR graph query reasoner.

Why Panama Papers Journalists Use Graph Databases

This magazine article from Forbes explains briefly why ICIJ used graph databases to organize the huge Panama Papers leakage data. It mentions that relational databases are not appropriate for the management of the lengthy documents or for the complex relationships that connect them. Whereas, article states that graph databases are more appropriate structures for applications that need to find related documents quickly, especially when one need only a few documents at a time. Therefore, using graph databases facilitate the representation and analysis on Panama Papers data, especially for journalists from all over the world. Even though it is not related with the implementation of our project, this is an informative article to understand in what kind of data the usage of graph databases are useful.

Extending Neo4j with Extensions

This website shows how the capabilities of the Cypher (Neo4j’s query language) can be extended by writing extensions. Since we aim to write an extension for querying additional graph and clustering algorithms in Neo4j, this is a beneficial guideline for our purpose.

This guideline shows:

  • how to write a procedure
  • how to write integration tests for procedures
  • how to write an unmanaged extension
  • how to use Cypher in the extension
  • how to test an extension

The extensions are written in Java language. We aim to study this guideline and start to write our own extension soon.

Presentation

This is the presentation that we presented in the class as part of the course workload.

Natural Language Processing with Graphs

Graph databases are a type of NoSQL database that use a graph data model and can be used in a variety of natural language processing techniques.

This video provides an overview of graph databases, followed by a survey of the role for graph databases in natural language processing tasks, including:

  • modeling text as a graph
  • mining word associations from a text corpus using a graph data model and
  • mining opinions from a corpus of product reviews.

It concludes with a demonstration of how graphs can enable content recommendation based on keyword extraction.

This video is important for us to learn how a text can be represented as a graph and how NLP-related operations can be done on that graph.

Analyzing Graph Databases by Aggregate Queries

This paper proposes several extensions to a previously proposed data model and query language (BiQL). The main elements of these extensions are aggregates, rankings and path expressions that allow to calculate well-known network statistics (such as centrality measures), transform networks for the application of more advanced mining algorithms or more complex probabilistic measures (such as connection probabilities).

This paper is important to give an idea about how we can enhance our queries to explore the ICIJ Offshore Database in more detail. The extensions proposed by the paper that are in BiQL language can be transfered to the Neo4j’s query language Cypher and then can be used to make more complex queries.

ICIJ Offshore Leaks Database

This database contains information on almost 500,000 offshore entities that are part of the Panama Papers, the Offshore Leaks and the Bahamas Leaks investigations. The data covers nearly 40 years – from 1977 through to early 2016 – and links to people and companies in more than 200 countries and territories.

We will analyze this database during the semester as the course project. The data is represented as Neo4j graph database, which makes it easier to analyze. We will apply some known clustering and graph algorithms on this graph database to infer some (may be not seen so far) relationships between people, companies and countries. Since Neo4j does not have some of these algorithms, we are planning to implement extensions of these algorithms to Neo4j when needed.