
Introduction
Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular graph database that can be used to make recommendations, get social, find paths, uncover fraud, manage networks, and so on. A graph database can store any kind of data using a Nodes (graph data records), Relationships (connect nodes), and Properties (named data values).
A graph database can be used for connected data which is otherwise not possible with either relational or other NOSQL databases as they lack relationships and multiple depth traversals. Graph Databases Embrace Relationships as they naturally form Paths. Querying or traversing the graph involves following Paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient.
Use Case
This use case is based on modified version of StackOverflow dataset that shows network of programming languages, questions that refers to these programming languages, users who asked and answered these questions, and how these nodes are connected with relationships to find deeper insights in Neo4J Graph Database which is otherwise not possible with common relation database or other NoSQL databases.
What we want to do:
- Prerequisites
- Download StackOverflow Dataset
- Data Manipulation with R
- Create Nodes & Relationships file with Java
- Create GraphDB with BatchImporter
- Visualize Graph with Neo4J
Solution
Prerequisites
- Download and Install Neo4j: We will be using Neo4j 2.x version and installing it on Windows is very easy. Follow the instructions on at the below link to download and install.
Note: Neo4j 2.x requires JDK 1.7 and above.
http://www.neo4j.org/download/windows
- Download and Install RStudio: We will be using R to perform some data manipulation on the StackOverflow dataset which is available in RData format and this includes filtering, altering, dropping columns, and others. This is done to show the power of R with respect to data manipulation and the same can be done in other programming languages as well. Download the open source edition of Rstudio from the below link.
http://www.rstudio.com/products/rstudio/#Desk
Download StackOverflow Dataset
- Download Dataset: This use case is based on modified version of StackOverflow dataset which is rather old and available in both CSV and RData format. Follow the below links to download the dataset. The first link contains the details about various fields and the second link is to download RData
http://www.ics.uci.edu/~duboisc/StackOverflow
http://www.ics.uci.edu/~duboisc/StackOverflow/answers.Rdata
- Understanding Dataset:
We will be mostly interested in the following fields which will be used to create nodes and relationships in Neo4j.
qid: | Unique question id |
i: | User id of questioner |
qs: | Score of the question |
tags: | a comma-separated list of the tags associated with the question that refers to programming languages |
qvc: | Number of views of this question |
aid: | Unique answer id |
j: | User id of answer |
as: | Score of the answer |
Data Manipulation with R
We will reshape the dataset to fit to our needs and appreciate the power of data manipulation with R. The actual RData contains around 250 K rows but this use case will perform the following manipulation to keep it interesting and small.
- Open RStudio and Set Working Directory: Open RStudio and set the working directory to where the RData file was downloaded.
- Load and Perform Data Manipulation:
1234567//Load answers.Rdata that was downloadedload(“answers.Rdata”)//The data is available in “data” object and a quick can be done with headhead(data)
12345678910111213141516171819//Load stringr library to perform some String manipulationrequire(stringr)//Create a new column Match and assign True/False based on whether the tags contain only specific language.//For this use case, we are interested only in subset of programming languages.data$Match <– str_detect(string = data$tags, pattern = “(java|mysql|linux|python|django|php|jquery)”)//Create a new column length that contains number of words in tags column by using splitting.//sapply function will perform the function str_split recursively for each rowdata$length <– sapply(str_split(data$tags, “,”), length)//The data object now contains 2 new columns: Match and length. Match column will have TRUE if the tags column contains//one of the programming language patterns that we are interested in. The length column will have number of words delimited//by commahead(data)
12345678910111213//Find number of rows in the data objectnrow(data) //This will show 263540 rows//Subset the data object where Match=True, length=1, question and answer score are greater than zero//Store the result in a newdata objectnewdata <– subset(data, (Match == “TRUE” & length == 1 & qs > 0 & as > 0))//the row count is significantly went down to 1668nrow(newdata)//The top 5 row sample shows that the tags column has only one programming language associatedhead(newdata)
1234567//Create a drop column list(qt, at, Match, and length) and drop from the newdata object that are not needed anymoredrops <– c(“qt”, “at”, “Match”, “length”)//The new data frame finaldata object doesn’t contain the drops column listfinaldata <– newdata[, !(names(newdata) %in% drops)]head(finaldata)
12345//Order the finaldata object by question idfinaldata <– finaldata[order(finaldata$qid),]//Write the finaldata object to a CSV file that will be used to create nodes and relationshipswrite.csv(finaldata, “finaldata.csv”,sep=“,”,row.names=FALSE)
Note: Ignore the warning message
Create Nodes and Relationship file with Java
We will write a Java program that takes the finadata.csv generated from the above R program and create multiple node files and a single relationship file that contains relations between the nodes. Our nodes and relationship structure is as follows:
Nodes: question_nodes, answer_nodes, user_nodes, lang_nodes
Relationships: The following are the relationships
1
2
3
4
5
6
7
8
9
10
11
|
//One question refers to one programming language
Question REFERS Language
//One question can have multiple answers
Question HAS_ANSWER Answer
//One question asked by one user
Question ASKED_BY User
//One answer answered by one user
Answer ANSWERED_BY User
|
- Details about Java Program: This Java program is self explanatory and simply creates nodes and relationship files in CSV format as needed by the Neo4j Batch Importer program. Few things about the Java program to keep in mind
- The format of Nodes file is as follows:
1234//id is the actual id, string is the datatype of the id, and users indicate the name of the index that we want to create in Neo4J. This file should contain somename:datatype:index_name and may contain more attributes of the nodes with tab delimited. This is the format that Neo4J Batch Importer expectsId:string:users attribute1 attribute2qid_123456 4 (views) 10 (score)
- The format of Nodes file is as follows:
-
- The format of Relationship file is as follows:
12345678//ids of the nodes and type of the relationship between them. So, the question qid_797771 is ASKED_BY user uid_94691id:string:users id:string:users typeqid_797771 uid_94691 ASKED_BYqid_887301 javascript REFERSqid_607386 aid_608425 HAS_ANSWERqid_809735 uid_88631 ASKED_BYqid_954376 uid_117795 ASKED_BY
- The format of Relationship file is as follows:
-
- lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated
123456789101112//lang_nodes.csvid:string:users namejava Javamysql MySQLlinux Linuxpython Pythondjango Djangophp PHPjquery JQueryjavascript Javascriptcakephp CakePHP
- lang_nodes is manually created as it is static. All other nodes and relationship file is programmatically generated
-
- finaldata.csv is renamed to sodata.csv (optional)
- The dataset doesn’t come with names of questioners and answerers. So, we have downloaded some fictional names and associated them with the userid. This will make more sense when we view them in Neo4j graphical interface. A fictional name file for around 1500 names were created from http://homepage.net/name_generator/ and stored as “random_names.txt”.
12345Edward MacDonaldNicholas ArnoldFaith LambertPeter WhiteTrevor Campbell
- Java Program to Create Nodes & Relationships:
Note:The below program has dependency only on OpenCSV library that can be downloaded from http://sourceforge.net/projects/opencsv/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
|
package com.treselle.soagrapher;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import au.com.bytecode.opencsv.CSVReader;
public class NodeRelationCreator {
private static final String QUESTION_NODE_FILE = “question_nodes.csv”;
private static final String USER_NODE_FILE = “user_nodes.csv”;
private static final String ANSWER_NODE_FILE = “answer_nodes.csv”;
private static final String RELATIONS_FILE = “rels.csv”;
private static final String INPUT_FILE = “sodata.csv”;
private static final String RANDOM_NAME_FILE = “random_names.txt”;
//stores question id as the key and views, score as map values
private static Map<String, Map<String, String>> questions = new HashMap<String, Map<String, String>>();
//stores unique userids of both questioner and answerer
private static Set<String> users = new HashSet<String>();
//stores random names from the file
private static List<String> randomNames = new ArrayList<String>();
//stores answerid as key and score as the map values
private static Map<String, Map<String, String>> answers = new HashMap<String, Map<String, String>>();
//stores various relations between nodes. The key is two nodes delimited by :: and the value is relation type
private static Map<String, String> relsMap = new HashMap<String, String>();
private void readFromCSV() throws Exception{
//Read the CSV with tab delimited and skip first row
CSVReader csvReader = new CSVReader(new FileReader(INPUT_FILE),‘,’,‘\”‘,1);
String[] rows = null;
String lang = null;
String questionId = null;
String question_user = null;
String question_score = null;
String question_views = null;
String answerId = null;
String answer_user = null;
String answer_score = null;
Map<String, String> questionAttrs = null;
Map<String, String> answerAttrs = null;
while((rows = csvReader.readNext()) != null) {
questionAttrs = new HashMap<String, String>();
answerAttrs = new HashMap<String, String>();
questionId = rows[0];
question_user = rows[1];
question_score = rows[2];
lang = rows[3];
question_views = rows[4];
answerId = rows[6];
answer_user = rows[7];
answer_score = rows[8];
questionAttrs.put(“views”,question_views);
questionAttrs.put(“score”,question_score);
questions.put(“qid_”+questionId, questionAttrs);
answerAttrs.put(“score”, answer_score);
answers.put(“aid_”+answerId, answerAttrs);
users.add(“uid_”+question_user);
users.add(“uid_”+answer_user);
relsMap.put(“qid_”+questionId+“::”+“aid_”+answerId, “HAS_ANSWER”);
relsMap.put(“qid_”+questionId+“::”+“uid_”+question_user, “ASKED_BY”);
relsMap.put(“aid_”+answerId+“::”+“uid_”+answer_user, “ANSWERED_BY”);
relsMap.put(“qid_”+questionId+“::”+lang, “REFERS”);
}
this.writeQuestionNodesFile();
this.writeAwnsersNodesFile();
this.writeUsersNodesFile();
this.writeRelationsFile();
csvReader.close();
}
private void writeQuestionNodesFile(){
try{
FileWriter fos = new FileWriter(QUESTION_NODE_FILE);
PrintWriter dos = new PrintWriter(fos);
dos.println(“id:string:users\tname\tviews\tscore”);
for (Entry<String, Map<String, String>> entry : questions.entrySet()){
dos.print(entry.getKey());
Map<String, String> valueMap = entry.getValue();
dos.print(“\t”+entry.getKey());
dos.print(“\t”+valueMap.get(“views”));
dos.print(“\t”+valueMap.get(“score”));
dos.println();
}
dos.close();
fos.close();
}catch (IOException e) {
System.err.println(“Error writeQuestionNodesFile File”);
}
}
private void writeAwnsersNodesFile(){
try{
FileWriter fos = new FileWriter(ANSWER_NODE_FILE);
PrintWriter dos = new PrintWriter(fos);
dos.println(“id:string:users\tname\tscore”);
for (Entry<String, Map<String, String>> entry : answers.entrySet()){
dos.print(entry.getKey());
Map<String, String> valueMap = entry.getValue();
dos.print(“\t”+entry.getKey());
dos.print(“\t”+valueMap.get(“score”));
dos.println();
}
dos.close();
fos.close();
}catch (IOException e) {
System.err.println(“Error writeAwnsersNodesFile File”);
}
}
private void writeUsersNodesFile(){
try{
FileWriter fos = new FileWriter(USER_NODE_FILE);
PrintWriter dos = new PrintWriter(fos);
dos.println(“id:string:users\tname”);
int count = 0;
for(String user : users){
dos.print(user);
dos.print(“\t”+randomNames.get(count));
dos.println();
count++;
}
dos.close();
fos.close();
}catch (IOException e) {
System.err.println(“Error writeUsersNodesFile File”);
}
}
private void writeRelationsFile(){
try{
FileWriter fos = new FileWriter(RELATIONS_FILE);
PrintWriter dos = new PrintWriter(fos);
dos.println(“id:string:users\tid:string:users\ttype”);
for (Map.Entry<String, String> entry : relsMap.entrySet()){
String splitKeys[] = entry.getKey().split(“::”);
dos.print(splitKeys[0]+“\t”);
dos.print(splitKeys[1]+“\t”);
dos.println(entry.getValue());
}
dos.close();
fos.close();
}catch (IOException e) {
System.err.println(“Error writeRelationsFile File”);
}
}
private void readRandomNames(){
try{
BufferedReader in = new BufferedReader(new FileReader(RANDOM_NAME_FILE));
String line = “”;
while ((line = in.readLine()) != null) {
randomNames.add(line);
}
in.close();
}catch (IOException e) {
System.err.println(“Error readRandomNames File”);
}
}
public static void main(String[] args){
try{
long start = System.currentTimeMillis();
NodeRelationCreator nodeRelationCreator = new NodeRelationCreator();
nodeRelationCreator.readRandomNames();
nodeRelationCreator.readFromCSV();
long end = System.currentTimeMillis();
System.out.println(“Done Processing in “+(end – start)+ ” ms”);
}
catch(Exception e){
System.out.println(“Exception in main is “+e.getMessage());
e.printStackTrace();
}
}
}
|
-
- Output of the Program:
Run the above program from command line or within eclipse to create question_nodes.csv, answer_nodes.csv, user_nodes.csv, and rels.csv. Click here to download nodes and relationship zip file to quickly run it thru BatchImporter to create Graph DB.
Create GraphDB with Batch Importer
- Download and Set up Batch Importer: Batch Importer program is a separate library that will create Graphdb data file which is needed by Neo4j. The input to the Batch Importer is configured in the batch.properties file that indicates what files to use as Nodes and Relationships. More details about the Batch Importer can be found in the readme at https://github.com/jexp/batch-import/tree/20
Download Link: https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip
Note: Unzip to the location where the nodes and relationship files are created by the Java program.
-
-
- Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.
12345678910111213dump_configuration=falsecache_type=noneuse_memory_mapped_buffers=trueneostore.propertystore.db.index.keys.mapped_memory=5Mneostore.propertystore.db.index.mapped_memory=5Mneostore.nodestore.db.mapped_memory=200Mneostore.relationshipstore.db.mapped_memory=500Mneostore.propertystore.db.mapped_memory=200Mneostore.propertystore.db.strings.mapped_memory=200Mbatch_import.node_index.users=exactbatch_import.nodes_files=lang_nodes.csv,question_nodes.csv,answer_nodes.csv,user_nodes.csvbatch_import.rels_files=rels.csv - Execute Batch Importer: Execute the batch importer program with import.bat within the Batch Importer directory and pass batch.properties and name of the graph db file to create
123//This command will create graph.db data file in the same location as your nodes and relationship filebatch_importer_20\import.bat batch.properties graph.db
- Create batch.properties: Create the batch.properties file as shown below. The details of each of the properties is better explained at BatchImporter site. The highlighted properties are the most important that defines nodes and relationship input files.
-
1
2
3
4
5
6
7
8
|
Using Existing Configuration File
Importing 9 Nodes took 0 seconds
Importing 676 Nodes took 0 seconds
Importing 1653 Nodes took 0 seconds
Importing 1491 Nodes took 0 seconds
Importing 4656 Relationships skipped (2) took 0 seconds
Total import time: 2 seconds
|
Visualize Graph with Neo4j
- Copy graph.db file: Create a new directory “data” under the root of Neo4J installation directory and copy graph.db to data directory. This is optional but recommended to keep the graph.db in the same location as Neo4j.
- Start Neo4j: Execute “neo4j-community” file under bin directory of Neo4j to start Neo4j. You will be prompted to choose the location of the graph.db file.
- Visualize Graphs:
- Launch Neo4j Web Console: http://localhost:7474/browser/
- Navigate to Graphs: Click on the bubbles on the left top and choose “*”
- Customize Graph Attributes: Double click on “Java” node and choose “name” as the caption.
- Explore Graphs: The below exploration shows the following:
Tracing the orange line indicates how the user Trevor answered (aid_853052) a Java question also asked a PHP question (qid_865476). Tracing the red line indicates the user Audrey answered two Java questions (aid_853030 and aid_892379). It’s lot of fun to work with Graph Database as the traversals are limitless. BTW, user names are fictional and not real users
Conclusion
- Neo4j is one of the best graph databases around and comes with powerful Cypher Query Language that enables us to traverse the nodes via the relationships and using nodes properties as well. We will be covering CQL in our next blog post based on this graph data.
- R is very handy in performing many data manipulation techniques to quickly cleanse, transform, and alter the data to our needs.
- Neo4j also comes with Rest API to add nodes and relationships dynamically on the existing graph DB.
References
- Neo4J: http://www.neo4j.org/
- Neo4J Use Cases: http://www.neo4j.org/learn/use_cases
- R: http://www.r-project.org/
- Neo4J Batch Importer: https://github.com/jexp/batch-import/tree/20
- Files: Click here to download nodes and relationship zip file
The post Embrace Relationships with Neo4J, R & Java appeared first on treselle.com.