When migrating data from legacy systems to Neo4J there are various ways to choose from. There is this high-speed specialised import tool /usr/bin/neo4j-import which offline prepares a database from a collection of CSV files. A great help if it fits your problem.
However most of the time I find myself writing scripts to read data from a source, do some conversions on them and then put them in the graph. My default tool to go to in cases like that is the Python language and the excellent py2neo library.
Python has a very rich environment capable of reading almost any file format you can think of and py2neo provides a very reliable connection to the graph. This method is notoriously much slower than the import tool as everything is send via HTTP and the transactional pipe. But with the conversions power of Python at your fingertips it is a very capable way. Often a script becomes a self-contained piece of “magic” which does its conversion magic in a very repeatable way, great when you need to do stuff more than once.
Recently I was working on a project where some “lists” needed to be migrated in the graph. Typically it were lists of 5 to 15 entries resembling some “kind of thing” and used in the frontend to present them in a SELECT or Checkbox kind of interface.
Like a Pavlov dog I started a new Python script to mould 5 lines of CSV into the graph. The first few keystrokes shivered down my spine. Something was silly here. Although I wrote stuff like this numerous times before it felt wrong for doing it this way. 5 lines, no conversions, no magic, make a CREATE statement from it and paste it into the console and be ready. Right?
But those commands would get lost in the conversion train I was building to repeat the conversions to create a fresh dataset when needed. Ah but there is a tool for that! Just in time I remembered the existence of /usr/bin/neo4-shell. It can read from a file and I quickly drafted a .cypher file instead of a silly .py :
CREATE (c:Certification) set c.name='ISO 27001';
CREATE (c:Certification) set c.name='ISO 9001';
CREATE (c:Certification) set c.name='ISO 50000';
CREATE (c:Certification) set c.name='ISO 14001';
CREATE (c:Certification) set c.name='NEN 7510';
CREATE (c:Certification) set c.name='OHSAS 18001';
CREATE (c:Certification) set c.name='PCI DSS';
CREATE (c:Certification) set c.name='ISAE 3402';
CREATE (c:Certification) set c.name='SOC 2-2';
CREATE (c:Certification) set c.name='EU Code of Conduct';
Boom! Done! I dropped in the Unix shell and typed:
neo4j-shell -file certifications.cy
Nothing. Zilch. Nada; it just hangs. The neo-shell was trying something which took forever to finish never returning to the commandline. Weird. So forget about that file for a moment and try without any parameters.
Great; same thing, just sitting there waiting for the shell which never comes.. The smart thing to do at this moment is to shrug your shoulder and move back to the known route and finish the task. Just create that Python script and be done with it.
But of course I couldn’t; I had to know what was wrong with my beloved database. I knew that shell was working, I played with it before, just not on this VM. I opened a Terminal on my Hackintosh and voila, instantly I was presented a neo4j shell connected to the Neo4J instance running there.
Would it be a VM issue? The development work is done in an Ubuntu VM running on that same Hackintosh to mimic the production environment where the actual site is hosted. Puzzled by “the waiting shell” I decided to run tcpdump to see if anything was going on the network interface.
And there it was (ip removed to protect the innocent):
14:54:20.786685 IP 10.10.10.236.54356 > ec2.eu-central-1.compute.amazonaws.com.38501: Flags