Empty Dataset after importing data from csv

Hi,

I am trying to test some query on the loaded dataset but I am always getting an “empty set” as shown below.

Dataset was loaded successfully

My test dataset looks like this :

nodes.csv

relationships.csv

What am I missing here?

Also, I would like to know if there is a good programmatical way to load and query using Python?

PS: I have read the Python example in the docs but I am not sure how do I load the data.

Any point in the right direction would be appreciated. :slight_smile:

1 Like

Hi @red_devil !

Empty Dataset after importing data from CSV

I am not sure which command you are using to start Memgraph, but I am going to assume it’s the source of your problem because everything else seems alright.

A typical command for starting Memgraph would be:

docker run -p 7687:7687 memgraph

Even though Memgraph will be up and running, this command isn’t the right way of starting a database instance that intends to load stored data.

When you use the CSV Import Tool, a snapshot is created from the supplied CSV files in the directory /var/lib/memgraph inside the Docker container. This snapshot needs to be loaded when Memgraph starts next time to populate the database.

The problem is that Docker containers don’t persist data (all changes are lost when the container is stopped) so we need to specify a local volume in which to store the data permanently.

From the official Docker documentation:

  • Volumes are stored in a part of the host filesystem which is managed by Docker ( /var/lib/docker/volumes/ on Linux). Volumes are the best way to persist data in Docker.

With this in mind, the correct command for starting Memgraph would be:

docker run -v mg_lib:/var/lib/memgraph -p 7687:7687 memgraph/memgraph --data-directory /var/lib/memgraph

Also, I would like to know if there is a good programmatical way to load and query using Python?

The simplest way of querying the database programmatically would be using the pymgclient driver for Python. You can find the instructions on how to install it here.

If you, for example, needed to load the following CSV file (nodes.csv) into Memgraph:

0,General,Forum
1,Support,Forum
2,Music,Forum
3,Film,Forum
4,Programming,Forum

the Python code in example.py could look like this:

import csv
import mgclient

# Make a connection to the database
connection = mgclient.connect(host='172.17.0.2', port=7687)

# Create a cursor for query execution
cursor = connection.cursor()

# Load the CSV file and create a node for each row
with open('nodes.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        query = """CREATE (n:{label})
                   SET n.title = '{title}'""".format(title=row[1], label=row[2])
        cursor.execute(query)

# A query to return all nodes
query = "MATCH (n) RETURN n"

# Execute the query
cursor.execute(query)

# Fetch one row of query results
row = cursor.fetchone()

# Print the first member in row
print(row[0])

# Make database changes persistent
connection.commit()

Note that this example requires the CSV file to be located in the same directory as example.py.

I hope that this will help you with your work. :smile:

1 Like

Hey @idespot :smiley:

Thank you for your response.

Basically, I am using a docker-compose to load Memgraph as a service into a separate container. The aim of my Project (Thesis) is to run some queries across Neo4j, Memgraph, and RedisGraph on the same dataset and build a query system around it. Here’s my configuration :

docker-compose.yml

version: "3"
services:
 redis:
  image: redislabs/redisgraph
  container_name: redis
  volumes:
   - ./redis/data:/data
  restart: unless-stopped
  ports:
   - "6379:6379"

memgraph:
 image: memgraph:latest
 container_name: memgraph
 volumes:
   - ./memgraph/modules:/usr/lib/memgraph/query_modules
   - ./memgraph/import-data:/usr/lib/memgraph/import-data
 networks:
   - test_network
 ports:
   - "7688:7687"

neo4j:
 image: neo4j:latest
 container_name: neo4j
 volumes:
   - ./neo4j/data:/data
   - ./neo4j/plugins:/plugins
   - ./neo4j/import:/import
   - ./neo4j/logs:/logs
   - ./neo4j/conf:/conf
 ports:
   - "7474:7474"
   - "7687:7687"
 environment:
   - NEO4J_apoc_export_file_enabled=true
   - NEO4J_apoc_import_file_enabled=true
   - NEO4J_apoc_import_file.use_neo4j_config=false
   - NEO4J_dbms_security_procedures_unrestricted=apoc.*,algo.*
   - NEO4J_dbms_memory_heap_initial__size=512m
   - NEO4J_dbms_memory_heap_max__size=2G
   - NEO4J_apoc_uuid_enabled=true
   - NEO4J_dbms_default__listen__address=0.0.0.0
   - NEO4J_dbms_allow__upgrade=true
   - NEO4J_dbms_default__database=neo4j
   - NEO4J_AUTH=neo4j/test
   - NEO4JLABS_PLUGINS=["apoc"]
 networks: ["Neo4j-test"]
 restart: unless-stopped

networks:
Neo4j-test:
test_network:
 driver: bridge

# use docker volume to persist data outside of a container.
volumes:
 Neo4j-test:

I simply run docker-compose up -d to bring all the containers up.

This loads correctly as per the logs,

I have to either use mgconsole or go inside the container to manually run queries.

I cannot use the approach in the long run so that is why I need a programmatic way to access and run my queries.

Regarding the CSV files, I have converted JSON to CSV from my Neo4j implementation and tried to format them according to the documentation.

However, I am not even sure about this but at least the mg_import_csv tool does not complain as earlier.

I am still exploring pymgclient from the documentation pymgclient. I tried to replicate one of the example implementations.

app.py

import mgclient

# Make a connection to the database
connection = mgclient.connect(host='172.19.0.2', port=7687)

# Create a cursor for query execution
cursor = connection.cursor()

# Delete all nodes and relationships for a cleaner start
query = "MATCH (n) DETACH DELETE n"

# Execute the query
cursor.execute(query)

# Create a node with the label FirstNode and its class
query = ("""CREATE (n:Cow {name: 'Cow1'})-[e:TYPE]->
              (m:Class {name: 'ABOX'})
       RETURN n, e, m""")

# Execute the query
cursor.execute(query)

# Fetch one row of query results
row = cursor.fetchone()

# Print the first member in row
print(row)

Output:

(<mgclient.Node(id=33, labels={'Cow'}, properties={'name': 'Cow1'}) at 0x7fba3acecea0>, <mgclient.Relationship(start_id=33, end_id=34, type='TYPE', properties={}) at 0x7fba3ac61b30>, <mgclient.Node(id=34, labels={'Class'}, properties={'name': 'ABOX'}) at 0x7fba3acece70>)

I can see multiple entries for obvious reasons (I was testing) but I am not able to figure out how do I inspect the rest of the ids?

It’s very unclear in which direction am I going at the moment. A piece of advice to a beginner would be appreciated :slight_smile:

PS: Loving the Memgraph’s idea and implementation from the start. Good job guys!

Update!

I ran this in a separate container. (without running previous ones)

Voila! I got something out of my python snippet

import mgclient

# Make a connection to the database
connection = mgclient.connect(host='172.17.0.2', port=7687)

# Create a cursor for query execution
cursor = connection.cursor()

# A query to return all nodes with associated relationships
query = """MATCH(n)-[r]->(m)
        RETURN n, r, m
        """
# Execute the query
cursor.execute(query)

# Fetch one row of query results
row = cursor.fetchall()

# Print the first member in row
print(row[0])

# Make database changes persistent
connection.commit()

Output :

(<mgclient.Node(id=4, labels={'Abox'}, properties={'id': '4', 'properties__admin': '', 'properties__comment': '', 'properties__description': 'A Cow', 'properties__html_info': '', 'properties__identifier': 'http://example.org/abox/Cowebcc054f', 'properties__namespace': '', 'properties__ontology_level': '', 'properties__pl': '', 'properties__sing': '', 'properties__title': 'Cow1', 'properties__unique': '', 'properties__uri': '', 'properties__url': '', 'properties__version': '', 'properties__xsd_type': ''}) at 0x7fbdff2defc0>, <mgclient.Relationship(start_id=4, end_id=2, type='relationship', properties={'end__labels__001': 'Abox', 'end__labels__002': '', 'label': 'relation', 'properties__description': '', 'properties__level': '', 'properties__namespace': '', 'properties__title': '', 'properties__uri': '', 'start__labels': 'Abox'}) at 0x7fbdfee59770>, <mgclient.Node(id=2, labels={'Abox'}, properties={'id': '2', 'properties__admin': '', 'properties__comment': '', 'properties__description': 'A Sensor Collar', 'properties__html_info': '', 'properties__identifier': 'http://example.org/abox/SensorCollar8153e23d', 'properties__namespace': '', 'properties__ontology_level': '', 'properties__pl': '', 'properties__sing': '', 'properties__title': 'SensorCollar1', 'properties__unique': '', 'properties__uri': '', 'properties__url': '', 'properties__version': '', 'properties__xsd_type': ''}) at 0x7fbdff2def90>)

From what I understand -

  1. I was querying the wrong instance and data was not getting persisted.
  2. Port mapping is an issue that I initially thought shouldn’t be.

My questions -

  1. What is the output format? I am assuming it is a graph data structure. (hard to understand)
  2. How do I run this in my docker-compose?
  3. How do I persist data after querying?

Other Results:

Thank you for your assistance.

1 Like

Hi @red_devil!

I hope you are doing well with your thesis. :smile:

What is the output format? I am assuming it is a graph data structure. (hard to understand)

The detailed documentation for pymgclient can be found here.
There are three main types of objects that are returned:

  • mgclient.Node

  • mgclient.Relationship

  • mgclient.Path

Here is a small code snippet that demonstrates all of them:

import mgclient

# Make a connection to the database
connection = mgclient.connect(host='172.17.0.2', port=7687)

# Create a cursor for query execution
cursor = connection.cursor()

# Create two nodes and a relationship
query = """CREATE (:Movie {title: 'Avatar', year: 2009})-
           [:DIRECTED_BY {announced: '2006'}]->(:Person {name: 'James Cameron'})"""
cursor.execute(query)

# Execute a query to return every node
query = "MATCH (n) RETURN n"
cursor.execute(query)
row = cursor.fetchone()

print(row[0].id)
print(row[0].labels)
print(row[0].properties)
print(row[0].properties['title'])

# Execute a query to return every relationship
query = "MATCH ()-[r]-() RETURN r"
cursor.execute(query)
row = cursor.fetchone()

print(row[0].id)
print(row[0].start_id)
print(row[0].end_id)
print(row[0].type)
print(row[0].properties)
print(row[0].properties['announced'])

# Execute a query to return every path
query = "MATCH p=((n)-[r]-(m)) RETURN p"
cursor.execute(query)
row = cursor.fetchone()

print(row[0].relationships)
print(row[0].nodes)

# Make database changes persistent
connection.commit()

How do I run this in my docker-compose?

Ideally, you would create a Docker Container that would run a Python script to query all of your databases. You can take a look at this post where I have utilized Memgraph with docker-compose to query the database programmatically.

The most important step is to find submit the right IP address of the container where Memgraph is running. This can be done with environment variables. For example:

version: "3"
  services:
    memgraph:
      image: "memgraph"
      ports:
        - "7687:7687"
    sng_demo:
      build: .
      volumes:
        - .:/app
      ports:
        - "5000:5000"
      environment:
        MG_HOST: memgraph
        MG_PORT: 7687
      depends_on:
        - memgraph

The variables MG_HOST and MG_PORT contain the IP address of the container where Memgraph will be started and the port we have specified. They can be retrieved in a Python script with:

#The second argument is the default value if the variable isn't found
MG_HOST = os.getenv('MG_HOST', '127.0.0.1') 
MG_PORT = int(os.getenv('MG_PORT', '7687'))

How do I persist data after querying?

You can add the directories where database snapshots and logs are stored just like you have specified volumes for importing data and using query modules in your compose file. The database will be repopulated with these files when you start it again.

For example:

version: '3'
services:
  memgraph:
    image: "memgraph"
    ports:
    - "7687:7687"
    volumes:
        - ./mg_lib:/var/lib/memgraph
        - ./mg_log:/var/log/memgraph
        - ./mg_etc:/etc/memgraph

From the querying perspective, you need to send a commit message (connection.commit()) so the database knows the transaction is finished.

1 Like

Hi @idespot

Thanks for your help. I hope you’re doing well. The thesis is going quite well, with a lot of pressure though.

I researched and tried a lot of ways to run memgraph in docker-compose.yaml but it seems like it does not work. Talking about the main use of this docker-compose it to instantiate 3 services - Neo4j, Memgraph, and RedisGraph simultaneously in their respective containers. Neo4j works fine but Memgraph only works with docker run -v mg_lib:/var/lib/memgraph -p 7687:7687 memgraph/memgraph --data-directory /var/lib/memgraph command. I have tried write docker-compose config like this :

docker-compose.yaml :

version: "3"
services:
  redis:
    image: redislabs/redisgraph
    container_name: redis
    volumes:
      - ./redis/data:/tmp/redis
    restart: unless-stopped
    ports:
      - "6379:6379"

  memgraph:
    image: memgraph
    container_name: memgraph
    volumes:
      - ./memgraph/import-data:/usr/lib/memgraph/import-data
      - ./memgraph/mg_lib:/var/lib/memgraph
      - ./memgraph/mg_log:/var/log/memgraph
      - ./memgraph/mg_etc:/etc/memgraph
    ports:
      - "7687:7687"
    environment:
      MG_HOST: memgraph
      MG_PORT: 7687

  neo4j:
    image: neo4j:latest
    container_name: neo4j
    volumes:
      - ./neo4j/data:/data
      - ./neo4j/plugins:/plugins
      - ./neo4j/import:/import
      - ./neo4j/logs:/logs
      - ./neo4j/conf:/conf
    ports:
      - "7474:7474"
      - "7688:7687"
    environment:
      - NEO4J_apoc_export_file_enabled=true
      - NEO4J_apoc_import_file_enabled=true
      - NEO4J_apoc_import_file.use_neo4j_config=false
      - NEO4J_dbms_security_procedures_unrestricted=apoc.*,algo.*
      - NEO4J_dbms_memory_heap_initial__size=512m
      - NEO4J_dbms_memory_heap_max__size=2G
      - NEO4J_apoc_uuid_enabled=true
      - NEO4J_dbms_default__listen__address=0.0.0.0
      - NEO4J_dbms_allow__upgrade=true
      - NEO4J_dbms_default__database=neo4j
      - NEO4J_AUTH=neo4j/test
      - NEO4JLABS_PLUGINS=["apoc"]
    networks: [ "Neo4j-test" ]
    restart: unless-stopped

networks:
  Neo4j-test:
    driver: bridge

# use docker volume to persist data outside of a container.
volumes:
  Neo4j-test:

Well, I can explain to you my thesis in short - I am creating an API on the system which translates plain/raw Cypher from the client and performs queries over datasets loaded in data stores (Memgraph, Neo4j, and RedisGraph), then returns a graph data structure output.

System architecture :

Memgraph implementation looks something like this at the moment :

import utilities.memgraphConfig as mcfg
from persistence.connection import Connection
import mgclient


class MemgraphImpl(Connection):

    def __init__(self):
        self.connect(host=mcfg.host, port=mcfg.port)

    def connect(self, host, port):
        self.connection = mgclient.connect(host=host, port=port)
        print("Memgraph connection successful")

    def execute(self, query):
        cursor = self.connection.cursor()
        cursor.execute(query)
        row = cursor.fetchall()
        self.connection.commit()
        return row


    def disconnect(self):
        self.connection_class = mgclient.Connection.close(self.connection)
        print("Memgraph connection terminated")

There is an orchestrator (custom driver) which runs on the system and takes care of datastore config and inputs.

It is an awesome library! :smiley: really helped me speed up things.

Your input would be much appreciated here! :slight_smile: