LOAD CSV timing out!

I have been trying to load LDBC data (9 nodes files and 9 edges files, each containing around 1000000 rows) using LOAD CSV through Python, but it keeps on getting timed out.
I am connected to a high power server, so there’s nothing wrong with the server.
I also set query_execution_time to 0.
Is there a solution to this or another way of importing files?

hello @ritwik1999 . That’s an unusual behaviour. Do you have the adequate indexes configured?

Yep. So just after importing the nodes, I create indices for all the node types on one of their properties. Even then I am not able to import the relationship files.

I see. Is there any possibility that we can go into the query and the indexes you have set up. Do you have any additional information on what’s the LOAD CSV query and what indexes have you configured so far?

So these are the queries generated for each file.

After importing the node files and before importing the edges files, I am creating an index of each node type. Each node csv has a ‘key’ column, and I am using that column to create the indices.

can you show the results of “EXPLAIN load csv from …” (but only of 1 relationship file)

Are all the csv lines unique? If yes, you can do MATCH CREATE instead of MATCH and MERGE, which will possibly decrease the time.

All the rows should be unique because of the key column, but I am using Merge because later I am deleting 100 nodes, and then I want to insert the relations for those nodes.

Anyhow I changed to CREATE from MERGE. Even now I am 40 minutes in and not one single file has been able to get imported.

Hello, is it a possibility that you send the files to josip.mrden@memgraph.io so I can check it out and see what’s going on?

I have shared the link.

I see. It seems that merge has been a bit troublesome for inserts. We’ll try to improve it by the v2.7.1 because we already have some performance issues with using MERGE.

Have you tried running it with analytics mode?

So what can be done to improve the runtime?
And what is analytics mode?

Analytics mode disables MVCC in the database and is meant to be used during bulk import like these workloads you’re having.


// bulk import


// after this you have all your acid guarantees back.

So I tried getting time for each relationship line using this:

It’s 0.12 s on an average per line with or without Analytics mode.
The whole file took 2627 minutes to run. I don’t know what else to do.

Your queries seem correct, but the execution time is horrible. A few questions on my mind:

  • How do you run Memgraph? Is it native on Linux or Docker or something else?
  • Is Memgraph deployed on the same machine as the client code?
  • Could you print the query summary because it contains the execution time observed from Memgraph’s perspective? It would be interesting to see what’s that time.
  • Maybe there is some issue on the Python layer, did you maybe try running the same query from Memgraph Lab?

Hello @ritwik1999, your comments.csv sample looks like this

1.37439E+12,2011-09-19T19:54:40.331+00:00,,Internet Explorer,thanks,6,8.79609E+12,77,1.23695E+12,UNKWN

It seems that the key column is not full as it is shortened with the exponent (e.g. E+12). It might be causing issues and creating supernodes, as well as some nodes then having same keys which should be distinct. Is this a wanted behaviour?

When I loaded the commentsHasTag.csv, it created me 10M edges instead of 1M since there is that many rows in the csv. Could that be the first issue we need to solve in order to resolve the import?

I run memgraph on a remote server using docker.
I had uploaded the result of ‘EXPLAIN LOAD CSV…’ in the comments of this post; you can check that out.
I tried running a simple query on memgraph lab, but trying to setup memgraph lab with the server was even more horrible. It was lagging a lot.