AWS Glue + Neo4j : Tutorials?

I'm learning AWS Glue (essentially Spark from what I understand about it) and I'd like to use Neo4j as my destination. I have a bunch of JSON in S3 that I'm hoping to process. Does anyone know if this is going to be possible : [S3]->[AWS Glue]->[Neo4j] ? I'm hoping I can just follow the write ups on Spark and they'll hopefully transfer over. If anyone has any resources they can point me to, I'd appreciate it! I'm new to the whole Spark ecosystem so I know I have a lot of googling ahead of me.

I finally got around to spending some more time on this project today and found some success that I wanted to share for anyone else who may come across this post in their education.

Neo4j Spark Documentation - helpful starting point
Download the latest release of the connector from GitHub and upload it to an S3 bucket.
I also downloaded the GraphFrames jar and uploaded it to the S3 bucket

AWS Glue Job
I made a Scala job because that's what the examples are written in (To Do: figure out the python equivalent)
Dependent Jars
include the two jars comma separated
Parameters
This was the tricky part, AWS only lets you specify the a key once. They also don't encourage you to pass in --conf settings but that's how Neo4j wants the connection parameters. Specify a --conf key and the value I just kept on specifying more confgs like this: spark.neo4j.bolt.url=bolt://mydomain.com:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password'. The Neo4j documentation says you can combine the user & password all part of the URL parameter but I never could get this work.

That's what you need as far as specifying the Glue Job. Then for the actual code of my job I did just a very basic select and print results.

import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

import org.neo4j.spark._
import org.graphframes._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    
    val neo = Neo4j(spark)
    
    val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame
    
    graphFrame.vertices.show()
  }
}

Most of this is just the boilerplate that AWS provides when making a new scala job. Not knowing scala and being new to spark in general, it took me some trial and error to get all this figured out. But it runs successfully and in the CloudWatch logs I can see values from my database printed!

Things still left to do

  • Figure out how to do this python
  • How to write and general manipulations with graphframes
  • Manage connection information to be passed in as parameters
  • What if you had to have two database connections how would you manage that

Hi Mike,

I keep getting a syntax error while running your script above as a AWS glue job

object GlueApp {
^
SyntaxError: invalid syntax

Could you please let me know what's wrong?

Also, it is possible to use AWS glue to get the neo4j properties/schema only?

Thanks,
Charles

Here some things I'd verify:

  • You're using AWS Glue Scala and not AWS Glue Python
  • You have the dependent Jars specified when creating the job
  • You've set the job parameter for --conf

To test that you have at least all the jars and configuration importing correctly, you can comment out that object and just print out hello world . Then slowly just start uncommenting code until things break.

Are you asking if you can use Cypher to get the Neo4j database schema? The same information if you were to execute this in the Neo4j Browser?

call db.schema()

Hi Mike,

Thanks for your quick reply! I am currently working on a project to use AWS Glue to collect the database schema on AWS. For example, we use AWS crawler job to collect RDS MySQL database schema information, such as table name, column name, data type and etc.
One of our application teams is using Neo4j, so we are wondering whether we can do similar things for that database technology too, and that is how I found your article on the internet. Could you please let us know whether that is possible?

Thank you in advance!

Charles

I'm sure it's going to be possible. I'm still new to learning about Graphframes and working with with AWS Glue (aka Apache Spark) but I don't have specific instructions how to.

If I may, I'd question the choice to use AWS Glue to go about collecting database metadata. Glue (Apache Spark) is good for large volumes of data when you want to leverage massive parallel processing (MPP). Using it to query a database for it's schema seems like overkill. Not only is it a much larger engine than what you need, it'll be costly because AWS Glue has a minimum bill of 10 minutes. To gather database meta data would take much less than that with relatively few rows that you wouldn't need to shard out across multiple DPUs.

What would be more appropriate and likely far easier and cheaper to implement is to use a AWS Glue Python Shell Job. It's much cheaper and lighter weight processing engine running just plain python. Refer to the instructions on how to connect vanilla python to Neo4j.

That's just my opinion and you may have other reasons why you need to be using Apache Spark to gather metadata.

Thanks Mike for the information. I was pulled into a project to save all database schemas into the AWS Glue catalog, so that our developers don't need to create various "collectors" to retrieve the schema for various different database technologies. I am actually
a relational database engineer, and new to both AWS Glue and neo4j. I will definitely relay your information to our developer teams.

By the way, I still can't get the your stuff working under my account. When I ran the scala job, I get this error message:

Compilation result: /tmp/g-5de1747b394a6f0bf75445dd36dd03e4d27dbe16-790763640924356726/script_2020-01-13-14-34-21.scala:15:
error: expected class or object definition GlueApp { ^ one error found Compilation failed.

Could you please review my glue job definition below to see anything
wrong? What should be Scala class name?

Name:
yju200-neo4j-scala

IAM
role: AWSGlueServiceAdminRole

Type:
Spark

Spark
version: 2.4

ETL
language: scala

Scala
class name: GlueApp

Script
location: s3://aws-glue-scripts-xxxxx-us-east-1/admin/yju200-neo4j-scala

Temporary
directory: s3://aws-glue-temporary-xxxxx-us-east-1/admin

Job
bookmark: Disable

Job
metrics: Disable

Continuous
logging: Disable

Server-side
encryption: Disabled

Python
lib path: s3://yju200-glue/graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip,s3://yju200-glue/neo4j-spark-connector-2.2.1-M5.zip

Jar
lib path: s3://yju200-glue/graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip,s3://yju200-glue/neo4j-spark-connector-2.2.1-M5.zip

Other
lib path:

Job
parameters: --conf spark.neo4j.bolt.url=bolt://10.140.107.171:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=xxxxx Connections:

Maximum
capacity: 10

Job
timeout (minutes): 2880

Delay
notification threshold (minutes):

Tags: