In this tutorial, I only mention about setting up in Python. I recommend Anaconda to better development environment
Glue is an AWS serverless tool for ETL (Extract, Transform, Load)
Install Python Environment Anaconda
1. Visit the Anaconda downloads page.
Go to the following link: https://www.anaconda.com/products/distribution
2. Select Linux
On the downloads page, select the Linux operating system, right-click on 64-Bit (x86) Installer (581 MB) and Copy link address
3. Use wget to download the bash installer
Now that the bash installer (.sh file) link is stored on the clipboard, use wget to download the installer script. In a terminal, cd into the home directory and make a new directory called setup. cd into setup and use wget to download installer. Then install with bash or sh command. cd ~
mkdir setup
cd setup
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh
Continue the installation according to the instructions. Once the installation is completed, you should get the following output: Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] [no] >>> yes
Type yes and press Enter to initialize the Anaconda.
Next, activate the Anaconda environment variable with the following command: cd ~ source ~/.bashrc
4. Setup Glue Python Anaconda
Our project is using Glue 3.0, and it works well with Python 3.7. We will create a conda environment with python version 3.7.3 and name is glue. conda create --name glue python==3.7.3
After the env is created successfully, we will install some necessary libraries. conda activate glue pip install boto3 pip install pytest
Add source activate glue command at the end of .bash_profile or .profile to automatic activate conda env cd ~ sudo nano .profile add this command: source activate glue
Install AWS CLI

Install according to the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Install Glue Local Development With Glue 3.0
Complete some prerequisite steps and then use AWS Glue utilities to test and submit your Python ETL script.
Prerequisites for Local Python Development
1. Install some package:
sudo add-apt-repository ppa:webupd8team/java sudo apt install openjdk-8-jdk sudo apt install zip
2. Create glue folder at the home directory and download some libraries:
cd ~ mkdir glue
- Clone the AWS Glue Python repository from GitHub (https://github.com/awslabs/aws-glue-libs) and checkout branch
glue-3.0
cd ~ cd glue git clone https://github.com/awslabs/aws-glue-libs cd aws-glue-libs git checkout glue-3.0- Install Apache Maven:
cd ~ cd glue wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz tar xzvf apache-maven-3.6.0-bin.tar.gz rm -rf apache-maven-3.6.0-bin.tar.gz- Install the Apache Spark distribution:
cd ~ cd glue wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz tar xzvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz rm -rf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
3. Config Environment Variables
CD into the home directory and add the following command to .bash_profile or .profile cd ~ sudo nano .profile
Add at the end of file: GLUE_DEV=$HOME/glue PATH=$GLUE_DEV/apache-maven-3.6.0/bin:$PATH PATH=$GLUE_DEV/aws-glue-libs/bin:$PATH export PATH export SPARK_HOME=$GLUE_DEV/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3 export SPARK_LOCAL_IP=127.0.0.1
4. Fix some issues:
- fix
mysql driver
downloadmysql-connector-java-8.0.29.jarthen copy.jarfile intospark-3.1.1-amzn-0-bin-3.2.1-amzn-3/jars/ - Fail
import impwhen run test: Open file/home/<your user>/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py, changeimport imptoimport importlib nettyerror: Open file/home/<your user>/glue/aws-glue-libs/bin/glue-setup.sh, addrm -rf $ROOT_DIR/jarsv1/netty-*bellow line 19# Run mvn copy-dependencies target to get the Glue dependencies locally mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies rm -rf $ROOT_DIR/jarsv1/netty-*
Running Your Python ETL Script
- With the AWS Glue jar files available for local development, you can run the AWS Glue Python package locally.
- Use the following utilities and frameworks to test and run your Python script.
| Utility | Command | Description |
|---|---|---|
| AWS Glue Shell | gluepyspark | Enter and run Python scripts in a shell that integrates with AWS Glue ETL libraries. |
| AWS Glue Submit | gluesparksubmit | Submit a complete Python script for execution. |
| Pytest | gluepytest | Write and run unit tests of your Python code. The pytest module must be installed and available in the PATH. |
- Usage:
gluesparksubmit <script.py>
