AWS has a lot of data transfer tools, but none of them can actually transfer from SFTP to S3 out of the box.
Luckily Glue is very flexible, and it is possible to run a pure python script there.
Without further ado, a basic python script, which can run in Glue (as well as locally), and will read all files in the root of a SFTP server to upload them into a S3 bucket.
import boto3
import paramiko
s3 = boto3.resource("s3")
bucket = s3.Bucket(name="destination-bucket")
bucket.load()
ssh = paramiko.SSHClient()
# In prod, add explicitly the rsa key of the host instead of using the AutoAddPolicy:
# ssh.get_host_keys().add('example.com', 'ssh-rsa', paramiko.RSAKey(data=decodebytes(b"""AAAAB3NzaC1yc2EAAAABIwAAAQEA0hV...""")))
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(
hostname="sftp.example.com",
username="thisdataguy",
password="very secret",
)
sftp = ssh.open_sftp()
for filename in sftp.listdir():
print(f"Downloading {filename} from sftp...")
# mode: ssh treats all files as binary anyway, to 'b' is ignored.
with sftp.file(filename, mode="r") as file_obj:
print(f"uploading {filename} to s3...")
bucket.put_object(Body=file_obj, Key=f"destdir/{filename}")
print(f"All done for {filename}")
There is only one thing to take care of. Paramiko is not available by default in Glue, so in the job setup you need to point the Python lib path
to a downloaded wheel of paramiko on S3.