Automated Unzipping of Google Cloud Storage files

Darpan Patel
4 min readAug 21, 2020

Welcome to the Data Analytics/Warehousing realm on Google Cloud Platform. I am sure you are taking advantage of the infrastructure and services that Google Cloud Platform provides, in a way that is most convenient and efficient for you.

Data Warehousing pipelines on Google Cloud usually (for most use cases) starts with ingestion of data to the landing/staging zone. Google Cloud Storage is a service that serves us for landing/staging area. GCS (Google Cloud Storage) holds all kinds of data and is preferred for unstructured data like csv files or blob.

Ingestion of data can be achieved in several ways.

  • If data is coming to GCP from on-prem and security is requirement then it is usually over VPN.
  • We can extract data using programming code from an API and put it in the bucket.
  • Also some vendors sends out data from an automated email platform to business email that we use.

From an on-prem transfer and automated email, data is transferred in compressed form, either in .zip or tar.gz. And it makes sense there, as transmission is faster and size reduces. We cannot use data in its compressed/zipped format. We need to decompress the zip file in order to use the data that resides within it or to be able to use it with other service. I had similar challenge of decompressing zipped files that I was getting from an external vendor.

Let’s move on solution and setting up an automated pipeline to get the extracted content. Well, there are 2 approaches one can take as there is no native capability that allows us to extract the zipped files within Google Cloud Storage. Both approaches are fully automated once setup and real-time (of course, if files are large, processing time will be more).

Option 1 pipeline

File drops into your cloud storage, that triggers cloud function based on object creation event. Code(written in whatever supported programming language) in Cloud function extracts the file and writes back to cloud storage. by the way, Cloud Functions is server-less compute platform on GCP that runs piece of code when triggered. This works but you will have to write a code yourself and computing capacity is limited for cloud function. I chose 2nd approach which I am putting below:

Option 2 pipeline

From the figure above let me explain a bit what Cloud Dataflow service is. It is fully managed, server-less Apache Beam ready to work for you. It auto-scales as resources are required. You maybe wondering you need to code for Dataflow as well, right! Luckily, Google has few tailored dataflow templates that we can use. Bulk Decompress Cloud Storage file is the template that we need to use. There is no way to automatically run or schedule dataflow template as file arrives or use CRON job from within dataflow service. Hence, we need to take help of cloud function, which will be triggered when object arrives and in turn trigger our template using dataflow API which will do our work and writes extracted data back to Google Cloud Storage. I am putting below sample python code which you can use in cloud function to invoke dataflow job.

from googleapiclient.discovery import build
def main(event, context):"""Triggered by a change to a Cloud Storage bucket.Args:event (dict): Event payload.context (google.cloud.functions.Context): Metadata for the event."""file = eventprint(f"Processing file: {file['name']}.")project = "<your-gcp-project-id>"job = "<unique-dataflow-jobname>"template = "gs://<path-to-bucket-storing-template>/Bulk_Decompress_GCS_Files"parameters = {"inputFilePattern":"gs://<bucketname-with-zip-file>/"+file['name'],"outputDirectory":"gs://<output-bucket-extracted-content>","outputFailureFile": "gs://<temp-bucket-to-store-error>/failed.csv",}environment = {"tempLocation": "gs://<temp-bucket-location>/","workerRegion": "us-central1","maxWorkers": 2,"subnetwork": "regions/us-central1/subnetworks/default",}dataflow = build('dataflow', 'v1b3')request = dataflow.projects().templates().launch(projectId=project,gcsPath=template,body={"jobName": job,"parameters": parameters,"environment": environment})response = request.execute()print(response)

Below is requirement.txt

google-api-python-client==1.7.11oauth2client==4.1.3google-auth-oauthlib==0.4.1

I hope this is helpful to you for your use case. I tried finding a good piece of code to unzip from a sever-less coding platform but was bit tricky. And so needed to share with you guys thinking it might help you. Feel free to drop me message on my email — darpan.3073@yahoo.com for any concern or queries!

--

--

Darpan Patel

Technology enthusiast | Cloud and Data Engineer | Turning complex to simpler | Meditator | Curious | Learner | Helper