Python Developer/ Data Engineer Resume
3.00/5 (Submit Your Rating)
Detroit, MI
SUMMARY
- Around 5+ years of experience as Application Developer and coding with analytical programming using Python, PySpark, Django, Flask, AWS, GCP, SQL.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Well knowledge and experience in Cloudera ecosystem (HDFS, YARN, Hive, SQOOP, FLUME, HBASE, Oozie, Kafka, Pig), Data pipeline, data analysis and processing with hive SQL, IMPALA, SPARK, SPARK SQL.
- Using Flume, Kafka and Spark streaming to ingest real time or near real time data in HDFS.
- Analyzed data and provided insights with R Programming and Python Pandas
- Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Good experience of software development in Python (libraries used: Beautiful Soup, Numpy, SciPy, Maplotlib, Pandas data frame, network, urllib2, MySQL dB for database connectivity) and IDEs - sublime text, Spyder, PyCharm.
- Expertise in AWS Resources like EC2, S3, EMR, Athena, RedShift, Glue VPC, ELB, AMI, SNS, RDS, IAM, Route 53, Auto scaling, Cloud Formation, Cloud Watch, API Gateway, Kinesis.
- Working experience with cloud infrastructure of AWS (Amazon Web Services) and computing AMI virtual machines on Elastic Compute Cloud (EC2).
- Have experience with AWS LAMBDA which runs the code with response of events.
- Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
- Worked with testing frameworks like unit test, Pytest and Bazel.
- Using Django evolution and manual SQL modifications, able to modify Django models while retaining all data, while site was in production mode.
- Experienced in NoSQL technologies like MongoDB, Cassandra, and relational databases like Oracle, SQLite, PostgreSQL and MySQL databases
- Developed Cloud Formation templates, also launched AWS Elastic Beanstalk for deploying, monitoring and scaling web applications using different platforms like Docker, Python etc.
- Extensively worked with automation tools like Jenkins, Artifactory, Sonarqube Chef and Puppet for continuous integration and continuous delivery (CI/CD) and to implement the End-to-End Automation.
- Have good experience on working with version controls like Git, GitHub and AWS CodeCommit
- Experience in using Tomcat apache servers and Docker containers for deployment.
- Good idea about testing tools like Bugzilla and JIRA
- Hands on Experience in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
PROFESSIONAL EXPERIENCE
Confidential - Detroit, MI
Python Developer/ Data Engineer
Responsibilities:
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Design and architect various layer of Data lake.
- Design star schema in Big Query
- Analyzing client data using Scala, spark, spark SQL, define an end to end data lake presentation towards the team
- Design transformation layers to write the ETL using Scala and spark and distribute among the team
- Keep the team motivated to deliver the project on time and work side by side with other members as a team member
- Design and develop spark job with Scala to implement end to end data pipeline for batch processing
- Loading salesforce Data every 15 min on incremental basis to BIGQUERY raw and UDM layer using SOQL, Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil And Shell Script.
- Using rest API with Python to ingest Data from and some other site to BIGQUERY.
- Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
- Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery.
- Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environment.
- Open SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
- Submit spark jobs using gsutil and spark submission get it executed in Dataproc cluster
- Write a Python program to maintain raw file archival in GCS bucket.
- Analyze various type of raw file like Json, Csv, Xml with Python using Pandas, Numpy etc.
- Write Scala program for spark transformation in Dataproc.
- Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
- Write a program to download a SQL Dump from there equipment maintenance site and then load it in GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
- Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
- Create firewall rules to access Google Data proc from other machines.
- Write Scala program for spark transformation in Dataproc.
Confidential - Phoenix, AZ
Application Developer/ Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Design, development and implementation of performant ETL pipelines using python API (pySpark) of Apache Spark on AWS.
- Developed Spark and MapReduce jobs to parse the JSON and XML data.
- Integration of data storage solutions in spark - especially with AWS S3 object storage.
- Performance tuning of existing pySpark scripts
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Developed Hive queries to process the data and generate the data cubes for visualizing
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Good experience with Talend open studio for designing ETL Jobs for Processing of data.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Imported data from AWS S3 into Spark RDDs to perform transformations and actions on those RDDs.
- Used Spark sql to load data and create schema RDD and handle the structured data.
- Worked with various file formats like Avro, Parquet, snappy, etc.
- Experience with Cloud Technologies: AWS (Lambda, S3, cfts, cloud watch rules, Redshift, EC2, EBS, IAM, API Gateway, cloud formation), Snowflake.
- Integrated services like GitHub, AWS Code Pipeline, Jenkins and AWS Elastic Beanstalk to create a deployment pipeline.
- Created scripts in Python (Boto3) which integrated with Amazon API to control instance operations.
- Designed, built and coordinate an automated build & release CI/CD process using Gitlab, Jenkins and Puppet on hybrid IT infrastructure.