Data Engineer/aws Cloud Resume
NewyorK
PROFESSIONAL EXPERIENCE:
Confidential, NewYork
Data Engineer/AWS cloud
Responsibilities:
- Data profiling and generating reports for the missing data and inconsistent data. Infrastructure design for the ELK Clusters and Used ELK (Elasticsearch, Logstash, and kibana) for name search pattern for a customer Developing data models for the data marts in the Erwin tool Worked on Pyspark - Dataframe, datasets, and RDD's and did multiple transformations, aggregations in order to achieve data pipelines
- Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup. Develop new and existing modules in Scala while working with developers across the globe Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python, and Scala. Develop Scala and
- Python software in an agile environment using continuous integration Running Apache Hadoop and dubbed Elastic MapReduce(EMR) on (EC2). Worked on migrating MapReduce programs into Spark transformations using Spark and Scala. Analyzed the SQL scripts and designed the solution to implement using Pyspark Extensively worked on Jenkins to implement continuous integration (CI) and Continuous deployment (CD) processes (CI/CD) Setup full CI/CD pipelines so that each commit a developer makes will go through the standard process of the software lifecycle and gets tested well enough before it can make it to production. Involved in creating Hive tables, and loading and analyzing data using hive queries
- Developed DataStage to process the data and generate the data cubes for visualizing Implemented schema extraction for Parquet and Avro file Formats in Hive. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE. Developed a Java Spring Boot application to ingest AWS S3 objects to Kafka and implemented a Spark Scala consumer to read data from Kafka. Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Integrated services like GitHub, AWS
- CodePipeline, Jenkins, and AWS Elastic Beanstalk to create a deployment pipeline. Proficiency in building data pipelines and/or related ETL pipelines within large-scale distributed frameworks as well as related query/processing technologies such as Spark, Impala, Hive, or Presto Building spark pipeline for real-time analysis of active users across different country Working with Data Science team to build data sets and pipeline for their models
Confidential
Big Data Engineer/Azure
Responsibilities:
- Worked with applications like R, SPSS, and Python to develop neural network algorithms, cluster analysis. Develop data ETL pipelines for multiple storage solutions, including distributed platforms such as Hadoop Manage, and prioritize multiple assignments.
- Understanding and exposure to troubleshooting, configuring Azure VM's and other services like Azure Data Factory, Azure Data Lake Expertise in building Azure native enterprise applications and migrating applications from on - premises to Azure environments
- Responsible for developing data pipelines with Snowflake to extract the data from weblogs and store it in HDFS. Implemented multiple data pipelines DAG's and maintenance DAG's in airflow orchestration.
- Building Datapipeline using Airflow and SnowSQL to compute business KPI's on a daily basis Formulated procedures to integrate R programming plans with data sources and delivery systems Implemented a Spark ETL pipeline to prepare Analytical datasets from the
- Ingested raw data. Wrote Hive, Spark-Scala scripts that were used as a part of DI/DQ checks and other post validation processes. Worked extensively on performance tuning of Spark jobs and Hive queries.
- Created an SFTP ingestion framework using Python Paramiko library to ingest data from multiple vendors into HDFS Developed Airflow Hook and Operator from scratch to connect with Google Ad Manager API that pulls Advertising data from the API GoldenGate Kafka adapters are used to write data to Kafka clusters
- Worked on R packages to interface with Caffe Deep Learning Framework Performed statistical analyses in R programming/R analysis. Developed scripts to automate routine pipeline running tasks using Shell Scripts/Python. Developed Coordinator and Oozie workflows to automate the jobs
Confidential
Sr. Data Engineer / Spark Developer
Responsibilities:
- Teamed up with Architects to design Spark model for the existing Map Reduce model Helped the Business Insights team with statistical predictions, business intelligence, and data science efforts Developed scripts, UDF's using both Spark
- SQL and Spark-Scala for aggregative operations Experienced in implementing Spark RDD/Data Frame transformations, actions to implement business analysis and Worked with Spark accumulators and broadcast variables
- Written python and Shell scripts for Jenkins to push build and commit information to ServiceNow Created custom real-time reports and dashboards in Servicenow for the Jenkins and Tenacity metrics in Servicenow using reports and performance analytics. Cloudera Manager was used to monitoring the health of Jobs which are running on the cluster Worked closely with the App Support team in production deployment and to schedule jobs in TIDAL/CTL-M.
Confidential
Hadoop Developer
Responsibilities:
- Worked with Business Analyst and helped to represent the business domain details, prepared documentation Optimized long - running Hive
ETL jobs by rewriting Hive SQLs, restructuring the Hive tables, and changed the underlying file format structure. Migrated existing analytical business models written in Hive to PySpark and developed new models directly in Pyspark. Output produced by these models is used to generate reports in Tableau. Migrated existing Data Pipelines to the cloud environment and developed new applications by choosing the right
AWS components. Sourced thousands of S3 JSON Objects into Spark SQL and created Hive external tables. Used Bitbucket and Jenkins for the CI/CD process and exported data to SnowFlake for Tableau DashBoards.