We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

4.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY

  • Over 7+ years of experience as a Big Data Engineer using languages such as SQL, Python with AWS, Scala, Java and tools such as Spark, Hadoop, Kafka.
  • Experience with Matplotlib, Numpy, Seaborn, Pandas python libraries.
  • Skilled programming in Map - Reduce framework and Hadoop ecosystems.
  • Excellent working knowledge of HDFS Filesystem and Hadoop demons such as Resource Manager, Name Node, Secondary Name node, Data node, Node Manager.
  • Experience with new Hadoop 2.0 architecture YARN and developing YARN applications on it.
  • Experience working on Hortonworks/ Cloudera /Map R
  • Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing.
  • Hands-on experience with major components in Hadoop Ecosystem like Map Reduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra.
  • Experience working on Spark and Spark Streaming.
  • Experience in setting up Hadoopclusters on cloud platforms like AWS.
  • Work experience with cloud infrastructure such as Amazon Web Services (AWS) EC2 and S3.
  • Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
  • Implementing and orchestrating data pipelines using Oozie and Airflow.
  • Good knowledge on Apache Hadoop Cluster planning which includes choosing the Hardware and operating systems to host an Apache Hadoop cluster.
  • Worked on developing ETL processes to load data from multiple data sources to HDFS
  • Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing.
  • Experience writing scripts using PySpark on Spark framework to leverage high speed data analytics.
  • Experience in Data cleansing, Data Transformation and building ETL data pipelines for static and live stream data.
  • Experience in designing, developing, execute and maintain data extraction, transformation, and loading for multiple corporate Operational Data Store, Data warehousing, and Data mart systems.
  • Experience in AWS CloudFront, including creating and managing distributions to provide access to S3 bucket or HTTP server running on EC2 instances.
  • Experience in developing customized UDF’s in java to extend Hive.

TECHNICAL SKILLS

Big Data frameworks: HDFS, Spark, MapReduce, Pig, Hive, Sqoop, Oozie, Kafka, Cassandra, Spark Streaming, Spark SQL

Programming languages: Python, Core java, SQL, Scala, MapReduce.

Cloud Technologies: Amazon Web Services, GCP

Databases: MySQL, Oracle, SQL server

Operating Systems: Windows, Linux (Ubuntu, Cent OS)

File Formats: CSV, ORC, JSON, Sequence, Delimited/Fixed Width

Development methodologies: Agile, Waterfall

Web Technologies: HTML, XML, JavaScript, JSON.

Version Tools: Git and CVS

Others: Putty, WinSCP, Azure Devops.

PROFESSIONAL EXPERIENCE:

Confidential

Sr Data Engineer

Responsibilities:

  • Migrated Dag's from one airflow environment to another.
  • Performed data validation by using tableau reports.
  • Created two instances in GCP where one is for development and the other for production.
  • Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS and GCP.
  • Experienced in Developing Spark application using Spark Core, Spark SQL, and Spark Streaming API's.
  • Used GIT repo to host the code.
  • Worked with SCRUM team in delivering agreed user stories on time for every sprint.
  • Explored Dataplex Data Quality
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Designed, build and managed ELT data pipeline, leveraging Airflow, python, dbt, Stitch Data and GCP solutions.
  • Spark Datasets are stored into Snowflake relational databases for perform Analytics reports
  • Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.
  • Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
  • Set up a GCP Firewall rules in order to allow or deny traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloudstorage, cloud SQL, stack driver monitoring and cloud deployment manager.

Confidential, Atlanta, GA

SR Data Engineer

Responsibilities:

  • ImplementedAWSsolutions using EC2, S3 and load balancers.
  • Installed application onAWSEC2 instances and also configured the storage onS3buckets.
  • Responsible for migrating data from Legacy systems to AWS cloud which was running on XMBI data lake
  • Fetched data from various upstream applications and made it available for reporting inRedshift.
  • Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics.
  • Continuous data loads usingSnow-Pipeand file sizing and loaded structure and semi-structured data using web interfaces into Snowflakes
  • Implemented ETL jobs using Nifi to import from multiple databases such as Exadata, Teradata, MS-SQL to HDFS for Business Intelligence (Micro Strategy and SAS), visualization and user report
  • Experience with the use ofAWSservices includes RDS, Networking, Route 53, IAM, S3, EC2, EBS and VPC and also administeringAWSresources usingConsole and CLI.
  • Super-eminent understanding of AWS (Amazon Web Services), S3, Amazon RDS, Apache Spark RDD, process and concepts. Developing Logical Data Architecture with adherence to Enterprise Architecture.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Reviewing the explain plan for the SQLs in Redshift.
  • Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data.
  • Working with data delivery teams to setup new Hadoop users. This job includes setting up Linux users, setting upKerberosprincipals and testing HDFS, Hive.
  • Involved in creating Hadoop streaming jobs using Python.
  • Created concurrent access for hive tables with shared and exclusive locking that can be enabled in hive with the help of Zookeeper implementation in the cluster.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
  • Worked on various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
  • Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • ImplementedPySpark andSparkSQL for faster testing and processing of data.
  • Developed multiple MapReduce jobs in Java for data cleaning.
  • Developed Hive UDF to parse the staged raw data to get the Hit Times of the claims from a specific branch for a particular insurance type code.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like ApacheSparkwritten in Scala.
  • Used Scala to write the code for all the use cases in Spark and Spark SQL.
  • Expertise in implementing Spark and Scala application using higher order functions for both batch and interactive analysis requirement.
  • Implemented SPARK batch jobs.
  • Create external tables with partitions using Hive, AWS Athena and Redshift
  • Developed data warehouse model in Snowflake for over 100 datasets.
  • Designed and implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
  • Develop NiFi workflow to pick up the multiple retail files from ftp location and move those to HDFS on daily basis.
  • Worked with Spark core, Spark Streaming and Spark SQL module of Spark.
  • Worked on reading multiple data formats on HDFS using PySpark.
  • Ran many performance tests using theCassandra-stress tool in order to measure and improve the read and write performance of the cluster.
  • Created data model for structuring and storing the data efficiently. Implemented partitioning and bucketing of tables in Cassandra.
  • Worked on migrating MapReduce programs intoPySpark transformation.
  • Built wrapper shell scripts to hold Oozie workflow.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs,Pythonand Scala.
  • Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential
  • Worked on Distributed/Cloud Computing (Map Reduce/Hadoop, Pig, Hbase, AVRO, Zookeeper, etc.), Amazon Web Services (S3, EC2, EMR etc.)

Confidential, SFO, CA

SR Data Engineer

Responsibilities:

  • Develop and implement large-scale data from identifying source to completion. Creation of Hive tables and loading data incrementally and full load into the tables using Dynamic Partitioning.
  • Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.
  • Built a data pipeline to read data from Kafka topics and write into tables in Snowflake.
  • Worked on shell scripting and automated the spark jobs using schedulers.
  • Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
  • Implemented NiFi to Spark streaming directly withtout using Kafka internally to provide various options to client in single Confidential.
  • Analyzed the Incident, Change and Job data from snowflake and created a dependency tree-based model on the occurrence of incident for every application service present internally.
  • Helped business people to minimize the manual work they were doing and created python scripts like LDA sourcing, OneLake, SDP. S3, Databricks, Databench, Snowflake to get the cloud metrics and make their efforts easier.
  • Develop database scripts, code and design tables structures. Performed Data Ingestion and Data Migration from databases and ETL transformations to load data into Hadoop and another RDBMS.
  • Wrote Redshift UDFs and Lambda functions using Python for custom data transformation and ETL.
  • Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process.
  • Worked on Migrating jobs from NiFi development to Pre-PROD and Production cluster.
  • Scheduled different Snowflake jobs using NiFi.
  • Used NiFi to ping snowflake to keep Client Session alive.
  • Developed HQL and Spark jobs to perform transformations and joins based on the business requirements provided in the mapping document.
  • Worked on developing a multi - threaded scheduler, which allows faster scheduling cycles and fault tolerance while importing DAG files.
  • Implemented Unix Shell Scripts for various process to automate the ETL job’s and using CRONTAB as scheduler.
  • Develop the Shell scripts and Spark wrappers as part of framework development to load the data in batch and real-time processes.
  • Performance analysis and optimization for data processing. Install and perform data quality checks.
  • Experience in building ETL pipelines using NiFi.
  • Developed SQL and Spark jobs to perform transformations and joins based on the business requirements provided in the mapping document.
  • Prepare the workflows using Automation Engine tool to schedule the data processing into hive, Hadoop and ADLS.
  • Build data pipelines in Composer/Airflow in GCP for ETL related jobs using different airflow operators.

Confidential, Alpharetta, GA

Data Engineer

Responsibilities:

  • Developed pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation.
  • Analyzed and gathered business requirements from clients, conceptualized solutions with technical architects, and verified approach with appropriate stakeholders, developed E2E scenarios for building the application.
  • Developed the data ingestion pipeline inApache NiFi.
  • Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality. Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library.
  • We have worked with datasets of varying degrees of size and complexity including both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation, Visualization and Performed Gap analysis.
  • Optimized lot of SQL statements and PL/SQL blocks by analyzing the execute plans of SQL statement and created and modified triggers, SQL queries, stored procedures for performance improvement.
  • Built a process to get the usage of OneLake usage for all the LOB’s for higher level VPs that includes Kibana data source, EFK, Kinesis streams and created AWS Lambda functions which automates in daily manner.
  • Expertise working knowledge in Google cloud platform GCP (BigQuery, Cloud Dataproc, Composer/Airflow).
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
  • Experience withAWS Cloud IAM, Data pipeline,EMR, S3, EC2.
  • Experience with terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
  • Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
  • Implemented Predictive analytics and machine learning algorithms in Databricks to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2) and Django platform for the company's core business.
  • Participated in features engineering such as feature generating, PCA, Feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python.
  • Used Sqoop to move data from oracle database into hive by creating a delimiter separated files and using these files in an external location to be used as an external table in hive and further moving the data into refined tables as parquet format using hive queries.
  • Experience building and optimizing ‘big data’ data pipelines, architectures and data sets. (HiveMQ, Kafka, Cassandra, S3, Redshift)
  • Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems.
  • Developed spark programs using Scala API's to compare the performance of spark with HIVE and SQL.
  • Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
  • Evaluated the performance of Databricks environment by converting complex Redshift scripts to spark SQL as part of new technology adaption project.

Confidential

Data Engineer

Responsibilities:

  • Involved in Requirement Analysis, Development and Documentation.
  • Developing Scripts to schedule various Sqoop Jobs.
  • Developed Map-Reduce programs to clean and aggregate the data
  • Experienced in Importing and exporting data into HDFS and Hive using Sqoop.
  • Experience in Hadoop development using HDFS, Map Reduce, Hive, Sqoop.
  • Hands-on coding and scripting (automation) experience using OO languages such as Java, Python.
  • Used ETL tool to perform extract, transform and load on large amounts of source data.
  • Experienced in Ingesting real time into HBASE using Kafka through Spark Streaming
  • Created dependencies that triggers the jobs based on priority wise by using shell scripting.
  • Created directories and log files folder in the prod environment by using bash.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Worked on Apache Spark along with SCALA Programming language for transferring thedatain much faster and efficient way.
  • Developed Spark Streaming applications for Real Time Processing.
  • Used Flume to collect, aggregate, and store the web logdatafrom different sources like web servers, mobile and network devices and pushed to HDFS.
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java
  • Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the
  • Map Reduce jobs given by the users.
  • Worked extensively on development and maintenance ofHADOOPapplications using JAVA and MapReduce.

Confidential

Data Analyst

Responsbilities:

  • Enhanced data collection procedures to include information that is relevant for building analytic systems and created a value from data by performing advanced analytics and statistical techniques to determine deepen insights, optimal solution architecture, maintain ability, and scalability which make predictions and generate recommendations.
  • Performed data pre-processing and cleansing for various PFM project and by implementing Feature Selection and Regression on the datasets.
  • Maintained and developed complex SQL queries, stored procedures, views, functions and reports that qualify customer requirements using Microsoft SQL Server 2008 R2.
  • Abundantly created Views and Table-valued Functions, Common Table Expression (CTE), joins, complex subqueries to provide the reporting solutions.
  • Experienced in Developing Tableau Reports and Dashboards from multiple data sources using Data Blending.
  • Advanced and developed test plans to ensure successful delivery of a project. Employed performance analytics predicated on high-quality data to develop reports and dashboards with actionable insights
  • Worked with the ETL team to document the transformation rules forData migration from OLTP to Warehouse environment for reporting purposes.
  • Generated the reports and visualizations based on the insights mainly using Tableau and developed dashboards for the company insight teams.
  • Communicate data value to key stakeholders by synthesizing findings into actionable management reporting, utilizing using word, charts, graphs, and other visualizations to present your findings.
  • Support Sales and Engagement’s management planning and decision making on sales incentives by developing and maintaining financial models, reporting and sensitivity analysis by customer segment
  • Précised Development and implementation of several types of sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using SSRS through mailing server subscriptions & SharePoint server.
  • Acquainted with parameterized sales performance reports, done the reports every month and distributed them to respective departments/clients using Tableau.

We'd love your feedback!