Data Engineer Iii Resume
OrlandO
SUMMARY
- Overall, 7 years of experience in technologies related to Big Data, Hadoop Eco - System with domain experience in Finance, Retail
- Evaluating technology stack for building Analytics solutions on cloud by doing research and finding right strategies, tools for building end to end analytic s solutions and help designing technology road map for Data Ingestion, Data lakes, Data processing and Visualization.
- Good knowledge on Hadoop Architecture and its ecosystem
- Having extensive knowledge on Hadoop technology experience in Storage, writing Queries, processing and analysis of data.
- Strong understanding on No SQL database like HBase, DynamoDB.
- Experience on migrating on Premises ETL process to Cloud AWS.
- Experience on working with Apache Nifi for Ingesting data into Bigdata from different source system.
- Experience in creating and loading data into Hive tables with appropriate static and dynamic partitions, intended for efficiency.
- Worked on various Hadoop file formats like Parquet, ORC & AVRO file.
- Experience in Data Warehousing applications, responsible for the Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse
- Experience in optimizing Hive SQL quarries and Spark Jobs.
- Implemented various frameworks like Data Quality Analysis, Data Governance, Data Trending, Data Validation and Data Profiling with the help of technologies like Bigdata, Data Stage, Spark, Python, Mainframe with databases like DB2, Hive.
- Experience building and optimizing ‘big data’ data pipelines, architectures and datasets.
- Experience performing root cause analysis on internal and external data and processes to answer specific business questions and identify opportunities for improvement.
- Experience with native AWS technologies for data and analytics such as EMR, AWS Glue, Kinesis, IAM, S3, Lambda, Glue Catalog, Service Catalog, Route 53, SNS, Cloud Watch,etc.
- Strong experience with building complex data transformation in Scala and Python.
- Strong experience in creating AWS Services from creating AWS Cloud Formation template from scratch.
- Strong experience in creating and troubleshooting issues in AWS EMR Ingestion/Data Processing cluster
- Experience in creating AWS Glue Dynamic Data frames in Scala and Python and writing the parquet file to S3.
- Strong Experience on working with Configurations of MapReduce, Tez, Hive etc.., on AWSEMR.
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
- Involved in converting Hive/SQL queries into Spark transformations using Spark and Scala.
- Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL quires in Snowflake.
- Experience in creating user defined functions for complex Transformation on data in Scala and Python.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and Memory tuning.
- Good knowledge on Hadoop Ecosystem components like Spark, HDFS, Map Reduce, Hive, Sqoop.
- Developed PySpark code using Python and Spark-SQL for faster testing and data processing.
- Highly skilled in deployment, data security and troubleshooting of the applications using AWS services.
- Proficient in writing Cloud Formation Templates (CFT) in JSON format to build the AWS services with the paradigm of Infrastructure as a Code.
- Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources
- Have experience with AWS LAMBDA which runs the code with response of events
- Hands on experience with version control tools such as SVN, GitHub, and GitLab.
- Experienced in Agile Methodologies and SCRUM process including Sprint planning, backlog grooming, Daily Standups and Code review.
- Integration of service now with LDAP for authentication.
- Excellent knowledge in Hadoop ecosystem, HDFS, MapReduce, Spark, AWS - EMR, S3, Athena, Kinesis, Glue,Flume, Sqoop, HBase, Confidential.
- Programming/Scripting experience with Python, Linux scripting, BASH.
TECHNICAL SKILLS
Languages: Scala, Python, JavaScript, SQL, Py Spark, Snowflake
SCM Tools: Subversion, Bamboo, Bitbucket, GIT, GitHub.
Operating Systems: UNIX, Linux, Solaris, Windows, DOS, VMware
Database: SQL Server, MYSQL, TeraData and Oracle
Development IDE: PyCharm, Pydev Eclipse, Vim, Net beans, MS Visio, Sublime Text, Notepad++
Methodologies: Agile, Scrum
PROFESSIONAL EXPERIENCE
Confidential, Orlando
Data Engineer III
Responsibilities:
- Experience in data Ingestion from Sql server to Hadoop environment for further process and analytics.
- Design and develop ETL integration patterns using Python on Spark.
- Create Py Spark frame to bring data from DB2 to Amazon S3.
- Translate business requirements into maintainable software components and understand impact (Technical and Business).
- Provide guidance to development team working on Py Spark as ETL platform.
- Configured and used Query Surge tool to connect with HBase using Apache Phoenix for Data Validation.
- Developed spark applications in python (Py Spark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Makes sure that quality standards are defined and met.
- Optimize the Py Spark jobs to run on for faster data processing.
- Provide workload estimates to client.
- Migrated On prem informatica ETL process to AWS cloud.
- Implement CICD (Continuous Integration and Continuous Development) pipeline for Code Deployment.
- Reviews components developed by the team members.
- Worked in the conversion of SQL Server functions into the Teradata store procedure for confirming the data.
- Used the AWS-CLI to suspend an AWS Lambda function processing an amazon Kinesis strem, then to resume it again. strong experience on Amazon firehose.
- Used SQL Assistant to Query teradata tables.
- Developed spark Data Frames and Dynamic data frames for structured data processing
- Experienced in creating user defined functions for complex transformation on data frames.
- Experience in creating Hive tables and optimize the Hive queries
- Created Livy API application to submit PySpark and spark application to the cluster.
- Able to evaluate data quality, including identifying data outliers and anomalies that require resolution prior to analysis
- Ability to work with stakeholders to develop actionable metrics and data-driven insights
- Developing Spark Data Frame Operations to perform required Validations in the data and to perform analytics on the Hive data.
Technologies: EMR, Glue, Kinesis, Lambda, Athena, S3, EC2, IAM, CloudWatch, Spark, Python 3.6, Bigdata, Hadoop, Splunk, DB2, 3.1 Hadoop Environment.
Confidential, FC
AWS Big data Engineer
Responsibilities:
- Developed AWS Glue ETL environment through deploying AWS cloud formation template from scratch.
- Developed the Glue ETL scripts in Scala and python for data transformation, reading and writing the parquet files fromS3.
- Implemented Oozie workflow engine to run multiple Hive and Python jobs.
- Involved in automating the Bigdata jobs in Microsoft HDInsight Platform and managing logs.
- Experience in cloud data migration using AWS and snowflake.
- Involved in data migration to snowflake using AWS S3 buckets.
- Responsible for creating the design documents, establish specific solutions, creating the Test Cases.
- Analyzed the Sql scripts and designed it by using PySpark SQL for faster performance.
- Configured and used Query Surge tool to connect with HBase using Apache Phoenix for Data Validation.
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Created different Hive RAW and Standardize table for data validation and Analysis with Partition and bucket.
- Experience in elastic search Engine index based search, ELK log anaytics tool, Elastic Search, Logstash.
- Designed NoSQL databases schema to helo migrating legacy applications datastore to ElasticSearch.
- Involved in loading and transforming large sets of structured and semi structured from multiple data source to Raw Data Zone (HDFS) using Sqoop imports and Spark jobs.
- Developed Hive queries for data sampling and analysis to the analysts.
- Developed spark Data Frames and Dynamic data frames for structured data processing
- Experienced in creating user defined functions for complex transformation on data frames.
- Experience in creating Hive tables.
- Creating AWS EMR Data processing cluster for Ingestion the data from On-prem to S3bucket.
- Experience in handling the EMR configurations for handling memory issue for Spark jobs.
- Scheduling the Ad-hoc job in AWS EMR for transformation of data.
- Having experience in AWS EMR Jupyter Notebook for analyze the data and transformations.
- Production support for the EMR cluster and mainly troubleshooting memory and spark job application issues.
- Responsible for upgrading the EMR versions.
- Moving data in and out of an instance using import sets and transform maps and also auto imports of data into servicenow.
- Ingested the on prem CSV files to AWS S3 bucket Daily incremental using AWS EMR by adding Steps.
- Hands on experience on Oozie workflows for Ingest the data hourly basis.
- Importing data using Sqoop to load data from Oracle/Linux server to AWS S3/HDFS on regular basis.
- Developing Spark Data Frame Operations to perform required Validations in the data and to perform analytics on the Hive data.
- Using Hive to do transformations, joins, filter and some pre-aggregations after storing the data to HDFS.
- Monitoring Control-M job for scheduling the process and Managing Elastic MapReduce cluster through AWS console.
- Developed AWS Lambda function for monitoring the EMR cluster status updates and monitoring the jobs.
- Creating Splunk alerts for failure in Ingestion process on hourly basis.
- Experience in monitoring CloudWatch logs for EMR Bootstrap actions and Step logs etc.
- Experience in Configure and implement AWS tools such as CloudWatch, CloudTrail and direct system logs for monitoring.
Environment: Amazon EMR, Hadoop, Hive, Jupyter Notebook, Impala, Oracle, Spark, Sqoop, Oozie, Map Reduce, GIT, HDFS, Linux, Bamboo, cucumber and Jira.
Confidential
Data Engineer
Responsibilities:
- Building Reusable Data ingestion and Data transformation frameworks using Python
- Migration of data warehouse from SQL Server to Hadoop and Hive.
- Responsible for architecting a complex data layer to source the Raw data from variety of different sources and generating a derived data as per the business requirement and feed the data to BI Reporting to data scientist team
- Designed and built Data Quality frameworks for covering Data Quality aspects like Completeness, Accuracy, coverage using Python and Spark and Kafka.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
- Created different Hive RAW and Standardize table for data validation and Analysis with Partition and bucket
- Involved in loading and transforming large sets of structured and semi structured from multiple data source to Raw Data Zone (HDFS) using Sqoop imports and Spark jobs.
- Developed Hive queries for data sampling and analysis to the analysts.
- Written Sqoop Queries to import data into Hadoop from SQL Server table.
- Configured and used Query Surge tool to connect with HBase using Apache Phoenix for Data Validation.
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Data frame from the required files in HDFS.
- Implemented ETL framework using Spark with Python and loaded standardize data into Hive and Hbase tables.
- Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
- Implemented Oozie workflow engine to run multiple Hive and Python jobs.
- Responsible for creating the design documents, establish specific solutions, creating the Test Cases.
- Responsible for closing the defects identified by QA team and responsible for managing the Release process for the modules.
Technologies: Hadoop(2.6),Spark (2.1), HDFS, MapReduce,Hive, YARN, Zookeeper, Oozie,Python 2.7, Scala, Apache Nifi(1.1), Hue, ETL, MS SQL Server, Shell scripting.