Aws Data Engineer Resume
Atlanta, GeorgiA
SUMMARY
- Motivated IT professional with around 6 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse / Data Mart, Data Visualization, Reporting.
- In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
- Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
- Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In- depth understanding of MapReduce programming paradigm and Spark execution framework.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Dataframe API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on PySpark and Scala .
- Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce, Azure Data Factory and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Implemented the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, creating realm /domain, managing.
- Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, Json and XML .
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
- Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL . Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
- Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
- Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
- Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
- Created and configured new batch job in Denodo scheduler with email notification capabilities and Implemented Cluster setting for multiple Denodo node and created load balance for improving performance activity.
- Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
TECHNICAL SKILLS
Languages/Tools: Java, C++, Scala, VB, XML, HTML/XHTML, HDML, DHTML, Python, SQL.
Big Data Technologies: HDFS, Map Reduce, HIVE, PIG, HBase, SQOOP, Oozie, Zookeeper, Spark, PySpark, Kafka, MDM, Storm, Cassandra, Impala.
Cloud: AWS (S3, EC2, EMR, Glue, Kinesis), Azure (VM, Databricks, Data factory, Cosmos DBSQL DB)
GUI Environment: Swing, AWT, Applets.
Operating Systems: Windows 95/98/NT/2000/XP/7/8/10, MS-DOS, UNIX, Linux.
Messaging and Web services Technology: SOAP, WSDL, UDDI, XML, SOA, JAX-RPCIBM WebSphere MQ v5.3, JMS.
Network Protocols: HTTP, HTTPS, FTP, UDP, TCP/IP, SNMP, SMTP, and POP3.
Databases/ No SQL: Oracle 10g, MS SQL Server 2000, DB2, MS Access & MySQL. Teradata, Cassandra, and MongoDB.
Testing and case tools: JUnit, Log4j, Rational Clear case, CVS, Ant, Maven, JBuilder.
Project methodology: Agile Methodology - Scrum methodology, Waterfall methodology
PROFESSIONAL EXPERIENCE
Confidential - Atlanta, Georgia
AWS Data Engineer
Responsibilities:
- Responsible for developing and supporting Data warehousing operations.
- Involved in Peta byte scale data migration operations.
- Designed and implemented custom NiFi processors that reacted, processed for the data pipeline
- Worked on building and developing ETL pipelines using Spark-based applications.
- Designed technical solutions using object oriented design concepts
- Maintained resources on-premises as well as on the cloud.
- Started using Apache NiFi to copy the data from local file system to HDFS.
- Implemented Test driven development as per the architecture.
- Developed Java routines in ETL jobs for data transformation as required
- Developed SQL queries to generate the look up files needed for the ETL job
- Design and develop ETL integration patterns using Python on spark (PySpark)
- Used Informatica for extracting required data form operation all systems and transforms the same data on its server and load it to the data warehouse.
- Used Talend application for integration solutions.
- Designed VNets and also subscriptions to confirm to the Azure Network Limits.
- Used Redshift to run queries against exabytes of data in Amazon S3.
- Using MDM help provide the proverbial 360-degree, comprehensive view of customers.
- Used MDM systems and designed for ready in corporation into any variety of applications, including those for Big Data.
- Exposed Virtual machines and cloud services in the VNets to the Internet using Azure External Load Balance
- Used Impala which supports various file formats like LZO, Sequence File,Avro,ORC File, Parquet.
- Used Impala provides faster access for the data in HDFS when compared to other SQL engines.
- Performed Data Extraction,aggregation and consolidation of data within AWS Glue using PySpark
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Tested the functionality of the ETL jobs by validating data between source and target systems.
- Developed the PySpark code for AWS Glue jobs and for EMR.
- Used HDFS(EMR), S3 as the source to process the batch files for daily and weekly jobs
- Worked on developing ETL pipelines on S3 parquet files on data lake using AWS Glue
- Utilized various cloud-based services to maintain and monitor various cluster resources.
- Conducted ETL Data Integration, Cleansing, and Transformations using Apache Kudu, Spark.
- Used Apache NiFi for file conversions and data processing.
- Developed applications to map the data between different sources and destinations using Python and Scala.
- Reviewed and conducted performance tuning on various Spark applications.
- Responsible for managing data from disparate sources.
- Used Terraform in managing resource scheduling, disposable environments and multitier applications.
- Using Hive Script in Spark for data cleaning and transformation purpose.
- Responsible for migrating data from various conventional data sources as per the architecture.
- Used Autosys to Schedule Spark and Kafka Producer Jobs to run in parallel
- Developed Spark applications in Scala and Python to migrate the data.
- Developed Linux based shell scripts to automate the applications.
- Provided support for building Kafka consumer applications.
- Performed unit testing and collaborated with the QA team for possible bug fixes.
- Collaborated with data modelers and other developers during the implementation.
- Worked in an Agile-based Scrum Methodology.
- Load data into Hive partitioned tab.
- Export the analyzed data to relational databases using Kudu for visualization and to generate reports for the Business Intelligence team.
Environment: AWS, Linux, Spark-SQL, Python, Scala, CDH 5.12.1, Kudu, Spark, Oozie, Cloudera Manager, MDM, Hue, SQL Server, Maven, Git, Agile methodology PySpark. Redshift, Informatica.
Confidential - Philadelphia, Pennsylvania
Azure Data Engineer
Responsibilities:
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Managed, Configured, and scheduled resources across the cluster using Azure Kubernetes Service.
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Develop dashboards and visualizations to help business users analyse data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
- Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.
- Created various pipelines to load the data from Azure data lake into Staging SQLDB and followed by to Azure SQL DB
- Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
Environment: Azure SQL DW, Databrick, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob storage, Apache Spark.
Confidential
Data Engineer
Responsibilities:
- Responsible for creating Hive tables, loading the structured data resulted from MapReduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
- Chipped away at outlining and building up the Real Time Analysis module for Analytic Dashboard utilizing Cassandra, Kafka, and Spark Streaming.
- Involved in running MapReduce jobs for processing millions of records.
- Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying.
- Responsible for Data Modeling in Cassandra as per our requirement.
- Managing and scheduling Jobs on a Hadoop cluster using Oozie and cron jobs.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Created UDFs to calculate the pending payment for the given Residential or Small Business customer and used in Pig and Hive Scripts.
- Deployed and built the application using Maven.
- Used Python scripting for large scale text processing utilities
- Handled importing of data from various data sources, performed transformations using Hive. (External tables, partitioning).
- Responsible for data modeling in MongoDB in order to load data which is coming as structured as well as unstructured data.
- Unstructured files like XML's, JSON files are processed using custom-built Java API and pushed into MongoDB.
- Wrote test cases in MRunit for unit testing of MapReduce Programs.
- Involved in templates and screens in HTML and JavaScript.
- Developed the XML Schema and Web services for the data maintenance and structures.
- Built and deployed applications into multiple UNIX based environments and produced both unit and functional test results along with release notes.
Environment: HDFS, MapReduce, Hive, Pig, Cloudera, Impala, Oozie, Greenplum, Mongo DB, Cassandra, Kafka, Storm, Maven, Python, Cloud Manager, Ambari, JDK, J2EE, Struts, JSP, Servlets, Elastic Search, WebSphere, HTML, XML, JavaScript, MRunit.
Confidential
Data Engineer
Responsibilities:
- Involving in sprint planning as part of monthly deliveries.
- Involving in daily scrum calls and standup meetings as part of agile methodology.
- Good hands-on experience on Version One tool to update the work details and working hours for a task.
- Involving in the designing part of views.
- Involving in Writing Spring Configuration Files and Business Logic based on Requirement.
- Involved in code-review sessions.
- Implementing Junit tests based on the business logic w.r.t to assigned backlog in sprint plan.
- Good experience on creating the Jenkins CI jobs and Sonar jobs.
- Performed unit testing and collaborated with the QA team for possible bug fixes.
- Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.
- Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
Environment: Core Java, spring, Maven, XMF Services, JMS, Oracle10g, PostgreSQL, 9.2, Fitness, Eclipse, SVN.