Sr. Aws Data Engineer Resume
Irving, TX
SUMMARY
- Around 9 years of extensive experience in Information Technology with expertise on Data Analytics, Data Architect, Design, Development, Implementation, Testing and Deployment of Software Applications in Banking, Finance, Insurance, Retail and Telecom domains.
- Working experience on designing and implementation complete end to end Hadoop infrastructure using HDFS, MapReduce, Hive, HBase, Kafka, Sqoop, Spark, zookeeper, Ambari, Scala, Oozie, Yarn, No SQL, Postman and Python
- Created Data Frames and performed analysis using Spark SQL.
- Acute knowledge on Spark Streaming and Spark Machine Learning Libraries.
- Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python and Java.
- Excellent understanding of Spark Architecture and framework, Spark Context, APIs, RDDs, Spark SQL, Data frames, Streaming, MLlib.
- Worked in agile projects delivering end to end continuous integration/continuous delivery pipeline by Integration of tools like Jenkins and AWS for VM provisioning.
- Experienced in writing the automatic scripts for monitoring the file systems, key MapR services.
- Implemented continuous integration & deployment (CICD) through Jenkins for Hadoop jobs.
- Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS Redshift, Lambda and Amazon EC2, Amazon EMR.
- Excellent understanding of Hadoop Architecture and good Exposure in Hadoop components like Hadoop Map Reduce, HDFS, HBase, Hive, Sqoop, Cassandra, Kafka and Amazon Web services (AWS) API test, document and monitor by Postman which is easily integrate the tests into your build automation.
- Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip.
- Performed transformations on the imported data and Exported back to RDBMS.
- Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
- Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
- Created Hive External and Managed Tables.
- Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
- Used Apache Flume to ingest data from different sources to sinks like Avro, HDFS.
- Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks.
- Excellent knowledge on Kafka Architecture.
- Integrated Flume with Kafka, using Flume both as a producer and consumer (concept of FLAFKA).
- Used Kafka for activity tracking and Log aggregation.
- Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.
- Experience working with Text, Sequence files, XML, Parquet, JSON, ORC, AVRO file formats and Click Stream log files.
- Familiar in data architecture including data ingestion pipeline design, Hadoop architecture, data modeling and data mining and advanced data processing. Experience optimizing ETL workflows.
- Good Exposure in Data Quality, Data Mapping, Data Filtration using Data warehouse ETL tools like Talend, Informatica, DataStage, Ab - initio
- Good Exposure to create various dashboard in Reporting Tools like SAS, Tableau, Power BI, BO, QlikView used various filters, sets while dealing with huge volume of data.
- Experience in various Database such as Oracle, Teradata, Informix and DB2.
- Experience with NoSQL like MongoDB, HBase and PostgreSQL like Greenplum
- Worked in complete Software Development Life Cycle like Analysis, Design, Development, Testing, Implementation and Support using Agile and Waterfall Methodologies.
- Demonstrated a full understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
TECHNICAL SKILLS
Big Data Ecosystem: MapReduce, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, MapR, Amazon Web Services (AWS), EMR
Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.
Cloud Technologies: AWS, Azure
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter
Ensemble and Stacking: Averaged EnsemblesWeighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble.
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS SQL Server, HBASE
Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming, RStudio, PL/SQL, Linux shell scripts, Scala.
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.
PROFESSIONAL EXPERIENCE
Confidential, Irving TX
Sr. AWS Data Engineer
Responsibilities:
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Handled AWS Management Tools as Cloud watch and Cloud Trail.
- Stored the log files in AWS S3. Used versioning in S3 buckets where the highly sensitive information is stored.
- Integrated AWS Dynamo DB using AWS lambda to store the values of items and backup the DynamoDB streams
- Automated Regular AWS tasks like snapshots creation using Python scripts.
- Designed data warehouses on platforms such as AWS Redshift, Azure SQL Data Warehouse, and other high-performance platforms.
- Install and configure Apache Airflow for AWS S3 bucket and created dags to run the Airflow
- Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and Redshift.
- Created multiple scripts to automate ETL/ ELT process using Pyspark from multiple sources
- Developed Pyspark scripts utilizing SQL and RDD in spark for data analysis and storing back into S3
- Developed Pyspark code to load from stg to hub implementing the business logic.
- Developed code in Spark SQL for implementing Business logic with python as programming language.
- Designed, Developed and Delivered the jobs and transformations over the data to enrich the data and progressively elevate for consuming in the Pub layer of the data lake.
- Worked on Sequence files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Maintained Kubernetes patches and upgrades.
- Managed multiple Kubernetes clusters in a production environment.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis
- Developed various UDFs in Map-Reduce and Python for Pig and Hive.
- Data Integrity checks have been handled using hive queries, Hadoop, and Spark.
- Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
- Implemented the Machine learning algorithms using Spark with Python.
- Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query’s or python scripts based on source.
- Designs and implementing Scala programs using Spark Data frames and RDDs for transformations and actions on input data.
- Improved the Hive queries performance by implementing partitioning and clustering and Optimized file formats (ORC).
Environment: AWS, JMeter, Kafka, Ansible, Jenkins, Docker, Maven, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting, Golang, Web Sphere, Splunk, Tomcat, Soap UI, Kubernetes, Terraform, PowerShell.
Confidential, McLean, VA
Big Data Engineer & AWS Cloud Engineer
Responsibilities:
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations.
- Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- I have written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
- Creating datamodel that correlates all the metrics and gives a valuable output.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
- Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
- Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP) SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI.
- Developed visualizations and dashboards using PowerBI
- Sticking to ANSI SQL language specification wherever possible, and providing context about similar functionality in other industry-standard engines (e.g. referencing PostgreSQL function documentation)
- Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Created dashboards for analyzing POS data using Power BI.
Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel Spark, Python, ETL, Power BI, Tableau, Presto, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, IBM Cognos 10.1, Data Stage, Cognos Report Studio 10.1, Cognos 8 & 10 BI, Cognos Connection, Cognos office Connection, Cognos 8.2/3/4, Data stage and Quality Stage 7.5
Confidential, Columbus, OH
Sr. AWS Data Engineer
Responsibilities:
- Processed the Web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis, also extracted files from MongoDB through Flume and processed.
- Expert knowledge on MongoDB, NoSQL data modeling, tuning, disaster recovery backup used it for distributed storage and processing using CRUD.
- Extracted and restructured the data into MongoDB using import and export command line utility tool.
- Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
- Experience in creating tables, dropping, and altered at run time without blocking updates and queries using HBase and Hive.
- Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
- Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWSElastic search.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark-SQL to Load JSON data and create Schema R DD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Scala,Data Frame,Impala,OpenShift, Talend,pair RDD's.
- Setup data pipeline using in TDCH,Talend, Sqoop and PySparkon the basis on size of data loads
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
- Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sauseLABS, Cucumber JVM, Mongo DB, GITHub,BitBucket,SQL, NoSQL database, API, Java, Jenkins
Environment: Hadoop (HDFS, MapReduce), Databricks, Spark, Talend, Impala, Hive, PostgreSQL, Jenkins, Nifi, Scala, Mongo DB, Cassandra, Python, Pig, Sqoop, Hibernate, spring, Oozie, AWS Services EC2, S3, Autoscaling, scala, Azure, Elastic Search, DynamoDB, UNIX Shell Scripting.
Confidential, Columbus, Indiana
Data Engineer
Responsibilities:
- Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Data Warehouse to Atlas Data Lake (Big Data)
- Analyzed huge volumes of data Devised simple and complex HIVE, SQL scripts to validate Dataflow in various applications. Performed Cognos report validation. Made use of MHUB for validating Data Profiling & Data Lineage.
- Devised PL/SQLstatements - Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Created reports using Tableau/Power BI/Cognas to perform data validation.
- Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
- Performed statistical analysis using SQL, Python, R Programming and Excel.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Analyzed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.
- Data Ingest from Sqoop & flume from Oracle data base.
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
- Start working with AWS for storgae and halding for tera byte of data for customer BI Reporting tools
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library
- Experience implementing machine learning back-end pipeline with Pandas, Numpy
Environment: Hive,AWS,Hadoop,HDFS,Python,PL/SQL,SQL,Python,RProgramming,Apache Airflow, Numpy, Pandas, Jira, PIG, Tableau, Spark, Linux, Pandas, Numpy.
Confidential
Hadoop Developer
Responsibilities:
- Installed and configured Hadoop MapReduce HDFS Developed multipleMapReducejobs in java for data cleaning and preprocessing.
- Experience in installing configuring and using Hadoop ecosystem components.
- Experience in administration installing upgrading and managingCDH3 PigHiveHBase
- Importing and exporting data intoHDFSandHiveusingSqoop.
- Experienced in defining job flows.
- Knowledge in performance troubleshooting and tuning Hadoop clusters.
- Experienced in managing and reviewing Hadoop log files.
- Participated in development/implementation of Cloudera Hadoop environment.
- Load and transform large sets of structured semi structured and unstructured data.
- Responsible to manage data coming from different sources.
- Got good experience withNOSQLdatabase.
- Supported Map Reduce Programs those are running on the cluster.
- Involved in loading data fromUNIXfile system toHDFS.
- Installed and configured Hive and writtenHive UDFs.
- Involved in creating Hive tables loading with data and writing hive queries which will run internally in map reduce way.
- ImplementedCDH3Hadoop cluster onCentOS.
- Worked on installing cluster commissioning decommissioning of data node, name node recovery capacity planning and slots configuration.
- CreatedHBasetables to store variable data formats of PII data coming from different portfolios.
- Implemented best income logic usingPig scripts.
- Load and transform large sets of structured semi structured and unstructured data
- Cluster coordination services throughZookeeper.
- Exported the analyzed data to the relational databases usingSqoopfor visualization and to generate reports for the BI team.
- Supported in setting up QA environment and updating configurations for implementing scripts withPigandSqoop.