We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

0/5 (Submit Your Rating)

CA

SUMMARY

  • Possess a thorough knowledge of IT, especially Azure cloud. Knowing how to restrict database access, move on - premises databases to Azure Data Lake storage using Azure Data Factory, and move SQL databases to AZURE data lake, Azure data lake analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse.
  • Developed pipelines, datasets, linked services, and integration runtime in Azure Data Factory.
  • Experienced in implementing and deploying workloads on Azure virtual machines, as well as managing Azure infrastructure hosting plans (VMs).
  • Having knowledge of moving SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse, as well as managing database access and moving on-premises databases to Azure Data Factory.
  • I built AWS security groups that acted as virtual firewalls and controlled traffic allowed to reach one or more AWS EC2 instances. As part of the operations and maintenance support for AWS cloud resources, EC2 instances, S3 buckets, Virtual Private Clouds (VPC), Elastic Load Balancer (ELB), and Relational Database services were deployed, maintained, and troubleshot (RDS).
  • Working knowledge of AWS Cloud formation templates for creating IAM Roles and total architecture deployment from start to finish (creation of EC2 instances and its infrastructure).
  • Has experience utilizing boto3 modules to trigger events in a range of AWS resources using event-driven and scheduled AWS Lambda functions. hands-on experience with Cloudera (CDH 4/CDH 5), Hortonworks, Map-R, IBM Big Insights, Apache, and Amazon EMR Hadoop distributions.
  • Thorough understanding of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as a storage mechanism.
  • Worked on Building server less web pages using API gateway and lambda.
  • Proficient use of Apache NIFI to automate data transfer between different Hadoop systems.
  • I used Apache NIFI to move the data from the local file system to HDFS. Using MR algorithms, we integrated Map Reduce with HBase to ingest large amounts of data.
  • Good exposure on Apache Hadoop, Map Reduce programming, distribute application and HDFS.
  • Involved in creating Hive tables loading with data and writing hive queries which will run internally in Map Reduce way.
  • Following an examination of the SQL scripts,a Scala and Python implementation of the solution was developed.
  • Experience in developing scalable solutions using NoSQL databases CASSANDRA, MongoDB.
  • Developed enhancements to MongoDB architecture to improve performance and scalability.
  • Developed Scala scripts, UDFs, and queries that used Spark's Data Frames for data aggregation and Sqoop to write data back into the OLTP system.
  • Scala programs using Spark were created to compare Spark's performance to that of Hive and Spark SQL.
  • I have over 5 years of experience using frameworks to design, develop, and implement big data applications (MapReduce, Yarn, Sqoop, Spark, HDFS, Storm, HBase, Impala, NIFI, Zookeeper, Airflow, Flume, Kafka, Airflow, Oozie, etc.).
  • Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop Map Reduce HDFS, HBase, Hive, Sqoop, Zookeeper, and Flume.
  • Worked on developing ETL Processes to load data from multiple data sources to HDFS using Flume and Sqoop, perform structural modifications using Map - Reduce, Hive and analysis data using visualization / reporting tools.
  • Worked on storage classes to work with variety of data formats such as JASON and XML file formats. Experienced with different kind of compression technique like Snappy.
  • Good knowledge of Oozie concepts like design, development, and execution of workflows in Oozie.
  • Experience with Oozie workflow engine in running workflow jobs with actions that run Hadoop Map Reduce jobs.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Expert in setting up Horton works cluster with and without using Ambari.
  • Cluster monitoring and troubleshooting using tools such as Cloudera and Ambari metrics.
  • Expertise utilizing Hortonworks Ambari to create and manage multi-node Hadoop clusters for development and production that contain a variety of Hadoop components (HIVE, SQOOP, OOZIE, FLUME, HBASE, ZOOKEEPER).
  • Developed Spark Streaming programs to process Kafka data in near real time, as well as data containing both stateless and stateful changes.
  • Hands on experience on installation and configuring the Spark and Impala.
  • Good knowledge on spark components like spark SQL, Spark Streaming and extensively worked on spark streaming and Apache Kafka to fetch live stream data.
  • Involved in integrating hive queries into spark environment using Spark SQL.
  • Used Storm and Kafka Services to push data to HBase and Hive tables.
  • Substantial experience in writing Map Reduce jobs in Flume, Hive and Storm.
  • ThoroughunderstandingofHadoop,Hive,andNoSQLdatabasessuchasMongoDB,Cassandra, and HBase.
  • Expertise configuring Zookeeper to maintain data consistency and coordinate servers in clusters.
  • Skilled at managing cluster resources with Zookeeper and Sqoop and importing and exporting data.
  • Used Apache Kafka to gather and combine a sizable volume of web log data, which was then stored in HDFS for analysis.
  • Using Sqoop and Kafka to import data from Amazon S3 into HIVE while maintaining multi-node development and test Kafka clusters.
  • Realtime streaming the data using Spark with Kafka.
  • Good understanding of MPP databases such Impala.
  • Expert in creating UDF’s, UDTF’s and UDAF’s for Hive, and Impala.
  • Defined best practices for Tableau report development.
  • Solid understanding of data preparation, modeling, and visualization using Power BI, as well as experience developing various reports and dashboards using Tableau visualizations.
  • Developed various analytical dashboards displaying critical KPIs using PowerBI.
  • I have experience with PySpark based stream processing systems and can write DDL and DML scripts in SQL and HQL for RDBMS and Hive analytics systems.
  • Helped to design the application architecture for sever almicro services(AWS).
  • Used Teradata to extract, manipulate, and analyze health care and retail data from multiple sources to generate and visualize actionable insights for decision making.
  • Experience in deploying the Cassandra cluster in cloud, on premises, data storage and disaster recovery.
  • Implemented multi - data center and multi - rack Cassandra cluster.
  • Good experience on MongoDB scaling across data centers and /or in - depth understanding of MongoDB HA strategies, including replica - sets.
  • Data extraction, cleaning, and loading, statistical analysis, exploratory analysis, data wrangling, and predictive modeling using R, Azure, Netezza, Python, and other tools.
  • Working knowledge of Python Integrated Development Environments such as PyCharm, Anaconda, and Jupyter Notebook, as well as others.
  • I'm familiar with Python's math libraries, including NumPy, SciPy, Pandas for Matplotlib and data preparation, Seaborn for data visualization, TensorFlow, Re for NP, Theano, Keras for deep machine learning, and NLTK, stats model for Time series forecasting.
  • Understanding of the scientific computing stack, Jupyter Notebook, and Python (NumPy, SciPy, pandas and matplotlib).
  • Extensive experience in SQL,RESTAPIs,WebServices,and Message Queues development.
  • Created Python APIs with SQL Alchemy for ORM and MongoDB,documented APIs in Swaggerdocs,and deployed application using Jenkins.
  • Data extraction, aggregation, and consolidation of Adobe data within AWS Glue using PySpark.
  • Making use of Stream Sets, Apache NI-FI, and AWS Glue to create data pipelines.
  • Worked on setting up producers and listeners in Kafka to process streaming data. developed data pipelines to handle and load streaming data into Cassandra and internal Hive databases.
  • End-to-end data pipelines were built to extract, cleanse, process, and analyze large amounts of behavioral and log data.
  • Helped new QA members - getting familiar with main features and app flow.
  • Use Lambda functions and Step Functions to initiate Glue Jobs and orchestrate the data flow.
  • Create AWS Glue data ingestion modules for importing data in multiple levels in S3 and reporting using Athena and Quicksight.

TECHNICAL SKILLS

Languages: Scala, C, C++, Python (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), XML, SQL, HTML.

Technologies: Apache NIFI, Map Reduce, Sqoop, Flume, Hive, Oozie, Ambari, Storm, Spark, HDFS, Impala, Zookeeper and Kafka.

Microsoft Technologies: ASP.NET, ASP.NET MVC 3/4/5, ADO.NET, Entity Framework 4.0, LINQ, WPF, WCF and SharePoint.

Frameworks: ASP.Net MVC, Entity Framework, Spring

Web Services: RESTFUL.

Data Virtualization tools: Denodo

Data Formats: Parquet, JSON, AVRO, ORC, CSV, XMLand Proto Buffer.

Deployments: Pivotal Cloudy Foundry, Chef.

Integration Tools: Azure DevOps, Jenkins, Team City.

Operating Systems: Windows 7/10, Mac OS

Development Tools: Visual Studio, Jupyter, PyCharm

Database: Teradata, SQL Server, MySQL, PostgreSQL

ETL and Reporting Tools: SSRS, Tableau, SSIS, Power BI

NoSQL Database: HBase, Cassandra and MongoDB.

AWS Services: EC2, EMR, S3, Redshift, Lambda, Glue, Data Pipeline, Athena, Kinases.

Azure: Data lake, Data factory, Databricks, Azure SQL

Version Control Tools: TFS, GIT

PROFESSIONAL EXPERIENCE

Confidential, CA

AWS DATA ENGINEER

Responsibilities:

  • 4+ years of IT experience with Big Data Hadoop Ecosystem components MapReduce, HDFS, Yarn, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie, Avro, AWS, Spark integration with Cassandra, Zookeeper.
  • Made the decision to use PySpark to develop the solution and created the Python code to extract data from HBase.
  • Possesses practical knowledge of NoSQL databases including Cassandra, MongoDB, and HBase.
  • Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs for data cleaning and preprocessing.
  • To store the Parquet-formatted data produced by Apache Spark on the Cloudera Hadoop cluster, Hive tables were built on HDFS.
  • Extensive experience with Spark SQL, Spark Streaming, and using the Core Spark API to explore Spark features for building data pipelines.
  • Created data pipelines utilizing Sqoop, and Hive to load clinical, biometric, lab, claims, and customer member data into HDFS for data analytics.
  • Built data pipelines with Flume and Sqoop to import customer and cargo histories into HDFS for analysis.
  • Create ETL data pipelines by combining technologies such as Hive, Spark SQL, and PySpark.
  • Used Python and NoSQL databases like HBase and Cassandra to store real-time data from Kafka to HDFS using Spark Streaming.
  • To integrate with Cassandra, a distributed messaging queue was developed using Apache Kafka and Zookeeper.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Set up Spark streaming to receive continuous information from Kafka and store it in HDFS.
  • Worked with Spark RDD and Pyspark concepts to convert Hive/SQL queries into Spark transformations.
  • Used Spark SQL to connect many hive tables, write them to a single table, and store the results on S3.
  • For pre-processing, cleaning, and merging very big data sets, use Spark SQL.
  • To test and process data, use Spark with Scala and Spark SQL.
  • Used PySpark to create and modify data frames in Apache Spark by encoding and decoding JSON objects.
  • Developed product profiles using commodity UDFs.
  • Using Spark and Scala, I joined different Cassandra tables and then performed analytics on top of them.
  • Involved in migrating of MapReduce programs into Spark using Spark and Scala.
  • Wrote MapReduce jobs using various input and output formats also used custom formats whenever necessary.
  • Collected the logs from the physical machines and the open stack controller and integrated into HDFS using Flume.
  • Developed Flume ETL job for handling data from HTTP source and sink as HDFS.
  • Hive/SQL queries were converted into Spark transformations using Scala and Spark RDDs.
  • To assess the outcomes of Spark, Hive, and SQL/Teradata, several proof-of-concepts in Scala were developed and deployed on the Yarn cluster.
  • On the Yarn cluster, Scala proof-of-concepts were developed and deployed to evaluate the performance of Spark, Hive, and SQL/Teradata.
  • To assess the outcomes of Spark, Hive, and SQL/Teradata, several proof-of-concepts in Scala were developed and deployed on the Yarn cluster.
  • To conduct Jenkins Masters builds, Jenkins slave nodes were established.
  • Create open-source software bundles (R libraries and Python packages) to boost the productivity of the modeling team. Jenkins is used for continuous integration.
  • Tasks were automated using Jenkins jobs.
  • Knowledge of Python, Jupyter, and the scientific computing stack (NumPy, SciPy, pandas and matplotlib).
  • Plotted graphs using Matplotlib and Seaborn to get the insights from the data.
  • Took part in all phases of the lifespan of a project, including design, development, deployment, testing, implementation, and support. using NumPy and Matplotlib Python packages to produce different capacity planning reports (graphical).
  • BuiltanOoziepipelinetospeedtheloadingofHDFSdata.
  • Knowledge of managing a MongoDB environment from the standpoint of scalability, performance, and availability.
  • Developed ETL workflows utilizing Sqoop and Oozie to import data from various data sources into HDFS.
  • Implemented scripts for Mongo DB import, export, dump and restore.
  • Worked on analyzing and examining customer behavioral data using Mongo DB.
  • Worked on moving Tableau Dashboard from Development to Production Live Environment.
  • To check and reconcile data, developed, and assessed bespoke SQL queries using joins clauses in Tableau desktop.
  • Developed ETL mapping for data collection from multiple data feeds via REST API. Data sources include Twitter, web, and YouTube feeds.
  • Organized data interchange and integration with customers' and third-party systems via CSV, XLS, XML, JSON, REST, and SOAP. Parsing was developed and kept up with.
  • By merging tables from several data sources, we created unique DENODO displays.
  • Expertise with a variety of databases, such as MySQL, Teradata, Redshift, MongoDB, MS SQL Server, and DENODO views (data virtualization).
  • Knowledge of managing a MongoDB environment from the standpoint of scalability, performance, and availability.
  • Participates in the building, maintenance, and enhancement of applications for the Snowflake database.
  • Using whereScape, a data warehouse model in snowflake was created for over 100 datasets.

Environment: Hadoop, HDFS, Spark, Kafka, Hive, Spark SQL, Flume, Jenkins, HBase, MapReduce, Oozie, Sqoop, Tableau, rest, Denodo, Matplotlib, Mongo DB, Cassandra, Zookeeper, Snowflake.

Confidential, NJ

Big Data Engineer

Responsibilities:

  • Extensive experience with Real - time streaming technologies Spark, Storm, Kafka.
  • Used Apache Kafka to implement large-scale publish-subscribe message queues.
  • Created and deployed Kafka-based large data ingestion pipelines to ingest many TBs of data from a variety of data sources.
  • Airflow DAGs for Batch Processing were created to manage Python data pipelines for csv file preparation before to ingestion, with conf used to parameterize for a plethora of input files from various hospitals, starting distinct Task Instances.
  • Complex data pipelines were created to import log data into RDBMS.
  • Backend data structures such as ETL, data pipelines, data frames, and data storage structures in RDBMS, Hadoop/HDFS, both on-premises and in the cloud utilizing AWS and Oracle Cloud, were designed, constructed, and deployed.
  • Real-time data streaming with Spark and Kafka.
  • Designed and implemented by configuring topics in new Kafka cluster in all environments.
  • Deployed Data Lake cluster with Hortonworks Ambari on AWS using EC2 and S3.
  • Set up Hortonworks Infrastructure from configuring clusters to Node
  • Involved with Hortonworks Support team on Grafana consumer Lags Issues.
  • Create REST APIs and software packages that abstract complicated prediction and forecasting algorithms from time series data.
  • Worked on the creation of web services that send and receive data in the JSON format over external interfaces utilizing REST APIs.
  • Created a statistical analysis tool for server-based web traffic utilizing RESTful APIs and Pandas.
  • Used Impala and Presto for querying the datasets.
  • Developed Hive and Impala for end user/analyst requirements to perform hoc analysis.
  • Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
  • Installed Storm and Kafka on a four-node cluster.
  • Developed the Persistence layer for HBase using Apache Storm.
  • Adding/installation of new components and removal of them through Ambari.
  • Configuring YARN capacity scheduler with Apache Ambari.
  • Managing Ambari administration and setting up user alerts.
  • Handle the data - exchange between HDFS and Web Applications and databases using Flume and Sqoop.
  • Collaborating with software engineers to setup user experiencing service based on Flume and Hadoop.
  • Working knowledge of several build tools, such as Maven.
  • Creating and constructing various spark projects with Maven.
  • Maven builds were integrated, and workflows were created to automate the build and deployment process.
  • Expertise with a variety of databases, including MS SQL Server, MongoDB, Cassandra, My SQL, and Oracle.
  • Setting up geographical MongoDB replica sets across various data centers to achieve high availability.
  • Using Spark, HBase, and Hive, data pipelines for extraction, transformation, and loading were developed.
  • Used Hive External tables to perform data analysis with HBase.
  • Toconstructthehive,extractdatastatements,andinsertdataintotables,scriptswerebuilt.
  • Creating PySpark projects with the help of the data science team.
  • PySpark scripts built on Python were used to extract important data from data sets and store it on HDFS.
  • Knowledge of GitHub-like cloud versioning tools.
  • Has experience managing the versions and configurations of the code using version control tools including Git, GitHub, CVS, and SVN.
  • Used the Python libraries NumPy and SciPy to perform various mathematical operations for calculating purposes.
  • Has experience managing the versions and configurations of the code using version control tools including Git, GitHub, CVS, and SVN.
  • Used the Python libraries NumPy and SciPy to perform various mathematical operations for calculating purposes.
  • Very good Knowledge in YARN terminology and High availability Hadoop Clusters.
  • Worked on the large - scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive.
  • Knowledge of GitHub-like cloud versioning tools.
  • Has experience managing the versions and configurations of the code using version control tools including Git, GitHub, CVS, and SVN.
  • Used the Python libraries NumPy and SciPy to perform various mathematical operations for calculating purposes.
  • Used Git, GitHub, Amazon EC2, and Heroku for deployment.
  • Extracted data from variousSource Systemslike Oracle,Teradata,XML Files and Flat Files as per the requirements.
  • Extracted data from Teradata in the form of XML by usingXML Publishing methodologyand Loaded XML files to Teradata usingXML Shredding methodology. Assisted DBA panel for Teradata XML Services installation.

Environment: Kafka, Hortonwork, Rest, Impala, Storm, Ambari, Flume, Maven, MongoDB, Apache Hive, PySpark and Git Hub, Teradata, Power Bi, YARN.

Confidential

.Net/ Hadoop Developer

Responsibilities:

  • Using Ambari, Kafka Tool, and Kafka Manager to monitor and manage Kafka.
  • Implemented SSL-based Kafka Security Features without Kerberos. in addition to greater grain-fines Security I set up Kerberos to have users and groups this will enable more complex security features.
  • Created and constructed a system to use Kafka to collect data from various portals and Spark to process it.
  • Built Power BI Apps with the REST API and linked them with Power BI dashboards and reports.
  • To test and deploy Identity Microservices, a Microservice architecture that is based on services communicating over a combination of REST and Azure was used.
  • Scheduled the Denodo data extraction jobs to push to the end system as flat files, CSV files, excel files, hyper for Tableau, and data loads into other reporting databases.
  • Using Solution Manager, upgrade databases from Denodo 5.5 to Denodo 7.0.
  • Proficiency with Chef, Chef with Jenkins, and Ansible setup and automation.
  • Jenkins build migration to Azure.
  • For map reduction tasks, Amazon EMR was employed, while Jenkins was used for regional testing.
  • Used Hive from Hadoop to analyze data.
  • Use hive, and map reduce scripts to move data from HDFS to MONGODB, then use dashboard tableau to view streaming data.
  • Put Oozie workflows for Map Reduce, Hive, and Sqoop activities into practice.
  • Expert in performance optimization, Power BI Desktop, Power BI Service (SaaS), and Power BI Gateway.
  • Capable of employing Power BI capabilities including drill-throughs, selection panes, and bookmarks.
  • In charge of HBASE configuration and data storage.
  • Created MapReduce code to process and parse data from various sources, then used HBase - Hive Integration to store the processed data in HBase and Hive.
  • Developed HBase tables to store data in various application-specific forms.
  • Working knowledge with MongoDB, MySQL, and Cassandra databases.
  • Working knowledge in managing, setting up, and administering MySQL and NoSQL databases like MongoDB and Cassandra.
  • Wide-ranging familiarity and expertise with NoSQL databases like MongoDB and Cassandra.
  • Implemented Azure Storage, including Blob storage, Storage Accounts, and Azure SQL Server. examined Azure storage accounts for things like Blob storage.
  • Moderate workloads were transferred from on-premises to Azure.
  • Deployed web applications to Azure and used Azure to implement security in web applications.

Environment: Kafka, rest, denodo, Jenkins, Hive, power BI, HBase, MongoDB, Cassandra, azure

Confidential

AWS ENGINEER

Responsibilities:

  • Extensive experience in data extraction, cleaning, loading, statistical analysis, exploratory analysis, data wrangling, predictive modeling with R, AWS, Azure, Python, and data visualization with Databricks, Grafana, SSRS, and Power BI.
  • Create ETL data pipelines by combining technologies such as Hive, Spark SQL, and PySpark.
  • Added data to Power BI from a range of sources, including SQL, SQL Azure, Excel, Oracle, and more.
  • Built a market research Azure Proof of Concept.
  • Used AWS Redshift to run SQL queries and manage the data warehouse.
  • Set up multi - node Hadoop cluster with configuration management/Deployment tool (chef).
  • Hands-on experience testing Chef Recipes, Cookbooks, and Chef Spec in a test kitchen and automating the process using a knife to bootstrap the nodes. Refactored Chef and oops operate in the AWS cloud environment.
  • Creating Tableau dashboards to track client acquisition and weekly campaign success. constructed the trends over time intervals of a week, month, quarter, and year.
  • Assisting Tableau users in the creation or editing of spreadsheets and data visualization dashboards.
  • Experience in importing data from various sources to the Cassandra cluster.
  • Created documentation for benchmarking the Cassandra cluster for the designed tables.
  • Enhanced the performance and scalability of the MongoDB architecture.
  • Experience in managing large, shared MongoDB cluster.
  • Experience in managing life cycle of MongoDB including sizing, automation, monitoring and tuning.
  • Enhanced the performance and scalability of the MongoDB architecture.
  • Imported JASON file to change the theme and color palates per the requirement in Power BI Desktop.
  • Profiled and cleansed data extracted from various sources utilizing mashup codes in query editor on Power BI Desktop.
  • Gained experience with the Spark compute engine and the functional programming language used by the Spark Shell.
  • Used Python's SciPy and Matplotlib modules to work with data from various sources to make graphs and charts.
  • Created hive external table on top of HBase which were used for feed generation.
  • Create Hive scripts to load the historical data and partition the data
  • Hands on experience in creating Apache SparkRDD transformations on Data sets in the Hadoop data lake.
  • Validated Restful API services.
  • Proficient with testing REST APIs, Web, and Database testing.
  • Strong Knowledge/experience in creating Jenkins CI pipelines.
  • Experience inJENKINSto automate most of the build related tasks.

Environment: Matplotlib, SQL, chef, Tableau, AWS, Cassandra, MongoDB, Teradata, Power BI, Hive, Spark, rest, Jenkins.

We'd love your feedback!