We provide IT Staff Augmentation Services!

Aws Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Around 8 years of professional IT work experience in Analysis, Design, Development, Deployment and Maintenance of critical software and big data applications.
  • Hands - on experience across Hadoop and that includes extensive experience into BigData technologies.
  • Experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like Map Reduce, YARN, Hive, Pig, HBase, Flume, Sqoop, SparkStreaming, SparkSQL, Storm, Kafka, Oozie, Zookeeper.
  • Implemented performance tuning techniques for Spark SQLqueries.
  • Strong knowledge on Hadoop HDFS architecture, Map-Reduce(MRv1) and YARN(MRv2) framework.
  • Strong hands on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
  • Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Dataframe, Spark SQL API's.
  • Developed Sqoop scripts to migrate data from Teradata, Oracle to Bigdata Environment.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked with Hue GUI in scheduling jobs with ease and File browsing, Job browsing, Metastore management.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
  • Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
  • Implemented real time data streaming pipeline usingAWS Kinesis, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets.
  • Integrated AWS DynamoDB using AWS Lambda to store the value items and backup the DynamoDB streams.
  • Experienced in AWS Elastic Beanstalk for app deployments and worked on AWS Lambda with Amazon Kinesis.
  • Developing a Marketing Cloud Service on Amazon AWS. Developed serverless application using AWS Lambda, S3, Redshift and RDS.
  • Experience in working with Snowflake.
  • Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, python, PySpark and Spark-Shell.
  • Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
  • Good experience in writingMap Reducejobs usingJava native code,Pig,Hivefor various business use cases.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Added support for AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
  • Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS.
  • Performed Data scrubbing and processing with Oozie and for workflow automation and coordination.
  • Hands on experience in analyzing log files for Hadoop and eco-system services and finding root cause.
  • Hands on experience on handling different file formats like AVRO, PARQUET, Sequential files, MAP Files, CSV, xml, log ORC and RC.
  • Experience with NoSQL Database HBase, Cassandra, MongoDB.
  • Experience with AIX/Linux RHEL, Unix Shell Scripting and SQL Server 2008.
  • Worked on data search tool Elastic Search and data collection tool Logstash.
  • Strong knowledge in Hadoop cluster installation, capacity planning and performance tuning, benchmarking, disaster recovery plan and application deployment in production cluster.
  • Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
  • Exposed into methodologies Scrum, Agile and Waterfall.

TECHNICAL SKILLS

Programming Languages: Java, Scala, Python, SQL, and C/C++

Big Data Ecosystem: Hadoop, MapReduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.

Hadoop Distributions: Cloudera Enterprise, Data Bricks, Horton Works, EMC Pivotal.

Databases: Oracle, SQL Server, PostgreSQL.

Web Technologies: HTML, XML, JQuery, Ajax, CSS, JavaScript, JSON.

Streaming Tools: Kafka, RabbitMQ

Testing: Hadoop Testing, Hive Testing, MRUnit.

Operating Systems: Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.

Cloud: AWS EMR, Glue, RDS, CloudWatch, S3, Redshift Cluster, Kinesis, DynamoDB.

Technologies and Tools: Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.

Application Servers: Tomcat, JBoss.

IDE’s: Eclipse, Net Beans, IntelliJ.

PROFESSIONAL EXPERIENCE

AWS Big Data Engineer

Confidential - Dallas, TX

RESPONSIBILITIES:

  • Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
  • Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
  • Ensure that application continues to function normally through software maintenance and testing in production environment.
  • Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
  • Implemented real-time solutions for Money Movement and transactional data using Kafka, Spark Streaming, Hbase.
  • The project also includes a spread of big data tools and programming languages like Sqoop, Python, Oozie etc.
  • Worked on scheduling Oozie workflow engine to run multiple jobs.
  • Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
  • Writing complex snow-sql scripts in snowflake data warehouse to business analysis and reporting.
  • Worked on creating logical datasets scripts on snowflake using snow-sql joins, Aggregations, transformations, and window functions.
  • Experience in using the AWS services Athena, Redshift and Glue ETL jobs.
  • Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
  • Experience in using Spark SQL to handle structureddatafrom Hive in AWS EMR Platform (M4.Xlarge, M5.12Xlarge clusters).
  • Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
  • Actively involved in code review and bug fixing for improving the performance.
  • Good experience in handling data manipulation using python Scripts.
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
  • Created Splunk dashboard to capture the logs for end to end process of data ingestion.
  • Written unit test cases for Spark code for CICD process.
  • Good knowledge about the configuration management tools like BitBucket/Github and Bamboo(CICD).

Environment: AWS EMR, Attunity, Kinesis, DynamoDB, SNS, SQS, CloudWatch, HDFS, Sqoop, LINUX, Oozie, Hive, Spark, Kafka, Spark Streaming, Scala, Snowflake, Pyspark, Python, Tableau, MongoDB, Amazon Web Services, Talend.

Big Data Engineer

Confidential - Dallas, TX

RESPONSIBILITIES:

  • Worked with extensive data sets in Big Data to uncover pattern, problem & unleash value for the Enterprise.
  • Worked with internal and external data sources on improving data accuracy / coverage and generate recommendation on the process flow to accomplish the goal.
  • Ingestion of various types of data feeds from SOR and use-case perspective into Cornerstone 3.0 platform.
  • Re-engineered legacy IDN FastTrack process to get the Bloomberg data directly from source to the CS3.0.
  • Converted legacy Shell scripts to Map-Reduce jobs in a distributed manner without performing any kind of processing on the Edgenode to eliminate the burden.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
  • Created Spark applications for data preprocessing for greater performance.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
  • Developed Spark code and Spark-SQL/streaming for faster testing and processing ofdata.
  • Experience in creating spark applications using RDD, Dataframes.
  • Worked extensively on hive to analyse the data and create reports for data quality.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
  • Written Hive queries for data analysis to meet the business requirements and Designed and developed User Defined Function (UDF) for Hive.
  • Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
  • Good knowledge about the configuration management tools like SVN/CVS/Github.
  • Experience in configuring Event Engine nodes to import and export the data from Teradata to HDFS and vice-versa.
  • Developed Map Reduce jobs using Map Reduce Java API and HIVEQL.
  • Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Worked with source to get the history data as well as BAU data from IDN Teradata to the CornerStone platform and migrated also feeds from CS2.0.
  • Expert in creating the nodes in Event Engine as per the use-case requirement to automate the process for the BAU data flow.
  • Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
  • Exported the Event Engine nodes created in the silver environment to the IDN repository in BitBucket and created DaVinci package to migrate it to Platinum.
  • Worked with FDP team to create a secured flow to get the data from KAFKA Queue to CS3.0.
  • Expert in creating the SFTP Connection to the internal and external source to get data in secured manner without any breakage.
  • Handle the production Incidents assigned to our workgroup promptly and fix the bugs or route it to the respective teams and optimized the SLA’s.

Environment: MapReduce, HDFS, Sqoop, LINUX, Oozie, Hadoop, Pyspark, Hive, Spark, Kafka, Spark Streaming, Python, MongoDB, Amazon Web Services, Talend.

Data Engineer

Confidential - Chicago, IL

RESPONSIBILITIES:

  • Comparing the results of traditional system to Hadoop environment to identify any differences and fix them by finding the route cause.
  • Create a complete processing engine, based on Hortonworks distribution, enhanced to performance.
  • Design, Develop and test ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Implemented AWS IAM for managing the user permissions of applications that runs on EC2 instances.
  • Deployed applications onto AWS lambda with HTTP triggers and integrated them with API Gateway
  • Developed multiple ETL Hive scripts for data cleansing and transformations for data.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Used Apache Nifi for loading PDF Documents from Microsoft SharePoint to HDFS. Worked on the Publish component to read the source data, extract metadata and apply transformations to build Solr Documents, index them using SolrJ.
  • Experienced working on Star and Snowflake Schemas and used the fact and dimension tables to build the cubes, perform processing and deployed them to SSAS database.
  • Developed web pages using Struts framework, JSP, XML, JavaScript, HTML/ DHTML and CSS, configure struts application, use tag library.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Exported data from Hive to AWS s3 bucket for further near real time analytics.
  • Ingested data in real time from Apache Kafka to Hive and HDFS.
  • Developed the Apache Storm, Kafka, and HDFS integration project to do a real-time data analysis.
  • Use of Sqoop to import and export data from RDBMS to HDFS and vice-versa.
  • Exporting data to Teradata using SQOOP.
  • Experienced working on Star and Snowflake Schemas and used the fact and dimension tables to build the cubes, perform processing and deployed them to SSAS database.
  • Migrated the computational code in hql to PySpark.
  • Designed and developed dynamic pages using HTML, CSS- layout techniques, Java script.
  • Configured AWS Lambda with multiple functions.
  • Used Python to develop Spark projects and execute using spark-submit.
  • Built pipeline using Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS.
  • Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
  • Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Implemented Kerberos Security Authentication protocol for existing cluster.
  • Good experience in troubleshooting production level issues in the cluster and its functionality.
  • Involved in advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Regular Commissioning and Decommissioning of nodes depending upon the amount of data.
  • Using the memory computing capabilities of spark using Scala, performed advanced procedures like text analytics and processing.

Environment: Hadoop, Hortonworks, AWS, Spark, Java, Pyspark, Python, PySpark, sql, Redshift and GitHub.

Big Data Engineer

Confidential - San Jose, CA

RESPONSIBILITIES:

  • Experience with professional software engineering practices and best practices for the full software development life cycle including coding standards, code reviews, source control management and build processes.
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Map Reduce, Hive.
  • Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
  • Worked on Teradata parallel transport (TPT) to load data from databases and files to Teradata.
  • Wrote views based on user and/or reporting requirements.
  • Wrote Teradata Macros and used various Teradata analytic functions.
  • Involved in migration projects to migrate data from data warehouses on Oracle/DB2 and migrated those to Teradata.
  • Configured Flume source, sink and memory channel to handle streaming data from server logs and JMS sources.
  • Generated property list for every application dynamically using Python modules like math, glob, random, itertools, functools, NumPy, matplotlib, seaborn and pandas.
  • Experience in working with Flume to load the log data from multiple sources directly into HDFS.
  • Worked in the BI team in Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
  • Used Python to develop Spark projects and execute using spark-submit.
  • Involved in source system analysis, data analysis, data modeling to ETL (Extract, Transform and Load).
  • Handling structured and unstructured data and applying ETL processes.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
  • Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
  • Implemented logging framework - ELK stack (Elastic Search, LogStash& Kibana) onAWS.
  • Developed the Pig UDF’S to pre-process the data for analysis.
  • Coding complex Oracle stored procedures, functions, packages, and cursors for the client specific applications.
  • Experienced in using Java Rest API to perform CURD operations on HBase data.
  • Applied Hive queries to perform data analysis on HBase using Storage Handler to meet the business requirements
  • Writing Hive Queries to Aggregate Data that needs to be pushed to the HBase Tables.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
  • Supports and assist QA Engineers in understanding, testing and troubleshooting.

Environment: Hadoop, Hive, Linux, Java, Map Reduce, Sqoop, Storm, HBase, Flume, Eclipse, Maven, Junit, agile methodologies.

Junior Database Administrator

Confidential

Responsibilities:

  • Involved in Installation and customization of Databases (10g&9i) in Windows, Linux, and Solaris environments.
  • Involved in Schema management, Tablespace Management, User Creation, and privileges.
  • Upgraded and applied patches as and when required on ORACLE and Linux.
  • Implemented Backup and recovery using RMAN incremental and cumulative backups.
  • Scheduling the databases backup using Crontab. Monitored Day-to-Day backups, logs and recovering them as per the requirements.
  • Interacting with the Oracle support on Service Requests and implementing the solutions for successful completion of tasks.
  • Working on DBA activities like Refreshes from DR database, Table Reorg, Data Refreshes.
  • Putting Standby DB to Flashback mode when required by Application teams.
  • Restoring RMAN backups, Working on RMAN Backup Failures
  • Working on all DB L1 issues, adding ASM disks to disk groups
  • Daily health checkups and Monitoring DR log sync
  • Performing Database (DB) tuning & performance monitoring as well as capacity planning for growth and changes; undertaking Patch Management and Version Control.
  • Recommending production database best practices, analyzing the existing database structures, and developing new code to provide solutions for new requirements.
  • Responding to and resolving database access and performance issues.

Environment: Linux, Windows, Oracle, sql and Putty.

We'd love your feedback!