Sr. Data Engineer Resume
Seattle, WA
SUMMARY
- AWS certified Data Engineer with around 7 years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and DataAnalytics techniques.
- Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Redis, Sqoop, Oozie.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Experience in different Hadoop distributions likeCloudera andHorton WorksData Platform (HDP).
- In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
- Hands on experience in Importing and exporting data from RDBMS into HDFS and vice - versa using Sqoop.
- Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques.
- Experience working with Cassandra and NoSQL database including MongoDB and HBase.
- Experience in tuning and debugging Spark application and using Spark optimization techniques.
- Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
- Extensive noledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena.
- Experience in working with Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, ML Studio, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Experienced in data manipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning.
- Experience in writing queries using SQL, experience in data integration and performance training.
- Developed various shell scripts and python scripts to automate Spark jobs and Hive scripts.
- Actively involved in all phases ofdatascience project life cycle includingDatacollection,DataPre-Processing, Exploratory Data Analysis, Feature Engineering, Feature selection and building Machine learning Model pipeline.
- Hands on Experience in using Visualization tools like Tableau, Power BI.
- Experience in working with GIT, Bitbucket Version Control System.
- Extensive experience working in a Test-Driven Development and Agile-Scrum Development.
- Involved in daily SCRUM meetings to discuss teh development/progress and was active in making scrum meetings more productive.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS
Big Data/Hadoop Ecosystem: Apache Spark, HDFS, Map Reduce, HIVE, Sqoop, Oozie, Zookeeper, Kafka, Redis, Flume, IntelliJ
Programming Languages: Python, Scala, R, SQL, PL/SQL, Linux Shell Scripts
NoSQL Database: HBase, Cassandra, Mongo DB, Dynamo DB
Database: Oracle 11g/10g, MY SQL, MS-SQL Server, DB2, Teradata, PrestoDB
Web Technologies: HTML, XML, JDBC, JSP, CSS, JavaScript, SOAP
Tools: Used: Eclipse, Putty, Winscp, NetBeans, QlikView, PowerBI
Operating Systems: Linux, Unix, Windows, Mac OS-X, CentOS, Red Hat
Methodologies: Agile/Scrum, Rational Unified Process and Waterfall
Distributed Platforms: Cloudera, Horton Works, MapR
PROFESSIONAL EXPERIENCE
Confidential, Seattle WA
Sr. Data Engineer
Responsibilities:
- Created Pipelines in Azure ML Workspace and ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
- Responsible for estimating teh cluster size, monitoring, and troubleshooting of teh Spark data bricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Created Azure ML pipelines with python module codes for production and user consumption
- Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh SQL Activity.
- Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Experienced in Azure Data Lake Storage Gen2 to store excel files, parquet files and retrieve using data using Blob API.
- Created Logic Apps with different triggers, connectors to integrate teh data from Workday to different destinations
- Responsible for testing and fixing bugs of a monitoring application in which a user can create, start, stop or delete either of spark cluster, compute instance or compute cluster in Azure Databricks and Azure ML workspace respectively
- Worked on developing Restful endpoints to cache application specific data in in-memory data clusters like REDIS and exposed them with Restful endpoints.
- Create programs usingNIFIworkflows for various data ingestion into Hadoop Data Lake from MySQL, Postgres.
- Developed various solution driven views and dashboards by developing different chart types includingPie Charts, Bar Charts, Tree Maps, Circle Views, Line Charts, Area Charts, Scatter Plots in Power BI.
Environment: Azure Databricks, DataLake, MySQL, Azure ML, Azure SQL, PowerBI, Blob Storage, Data Factory, Data Storage Explorer, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Hive, BitBucket, Postgres, PrestoDB, Redis, RabbitMQ.
Confidential, Irvine CA
Data Engineer
Responsibilities:
- DevelopedSpark Applicationsby usingPythonand Implemented Apache Spark data processing project to handle data from variousRDBMSandStreamingsources.
- Worked withSparkfor improving performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Performed tuning of Spark Applications to set batch interval time and correct level of Parallelism and memory tuning.
- UsedSpark Streaming APIsto perform transformations and actions on teh fly for building common learner data model which gets teh data fromKafkain real time and persist it toCassandra.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for teh source-to-target transformations.
- Developed Kafka consumer’s API in python for consuming data from Kafka topics.
- Used Kafka to consume XML messages and Spark Streaming to process teh XML file to capture UI updates.
- Valuable experience on practical implementation of cloud-specific technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, Glue, EMR.
- Migrated an existing on-premises application toAWS and usedAWSservices likeEC2andS3for small data sets processing and storage.
- Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster onAWS EMR.
- Used AWS EMR Spark cluster and Cloud Dataflow on GCP to compare teh efficiency of a POC on a developed pipeline.
- Configured Snow pipe to pull teh data from S3 buckets into Snowflakes table and stored incoming data in teh Snowflakes staging area.
- Created live real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed datatransformations, and then exported teh transformed data to Cassandra as per teh business requirement.
- Designed, developed, deployed, and maintained MongoDB.
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming.
- Worked extensively with Sqoop for importing and exporting teh data from HDFS to Relational Database systems (RDBMS) and vice-versa.
- Written several Map reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Optimized teh Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.
- Worked on cloud deployments using Maven, Docker, and Jenkins.
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
- Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON,XML, CSV and generated Bags for processing using PIG etc.
- Generated various kinds ofreportsusingPower BIandTableaubased on Client specification.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2, MapR, HDFS, Hive, PIG, Apache Kafka,Sqoop, Python, Scala, Pyspark, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins,Eclipse, Oracle, Git, Oozie, Tableau, Power BI, SOAP, Cassandra, and Agile Methodologies.
Confidential, New York City, NY
Big Data Engineer
Responsibilities:
- Analyzing large amounts of datasets to determine optimal way to aggregate and report on these datasets.
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with teh Spark for improving performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark streaming to receive real time data from teh Kafka and store teh stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
- Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer teh data from relational database and Hadoop.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Involved in designing and deploying multi-tier applications using all teh AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert teh response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining teh Hadoop cluster on AWS EMR.
- Imported data fromAWS S3intoSpark RDD,PerformedtransformationsandactionsonRDD's.
- Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Developed Python code for different tasks, dependencies, and time sensor for each job for workflow management and automation using Airflow tool.
- Worked on cloud deployments using Maven, Docker and Jenkins.
- Create Glue jobs to process teh data from S3 stating area to S3 persistence area.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for teh source-to-target transformations.
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: AWS EMR, S3, EC2, Lambda, MapR, Apache Spark, Spark-Streaming, Spark SQL, HDFS, Hive, PIG, Apache Kafka,Sqoop, Flume, Python, Scala, Shell scripting, Linux, MySQL, HBase, NoSQL, DynamoDB, Cassandra, Machine Learning, Snowflake, Maven, Docker, AWS Glue, Jenkins,Eclipse, Oracle, Git, Oozie, Tableau, Power BI.
Confidential, Chicago, IL
ETL/SQL Developer
Responsibilities:
- Analysed, designed, and developed databases using ER diagrams, normalization, and relational database concept.
- Involved in design, development, and testing of teh system.
- Developed SQL Server stored procedures, tuned SQL queries (using indexes and execution plan).
- Developed user defined functions and created views.
- Created triggers to maintain teh referential integrity.
- Implemented exceptional handling.
- Worked on client requirement and wrote complex SQL queries to generate crystal reports.
- Created and automated teh regular jobs.
- Tuned and optimized SQL queries using execution plan and profiler.
- Developed teh controller component with Servlets and action classes.
- Business components are developed (model components) using Enterprise Java Beans (EJB).
- Established schedule and resource requirements by planning, analyzing and documenting development effort to include timelines, risks, test requirements and performance targets.
- Analysed system requirements and prepared system design document.
- Developed dynamic user interface with HTML and JavaScript using JSP and Servlet technology.
- Used JMS elements for sending and receiving messages.
- Created and executed test plans using quality center by test director.
- Mapped requirements with teh test cases in teh quality center.
- Supported system test and user acceptance test.
- Rebuilt indexes and tables as a part of performance tuning exercise.
- Involved in performing database backup and recovery.
- Worked on documentation using MS Word.
Environment: MS SQL Server, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel, MS Word.
Confidential
Jr.SQL/PLSQL Developer
Responsibilities:
- Responsible for requirements analysis, application design, coding, testing, maintenance, and support.
- Created Stored Procedures, functions, Data base triggers, Packages and SQL Scripts based on requirements.
- Created complex SQL queries using views, sub queries, correlated sub queries.
- Developed UNIX shells/scripts to support and maintain teh implementation.
- Developed ad-clicks based data analytics, for keyword analysis and insights.
- Crawled public posts from Facebook and tweets.
- Hands on experience in MapReduce jobs with teh Data Science team to analyze this data.
- Converted output to structured data and imported to Tableau with analytics team.
- Defined problems to look for right data and analyze results to make room for new project.
- Created Shell Scripts for invoking SQLscripts and scheduled them using crontab.
- Defect management involving discussion with Business, Process Analysts, and team.
- Defect Tracking and Prepare Test Summary Reports
Environment: C++, Oracle PL/SQL(MS Visual Studio, SQL Developer), Unix Shell Scripts.