We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

Seattle, WA

SUMMARY

  • AWS certified Data Engineer with around 7 years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and DataAnalytics techniques.
  • Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Redis, Sqoop, Oozie.
  • Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
  • Experience in different Hadoop distributions likeCloudera andHorton WorksData Platform (HDP).
  • In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
  • Hands on experience in Importing and exporting data from RDBMS into HDFS and vice - versa using Sqoop.
  • Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques.
  • Experience working with Cassandra and NoSQL database including MongoDB and HBase.
  • Experience in tuning and debugging Spark application and using Spark optimization techniques.
  • Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
  • Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
  • Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena.
  • Experience in working with Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, ML Studio, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Experienced in data manipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning.
  • Experience in writing queries using SQL, experience in data integration and performance training.
  • Developed various shell scripts and python scripts to automate Spark jobs and Hive scripts.
  • Actively involved in all phases ofdatascience project life cycle includingDatacollection,DataPre-Processing, Exploratory Data Analysis, Feature Engineering, Feature selection and building Machine learning Model pipeline.
  • Hands on Experience in using Visualization tools like Tableau, Power BI.
  • Experience in working with GIT, Bitbucket Version Control System.
  • Extensive experience working in a Test-Driven Development and Agile-Scrum Development.
  • Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
  • Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

TECHNICAL SKILLS

Big Data/Hadoop Ecosystem: Apache Spark, HDFS, Map Reduce, HIVE, Sqoop, Oozie, Zookeeper, Kafka, Redis, Flume, IntelliJ

Programming Languages: Python, Scala, R, SQL, PL/SQL, Linux Shell Scripts

NoSQL Database: HBase, Cassandra, Mongo DB, Dynamo DB

Database: Oracle 11g/10g, MY SQL, MS-SQL Server, DB2, Teradata, PrestoDB

Web Technologies: HTML, XML, JDBC, JSP, CSS, JavaScript, SOAP

Tools: Used: Eclipse, Putty, Winscp, NetBeans, QlikView, PowerBI

Operating Systems: Linux, Unix, Windows, Mac OS-X, CentOS, Red Hat

Methodologies: Agile/Scrum, Rational Unified Process and Waterfall

Distributed Platforms: Cloudera, Horton Works, MapR

PROFESSIONAL EXPERIENCE

Confidential, Seattle WA

Sr. Data Engineer

Responsibilities:

  • Created Pipelines in Azure ML Workspace and ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Azure ML pipelines with python module codes for production and user consumption
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
  • Experienced in Azure Data Lake Storage Gen2 to store excel files, parquet files and retrieve using data using Blob API.
  • Created Logic Apps with different triggers, connectors to integrate the data from Workday to different destinations
  • Responsible for testing and fixing bugs of a monitoring application in which a user can create, start, stop or delete either of spark cluster, compute instance or compute cluster in Azure Databricks and Azure ML workspace respectively
  • Worked on developing Restful endpoints to cache application specific data in in-memory data clusters like REDIS and exposed them with Restful endpoints.
  • Create programs usingNIFIworkflows for various data ingestion into Hadoop Data Lake from MySQL, Postgres.
  • Developed various solution driven views and dashboards by developing different chart types includingPie Charts, Bar Charts, Tree Maps, Circle Views, Line Charts, Area Charts, Scatter Plots in Power BI.

Environment: Azure Databricks, DataLake, MySQL, Azure ML, Azure SQL, PowerBI, Blob Storage, Data Factory, Data Storage Explorer, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Hive, BitBucket, Postgres, PrestoDB, Redis, RabbitMQ.

Confidential, Irvine CA

Data Engineer

Responsibilities:

  • DevelopedSpark Applicationsby usingPythonand Implemented Apache Spark data processing project to handle data from variousRDBMSandStreamingsources.
  • Worked withSparkfor improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Performed tuning of Spark Applications to set batch interval time and correct level of Parallelism and memory tuning.
  • UsedSpark Streaming APIsto perform transformations and actions on the fly for building common learner data model which gets the data fromKafkain real time and persist it toCassandra.
  • Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
  • Developed Kafka consumer’s API in python for consuming data from Kafka topics.
  • Used Kafka to consume XML messages and Spark Streaming to process the XML file to capture UI updates.
  • Valuable experience on practical implementation of cloud-specific technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, Glue, EMR.
  • Migrated an existing on-premises application toAWS and usedAWSservices likeEC2andS3for small data sets processing and storage.
  • Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster onAWS EMR.
  • Used AWS EMR Spark cluster and Cloud Dataflow on GCP to compare the efficiency of a POC on a developed pipeline.
  • Configured Snow pipe to pull the data from S3 buckets into Snowflakes table and stored incoming data in the Snowflakes staging area.
  • Created live real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
  • Designed columnar families in Cassandra and Ingested data from RDBMS, performed datatransformations, and then exported the transformed data to Cassandra as per the business requirement.
  • Designed, developed, deployed, and maintained MongoDB.
  • Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems (RDBMS) and vice-versa.
  • Written several Map reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.
  • Worked on cloud deployments using Maven, Docker, and Jenkins.
  • Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
  • Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON,XML, CSV and generated Bags for processing using PIG etc.
  • Generated various kinds ofreportsusingPower BIandTableaubased on Client specification.

Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2, MapR, HDFS, Hive, PIG, Apache Kafka,Sqoop, Python, Scala, Pyspark, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins,Eclipse, Oracle, Git, Oozie, Tableau, Power BI, SOAP, Cassandra, and Agile Methodologies.

Confidential, New York City, NY

Big Data Engineer

Responsibilities:

  • Analyzing large amounts of datasets to determine optimal way to aggregate and report on these datasets.
  • Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
  • Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.
  • Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Imported data fromAWS S3intoSpark RDD,PerformedtransformationsandactionsonRDD's.
  • Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
  • Developed Python code for different tasks, dependencies, and time sensor for each job for workflow management and automation using Airflow tool.
  • Worked on cloud deployments using Maven, Docker and Jenkins.
  • Create Glue jobs to process the data from S3 stating area to S3 persistence area.
  • Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
  • Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.

Environment: AWS EMR, S3, EC2, Lambda, MapR, Apache Spark, Spark-Streaming, Spark SQL, HDFS, Hive, PIG, Apache Kafka,Sqoop, Flume, Python, Scala, Shell scripting, Linux, MySQL, HBase, NoSQL, DynamoDB, Cassandra, Machine Learning, Snowflake, Maven, Docker, AWS Glue, Jenkins,Eclipse, Oracle, Git, Oozie, Tableau, Power BI.

Confidential, Chicago, IL

ETL/SQL Developer

Responsibilities:

  • Analysed, designed, and developed databases using ER diagrams, normalization, and relational database concept.
  • Involved in design, development, and testing of the system.
  • Developed SQL Server stored procedures, tuned SQL queries (using indexes and execution plan).
  • Developed user defined functions and created views.
  • Created triggers to maintain the referential integrity.
  • Implemented exceptional handling.
  • Worked on client requirement and wrote complex SQL queries to generate crystal reports.
  • Created and automated the regular jobs.
  • Tuned and optimized SQL queries using execution plan and profiler.
  • Developed the controller component with Servlets and action classes.
  • Business components are developed (model components) using Enterprise Java Beans (EJB).
  • Established schedule and resource requirements by planning, analyzing and documenting development effort to include timelines, risks, test requirements and performance targets.
  • Analysed system requirements and prepared system design document.
  • Developed dynamic user interface with HTML and JavaScript using JSP and Servlet technology.
  • Used JMS elements for sending and receiving messages.
  • Created and executed test plans using quality center by test director.
  • Mapped requirements with the test cases in the quality center.
  • Supported system test and user acceptance test.
  • Rebuilt indexes and tables as a part of performance tuning exercise.
  • Involved in performing database backup and recovery.
  • Worked on documentation using MS Word.

Environment: MS SQL Server, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel, MS Word.

Confidential

Jr.SQL/PLSQL Developer

Responsibilities:

  • Responsible for requirements analysis, application design, coding, testing, maintenance, and support.
  • Created Stored Procedures, functions, Data base triggers, Packages and SQL Scripts based on requirements.
  • Created complex SQL queries using views, sub queries, correlated sub queries.
  • Developed UNIX shells/scripts to support and maintain the implementation.
  • Developed ad-clicks based data analytics, for keyword analysis and insights.
  • Crawled public posts from Facebook and tweets.
  • Hands on experience in MapReduce jobs with the Data Science team to analyze this data.
  • Converted output to structured data and imported to Tableau with analytics team.
  • Defined problems to look for right data and analyze results to make room for new project.
  • Created Shell Scripts for invoking SQLscripts and scheduled them using crontab.
  • Defect management involving discussion with Business, Process Analysts, and team.
  • Defect Tracking and Prepare Test Summary Reports

Environment: C++, Oracle PL/SQL(MS Visual Studio, SQL Developer), Unix Shell Scripts.

We'd love your feedback!