Sr. Data Engineer Resume
Boston, MA
SUMMARY
- Data Engineering professional having around 8 years of experience in a variety of data platforms, with hands on experience in Big Data Engineering and Data Analytics.
- Strong Expertise with tools in Hadoop Ecosystem and Big data stack which includes HDFS, Spark, Airflow, Pyspark, MapReduce, Kafka, Hive, Oozie, Zookeeper, Ambari, HBase, and Impala.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig HBase database and Sqoop.
- Strong Experience in Data Engineering, Data Pipeline Design, Development, Documentation, Deployment and Integration as a Sr. Data Engineer/Data Developer.
- Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workflows), extending the functionality by writing custom UDFs.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Extensive knowledge of data architecture including designing pipelines, data ingestion, Hadoop/Spark architecture and advanced data processing.
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance. Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, Sqoop.
- Installed and configured Hive and written Hive UDFs.
- Creating Hive tables and working on them using Hive QL. Importing and exporting data into HDFS and Hive using Sqoop.
- Experience in analyzing/manipulating huge and complex data sets and finding insightful patterns and trends within structured, semi-structed and unstructured data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, embedPy, Pandas, NumPy.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HiveQL queries.
- Experience in collection the real time streaming data and creating the pipeline for data from different source using Kafka and store data into HDFS and NoSQL using Spark.
- Extensive Shell/Python scripting experience and setup production systems on UNIX/LINUX environment.
- Experience in providing application support for Jenkins.
- Experience working with continuous integration framework, building regression-able code within data world using GitHub, Jenkins, and related applications.
- Experience in working with Amazon Web Services (AWS) Using EC2 for computing, Redshift Spectrum, Glue and S3 as storage mechanism.
- Strong working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Good experience in managing Kubernetes environment for scalability, availability and zero downtime.
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
- Implemented medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Design, develop, and test dimensional data models using Star and Snowflake schema methodologies under the Kimball method.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK's.
- Experience in working with Flume for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively meet client expectations.
- An excellent team member capable to work efficiently in cross functional team. Excellent communication skills, strong analytical skills, problem solving skills and decision-making skills.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Boston, MA
Responsibilities:
- Participated in gathering and identifying data points to create a Data model.
- Classified all the data which is coming from source into IVC templates which initiates ingestion based on the classification.
- Responsible for creating on-demand tables on S3 files using lambda functions using Python and Pyspark.
- Ingested data into RAW layer using PySpark framework.
- Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
- Good experience in handling messaging services using Apache Kafka
- Migrated the data from AWS S3 to HDFS using Kafka.
- Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
- Created validation for the data while loading into Curated layer.
- Defined encryption process for the data loading and processing into App layer to Reporting layer.
- Implemented Change Data Capture based on Filename and created tracking on load date of the file.
- Involved in developing python scripts, informatica for etl tools for extraction, transformation and loading of data in to data warehouse.
- Worked on developing Visual reports, Dashboards and KPI scorecards.
- Installed and configured Apache Airflow for S3 bucket and Snowflake data warehouse and created DAGs to run the airflow.
- Gathered data from various data sources like Database, Files and API and managed to integrate in the RAW layer.
- Stake holder management, understanding the business requirement and creating technical documents.
- Participate in the DWG biweekly meetings, support/discuss the technical challenges of the solution being delivered.
- Worked in Agile methodology in order to discuss status of the project twice a week and take further measures which includes scrum meetings daily to discuss product backlog. Met the expectations by reaching tight deadline on the Agile model.
Sr. Data Engineer
Confidential, Santa Rosa, NM
Responsibilities:
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3. Created Volumes and configured Snapshots for EC2 instances.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
- Developed Python scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Developed Hive queries to pre-process the data required for running the business process.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Implementations of generalized solution model using AWS SageMaker.
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
- Programmed in Hive, Spark SQL, Java and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines.
- Worked on ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server) & Credit Edge server.
- Experience in using and tuning relational databases (e.g. Microsoft SQL Server, Oracle, MySQL) and columnar databases (e.g. Amazon Redshift, Microsoft SQL Data Warehouse).
Environment: Hortonworks, Hadoop, HDFS, AWS Glue, AWS Athena, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Python, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau, ETL
Sr. Data Engineer
Confidential, Costa Mesa, CA
Responsibilities:
- Built S3 buckets and managed policies for S3 buckets and used S3 glacier for storage and backup on AWS.
- Designed, built, and coordinate an automated build & release CI/CD process using GitLab, Jenkins and Puppet on hybrid IT infrastructure.
- Involved in designing and developing Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Amazon SWF, Amazon SQS, and other services of the AWS infrastructure.
- Running build jobs and integration tests on Jenkins Master/Slave configuration.
- Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).
- Involved in maintaining the reliability, availability, and performance of Amazon Elastic Compute Cloud (Amazon EC2) instances.
- Managed Servers on the Amazon Web Services (AWS) platform instances using Puppet configuration management.
- Integrated services like GitHub, AWS Code pipeline, Jenkins, and AWS Elastic Beanstalk to create a deployment pipeline.
- Involved in complete SDLC life cycle - Designing, Coding, Testing, Debugging and Production Support.
- Coordinate/assist developers with establishing and applying appropriate branching, labeling/naming conventions using Git.
- Used Kubernetes to deploy scale, load balance, scale and manage docker containers.
- Worked on JIRA for defect/issues logging & tracking and documented all my work using CONFLUENCE
- Branching, Merging, Release Activities on Version Control Tool GIT. Used GitHub as version control to store source code and implemented Git for branching and merging operations.
Environment: Jenkins, JIRA, Maven, GIT, AWS EMR/EC2/S3/Redshift, Python, Cassandra, Web logic, Unix Shell Scripting, SQL, Kubernetes, docker.
Data Engineer
Confidential
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Demonstrated a strong comprehension of project scope, data extraction, design of dependent and profile variables, logic and design of data cleaning, exploratory data analysis and statistical methods.
- Used spark steaming APIs to perform necessary transformations for building the common learner data model which gets data from Kafka in near real time and persists into Hive.
- Developed Spark scripts by using Python as per the requirements.
- Developed real time data pipeline using Spark to ingest customer events/activity data into Hive and Cassandra from Kafka.
- Performed Spark jobs optimization and performance tuning to improve running time and resources.
- Worked on reading and writing multiple data formats like JSON, AVRO, Parquet, ORC on HDFS using Pyspark.
- Designed, developed, and did maintenance of data integration in Hadoop and RDBMS environment with both traditional and non-traditional source system as well as RDBMS and NoSQL data stores for data access and analysis.
- Involved in recovery of Hadoop clusters and worked on cluster size of 310 nodes.
- Worked on creating Hive tables, loading, and analyzing data using Hive queries.
- Experience in proving application support for Jenkins.
- Developed a data pipeline with AWS to extract the data from weblogs and store in HDFS.
- Used Hive QL to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Used reporting tools like Tableau and Power BI to connect with Hive for generating daily reports of data.
Environment: s: Python, Bigdata, Hadoop, HBase, Hive, Spark, Pyspark, Cloudera, Kafka, Sqoop, Jenkins, Unix Shell scripting, GitHub, SQL, Tableau, Power BI.
Data Analyst
Confidential
Responsibilities:
- Collaborated with multiple business units to gather requirements and develop data reporting infrastructure to provide real-time business insights.
- Developed Power Bi Data Model using datasets from multiple data sources (Postgres, Oracle, FAST) by establishing meaningful relationships between them, improving performance by 60%.
- Presented data insights to executive stakeholders using dashboards to provide them with data-based recommendations.
- Modeled a new data mart on SQL server to hold subset of data from multiple data sources by diverse use of SQL queries with working knowledge of RDBMS and ETL tools.
- Built the infrastructure required for optimal extraction, transformation and loading of data from a wide variety of data sources using SQL.
- Collaborated with QA analysts to troubleshoot data quality issues and worked with data source owners to resolve them; Monitoring dashboard helped improve data quality by 80%
- Utilized Excel features - Pivot Tables, Vlookup to reconcile account values for Actuarial department with Policy Management System reducing risk of invalid cash disbursements by 85%
- Used REST API endpoints to interact with Power Bi web services to automate report deployment across different environments, increasing efficiency by 40%.
Environment: Tableau, Azure, Informatica, Oracle server, PL/SQL, Linux, Python.