Senior Data Engineer Resume
Boston, MA
SUMMARY
- 10+ years of IT experience in Software Development with 5+ years’ work experience as Big Data /Hadoop Developer with good knowledge of Hadoop framework.
- Expertise in Hadoop architecture and various components such as HDFS, YARN, High Availability, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming paradigm.
- Experience with all aspects of development from initial implementation and requirement discovery, through release, enhancement and support (SDLC & Agile techniques).
- Having 4+ years of experience in Design, Development, Data Migration, Testing, Support and Maintenance using Redshift Databases.
- Having 4+ years of experience on Apache Hadoop technologies like Hadoop distributed file system (HDFS), Map Reduce framework, Hive, PIG, Pyspark, Sqoop, Oozie, HBase, Spark, Scala and Python.
- 3+ years of experience in AWS cloud solution development using Lambda, SQS, SNS, Dynamo DB, Athena, S3, EMR, EC2, Redshift, Glue, and CloudFormation.
- Experience in using Microsoft Azure SQL database, Data Lake, Azure ML, Azure data factory, Functions, Databricks and HDInsight.
- Working experience in Big data on cloud using AWS EC2 & Microsoft Azure, and handled redshift & Dynamo databases with huge amount of data 300 TB.
- Extensive experience in migrating on premise Hadoop platforms to cloud solutions using AWS And Azure.
- 3+ years of experience in writing python as ETL framework and Pyspark to process huge amount of data daily.
- Strong experience in implementing data models and loading unstructured data using HBase, Dynamo Db and Cassandra.
- Created multiple report dashboards, visualizations and heat maps using tableau, QlikView and qliksense reporting tools.
- Strong experience in extracting and loading data using complex business logic’s using Hive from different data sources and built the ETL pipelines to process tera bytes of data daily.
- Experienced in transporting, and processing real time event streaming using Kafka and Spark Streaming.
- Hands on experience with importing and exporting data from Relational databases to HDFS, Hive and HBase using Sqoop.
- Experienced in processing real time data using kafka 0.10.1 producers and stream processors and implemented stream process using Kinesis and data landed into data lake S3.
- Experience in implementing multitenant models for the Hadoop 2.0 Ecosystem using various big data technologies.
- Designed and developed spark pipelines to ingest real time event - based data from Kafka and other message queue systems and processed huge data with spark batch processing into data warehouse hive.
- Experienced in creating and analyzing Software Requirement Specifications (SRS) and Functional Specification Document (FSD).
- Excellent working experience in Scrum / Agile framework, Iterative and Waterfall project execution methodologies.
- Designed data models for both OLAP and OLTP applications using Erwin and used both star and snowflake schemas in the implementations.
- Capable of organizing, coordinating and managing multiple tasks simultaneously.
- Excellent communication and inter-personal skills, self-motivated, organized and detail-oriented, able to work well under deadlines in a changing environment and perform multiple tasks effectively and concurrently.
- Strong analytical skills with ability to quickly understand client’s business needs. Involved in meetings to gather information and requirements from the clients.
TECHNICAL SKILLS
Hadoop: Hadoop, Spark(Pyspark),Map Reduce, HIVE, PIG, Impala SQOOP, HDFS, HBASE, Oozie, Ambari, Spark, Scala and Mongo DB
Cloud Technologies: AWS Kinesis, Lambda, EMR, EC2, SNS, SQS, Dynamo DB, Step Functions, Glue, Athena, CloudWatch, Azure Data Factory, Azure Data Lake, Functions, Azure SQL Data Warehouse, Databricks and HDInsight
DBMS: Amazon Redshift, Postgres, Oracle 9i, SQL Server, IBM DB2 And TeraData
ETL Tools: Data Stage, Talend and ABInitio
Reporting Tools: Power BI, Tableau, TIBCO Spotfire, Qlikview and Qliksense
Deployment Tools: Git, Jenkins, Terraform and CloudFormation
Programming Language: Python, Scala, PL/SQL and Java
Scripting: Unix Shell and Bash scripting
PROFESSIONAL EXPERIENCE
Confidential, Boston, MA
Senior Data Engineer
Responsibilities:
- Build Data pipelies in Airflow in GCP for ETL related jobs using airflow operators
- Created Spark data pipelines using GCP Data Proc
- Used cloud shell SDK in GCP to configure services Data Prod, storage,Bigquery
- Developed framework to generate dailyadhc reports and extract data from bigquery
- Designed and co-ordinated with Data science team in implementing Advance analytical models in Hadoop CLuster over large datasets
- Wrote Hive SQL scripts for creating complex tables with high performance metrics like partitioning, clustering and skewing
- Read data from bigquery into pandas or spark dataframe for advance ETL capabilities
- Created BigQuery Views for row levwl security or exosing data to other teams
- Worked with google data catalogs and other google cloud API's for monitoring, query and other billing related analysis for big query usage.
Confidential, Charlotte, NC
Data Engineer
Responsibilities:
- Written ETL jobs in using spark data pipelines to process data from different source to transform data to multiple targets.
- Created streams using Spark and processed real time data into RDDs & data frames and created analytics using PYSPARK SQL.
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Optimize the Pyspark jobs to run on EMR Cluster for faster data processing
- Designed Redshift based data delivery layer for business intelligence tools to operate directly on AWS S3.
- Implemented kinesis data streams to read real time data and loaded into data s3 for downstream processing.
- AWS Infrastructure setup on EC2 and S3 API implementation for accessing S3 bucket data file.
- Designed “Data Services” to intermediate data exchange between the Data Clearinghouse and the Data Hubs.
- Written ETL flows and MapReduce to process data from AWS S3 to dynamo DB and HBase.
- Involved in the ETL phase of the project & Designed and analyzed the data in oracle and migrated to Redshift and Hive.
- Created Databases and tables using Redshift and dynamo DB and written complex EMR scripts to process Tera bytes of data into AWS S3 cluster.
- Performed real time analytics on transactional data using python to create statistical model for predictive and reverse product analysis.
- Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
- Involving in client meetings and explaining the views to supporting and gathering requirements.
- Working in an agile methodology, understand the requirements of the user stories
- Prepared High-level design documentation for approval
- Also, data visualization software tableau, quick sight and Kibana are used as part of bringing new insights from data extracted and better representation of data.
- Designed data models for dynamic and real-time data with intention to be used by various applications with OLAP and OLTP needs
Confidential, Herndon, VA
Cloud Support Engineer
Responsibilities:
- Assist in incident management including problem resolution and tracking problems
- Maintain a good knowledge of IT technologies and developments
- Provide proactive communication to the customer base for wide scale service affecting problems
- Provide fast value-add responses to inbound tickets from customers, acknowledging receipt and providing next steps to the customer through both written and verbal channels
- Utilize monitoring tools to proactively identify problems with systems, applications, and networks
Confidential, Auburn Hills, MI
Big Data Engineer/Hadoop Developer
Responsibilities:
- Full life cycle of the project from Design, Analysis, logical and physical architecture modeling,development, Implementation, testing.
- Conferring with data scientists and other qlikstream developers to obtain information on limitations or capabilities for data processing projects
- Designed and developed automation test scripts using Python
- Creating Data Pipelines using Azure Data Factory.
- Automating the jobs using Python.
- Creating tables and loading data in Azure MySql Database
- Creating Azure Functions, Logic Apps for Automating the Data pipelines using Blob triggers.
- Analyze SQL scripts and design the solution to implement using Pyspark
- Developed Spark code using Python(Pysaprk) for faster processing and testing of data.
- Used SparkAPI to perform analytics on data in Hive
- Optimizing and tuning Hive and spark queries using data layout techniques such as partitioning, bucketing or other advanced techniques.
- Data Cleansing, Integration and Transformation using PIG
- Involved in exporting and importing data from local file system and RDBMS to HDFS
- Designing and coding the pattern for inserting data into Data lake.
- Moving the data from On-Prem HDP clusters to Azure
- Building, installing, upgrading or migrating petabyte size big data systems
- Fixing Data related issues
- Loading data to DB2 data base using Data Stage.
- Monitoring the functioning of big data and messaging systems like Hadoop, Kafka, Kafka Mirror makers to ensure they operate at their peak performance at all times.
- Created Hive tables, and loading and analyzing data using hive queries
- Communicatingregularly with the business teams to ensure that any gaps between business requirementsand technicalrequirements are resolved.
- Reading and translating data models, data querying and identifying data anomalies and provide root cause analysis.
- Support "Qlik Sense" reporting, to gauge performance of various KPIs/facets to assist top management in decision-making.
- Engage in project planning and delivering to commitments.
- POC’s on new technologies(Snowflake) that are available in the market to determine the best suitable one for the Organization needs
Confidential, McLean, VA
Data Warehouse Architect - Hadoop Developer/SQL Developer
Responsibilities:
- Set up and builtAWSinfrastructure with various services available by writing cloud formation templates(CFT) in json and yaml.
- Developed Cloud Formation scripts to build EC2 on demand
- With the help of IAM created roles, users and groups and attached policies to provide minimum access to the resources.
- Updating the bucket policy with IAM role to restrict the access to user.
- ConfiguredAWSIdentity Access Management (IAM) Group and users for improved login authentication.
- Created topics in SNS to send notifications to subscribers as per the requirement.
- Involved in full life cycle of the project from Design, Analysis, logical and physical architecture modeling, development, Implementation, testing.
- Moving data from Oracle to HDFS using Sqoop
- Data profiling on critical tables from time to time to check for the abnormalities
- Created Hive Tables, loaded transactional data from Oracle using Sqoop and Worked with highly unstructured and semi structured data.
- Developed MapReduce (YARN) jobs for cleaning, accessing and validating the data.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables
- Scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
- Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system
- Developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
- Working on CDC (Change Data Capture) tables using Spark Application to load data into Dynamic Partition Enabled Hive Tables.
- Designed and developed automation test scripts using Python
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark
- Implemented HiveGenericUDF's to in corporate business logic into HiveQueries.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
- Supporting data analysis projects by using Elastic MapReduce on the Confidential (AWS) cloud performed Export and import of data into s3.
- Involved in designing the row key in Hbase to store Text and JSON as key values in Hbase table and designed row key in such a way to get/scan it in a sorted order.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Creating Hive tables and working on them using Hive QL.
- Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Developed multiple POCs using PySpark and deployed on the YARN cluster, compared the performance of Spark, with Hive and SQL and
- Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
Confidential
Hadoop Analyst
Responsibilities:
- Participated in SDLC Requirements gathering, Analysis, Design, Development and Testing of application developed using AGILE methodology.
- Developing Managed, external and partition tables as per the requirement.
- Ingested structured data into appropriate schemas and tables to support the rule and analytics.
- Developing custom User Defined Functions (UDF's) in Hive to transform the large volumes of data with respect to business requirement.
- Developing Pig Scripts, Pig UDF's and Hive Scripts, Hive UDF's to load data files.
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in loading data from edge node to HDFS using shell scripting
- Implemented scripts for loading data from UNIX file system to HDFS.
- Load and transform large sets of structured, semi structured and unstructured data.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Actively participated in Object Oriented Analysis Design sessions of the Project, which is based on MVC Architecture using Spring Framework.
- Developed the presentation layer using HTML, CSS, JSPs, BootStrap, and AngularJS.
- Adopted J2EE design patterns like DTO, DAO, Command and Singleton.
- Implemented Object-relation mapping in the persistence layer using hibernate framework in conjunction with spring functionality.
- Generated POJO classes to map to the database table.
- Configured Hibernate's second level cache using EHCache to reduce the number of hits to the configuration table data.
- ORM tool Hibernate to represent entities and fetching strategies for optimization.
- Implementing the transaction management in the application by applying Spring Transaction and Spring AOP methodologies.
- Written SQL queries and stored procedures for the application to communicate with Database
- Used Junit framework for unit testing of application.
- Used Maven to build and deploy the application.
Confidential
Java Developer
Responsibilities:
- Participated in gathering business requirements, analyzing the project and creating use Cases and Class Diagrams.
- Interacted coordinated with the Design team, Business analyst and end users of the system.
- Created sequence diagrams, collaboration diagrams, class diagrams, use cases and activity diagrams using Rational Rose for the Configuration, Cache & logging Services.
- Implementing Tiles based framework to present the layouts to the user. Created the WebUI using Struts, JSP, Servlets and Custom tags.
- Designed and developed Caching and Logging service using Singleton pattern, Log4j.
- Coded different action classes in struts responsible for maintaining deployment descriptors like struts-config, ejb-jar and web.xml using XML.
- Used JSP, JavaScript, Custom Tag libraries, Tiles and Validations provided by struts framework.
- Wrote authentication and authorization classes and manage it in the front controller for all the users according to their entitlements.
- Developed and deployed Session Beans and Entity Beans for database updates.
- Implemented caching techniques, wrote POJO classes for storing data and DAO’s to retrieve the data and did other database configurations using EJB 3.0.
- Developed stored procedures and complex packages extensively using PL/SQL and shell programs.
- Used Struts-Validator frame-work for all front-end Validations for all the form entries.
- Developed SOAP based Web Services for Integrating with the Enterprise Information System Tier.
- Design and development of JAXB components for transfer objects.
- Prepared EJB deployment descriptors using XML.
- Involved in Configuration and Usage of Apache Log4J for logging and debugging purposes.
- Wrote Action Classes to service the requests from the UI, populate business objects & invoke EJBs.
- Used JAXP (DOM, XSLT), XSD for XML data generation and presentation