Lead Data Engineer Resume
Columbus, GA
SUMMARY
- Total 12 Years of IT Experience in Data engineering and Analytics.
- Worked on distributed and cloud computing Big Data technologies like Apache Hadoop, AWS, Spark, PySpark, Oracle.
- Proficient in working on programing languages like Python, Scala, Java.
- Expertise in building ETL frameworks using various tools and technologies.
- Worked on verities of data storage formats like Parquet, ORC, AVRO, XML, JSON, XLS, CSV, data formats like Structured, Unstructured, and Semi structured data
- Having good exposer on supervised, unsupervised learning and Natural language processing (NLP) methods and mathematical and statistical methods
- Developed date warehouse solutions in Hadoop using HDFS, Hive, Pig, Sqoop, HBase, Oozie, Cloudera Hue, Cloudera Manager, Scala, Spark, Python, Java, Impala, Ambari, Ranger
- Developed cloud - based solutions using AWS Redshift, Glue, Lambda, Athena, S3, Spectrum, Azure data factory, Synapse, Databricks.
- Proficient in other tools and technologies like oracle 11g/12c DB, SQL Server, MySQL, Netezza
- Good exposer on Design, document, and implement data warehouse strategies, including building ETL, ELT, and data pipeline processes.
- Collaborated with different stakeholders like product managers, architects, Analytics, and project managers to deliver solutions.
- Having complete picture of software lifecycle management and following best practices throughout.
- Supervise team activities including work scheduling, technical direction, and standard development practices.
- Believing in continues improvement, developed a multiple framework to improve the efficiency of team across my carrier.
TECHNICAL SKILLS
Big Data Ecosystems: Apache Hadoop, MapReduce, Spark, HDFS, HBase, Hive, Pig, Sqoop, Oozie, Kafka
Cloud Ecosystems: AWS EC2, Redshift, Glue, Lambda, Athena, S3, EMR, Spectrum, Azure data factory, Synapse, Databricks
Languages: Python, Scala, Java, PL/SQL
Machine learning: Scikit-learn, Pandas, Matplotlib, NumPy, NLTK
Databases: Oracle, Netezza, RedShift, SQL Server, MySQL
NoSQL Database: HBase, Elastic Search
Operating System: Windows, Red hat Linux
Tools: Used: Spyder, PyCharm, Toad, IntelliJ, Anaconda, Atlan Data Catalog
Streaming tools (Real time): Kafka, Nifi
Version Controls: SVN, TFS, Mercurial, Bitbucket, Git
Data Processing: Structured, Unstructured, and Semi structured data
PROFESSIONAL EXPERIENCE
Confidential, Columbus, GA
Lead Data Engineer
Responsibilities:
- As a Primary resource understood complete project and data architecture and worked closely with client team to improve the data collection, quality, reporting and analytics capabilities.
- Worked on AWS Glue, data Catalog, S3, Athena, Lambda to build the data pipelines.
- Worked on Azure Blob, Data Factory, Synapse, and Azure Databricks to build the ETL data pipelines.
- Understand the complexity of existing system proposed the new solutions to run the pipeline efficiently.
- Optimized existing pipelines, resulted in reduced operational cost, scalability, and usability.
Technologies: AWS Glue, data Catalog, S3, Athena, Lambda, MySQL Database, Azure Blob, Data Factory, Synapse, and Azure Databricks, Python, PySpark, Bitbucket, Spider, Atlan Data Catalog
Confidential, Burlington, MA
Sr Specialist
Responsibilities:
- Played the role of Lead engineer and Analyst, designed end to end data lake architecture and helped in analysis of flight risk and learning models of employees.
- Proposed, Designed and Developed a Generic plug and play ETL framework on PySpark.
- Which will enable developers to configure and run any new pipeline in a quick and efficient way.
- This reduces the development efforts by more than 50%.
- Helped team in development of Python base Data Validation utility to detect data related issue in early stage of the pipeline.
- Perform a variety of tasks to facilitate completion of projects including coordinating with different teams, helping team in complex production related issue.
- Working on strategizing efficient migration from Hadoop to AWS cloud and Palantire platforms.
- Helped team in migrating the data from Hadoop to AWS and Palantire platform in efficient manner.
Technologies: Hadoop, PySpark, HDFS, Hive, Sqoop, Oozie, Bitbucket, PyCharm, AWS Glue, Lambda, Athena, S3, EMR, Spectrum, Elastic Search, Scikit-learn, Pandas, NumPy, NLTK, Data modeling, DataIku, Palantir, Ambari
Confidential
Sr Data Engineer
Responsibilities:
- Helped in migration from legacy DataStage pipelines to Hadoop, Spark ecosystem efficiently with help of team.
- Emphasized the importance of file monitoring framework for all non-Hadoop pipelines and implemented using python, it saves a weekly 5Hr of operations time.
- Built a customer sessionization logic on a Coupons website clickstream data using PySpark, which eliminated the use of HBase cluster.
- Developed a configurable generic PySpark utility to generate XML reports by reading a JSON file, which suits different clients based on configuration.
Technologies: Hadoop, PySpark, HDFS, Hive, Sqoop, Oozie, Mercurial, PyCharm, Python, Pandas, Apache Solr, Unix Shell scripting, Cloudera Manager
Confidential
Sr Database Engineer
Responsibilities:
- Provided excellent leadership by recommending the right technologies and solutions for a given use case.
- Provided technical support to resolve or assist in resolution of issues relating to production systems.
- Involved in requirement gathering and designing end to end project architecture in Hadoop.
- Built and automated a report using MS SSIS that reduced a weekly 2 Hrs. of manual efforts.
- Involved in migrating data pipelines from Netezza to AWS environment.
- Initiated multiple process improvement activities.
Technologies: Hadoop, Spark, Scala, HDFS, Hive, Pig, HBase, Sqoop, Oozie, SVN, IntelliJ, Unix Shell scripting, Data modeling, AWS Glue, S3, Redshift. MS SSIS, SQL Server
Confidential, Danbury, CT
Data Engineer
Responsibilities:
- Built data Lake in Hadoop, migrating pipeline and data from Netezza to Hive and pig scripts.
- Worked on top revenue generating analytical project. It analyzes trends of the pharmaceutical product and market segments.
- Built a cost-effective generic pipeline and data model that accommodates multiple client reports in single data platform.
- Worked on performance improvement of data pipelines and awarded by out of the box thinker award.
- Provided excellent leadership by recommending the right technologies and solutions for a given use case.
- Designed best practices to support continuous process automation for data ingestion and data pipeline workflows.
- Prepared and presented reports, analysis and presentations to various stakeholders including executives.
Technologies: Oracle PLSQL, Netezza, Hadoop, Python, HDFS, Hive, Pig, Sqoop, Unix Shell scripting, Data modeling, Toad, AWS Redshift, S3
Confidential
Software Engineer
Responsibilities:
- Involved in development of in-house tool called Mech bench to provide access to all Honeywell and external employees.
- Got very good exposer on OLTP systems and processes.
- Built efficient logic on backend to handle heavy data activities.
- Worked on advanced concepts line arrays, triggers, materialized views and temporary table.
- Worked on performance tuning and code refectory of multiple projects.
Technologies: Oracle PLSQL, Toad, SVN, Data modeling, Unix Shell scripting, Oracle Database administration.