Sr. Data Engineer Resume
Dania Beach -, FL
SUMMARY
- 7+ years of technical IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Had more than 5 years of industrial experience in Big Data analytics, Data manipulation using Hadoop Eco system tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HBase, Spark/PySpark, Kafka, Flume, Sqoop, Oozie, Avro, AWS, Spark integration with Cassandra, Zookeeper.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in Developing Spark applications using Spark/PySpark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.
- Hands on experience in Hadoop Ecosystem components such as Spark/PySpark, SQL, Hive, Pig, Sqoop, Flume, Zookeeper/Kafka, HBase and MapReduce.
- Experience in converting SQL queries into Spark Transformations using Spark/PySpark RDDs, Scala/Python and Performed map-side joins on RDD's.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)/ Non-Relational Database Systems and vice-versa.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Excellent Programming skills at a higher level of abstraction using Scala, Java and Python.
- Experience in Hive partitioning, bucketing and perform joins on Hive tables and implement Hive SerDes.
- Worked on different file formats like delimited files, Avro, json and parquet.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Hands on Experience in designing and developing applications in Spark using Scala and Pyspark to compare the performance of Spark with Hive and SQL/Oracle.
- Solid experience in working with csv, text, sequential, Avro, parquet, orc, Jason formats of data.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Ability to tune Big Data solutions to improve performance and end-user experience.
- Managed multiple tasks and worked under tight deadlines and in fast pace environment.
- Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.
- Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.
- Have good knowledge and experience in Google Cloud Platform (GCP).
- GCP platform on IAM roles, migration of applications, using Cloud storage, BigQuery, dataflow, dataProc
- Experience in building and architecting multiple Data Pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among teh team.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
- Developed and deployed the outcome of spark and Scala code in Hadoop cluster running on GCP.
- Migrated previously written cron jobs to airflow in GCP.
- Experience in moving data between GCP and Azure using Azure data factory.
- Experience in creating complex data pipeline process using T-SQL scripts, SSIS packages, Apteryx workflow, PL/SQL scripts, Cloud REST APIs, Python scripts, GCP Composer, GCP data flow.
- Knowledge on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud DataProc, Cloud Pub/Sub, cloud SQL, Big Query, stack driver monitoring and cloud deployment manager.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: HDFS, Hive, Pig, Sqoop, Yarn, Spark, Spark SQL, Kafka
Hadoop Distributions: Horton works and Cloudera Hadoop
Languages: C, C++, Python, Scala, UNIX Shell Script, COBOL, SQL and PL/SQL
Tools: Teradata SQL Assistant, Pycharm, Autosys
Operating Systems: Linux, Unix, ZOS and Windows
Databases: Teradata, Oracle 9i/10g, DB2, SQL Server, MySQL 4.x/5.x
ETL Tools: IBM InfoSphere Information Server V8, V8.5 & V9.1
Reporting: Tableau
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Dania Beach - FL
Responsibilities:
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Used Azure Event Grid for managing event service that enables you to easily manage events across many different Azure services and applications.
- Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
- Optimized apache spark clusters using Delta lake
- Used spark sql to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark SQL.
- Involved in converting the HQL’s in to spark transformations using spark RDD with support of python and Scala.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) in batches through Azure Databricks Workspace.
- Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
- Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).
- Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Build an ETL which utilizes spark jar inside which executes the business analytical model
- Optimized existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Setting up and installing Azure Databricks, Azure Data Factory, Azure Data Lake, Delta Lake
- Performed data validation which does the record wise counts between the source and destination
- Scheduled and automated workflows using Airflow to map the data from source to destination
- Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.
- Worked on implementation of some check points like hive count check, Sqoop records check, done file create check, done file check and touch file lookup.
- Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Delta lake supports merge, update and delete operations to enable complex use cases.
- Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with the whole Microsoft stack.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
- Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Handled escalations and resource assignments as a Team Lead, for several tasks/tickets in production env.
Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta lake, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, Tableau, Maven, Git, Jira.
Sr. Data Engineer
Confidential, New York, NY
Responsibilities:
- Worked as Data Engineer to review business requirement and compose source to target data mapping documents.
- Advice the business on best practices in the Spark Sql while making sure the solution meet the business needs.
- Participated in requirements sessions to gather requirements along with business analysts and product owners.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Performance tuning the tables in Redshift.
- Reviewing the explain plan for the SQLs in Redshift.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Worked on Spark SQL in creating data frames by loading data from Hive tables and stored in AWS S3.
- Developed batch scripts to fetch data from AWS S3 and performed transformations in Scala using Spark.
- Created Sqoop Scripts to import and export customer profile data from RDBMS to S3 buckets.
- Built custom Input adapters to migrate click stream data from FTP servers to S3.
- Used Spark Streaming APIs to perform transformations and actions to build common learner data model to get data from Kafka real-time to HBase.
- Involved in Big data requirement analysis, develop and design solutions for ETL and Business Intelligence platforms.
- Automated creation and termination of AWS EMR clusters.
- Created monitors, alarms, notifications, logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.
- Utilize AWS EMR and Jupyter notebooks to create PySpark enabled notebooks which perform data wrangling and data manipulation operations.
- Consumed XML messages using Kafka and processed the xml file using Spark streaming to capture UI updates.
- Migrate data into RV Data Pipeline using DataBricks, Spark SQL and Scala.
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Built and designed ETL pipeline using python to fetch data from Redshift data warehouse and applications.
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- ACID transactions on spark are offered by using Delta Lake.
- Accessed the Hive tables using Spark Hive context (Spark sql) and used Scala for interactive operations.
- Implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
- Worked with structured/semi-structured data ingestion and processing on AWS using S3, Python and Migrate on-premises big data workloads to AWS.
- Wrote Python scripts to parse XML, JSON documents and load the data in database.
- Involved in preparing SQL and PL/SQL coding convention and standards.
- Involved in Data mapping specifications to create and execute detailed system test plans.
Environment: Agile, ODS, OLTP, ETL, HDFS, Kafka, AWS, S3, Python, K-means, XML, SQL, Talend, Redshift, Glue, Delta Lake Lambda, MS SQL, Cosmos DB, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git. Spark, Hive, Scala, Pyspark.
Data Engineer
Confidential
RESPONSIBILITIES:
- Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Kafka, and Talend.
- Experience in developing scalable & secure data pipelines for large datasets.
- Supported data quality management by implementing proper data quality checks in data pipelines.
- Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies.
- Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
- Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
- Implemented data streaming capability using Kafka and Talend for multiple data sources.
- Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
- Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
- Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
- Worked on analyzing and resolving the production job failures in several scenarios.
- Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
- Automated Regular AWS tasks like snapshots creation using Python scripts.
- Designed data warehouses on platforms such as AWS Redshift, Azure SQL Data Warehouse, and other high-performance platforms.
- Install and configure Apache Airflow for AWS S3 bucket and create dags to run the Airflow.
- Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and Redshift.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Using AWS Redshift, me Extracted, transformed and loaded data from various heterogeneous data sources and destinations.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to Elastic search.
- Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sause LABS, Cucumber JVM, Mongo DB, GitHub, BitBucket, SQL, NoSQL database, API, Java, Jenkins
- Start working with AWS for storage and handling for tera byte of data for customer BI Reporting tools
Environment: Python, AWS, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Git, Oozie, Talend, Agile Methodology.
Data Engineer
Confidential
RESPONSIBILITIES:
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Responsible for building scalable distributed data solutions using Hadoop.
- Experienced in loading and transforming of large sets of structured, semi-structured and unstructured data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Experienced in developing Spark scripts for data analysis in both python and Scala.
- Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
- Built on-premise data pipelines using Kafka and spark for real-time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
- Implemented Hive complex UDF's to execute business logic with Hive Queries.
- Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
- Experience in managing and reviewing Hadoop Log files.
- Used Sqoop to transfer data between relational databases and Hadoop.
- Worked on HDFS to store and access huge datasets within Hadoop.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
- Design Setup maintains Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
- Migrate SQL Server and Oracle database to Microsoft Azure Cloud.
- Migrate the Data using Azure database Migration Service (AMS).
Environment: Cloudera Manager (CDH5), Azure, HDFS, Sqoop, Pig, Hive, Oozie, Kafka, flume, Java, Git, Tableau.