Sr. Data Engineer Resume
Lake Success, NY
SUMMARY
- 7 years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Horton works distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
- Hands - on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Fluent programming experience with Scala, Java, Python, SQL, T-SQL, R.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
- Hands on experience in using other Amazon Web Services like Auto scaling, RedShift, DynamoDB, Route53.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Experienced in using various Python libraries like NumPy, SciPy, python-twitter, Pandas.
- Worked on visualization tools like Tableau for report creation and further analysis.
- Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine and imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs.
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering, and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
- Experience working with GitHub/Git 2.12 source and version control systems.
- Worked on confluence, SharePoint and Jira
TECHNICAL SKILLS
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: Java, Python, Scala, SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Databases: Oracle 12c/11g, Teradata R15/R14, Netezza, MySQL, SQL Server
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, Talend Open Studio 7.3, Alteryx, and Tableau.
Operating System: Windows, Unix, Sun Solaris, and Macintosh.
Big Data Tools: Hadoop Ecosystem Map Reduce.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Lake Success, NY
Responsibilities:
- Worked on AWS Data pipeline to configure data loads from S3 to Redshift.
- Using AWS Redshift, Extracted, transformed, and loaded data from various heterogeneous data sources and destinations
- Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
- Working in big data technologies like spark 2.3 & 3.0, Scala, Hive, Hadoop cluster (Cloudera platform).
- Making a data pipelining with the help of Data Fabric job, SQOOP, SPARK, Scala, and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
- Closely work on pub-sub model as well because of the Lambda model which we have implemented in TCF bank.
- Design & implement Spark SQL tables, Hive scripts job with stone branch for scheduling and create workflow and task flow.
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Performs data analysis and design, create, and maintain large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- I have also written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Compiling and validating data from all departments and Presenting to Director Operation.
- Creating data model that correlates all the metrics and gives a valuable output.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
- Design, Develop and test dimensional data models using star and Snowflake schema methodologies under the Kimball method.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, Azure.
- SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
Environment: Spark, Python, ETL, Power BI, Tableau, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, Cognos Connection, Connection, MS SQL Server, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros).
Cloud Data Engineer
Confidential, Charlotte, NC
Responsibilities:
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks
- Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and ‘big data’ technologies likeHadoop Hive, Azure Data Lake storage
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
- Created yaml files for each data source and including glue table stack creation
- Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Writing pyspark and spark sql transformation in Azure Databricks to perform complex transformations for business rule implementation
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Storage, Web Apps, Active Directory, USQLS, Application Insights, and Logic Apps.
- Writing UNIX shell scripts to automate the jobs and scheduling corn jobs for job automation using commands with Crontab.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Installing IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND).
- Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
- Provide troubleshooting and best practices methodology for development teams.
- This includes process automation and new application on boarding.
- Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
- Configuring IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND) for user work-load distribution.
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Optimized the Tensor Flow Model for efficiency
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Built performant, scalable ETL processes to load, cleanse and validate data
Environment: Microsoft Windows Azure, Azure Data Factory, Databricks Apache airflow, Cloud Dataflow, Cloud Shell, Hadoop, Hive, lambda, MySQL, PostgreSQL, SQL Server, Python, Scala, Spark, Spark-SQL, Docker, Unix, Shell Scripting, Git Hub.
Big Data Engineer
Confidential
Responsibilities:
- Implemented a Continuous Delivery pipeline with Docker and Git Hub
- Process and load bound and unbound Data from Google pub/subtopic to Bigquery using cloud Dataflow with Python.
- Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used python pandas, Nifi, Jenkins, nltk, and textbook finish the ETL process of clinical data for future NLP analysis.
- Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Designed and Developed data mapping procedures ETL-Data Extraction, Data Analysis and Loading process for integrating data using R programming.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java. good experience with ETL concepts, building ETL solutions and Data modeling
- Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Developed logistic regression models (Python) to predict subscription response rate based on customer's variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
- Develop near real time data pipeline using spark
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling skilled in data visualization like Matplotlib and seaborn library
- Experience implementing machine learning back-end pipeline with Pandas, NumPy.
Environment: Bigquery, Gcs Bucket, G-Cloud Function Spark, MapReduce, HDFS, AWS, GCP, Apache Airflow, Matplotlib, Seaborn, PL/SQL, Hive, Pandas, Nifi, NumPy, Talend, Star schema, Snowflake schema, SQL Server, Docker, Git Hub.
Hadoop Developer
Confidential
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and also documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations.
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
- Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
- Worked on Google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Setup Alerting and monitoring using stack driver in GCP.
- Design and implement large scale distributed solutions in GCP cloud.
- Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Built on-premise data pipelines using Kafka and spark for real-time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
- Implemented Hive complex UDF's to execute business logic with Hive Queries.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
- Experience in managing and reviewing Hadoop Log files.
- Worked on HDFS to store and access huge datasets within Hadoop.
Environment: Hadoop, Hive, Pig, Spark, Hbase, GCP, Zookeeper, Sqoop, Scala, Kafka, Tableau, MapReduce, Python, MySQL, NoSQL Database.
Data Analyst
Confidential
Responsibilities:
- Participated in testing of procedures and Data utilizing, PL/SQL to ensure integrity and quality of Data in Data warehouse.
- Gathered Data from Help Desk Ticketing System and write ad-hoc reports and, charts and graphs for analysis.
- Worked to ensure high levels of Data consistency between diverse source systems including flat files, XML and SQL Database.
- Developed and run ad-hoc Data queries from multiple database types to identify system of records, Data inconsistencies, and Data quality issues.
- Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
- Provided business intelligence analysis to decision-makers using an interactive OLAP tool
- Created T/SQL statements (select, insert, update, delete) and stored procedures.
- Defined Data requirements and elements used in XML transactions.
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
- Performed Tableau administering by using tableau admin commands.
- Involved in defining the source to target Data mappings, business rules and Data definitions.
- Ensured the compliance of the extracts to the Data Quality Center initiatives
- Metrics reporting, Data mining and trends in helpdesk environment using Access
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
- Built reports and report models using SSRS to enable end user report builder usage.
- Created Excel charts and pivot tables for the Ad-hoc Data pull.
- Created Column Store indexes on dimension and fact tables in the OLTP database to enhance read operation.
Environment: SQL, PL/SQL, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP.
