Sr. Data Engineer/big Data Developer Resume
Boston, MA
SUMMARY
- Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
- Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
- Substantial experience in Spark 3.o integration with Kafka 2.4
- Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
- Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
- Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
- Knowledge of Cloudera platform & Apache Hadoop 0.20. version.
- Very good exposure in OLAP and OLTP.
- Proficient in Statistical Methodologies including Hypothetical Testing,ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
- Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
- Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Boston, MA
Sr. Data Engineer/Big Data Developer
Responsibilities:
- Working as Developer in hive and impala for more parallel processing data in Cloudera systems.
- Working in big data technologies like spark 2.3 & 3.0 Scala, Hive, Hadoop cluster (Cloudera platform).
- Making a data pipelining with help Data Fabric job, SQOOP, SPARK, Scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
- Writing big query to get data wrangling for with help of data flow in gcp cloud.
- Closely work on pub-sub model as well because of Lambda model we implemented in tcf bank.
- Design & implement Spark Sql tables, Hive scripts job with stone branch for scheduling and create work flow and task flow.
- We generally used partitions and bucketing for data in hive to get query faster. This part of hive optimization
- Write programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data
- Used Scala function, dictionary and data structure (array, list, map) for better code reusability
- Based on Development, we need to do the Unit Testing.
- Prepare the Technical Release Notes (TRN) for the application deployment into the
- DEV/STAGE/PROD environment.
- Developed report layouts for Suspicious Activity and Pattern analysis under AML regulations
- Prepared and analysed AS IS and TO BE in the existing architecture and performed Gap Analysis. Created workflow scenarios, designed new process flows and documented the Business Process and various Business Scenarios and activities of the Business from the conceptual to procedural level.
- Analyzed business requirements and employed Unified Modeling Language (UML) to develop high-level and low-level Use Cases, Activity Diagrams, Sequence Diagrams, Class Diagrams, Data-flow Diagrams, Business Workflow Diagrams, Swim Lane Diagrams, using Rational Rose
- Worked with senior developers to implement ad-hoc and standard reports using Informatica, Cognos, MS SSRS and SSAS.
- Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
- Used SQL Server Management Tool to check the data in the database as compared to the requirement give
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- I have written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Compiling and validating data from all departments and Presenting to Director Operation.
- KPI calculator Sheet and maintain that sheet within SharePoint.
- Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI.
- Creating datamodel that correlates all the metrics and gives a valuable output.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Pre-processing using Hive and Pig.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
Environment: HDFS, Hive, Pig, Azure, AWS, Lambda, Sqoop, Spark, Linux, Kafka, Scala, Python, Stone branch, Cloudera, Pyspark, Restful, Oracle11g/10g, PL/SQL, Sql Server, T-Sql, Unix, Scala, Tableau, Parquet File systems.
Confidential, Oldsmar, FL
Big Data Engineer / Hadoop Developer
Responsibilities:
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
- Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
- Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
- Worked with the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
- Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
- Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers and policies.
- Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
- Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
- Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
- Extensively worked on Python and build the custom ingest framework.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, OLTP, Talend, Oozie, Casandra, Control-M, Java, AWSS3, Oracle 12c, Linux
Confidential, St.Paul, Minnesota
Data Engineer
Responsibilities:
- Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
- Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
- Researched and downloaded jars for Spark-avro programming.
- Developed a PySpark program that writes dataframes to HDFS as avro files.
- Utilized Spark's parallel processing capabilities to ingest data.
- Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
- Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
- Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
- In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
- Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
- Configured documents which allow Airflow to communicate to its PostgreSQL database.
- Developed Airflow DAGs in python by importing the Airflow libraries.
- Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
- Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
- Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
- Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
- Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
- Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI.
- Decommissioning nodes and adding nodes in the clusters for maintenance
- Monitored cluster health by Setting up alerts using Nagios and Ganglia
- Adding new users and groups of users as per the requests from the client
- Working on tickets opened by users regarding various incidents, requests
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.
Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology, Spark, stonebranch, Cloudera, Oracle11g/10g, PL/SQL, Unix, Json and Parquet File systems
Confidential
Hadoop & Spark Developer
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations.
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Manipulated and summarized data to maximize possible outcomes efficiently
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Analyzed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope
- Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
- Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Setup Alerting and monitoring using Stack driver in GCP.
- Design and implement large scale distributed solutions in AWS and GCP clouds.
- Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Java, AWSS3, Oracle 12c, Linux
Confidential
Data Analyst
Responsibilities:
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN,Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
- Worked on to retrieve the data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
- Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
- Analyzed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Understood Business requirements to the core and Came up with Test Strategy based on Business rules
- Prepared Test Plan to ensure QA and Development phases are in parallel
- Written and executed Test Cases and reviewed with Business & Development Teams.
- Implemented Defect Tracking process using JIRAP tool by assigning bugs to Development Team
- Automated Regression tool (Qute) and reduced manual effort and increased team productivity
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Snowflake, AWS S3, GitHub, Service Now, Map Reduce, EMR, Nebula, Python,Pig,Hive,Teradata, SQL Server, Scala,Apache Spark, Sqoop