We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Sunnyvale, CA

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
  • Knowledge of Cloudera platform & Apache Hadoop 0.20. version.
  • Very good exposure in OLAP and OLTP.
  • Expertise in transforming business resources and requirements into manageable data formats and analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Knowledge of Chef, Linux, shell scripting, regex
  • Knowledge of graph databases such as Orient DB for mapping system and data dependencies
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Strong understanding in Dimensional modeling - Star Schema, Snow Flake Schema.
  • Good understanding about BI Stack like SSRS, Tableau, Power BI.
  • Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Design and implement effective Analytics solutions and models with Snowflake
  • Hands-on experience with Snowflake utilities, SnowSQL and SnowPipe
  • Experience in Data Migration from RDBMS to Snowflake cloud data warehouse
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Good knowledge in Unix commands and Python Scripting.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS:

Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR, Teradata, Airflow

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.

Cloud Technologies: AWS, Azure, Google cloud platform (GCP)

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE, PostgreSQL

Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, GCP, Google Shell, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.

PROJECT EXPERIENCE:

Confidential, Sunnyvale, CA

Sr. Big Data Engineer

Responsibilities:

  • Installing, configuring and maintaining Data Pipelines
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis into S3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Design and implement effective data systems and models with Snowflake.
  • Experience on end to end implementation of Snowflake cloud data warehouse
  • Expertise in Snowflake - data modelling, ELT using Snowflake SQL, implementing complex stored Procedures and standard DWH and ETL concepts
  • Expertise in Snowflake advanced concepts like setting up resource monitors, RBAC controls, virtual warehouse sizing, query performance tuning, Zero copy clone, time travel and understand how to use these features
  • Expertise in deploying Snowflake features such as data sharing, events and lake-house patterns
  • Conduct root cause analysis and resolve production problems and data issues
  • Performance tuning, code promotion and testing of application changes
  • Hands-on experience analyzing and re-platforming on-prem data warehouses to data platforms on GCP cloud using GCP/3rd party services
  • Experience using Cloud Spanner to handle relational data and Big Table to store huge volume of Key Value pairs and Big Query to do interactive data analysis.
  • Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
  • Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR, CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
  • Work with region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining the subsequent phases training, UAT, staff to perform test scripts, data migration and the uplift strategy (updating of customer information to bring them to the new KYC standards) review of customer documentation.ch
  • Description of End-to-end development of Actimize models for trading compliance solutions of the project bank.
  • Automated and scheduled recurring reporting processes using UNIX shell scripting and Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.

Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Git, Sqoop, Oozie, Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting XML, Unix.

Confidential, Boise, Idaho

Sr. Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test Driven Development) using Cucumber, Gherkin and ruby .
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis into S3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
  • Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution.
  • Start working with AWS for storage and halding for tera byte of data for customer BI Reporting tools
  • Built 12 node Hadoop cluster
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Working on tickets opened by users regarding various incidents, requests
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
  • Used Apache Spark Data frames, Spark-SQL, Spark MLlib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Data Integration ingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Applied various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, R, and Matlab. Collaborate with Data Engineers and Software Developers to develop experiments and deploy solutions to production.
  • Create and publish multiple dashboards and reports using Tableau server and work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Implement data warehousing solutions using Snowflake that meet data analysis, data extraction, transformation and loading of data.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts.
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitive algorithms.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management.
  • Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, SSIS, Hbase, OLTP, Data Lake, Python, AWS, Lambda, Informatica Power Center, Linux Shell Scripting, Git.

Confidential, Weston, FL

Big Data Engineer

Responsibilities:

  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
  • Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
  • Built APIs that will allow customer service representatives to access the data and answer queries.
  • Designed changes to transform current Hadoop jobs to HBase.
  • Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
  • Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
  • Extending the functionality of Hive with custom UDF s and UDAF's.
  • The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
  • Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
  • Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
  • Expert in creating Hive UDFs using Java to analyze the data efficiently.
  • Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
  • Implemented AJAX, JSON, and Java script to create interactive web screens.
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB. Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
  • Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
  • Utilized Waterfall methodology for team and project management
  • Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating Created Tableau dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality. Dashboards and stories as needed using Tableau Desktop and Tableau Server
  • Performed statistical analysis using SQL, Python, R Programming and Excel.
  • Worked extensively with Excel VBA Macros, Microsoft Access Forms
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Developed story telling dashboards in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Analyzed and recommended improvements for better data consistency and efficiency
  • Designed and Developed data mapping procedures ETL-Data Extraction, Data Analysis and Loading process for integrating data using R programming.
  • Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.

Environment: Hadoop, Pig, Hive, Informatica, Cognos, HBase, MapReduce, HDFS, Sqoop, SQL, Tableau, Python, SAS, Oozie, No Sql, Mongo DB, Oracle, Linux, Git.

Confidential

Data Analyst

Responsibilities:

  • Assisting in designing the overall ETL solutions including analyzing data, preparation of high level and detailed design documents, test and data validation plans and deployment strategy.
  • Prepared the technical mapping specifications, process flow and error handling document.
  • Developed both simple and complex mappings implementing complex business logic using variety of transformation logic like Unconnected and connected Lookups, Router, Filter, Expression, Aggregator, Joiner, Update Strategy, Unconnected and Connected Stored procedures, normalizer and more.
  • Created various tasks like Pre\Post Session Commands, Timer, Event Wait, Event Raise, Email and Command task.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka
  • Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Experienced in querying data using SparkSQL on top of Spark engine
  • Involved in managing and monitoring Hadoop cluster using Cloudera Manager.
  • Used Python and Shell scripting to build pipelines.
  • Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
  • Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
  • Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts
  • Writing autosys Jil files to run workflow components and deploy the files.
  • Extensively worked in database components like SQL, PL/SQL, Stored Procedures, Stored Functions, Packages and Triggers.
  • Performs Code review and troubleshooting of existing Informatica mappings and deployment of code from Development to test to production environment.
  • Supporting other ETL developers, providing mentoring, technical assistance, troubleshooting and alternative development solutions

Environment: Hadoop, HDFS, Spark, HiveQL, Kafka, Oozie, Pig, Airflow, Informatica, Oracle, PL/SQL, Sql Server, Unix, Linux, Shell Scripting.

Confidential

Data Analyst

Responsibilities:

  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Worked with stakeholders to communicate campaign results, strategy, issues or needs.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
  • Understood Business requirements to the core and Came up with Test Strategy based on Business rules
  • Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/ Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Hadoop, Map Reduce, Spark SQL, Python, Pig, AWS, GitHub, EMR, Nebula Metadata, Teradata, SQL Server, Apache Spark, Sqoop

We'd love your feedback!