We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY

  • Experienced Data Engineer with almost 7 years of Experience in IT with a strong background in big data and data analytical projects.
  • Hands on Experience in gathering requirements, developing, mapping, and creating data models.
  • Extensively used python libraries like NumPy, SciPy, pytables, sklearn, NLTK, Text Blob, Gensim, Beautiful Soup, PySpark, Pytest.
  • Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Working with AWS/GCP cloud using in GCP Cloud storage, Data - Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
  • Expertise with the tools in Hadoop Ecosystem including Spark, Hive, Airflow, Impala, HDFS, Zoo- Keeper, Sqoop, Flume, HBase.
  • Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
  • Hands on experience in Test-driven development, Software Development Life Cycle (SDLC) methodologies like Agile and Scrum Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Responsible data engineer conducts root cause analysis and resolve production problems and data issues.
  • Developed solutions to leverage ETL tools and identify opportunities for process improvements using Python.
  • Good experience in creating data ingestion pipelines, data transformations, data management and data governance.
  • Designed new data pipelines and worked on the existing data Pipelines to be make them more efficient.
  • Coordinated with the Machine Learning team to perform Data Visualization.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Hands-on use of Spark andScalaAPI’s to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark Mllib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like Hbase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Extensive experience in SDLC, STLC process development and implementation
  • Worked in Core java application development and maintenance support of AMS.
  • Project Management level activity and Audit Like (CMMI, Lean & Project Level Configuration Audit (IPWC).
  • Extensively usedPythonLibraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC’s) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Strong Experience in working with Databases like DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Excellent communication, interpersonal and analytical skills and a highly motivated team player with the ability to work independently.

TECHNICAL SKILLS

Big Data Ecosystems: Hadoop, HDFS, Hive, Spark, Sqoop, MapReduce, Spark, HBase, Airflow, Nifi, Pig, Kafka, Oozie

Programming Languages: Python, SQL, PL/SQL, Hive QL, Scala, Shell scripting, UNIX

Analytical Tools: SAS EM, SPSS, Rapid miner

Database: Oracle, MySQL, MS SQL, Teradata

Cloud technologies: AWS, Azure, GCP

Version control: GIT, GITHUB

Visualization tools: Tableau, PowerBi

Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN)Principal Component Analysis

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Atlanta, GA

Senior Big Data Engineer

Responsibilities:

  • Working as Developer in hive and impala for more parallel processing data in Cloudera systems.
  • Working in big data technologies like spark 2.3 & 3.0 Scala, Hive, Hadoop cluster (Cloudera platform).
  • Wrote programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data.
  • Used Scala function, dictionary and data structure (array, list, map) for better code reusability
  • Based on Development, we need to do the Unit Testing.
  • Prepare the Technical Release Notes (TRN) for the application deployment into the DEV/STAGE/PROD environment.
  • Developed report layouts for Suspicious Activity and Pattern analysis under AML regulations
  • Prepared and analysed AS IS and TO BE in the existing architecture and performed Gap Analysis. Created workflow scenarios, designed new process flows and documented the Business Process and various Business Scenarios and activities of the Business from the conceptual to procedural level.
  • Analyzed business requirements and employed Unified Modeling Language (UML) to develop high-level and low-level Use Cases, Activity Diagrams, Sequence Diagrams, Class Diagrams, Data-flow Diagrams, Business Workflow Diagrams, Swim Lane Diagrams, using Rational Rose
  • Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution.
  • Start working with AWS for storage and halding for tera byte of data for customer BI Reporting tools
  • Worked with senior developers to implement ad-hoc and standard reports using Informatica, Cognos, MS SSRS and SSAS.
  • Making a data pipelining with help Data Fabric job, SQOOP, SPARK, Scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
  • Writing big query to get data wrangling for with help of data flow in gcp cloud.
  • Closely work on pub-sub model as well because of Lambda model we implemented in tcf bank.
  • Design & implement Spark Sql tables, Hive scripts job with stone branch for scheduling and create work flow and task flow.
  • We generally used partitions and bucketing for data in hive to get query faster. This part of hive optimization
  • Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Pre-processing using Hive and Pig.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • I have written shell script to trigger data Stage jobs.
  • Assist service developers in finding relevant content in the existing reference models.
  • Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI calculator Sheet and maintain that sheet within SharePoint.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI.
  • Creating datamodel that correlates all the metrics and gives a valuable output.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.

Confidential, Branchburg, NJ

Big Data Engineer

Responsibilities:

  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
  • Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
  • Worked with the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
  • Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers and policies.
  • Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
  • Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
  • Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
  • Extensively worked on Python and build the custom ingest framework.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.

Confidential

Sr. Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Installing, configuring and maintaining Data Pipelines
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
  • Researched and downloaded jars for Spark-avro programming.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
  • Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
  • Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
  • Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
  • Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI.
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Developed a PySpark program that writes dataframes to HDFS as avro files.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
  • Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
  • Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Working on tickets opened by users regarding various incidents, requests
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

Confidential

Data Engineer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Experienced in loading and transforming of large sets of structured, semi-structured and unstructured data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
  • Experienced in developing Spark scripts for data analysis in both python and Scala.
  • Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
  • Built on-premise data pipelines using Kafka and spark for real-time data analysis.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Manipulated and summarized data to maximize possible outcomes efficiently
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Analyzed and recommended improvements for better data consistency and efficiency
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope
  • Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Setup Alerting and monitoring using Stack driver in GCP.
  • Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations.
  • Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
  • Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
  • Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
  • Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Design and implement large scale distributed solutions in AWS and GCP clouds.
  • Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop Log files.
  • Used Sqoop to transfer data between relational databases and Hadoop.
  • Worked on HDFS to store and access huge datasets within Hadoop.
  • Good hands on experience with GitHub.

Confidential

Data Analyst

Responsibilities:

  • Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Worked with stakeholders to communicate campaign results, strategy, issues or needs.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
  • Analyzed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
  • Understood Business requirements to the core and Came up with Test Strategy based on Business rules
  • Prepared Test Plan to ensure QA and Development phases are in parallel
  • Written and executed Test Cases and reviewed with Business & Development Teams.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN,Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Worked on to retrieve the data from FS to S3 using spark commands.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders.
  • Implemented Defect Tracking process using JIRAP tool by assigning bugs to Development Team.
  • Automated Regression tool (Qute) and reduced manual effort and increased team productivity.
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.
  • Understand the data visualization requirements from the Business Users.

We'd love your feedback!