We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Eagan, MN

SUMMARY

  • Eight plus years of experience in Analysis, Design, Development and Implementation as aData Engineer.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Experience in development and design of various scalable systems using Hadoop technologies in various environments. Extensive experience in analyzing data using Hadoop Ecosystems includingHDFS, MapReduce, Hive & PIG.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
  • Expertise in transforming business resources and requirements into manageable data formats and analytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Working with AWS/GCP cloud using in GCP Cloud storage, Data - Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Sustaining the Big Query, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and has expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gatheird necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
  • Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like Mongo DB, HBase and Cassandra.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Big Data /Hadoop Technologies: Map Reduce, Spark, Spark SQL, Azure, Spark Streaming, Kafka, PySpark,, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Programming/ Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

NO SQL Databases: Cassandra, HBase, MongoDB, Maria DB

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Auto scaling, Cloud Watch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server, MySQL, Oracle, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Eagan, MN

Senior Big Data Engineer

Responsibilities:

  • Installing, configuring and maintaining Data Pipelines
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Conduct root cause analysis and resolve production problems and data issues
  • Performance tuning, code promotion and testing of application changes
  • Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
  • Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
  • Work with region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining the subsequent phases training, UAT, staff to perform test scripts, data migration and the uplift strategy (updating of customer information to bring them to the new KYC standards) review of customer documentation.
  • Description of End-to-end development of Actimize models for trading compliance solutions of the the project bank.
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.

Environment: Cloudera Manager (CDH5),Hadoop,Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka,Scrum,Git,Sqoop,Oozie.Pyspark,Informatica,Tableau,OLTP,OLAP,HBase,Cassandra,Informatica,SQL Server,Python,Shell Scripting XML,Unix.

Confidential, St. Louis, MO

Sr.Data Engineer / Big Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, has created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Storing Data Files in Google Cloud S3 Buckets daily basis. Using Data Proc, Big Query to develop and maintain GCP cloud base solution.
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Working on tickets opened by users regarding various incidents, requests
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
  • Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization.
  • Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, ZookeeperPython 3.6, AWS(Glue, Lambda, StepFunctions, SQS, Unix/Linux Shell Scripting, PyCharm, Informatica, Code Build, Code Pipeline, EventBridge, Atana

Confidential, Columbus, OH

Big Data Engineer

Responsibilities:

  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
  • Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
  • Built APIs that will allow customer service representatives to access the data and answer queries.
  • Designed changes to transform current Hadoop jobs to HBase.
  • Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
  • Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
  • Extending the functionality of Hive with custom UDF s and UDAF's.
  • The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
  • Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
  • Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
  • Expert in creating Hive UDFs using Java to analyze the data efficiently.
  • Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
  • Implemented AJAX, JSON, and Java script to create interactive web screens.
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce tan stored into HDFS.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
  • Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
  • Utilized Waterfall methodology for team and project management
  • Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
  • Performed statistical analysis using SQL, Python, R Programming and Excel.
  • Worked extensively with Excel VBA Macros, Microsoft Access Forms
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Analyzed and recommended improvements for better data consistency and efficiency
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.

Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, HBase, Map Reduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Oozie, Linux.

Confidential

Data Engineer/ Data Analyst

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce(EMR)on(EC2).
  • Performing the forking action whenever their is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted dis script into a jar and passed as parameter in Oozie script
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
  • Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, Git add to add files, Git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. Yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are tan implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, AWS, Oozie, Impala, Kafka, JSON, XML PL/SQL,SQl, Azure, HDFS, Unix, Python, PySpark, Azure.

Confidential

Data Analyst

Responsibilities:

  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. me need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data me.e. Name, Address, SSN,Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Worked with stakeholders to communicate campaign results, strategy, issues or needs.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
  • Understood Business requirements to the core and Came up with Test Strategy based on Business rules
  • Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Teradata, SQL Server, Apache Spark, Sqoop

We'd love your feedback!