We provide IT Staff Augmentation Services!

Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Santa Clara, CA

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Around 8 Years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies
  • Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, Hbase, Sqoop, Oozie, Flume, Drillandsparkfor data storage and analysis.
  • Experience in developingcustomUDFsfor Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(HiveQL) and Used UDFs from Piggybank UDF Repository
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA, Time Series, TEMP Principal Component Analysis, Factor Analysis, Cluster Analysis, Discriminant Analysis.
  • Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Knowledge of Cloudera platform & Apache Hadoop 0.20. version.
  • Very good exposure in OLAP and OLTP.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, re index, melt and reshape.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services likeEC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

Programming Languages: SQL, PL/SQL, and UNIX.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Microsoft Azure

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda

Databases: Oracle, Teradata R15/R14.

BI Tools: SSIS, SSRS, SSAS.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential - Santa Clara, CA

Big Data Engineer

Responsibilities:

  • Working as Developer in hive and impala for more parallel processing data in Cloudera systems.
  • Working in big data technologies like spark 2.3 & 3.0 Scala, Hive, Hadoop cluster (Cloudera platform).
  • Worked in AWS environment for development and deployment of Custom Hadoop Applications.
  • Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Design & implement Spark Sql tables, Hive scripts job with stone branch for scheduling and create workflow and task flow.
  • We generally used partitions and bucketing for data in hive to get query faster. This part of hive optimization
  • Write programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data
  • Used Scala function, dictionary and data structure (array, list, map) for better code reusability
  • Based on Development, we need to do the Unit Testing.
  • Prepare the Technical Release Notes (TRN) for the application deployment into the
  • DEV/STAGE/PROD environment.
  • Developed report layouts for Suspicious Activity and Pattern analysis under AML regulations
  • Prepared and analyzed AS IS and TO BE in the existing architecture and performed Gap Analysis. Created workflow scenarios, designed new process flows and documented the Business Process and various Business Scenarios and activities of the Business from the conceptual to procedural level.
  • Analyzed business requirements and employed Unified Modeling Language (UML) to develop high-level and low-level Use Cases, Activity Diagrams, Sequence Diagrams, Class Diagrams, Data-flow Diagrams, Business Workflow Diagrams, Swim Lane Diagrams, using Rational Rose
  • Worked with senior developers to implement ad-hoc and standard reports using Informatica, Cognos, MS SSRS and SSAS.
  • Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
  • Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR, CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • I have written shell script to trigger data Stage jobs.
  • Assist service developers in finding relevant content in the existing reference models.
  • Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
  • Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon DynamoDB.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI calculator Sheet and maintain that sheet within SharePoint.
  • Creating data model that correlates all the metrics and gives a valuable output.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Pre-processing using Hive and Pig.

Environment: HDFS, Hive, Pig, AWS, Lambda, Sqoop, Spark, Linux, Kafka, Scala, Python, Stone branch, Cloudera, Pyspark, Restful, Oracle12c, PL/SQL, Sql Server, T-Sql, Unix, Scala, Tableau, Parquet File systems.

Confidential - NYC, NY

Data Engineer

Roles & Responsibilities:

  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
  • Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Worked with the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
  • Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers and policies.
  • Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI.
  • Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
  • Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
  • Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI.
  • Extensively worked on Python and build the custom ingest framework.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Hadoop YARN, Spark 1.6, Spark Streaming, MS Azure, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Databricks, Data Lake, Azure HDInsight, Data Storage, Power BI, Tableau, OLTP, Talend, Oozie, Casandra, Control-M, Java, Oracle 12c, Linux

Citrus Healthcare, Tampa, FL

Data Engineer

Roles & Responsibilities:

  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
  • Researched and downloaded jars for Spark-avro programming.
  • Developed a PySpark program that writes data frames to HDFS as avro files.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
  • Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
  • Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge of PySpark code, creating data frames from tables in data service layer and writing them to a Hive data warehouse.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
  • Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
  • Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
  • Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
  • Worked with NoSQL databases like HBase, Cassandra, DynamoDB (AWS) and MongoDB.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Working on tickets opened by users regarding various incidents, requests
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.

Environment: Spark, AWS, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology, Spark, stonebranch, Cloudera, Oracle11g, PL/SQL, Unix, Json and Parquet File systems

Polestar Solutions & Services - Noida, India

Python Developer

Roles & Responsibilities:

  • Responsible for SDLC process in gathering requirements, system analysis, design, development, testing and deployment.
  • Designed the front-end applications, user interactive (UI) web pages using web technologies like HTML, XHTML, CSS, JavaScript and JQuery.
  • Work with team of developers on python applications for RISK management.
  • Design, develop, test, deploy and maintain the website.
  • Designed and developed data management system using MySQL.
  • Rewrite existing Python/Django modules to deliver certain format of data.
  • Developed entire frontend and backend modules using Python on Django Web Framework.
  • Responsible for debugging and troubleshooting the web application.
  • Using Subversion control tool to coordinate team-development.
  • Used Django Database API’s to access database objects.
  • Developed server-based web traffic statistical analysis tool using Flask, Pandas.
  • Wrote python scripts to parse XML documents and load the data in database.
  • Handled all the client-side validation using JavaScript.
  • Automated the existing scripts for performance calculations using Numpy and sql alchemy.
  • Used JQuery for all client-side JavaScript manipulation.
  • Created unit test/regression test framework for working/new code.
  • Created a Python based GUI application for Freight Tracking and processing
  • Used Django framework for application development.
  • Created database using MySQL, wrote several queries to extract data from database.
  • Wrote scripts in Python for extracting data from HTML file.
  • Effectively communicated with the external vendors to resolve queries.
  • Used Perforce for the version control.
  • Worked in development of applications especially in UNIX environment and familiar with all commands.

Environment: Python 2.7, Flask, PHP, HTML5, CSS, JavaScript, JQuery, AJAX, Web services, GitHub, Selenium, MYSQL, PostgreSQL, Mongo DB.

We'd love your feedback!