Sr. Data Engineer Resume
3.00/5 (Submit Your Rating)
Atlanta, GA
SUMMARY
- 8+ years of experience in IT experience in software design, development, implementation and support of business applications for Telecom, health and Insurance industries
- Experience in Big data Hadoop, Hadoop Ecosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Airflow, Oozie, and Zookeeper
- Worked extensively on installing and configuring Hadoop ecosystem components Hive, SQOOP, HBase, Zookeeper and Flume
- Good Knowledge in writing Spark Applications in Python (Spark)
- Working with the data extraction, transformation and load using Hive, Sqoop and HBase
- Hands on Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Implemented ETL operations using Big Data platform
- Hands on experience on Streaming data ingestion and Processing
- Experienced in designing different time driven and data driven automated workflows using Airflow.
- MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in spark streaming.
- Highly Acumen in choosing an efficient ecosystem in Hadoop and providing the best solutions to Big Data problems.
- Well versed with Design and Architecture principles to implement Big Data Systems.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency
- Acumen on Data Migration from Relational Database to Hadoop Platform using SQOOP.
- Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
- Good understanding of MPP databases such as HP Vertica and Impala.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS
- Expertise in relational databases like Oracle, My SQL and SQL Server.
- Strong analytical and problem - solving skills, highly motivated, good team player with very Good communication & interpersonal skills
PROFESSIONAL EXPERIENCE
Confidential, Atlanta, GA
Sr. Data Engineer
Responsibilities:
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Flume, Spark, Impala.
- Ingested the data from Relational Databases to HDFS using SQOOP
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in python
- Implemented Spark using python and Spark SQL for faster testing and processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Implemented intermediate functionalities like events or records count from the flume sinks or Kafka topics by writing Spark programs in java and python.
- Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Experienced in transferring Streaming data, data from different data sources into HDFS, No SQL databases
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Developed multiple Kafka Producers and Consumers from scratch to as per the software requirement specifications.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using python.
- Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language python.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Real time streaming the data using Spark with Kafka.
- Designed and developed data loading strategies, transformation for business to analyze the datasets
- Processed flat files in various formats and stored them as in various partition models in HDFS
- Responsible for building, develop testing shared components that will be used across modules
- Responsible in implementing advanced procedures like tet analytics and processing using the in-memory computing capabilities like spark.
- Experience in extracting appropriate features from data sets in order to handle bad, null, partial records using spark SQL.
- Collected data using spark streaming in near-real-time and performs necessary transformations and aggregations to build the data model persists the data in HDFS
- Expertized in implementing spark using Scala and spark SQL for faster testing and processing of data responsible to manage data from different sources
- Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
- Technologies: Hadoop, Hive, Flume, Map Reduce, Sqoop, Kafka, Spark, Yarn, Cassandra, Oozie, shell Scripting, Scala, Maven, MySQL
Confidential, Atlanta, GA
Data Engineer
Responsibilities:
- Analyze and Prepare data, identify the patterns on dataset by applying historical models. Collaborating with Senior Data Scientists for understanding of data
- Designed and developed analytics, machine learning models on AWS, Cloudera platform, and visualizations that drove performance and provided insights, from prototyping to production deployment and product recommendation and allocation planning and used Amazon Sage maker to set up fashions and to song changes in deploying the model's git is used.
- Managed the activities required to maintain a data & process governance structure
- Utilize data from external provider to properly class MDM data components (customer category, sub-category, etc.)
- Made DataStage jobs utilizing various stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc.
- Perform data manipulation, data preparation, normalization, and predictive modelling. Improve efficiency and accuracy by evaluating model in Python and R
- Used MDM suite of tools to design, develop, optimize and support MDM for various domains. Lead the performance tuning of existing processes.
- Collaborated to implement A/B testing for an e - commerce website and created effective call-to-actions to improve CTR and Conversion Rate by 10%
- This project was focused on customer segmentation based on machine learning and statistical modelling effort including building predictive models and generate data products to support customer segmentation
- Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
- Built mid to large clusters on AWS cloud using multiple instances of the Amazon EC2 cloud. This was to enable use cases which used distributions of Cloudera Hadoop or Hortonworks Hadoop.
- Conceptualizing and migration of SAP Upstream systems to AWS to reduce TCO.
- Built price elasticity model for various product and services bundled offering
- A Data Platform Solution Architect presently associated with Confidential Corporation with a strong consulting and presales background possessing hands on experience in Big Data, Data Science, Cloud and Enterprise applications.
- Perform Data Analysis on the Analytic data present in Teradata, Hadoop/HIVE/Oozie/Sqoop and AWS using SQL, Teradata SQL Assistant, Python, Apache Spark, SQL Workbench.
- Managed the partnership with Data Integration Hubs around data modeling, data mappings, data validation, hierarchy management and overall security. Provide continuous enhancement and review of MDM matching rules, data quality and validation processes
- Develop a pricing model for various product and services bundled offering to optimize and predict the gross margin
- Enabled cloud watch and Ganglia on large AWS based cloud clusters to enable effective monitoring of clusters.
- Developed, optimize and support Informatica PowerCenter components to provide MDM ETL processes for data extraction, transformations and loading of data
- Developed predictive causal model using annual failure rate and standard cost basis for the new bundled service offering
- Design and develop analytics, machine learning models, and visualizations that drive performance and provide insights, from prototyping to production deployment and product recommendation and allocation planning
- Utilized Scala, HBase, Kafka, Spark Streaming, MLLib, R, a broad variety of machine learning methods including classifications, regressions, dimensionality reduction etc
- Worked with sales and Marketing team for Partner and collaborate with a cross-functional team to frame and answer important data questions prototyping and experimentation ML/DL algorithms and integrating into production system for different business needs
- Worked on Multiple datasets containing two billion values which are structured and unstructured data about web applications usage and online customer surveys
- Good hands-on experience on Amazon Red shift platform
- Segmented the customers based on demographics using K-means Clustering
- Explored different regression and ensemble models in machine learning to perform forecasting
- Presented Dashboards to Higher Management for more Insights using Power BI
- Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring
- Skilled in Advanced Regression Modeling, Time Series Analysis, Statistical Testing, Correlation, Multivariate Analysis, Forecasting, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Ability to extract Web search and data collection, Web data mining, Extract database from website, Extract Data entry and Data processing with R Visualization, Power BI and KIBANA.
- Good knowledge, understanding and implementing of data mining techniques like classification, clustering, regression techniques and random forests in python and R- programming and java.
- Experience on advanced SAS programming techniques, such as PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.
- Implemented Deep Learning models and numerical Computation with the help of data flow graphs using Tensor Flow Machine learning with the help python.
- Proficient knowledge of statistics, mathematics, machine learning, recommendation algorithms and analytics with an excellent understanding of business operations and analytics tools for effective analysis of data.
- Performed Boosting method on predicted model for the improve efficiency of the model
- Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom, visualization tools using R, Tableau, and Power BI
- Collaborating with the project managers and business owners to understand their organizational processes and help design the necessary reports
- Technologies: MS SQL Server, R/R studio, SQL Enterprise Manager, Python, Red shift, MS Excel, Power BI, Tableau, T-SQL, ETL, MS Access, XML, MS office, Outlook, AS E-Miner.
Confidential, Warren, NJ
Hadoop Developer
Responsibilities:
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming
- Converting the existing relational database model to Hadoop ecosystem.
- Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
- Strong experience in working with ELASTIC MAPREDUCE and setting up environments on Amazon AWS EC2 instances.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
- Managed and reviewed Hadoop and HBase log files.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Analyze table data and implement compression techniques like Teradata Multivalued compression
- Involved in ETL process from design, development, testing and migration to production environments.
- Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
- Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
- Provide guidance to development team working on PySpark as ETL platform.
- Writing Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs
- Generating analytics reporting on probe data by writing EMR (elastic map reduce) jobs to run on Amazon VPC cluster and using Amazon data pipelines for automation.
- Have good understanding of Teradata MPP architecture such as Partitioning, Primary Indexes,
- Good knowledge in Teradata Unity, Teradata Data Mover, OS PDE Kernel internals, Backup and Recovery
- Created HBase tables to store variable data formats of data coming from different portfolios.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop
- Creating Hive tables and working on them using HiveQL.
- Creating and truncating HBase tables in hue and taking backup of submitter ID
- Developed data pipeline using Kafka to store data into HDFS.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Involved in review of functional and non-functional requirements.
- Developed ETL Process using HIVE and HBASE.
- Worked as an ETL Architect/ETL Technical Lead and provided the ETL framework Solution for the Delta process, Hierarchy Build and XML generation.
- Prepared the Technical Specification document for the ETL job development.
- Responsible to manage data coming from different sources.
- Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by using Flume.
- Installed and configured Apache Hadoop, Hive and Pig environment.
- Technologies: Hadoop, HDFS, pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Kafka.
Confidential, Chicago, IL
Data Engineer
Responsibilities:
- Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Data Warehouse to Atlas Data Lake (Big Data)
- Performed all the Technical Data quality (TDQ) validations which include Header/Footer validation, Record count, Data Lineage, Data Profiling, Check sum, Empty file, Duplicates, Delimiter, Threshold, DC validations for all Data sources.
- Analyzed huge volumes of data Devised simple and complex HIVE, SQL scripts to validate Dataflow in various applications. Performed Cognos report validation. Made use of MHUB for validating Data Profiling & Data Lineage.
- Devised PL/SQL statements - Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Created reports using Tableau/Power BI/Congas to perform data validation.
- Set up a governance process around Tableau dashboard processes
- Worked with senior management to plan, define and clarify tableau dashboard goals, objectives and requirement.
- Involved in creating Created Tableau dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality. Dashboards and stories as needed using Tableau Desktop and Tableau Server
- Responsible for daily communications to management and internal organizations regarding status of all assigned projects and tasks.
Confidential, Raritan, NJ
Business Intelligence Developer
Responsibilities:
- Provided Technical Support for all the key 64 IMS applications and completely managed the Support Teams.
- As a SME of more than 30 applications, provided several Innovative solutions and Value adds to the existing Applications.
- Designed User Interfaces. Frame Work Development &Customization
- Involved in several key enhancements and maintenance activities. Completed within agreed timescales and met the Clients expectations.
- Managed the fast paced growth in the number of processes and the expansion of existing business processes and gained the Client’s appreciations.
- Responsibility for all the deliverables of the projects and ensuring to adhere to SLA agreed.
- Other KRAs were Enterprise Change Management, Project Planning, Resource Planning and Mobilization, Managing Deliveries and resolving dependencies, Client Relationship Management and internal interfacing for project requirements.
- Deliver the project within estimated budget and timescales, Involved in preparation of SLAs.
- Technologies: SQLServer 2008/2014, MSBI (SSIS, SSAS &SSRS),.Net, VB