Sr. Data Engineer Resume
Oklahoma City, OK
SUMMARY
- Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
- Over 9+ years of experience in IT industry in which 5 years of experience on Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler. 4 years of experience on Data Warehouse/ETL Technologies.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK's.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
- Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon's Approach.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle 12c/11g, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Oklahoma City, OK
Responsibilities:
- Implemented solutions for ingesting data from various sources and processing the Data- Confidential -Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive with Cloud Architecture.
- Construct and maintain an appropriate, scalable, and easy-to-use infrastructure with various tools to support the development of actionable reports used in decision-making across the strategy team.
- Worked on AWS, implementing solutions using services like (EC2, S3, RDS, Redshift, VPC).
- Extracted the data from Netezza, AWS Redshift into HDFS using Sqoop.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Performed data profiling and transformation on the raw data using Pig, Python, and Java.
- Used Apache Spark for batch processing to source the data.
- Expert in performing business analytical scripts using Hive SQL.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Building data ingestion pipe lines with Nifi, Kafka for different data sources like LMS, MVP, RDBMS,etc….
- Analyzed, designed, developed, implemented, and maintained Parallel jobs using IBM info sphere Data stage.
- Involved in design of dimensional data model - Star schema and Snowflake Schema.
- Load and transform large sets of structured, semi structured, and unstructured data.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Generating DB scripts from Data modeling tool and Creation of physical tables in DB.
- Used the ETL Data Stage Director to schedule and running the jobs, testing, and debugging its components & monitoring performance statistics.
- Experience in different Hadoop distributions like Cloudera (CDH3 & CDH4) and Hortonworks Distributions (HDP) and MapR.
- Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sause LABS, Cucumber JVM, Mongo DB, GitHub, Bitbucket, SQL, NoSQL database, API, Java, Jenkins
- Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage.
- Created some routines (Before-After, transform function) used across the project.
- Experienced in PX file stages that include Complex Flat File stage, Dataset stage, LookUp File Stage, Sequential file stage.
- Experienced in developing parallel jobs using various Development/debug stages (Peek stage, Head & Tail Stage, Row generator stage, Column generator stage, Sample Stage) and processing stages (Aggregator, Change Capture, Change Apply, Filter, Sort & Merge, Funnel, Remove Duplicate Stage).
- Repartitioned job flow by determining DataStage PX best available resource consumption.
- Successfully implemented pipeline and partitioning parallelism techniques and ensured load balancing of data.
- Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
- Document all the changes implemented across all systems and components using Confluence and Atlassian Jira.
- Documentation includes Technical changes, Infrastructure changes, and Business Process changes.
- Post Release documentation would also include Known Issues from Production Implementation and Deferred defect.
Environment: DataStage, Netezza, E3 Framework, Unix scripting, Hadoop 3.0, HBase 1.2, Hive 2.3, AWS, EC2, S3, RDS, VPC, MySQL, Redshift, Sqoop, GITHUB, HDFS, Spark, ETL, YARN, Python, UDF, HQL, NoSQL, Cassandra 3.11, Hortonworks, MapR, NiFi.
Data Engineer
Confidential, Newark, NJ
Responsibilities:
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, and time, Date and Time etc.
- Integrating with external data sources and APIs to discover interesting trends.
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Worked on Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie, Automatic on the Hadoop cluster.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Good working knowledge on Snowflake and Teradata databases.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Work with IT security auditors to resolve security vulnerabilities in Linux, UNIX, Apache.
- Develop Spark jobs using PySpark and Scala to create a generic framework to process all kinds of files such as json, txt, and csv.
- Delivered zero defect code for three large projects which involved changes to both front end (web services) and back-end (Oracle, snowflake, Teradata).
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Performed Data Cleaning, features scaling, featurization, features engineering and deploying the data in amazon s3 and Athena.
- Created data pipelines migrating data from on premises servers to S3 to Glue to Athena and utilized by AWS Quick sight and Tableau.
- Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets and Migrated applications from internal data center to AWS Athena and Glue.
- Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3.
- Implemented Continuous Integration using GIT and GitHub from scratch.
- Involved in all the stages of Software Development Life Cycle Primarily in Database Architecture, Logical and Physical modeling, Data Warehouse/ETL development using MS SQL Server 2012/2008R2/2008, Oracle 11g/10g, and ETL Solutions/Analytics Applications development.
- Experience with Unix/Linux systems with scripting experience and building data pipelines.
- Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, Automatic scheduling.
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Hands on experience in writing Python and Bash Scripts.
- Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins.
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Experience on Cloud Databases and Data warehouses (Redshift/RDS).
- Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
- Experience in refactoring the existing spark batch process for different logs written in Scala.
- Hands-on work developing in SAS, SQL, Python, and Java with Eclipse for extraction patterns from very large datasets and transform data into an informational advantage for decision support.
- Performed and assisted in design, development and testing of predictive analytics models that includes large data collection, data organization, text segmentation, categorization, summarization, and topic modeling.
- Advanced statistical analysis in SAS and predictive solutions.
- Implemented Big Data tools like Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive.
- Debugging and maintenance of automaton test scripts in batch mode and implemented a plan on automation scripts on based on Sprint.
- Develop Oozie workflows to schedule the Scripts on daily basis.
Environment: Hadoop/Big Data Technologies: Spark-Scala, Kafka, Spark Streaming, Mlib, Sqoop, Hbase, HDFS, Map Reduce, Pig, Hive, Zeppelin(Distributions Data Bricks, Horton works and Cloudera), Cassandra, HBase, HDFS, MapReduce, GITHUB, Hive, Pig, Sqoop, Flume, Oozie, JDBC, Tomcat, Apache, Shell Scripting.
Data Engineer
Confidential - Jacksonville, FL
Responsibilities:
- Proficient in designing and creating various Data Visualization Dashboards, worksheets, and analytical reports to help users to identify critical KPIs and facilitate strategic planning in the organization utilizing Tableau Visualizations according to the end user requirements.
- Determined operational objectives by studying business functions; gathering information; evaluating output requirements and formats.
- Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked on cloud deployments using maven, docker and Jenkins.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
- Developed the PySpark code for AWS Glue jobs and for EMR.
- Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle, SQL Server, MySQL.
- Involved in developing various ETL jobs to load, extract and map the data from flat files and heterogeneous database sources like Oracle and DB2.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP, Big Query and coordinate task among the team.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on Big Query and GCP.
- Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
- Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team.
- Good Knowledge of web services using SOAP and REST protocols.
- Expertise in developing data driven applications using Python 2.7, Python 3.0 on PyCharm and Anaconda Spyder IDE's.
- Writing Technical documents and mentoring global UNIX team.
- Proficient in all aspects of software life cycle like Build/Release/Deploy and specialized in cloud automation through open-source DevOps tools like Jenkins.
- Hands on experience in writing Python and Bash Scripts.
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Dockized applications by creating Docker images from Docker file.
- Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins.
- Periodic patch management on Unix/Linux Environment.
- Created reports using Tableau and Power BI to help forecast the provider information.
- Used Postman & SOAPUI for rest service testing.
- Created SQL scripts to insert/update and delete data in MS SQL database.
- Created database tables, wrote stored procedures to update and clean the old data and also helped the front-end application developers with their queries.
- Extracted data from the legacy system and loaded/integrated into another database through the ETL process.
- Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS.
- Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage, Azure Search, Azure App Service,Azure data Platform Services.
- Azure Data Factory(ADF),Integration Run Time(IR),File System Data Ingestion, Relational Data Ingestion.
- Executed SQL queries to test back end data validation of DB2 database tables based on business requirement.
- Recommended controls by identifying problems; writing improved procedures for the portal.
- Designed and created different ETL packages using SSIS and transferred data from Oracle source to MS SQL server destination.
- Performance tuning of SQL queries and stored procedures using SQL profiler and index tuning advisor.
- Created T-SQL queries for schemas, views, stored procedures, triggers and functions for data migration.
- Involved in the project from planning stage to pushing codes to production.
- Scheduled Cube Processing from Staging Database Tables using SQL Server Agentusing SSAS.
- Translated technical applications specification into functional and nonfunctional business requirements and created user stories based on those requirements in Rally.
- Created dashboards, worksheets, storyboards for the stake holders using Tableau and Excel.
Environment: Gcp, Big query, Gcs, Big Query Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, GITHUB,Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Salesforce Soql, Python, Azure Data Factory(ADF), Azure Database migration Service(DMS), ETL SQL Server Integration Services (SSIS), SQL Server Reporting Services(SSRS), ETL Extract Transformation and Load., Business Intelligence(BI),BCPScala, Spark, Hive, Sqoop, Spark- MS SQL Server 2005/2008, SQL Server.
Data Engineer
Confidential - Boston, MA
Responsibilities:
- Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
- Experience in developing scalable & secure data pipelines for large datasets.
- Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
- Supported data quality management by implementing proper data quality checks in data pipelines.
- Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies.
- Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
- Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
- Implemented data streaming capability using Kafka and Talend for multiple data sources.
- Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
- S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
- Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
- Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
- Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
- Worked on analyzing and resolving the production job failures in several scenarios.
- Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
Environment: Spark, Redshift, Python, HDFS,GITHUB, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.
Data Warehouse Engineer
Confidential - Jersey City, NJ
Responsibilities:
- Designed, developed and implemented a Big Data - Data Warehouse from scratch using SQL server 2012
- Created and configured SQL Server Analysis Services database which introduced company to a multidimensional tracking of subscribers special statistical techniques using SQL and Excel
- Developed and implemented custom data validation stored procedures for metadata summarization for the data warehouse tables, for aggregating telephone subscribers switching data, for identifying winning and losing carriers, and for identifying value subscribers
- Identified issue and developed a procedure for correcting the problem which resulted in the improved quality of critical tables by eliminating the possibility of entering duplicate data in a Data Warehouse.
- Spearheaded a project to implement company standards and establish procedures to ensure a unified data management approach
- Analyzed and compared performance of Redshift, Hadoop, MySQL and SQL Server databases using TPC-H benchmarking and made recommendations to management
- Implemented partitions on a large dataset as well as index functions using SQL Server 2012 resulting in 90% improved performance
- Designed and implemented SQL based tools, stored procedures and functions for daily data volume and aggregation status
- Completed a data warehouse dictionary
- Introduced company to geographic distance calculation, and chi-square for identifying affinity in subscribers, and survival quantifying subscribers events over the Customer life Cycle using Analytic SQL
- Created management reports with SSRS as well as SQL and MDX queries
Environment: Redshift, Hadoop, MySQL, TPC-H benchmarking, Unix Scripting, SQL, Maven, Eclipse, TOAD
