SeniorData Engineer Resume

SUMMARY:

Strong experience in Software Development Life Cycle (SDLC) including Requirement Analysis, Design specification and testing as per cycle in both Agile methodologies and Waterfall models
Strong experience in writing scripts using SQL, PySpark, Hive, Python API, PySpark API for analyzing the data
Good hands - on experience in designing and implementing data engineering pipelines and analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Scoop and Hive
Experience in leveraging Python libraries like NumPy and Pandas for data manipulation
Expertise working with AWS cloud services like EMR, S3, Redshift, Athena, Glue for big data development
Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala
Experience with using PySpark to process XML and Json data stored in HDFS
Worked with Avro and Parquet file formats and used various compression techniques to leverage the storage in HDFS
Progressive experience in field of Big Data Technologies such as Hadoop and Hive and knowledge of PySpark
Experience in writing spark application using Python to analyze and process large datasets and run scripts
Implemented dynamic Partition and Bucketing in Hive for efficient data access
Experience with the design of large-scale ETL solutions integrating multiple source systems SQL Server, Oracle, and Snowflake Databases
Experience with Hierarchical stage to parse data into JSON or XML to create POST and PUT requests to Rest API
Proven track record in troubleshooting of DataStage jobs and addressing production issues like performance tuning and enhancement
Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in working with Flume and NiFi for loading log files into Hadoop
Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management
Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality
Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS
Worked on the development of Dashboard reports for the Key Performance Indicators (KPIs) to present to the higher management
Experienced in building highly scalable Big-data solutions using Hadoop
Extensively used DataStage Director for executing, monitoring data, analyzing logs
Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture
Skilled in System Analysis, E-R/Dimensional Data Modelling, Database Design and implementing RDBMS specific features
Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments
Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala
Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications
Excellent programming skills with experience in Java, C, SQL, Shell scripting and Python Programming.
Prepare technical design and mapping documents and develop ETL data pipelines for error handling

TECHNICAL SKILLS:

Programming: Python, Spark, Shell Scripting, SQL, Java, C#, .Net Core

Database: Oracle SQL, MySQL, MS SQL Server, Hive, DynamoDB

Big Data Tools: Hadoop HDFS, Hive, Sqoop, Oozie, Hue

ETL Tools: DataStage, SSIS, Teradata Utilities, BIML

Data Visualization: SSRS, PowerBI, Powershell

Data Analysis: NumPy, Pandas, Matplotlib, Advanced Excel, Statistics

Data Modeling: Visual Studio

Software/Tools: Jupyter, PyCharm, Eclipse, Splunk, MS Office Suite, JIRA, Confluence, SharePoint

Version Control: GitHub, Bitbucket, SVN

Operating Systems: Windows, Linux (Unix)

Cloud Computing: AWS (S3, EC2, Redshift, Elastic search, Athena, Glue)

Methodologies: System Development Life Cycle (SDLC), Agile, Scrum, Waterfall

Deployment Tools: Jenkins, Octopus

PROFESSIONAL EXPERIENCE:

Confidential

SeniorData Engineer

Responsibilities:

Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development and created technical specification documents based on business requirements
Collaborated with Architects and Team Leaders in best practices for code implementation with code reviews prior to deployment from the development environments
Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose by Pig
Responsible for building scalable distributed data solutions using Hadoop and assisted in enhancement of current systems to ensure backend scalability and data integrity
Transformed, parsed, and loaded student data stored in JSON (JavaScript object notation) format from Amazon S3 bucket by interacting with API client systems into multiple HDFS using Sqoop
Built packages to validate the QTI (Question & Test Interoperability) scoring engine based on scoring rubric provided in Test Form Planner and captured the error log for incorrect Item XML
Developed pipelines to extract data from source systems such as Transactional system for online assessments and legacy system for paper pencil assessments which are stored in DynamoDB and RDBMS, transform data based on business rules and load the data to Hive tables and analyzed data using Hive queries
Utilized Spark Scala API to implement batch processing of jobs
Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines. Troubleshooting Spark applications for improved error tolerance.
Converted Hive/SQL queries into Spark transformations using SparkRDDs, Python and Scala.
Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation and Ad Hoc querying
Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive QL queries
Used Python Pandas module to read CSV files to obtain Student data and store the data in datastructures
Created partitioned tables and loaded data using both staticpartition and dynamicpartition methods
Designed and implemented Data Warehouse life cycle and entity-relationship and multidimensional modeling using Star, Snowflake schema
Developed Python/Powershell script to automate the generation of different datafile exports like tab delimited files, pipe delimited files, fixed width files, comma separated text files etc., into Amazon S3 bucket by using AWS EC2 and then structured and stored in AWS RedShift
Used EPPlus .Net and .Net core libraries to read and write to excel files using Office Open XML format when exporting data with over 2000 columns to ensure backend scalability
Developed Powershell and Python scripts using .NET Core utilities to generate reports by interacting with a REST API of an online reporting server and rendered them into csv or pdf formats into specific district or school related sub folders files in shared directory or SFTP server or Reporting Service Point or Amazon S3 bucket based on the end user requirements
Developed multiple Powershell script using C Sharp, .Net and .Net Core utilities to interact with different source systems which include AWS API gateway clients, Hive, SQL Server, SQL Server Agent Job, SQL Server Integration services, JSON file stored in Amazon S3 bucket, DynamoDB to invoke the jobs which run on a nightly basis and capture the execution log to notify specific user when failures
Developed various dashboard, ad hoc and operational reports daily to monitor scheduled tasks or jobs using Power BI
Designed and implemented a variety of SSRS reports such as Parameterized, Drilldown, Ad hoc and Sub reports using ReportDesigner and ReportBuilder based on the requirements
Created different types of Tabular reports like Cascaded Parameters, Drill through, Drilldown, sub-reports and Matrix reports and developed some graphical reports using Report Designer
Deployed SSRS reports to ReportManager and created linked reports, snapshots, and subscriptions for the reports and worked on scheduling of the reports
Built CI/CD pipeline with Jenkins using repository files stored in Bitbucket which is configured with multiple environment or stages and publish into database using Octopus

Environment: Powershell Scripting, Python, Spark, PySpark, Redshift, Hive, HDFS, T-SQL, SSRS, BIML, JSON, XML, C#, .Net, .Net Core, AWS, Amazon S3, DynamoDB, Athena Glue, Power BI, Bitbucket, GitHub, SVN, Jenkins, Octopus, Visual Studios, Powershell Core.

Confidential . - Irving, TX

ETLDeveloper

Responsibilities:

Worked with business users, business analysts, program managers, project managers, system analysts, quality assurance analysts for reviewing business requirements
Collected business requirements from users and translate them as technical specifications and design documents for development
Designed, developed, and created ETL (Extract, Transform and Load) solutions for loading data from multiple source systems into data warehouse and data extracts using SSIS, Powershell Scripting
Developed logging for ETL load at package level and task level to log number of records processed by each package and each task in a package using SSIS
Used various control flow components such as For Each loop containers, Sequence containers, Execute SQL Task, Data Flow Task, File System Task, Expression Task, XML Task
Created and used extensive transformations within data flow tasks in SSIS such as Lookups, Merge Joins, Derived Column, Data Conversion, Conditional Split, Union, Union All, Multicast, Script component, Row count
Automated existing SSIS packages for current and future projects using BIML (business intelligence markup language) and maintained the design pattern for reporting ETL SSIS packages
Responsible for Deploying, SchedulingJobs, Alerting and Maintaining SSIS packages
Built PowerShell scripts to execute Bulk copy protocol (BCP) commands to process bulk volumes of data into multiple databases
Developed transact SQL queries, stored procedures, views, user-defined functions, inline table valued functions, triggers, error handling
Used Advance SQL, Dynamic SQL methods, creating Pivot functions, Un-pivot functions, dynamic table expressions, dynamic executions load through parameters and variables for generating data files
Wrote complex queries, stored procedures to validate the test results based on the requirements and to verify if the test’s XML is defined correctly
Loaded various data mart tables such as lookups, dimension tables using Type 2 slowly changing dimensions which hold student demographic data, multiple hash component to create unique hash value for all input records, fact tables which hold scoring data, aggregate fact tables which are mainly used for data files and reporting
Generated files which have very large column list through building logical data sets referring to business requirements using T-SQL queries by joining multiple tables using different join clauses, Case Expressions, while loops, Convert etc.
Performance Tuned SQL queries using Execution Plans, SQL Trace to pinpoint time consuming query and tune them by using hints, partitions, and indexes
Developed packages using Python and/or PowerShell script to automate some of the menial tasks
Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, bar graphs for display purposes as per business needs
Designed SSRS reports using parameters, drill down options, filters, sub reports
Supported and maintained the existing SSRS reports and responsible for source code fixes
Developed internal dashboards for the team using Power BI tools for tracking daily tasks
Migrated the scripts & packages through different environments from Development to Production using DevOps tools while maintaining versions in the GIT repository
Used DevOps tools to build Jenkin jobs using build parameters and deployed the code into multiple environments like DEV, QA, STG, PRD for repeatable and reliable deployments
Followed the process of updating and maintaining JIRA support ticket, project and its sub-tasks workflow process and communicating with ticket submitters. Maintained a track of all the loads in JIRA
Uploaded detailed documents on process flows, ETL flows, explanation of scripts used for processing data files, DataMart tables in confluence for knowledge sharing and team building

Environment: Powershell Scripting, C#, JSON, XML, T-SQL, MS SQL Server, SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), Report Designer, Report Builder, Power BI, Power Pivot, Red Gate Tools, Sentry Execution Planner, Visual Studio, GitHub, Jenkins, JIRA, Confluence.

Confidential

Big Data Engineer

Responsibilities:

Written Hive queries for data analysis to meet the business requirements
Migrated an existing on premises application to AWS.
Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and SparkSQL.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
Good knowledge on Spark platform parameters like memory, cores, and executors
By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.
Using Sqoop to import and export data from Oracle and MySQLinto HDFS to use it for the analysis.
Migrated Existing MapReduce programs to Spark Models using Python.
Migrating the data from Data Lake (hive) into S3 Bucket.
Done data validation between data present in Data Lake and S3 bucket.
Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
Used Kafka for real time data ingestion.
Created different topic for reading the data in Kafka
Read data from different topics in Kafka.
Involved in converting the HQL's in to spark transformations using SparkRDD with support of python and Scala
Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports

Environment: Linux, Apache Hadoop Framework, Snowflake, HDFS, Hive, Hbase, AWS (S3, EMR), Scala, Spark, Sqoop, Oracle, MySQL.

Confidential

Data Engineer

Responsibilities:

Designed a data story framework and new financial benchmark metrics on Costs and departmental expenditures.
Extracted and validated financial data from external data source like Quandl to generate reports to C-level executive Implemented charts, graphs and distribution of revenues through visualization tools for CFOs.
Reduced 500 man-hours by auto cleaning of data with validations using Python to improve efficiency.
Predicted revenue based on R&D and Sales expenses using financial econometric models.
Worked with large amounts of structured and unstructured data.
Developed data processing pipelines (40-50 GB/daily) using Python API, SQL with Google internal tools (Plx, Pantheon ETL) to create BigQuery datasets.
Migrated the data from SAP, Oracle and created Data mart using Cloud Composer (Airflow) and moving Hadoop jobs to Datapost workflows.
Implemented data transformations, data workflows, scripts, tables and views using GoogleSQL
Developed machine learning models such as Random forests using Collab notebooks and TensorFlow.
Designed data visualizations using Tableau, Data Studio platforms for near real-time analytics and optimized queries for performance improvements.
Implemented Data Quality validations with scripts during data migration and backfill data for previous months.

Environment: Python, Spark, Hadoop (HDFS, MapReduce), Hive, Pig, Sqoop, Flume, Yarn, HBase, Oozie, Red Hat, AWS Services, Redshift, EMR, UNIX, SQL scripting, Linux shell scripting, Eclipse and Cloudera

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship