We provide IT Staff Augmentation Services!

Seniordata Engineer Resume

3.00/5 (Submit Your Rating)

SUMMARY:

  • Strong experience in Software Development Life Cycle (SDLC) including Requirement Analysis, Design specification and testing as per cycle in both Agile methodologies and Waterfall models
  • Strong experience in writing scripts using SQL, PySpark, Hive, Python API, PySpark API for analyzing the data
  • Good hands - on experience in designing and implementing data engineering pipelines and analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Scoop and Hive
  • Experience in leveraging Python libraries like NumPy and Pandas for data manipulation
  • Expertise working with AWS cloud services like EMR, S3, Redshift, Athena, Glue for big data development
  • Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala
  • Experience with using PySpark to process XML and Json data stored in HDFS
  • Worked with Avro and Parquet file formats and used various compression techniques to leverage the storage in HDFS
  • Progressive experience in field of Big Data Technologies such as Hadoop and Hive and knowledge of PySpark
  • Experience in writing spark application using Python to analyze and process large datasets and run scripts
  • Implemented dynamic Partition and Bucketing in Hive for efficient data access
  • Experience with the design of large-scale ETL solutions integrating multiple source systems SQL Server, Oracle, and Snowflake Databases
  • Experience with Hierarchical stage to parse data into JSON or XML to create POST and PUT requests to Rest API
  • Proven track record in troubleshooting of DataStage jobs and addressing production issues like performance tuning and enhancement
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in working with Flume and NiFi for loading log files into Hadoop
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS
  • Worked on the development of Dashboard reports for the Key Performance Indicators (KPIs) to present to the higher management
  • Experienced in building highly scalable Big-data solutions using Hadoop
  • Extensively used DataStage Director for executing, monitoring data, analyzing logs
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture
  • Skilled in System Analysis, E-R/Dimensional Data Modelling, Database Design and implementing RDBMS specific features
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala
  • Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications
  • Excellent programming skills with experience in Java, C, SQL, Shell scripting and Python Programming.
  • Prepare technical design and mapping documents and develop ETL data pipelines for error handling

TECHNICAL SKILLS:

Programming: Python, Spark, Shell Scripting, SQL, Java, C#, .Net Core

Database: Oracle SQL, MySQL, MS SQL Server, Hive, DynamoDB

Big Data Tools: Hadoop HDFS, Hive, Sqoop, Oozie, Hue

ETL Tools: DataStage, SSIS, Teradata Utilities, BIML

Data Visualization: SSRS, PowerBI, Powershell

Data Analysis: NumPy, Pandas, Matplotlib, Advanced Excel, Statistics

Data Modeling: Visual Studio

Software/Tools: Jupyter, PyCharm, Eclipse, Splunk, MS Office Suite, JIRA, Confluence, SharePoint

Version Control: GitHub, Bitbucket, SVN

Operating Systems: Windows, Linux (Unix)

Cloud Computing: AWS (S3, EC2, Redshift, Elastic search, Athena, Glue)

Methodologies: System Development Life Cycle (SDLC), Agile, Scrum, Waterfall

Deployment Tools: Jenkins, Octopus

PROFESSIONAL EXPERIENCE:

Confidential

SeniorData Engineer

Responsibilities:

  • Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development and created technical specification documents based on business requirements
  • Collaborated with Architects and Team Leaders in best practices for code implementation with code reviews prior to deployment from the development environments
  • Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose by Pig
  • Responsible for building scalable distributed data solutions using Hadoop and assisted in enhancement of current systems to ensure backend scalability and data integrity
  • Transformed, parsed, and loaded student data stored in JSON (JavaScript object notation) format from Amazon S3 bucket by interacting with API client systems into multiple HDFS using Sqoop
  • Built packages to validate the QTI (Question & Test Interoperability) scoring engine based on scoring rubric provided in Test Form Planner and captured the error log for incorrect Item XML
  • Developed pipelines to extract data from source systems such as Transactional system for online assessments and legacy system for paper pencil assessments which are stored in DynamoDB and RDBMS, transform data based on business rules and load the data to Hive tables and analyzed data using Hive queries
  • Utilized Spark Scala API to implement batch processing of jobs
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines. Troubleshooting Spark applications for improved error tolerance.
  • Converted Hive/SQL queries into Spark transformations using SparkRDDs, Python and Scala.
  • Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation and Ad Hoc querying
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive QL queries
  • Used Python Pandas module to read CSV files to obtain Student data and store the data in datastructures
  • Created partitioned tables and loaded data using both staticpartition and dynamicpartition methods
  • Designed and implemented Data Warehouse life cycle and entity-relationship and multidimensional modeling using Star, Snowflake schema
  • Developed Python/Powershell script to automate the generation of different datafile exports like tab delimited files, pipe delimited files, fixed width files, comma separated text files etc., into Amazon S3 bucket by using AWS EC2 and then structured and stored in AWS RedShift
  • Used EPPlus .Net and .Net core libraries to read and write to excel files using Office Open XML format when exporting data with over 2000 columns to ensure backend scalability
  • Developed Powershell and Python scripts using .NET Core utilities to generate reports by interacting with a REST API of an online reporting server and rendered them into csv or pdf formats into specific district or school related sub folders files in shared directory or SFTP server or Reporting Service Point or Amazon S3 bucket based on the end user requirements
  • Developed multiple Powershell script using C Sharp, .Net and .Net Core utilities to interact with different source systems which include AWS API gateway clients, Hive, SQL Server, SQL Server Agent Job, SQL Server Integration services, JSON file stored in Amazon S3 bucket, DynamoDB to invoke the jobs which run on a nightly basis and capture the execution log to notify specific user when failures
  • Developed various dashboard, ad hoc and operational reports daily to monitor scheduled tasks or jobs using Power BI
  • Designed and implemented a variety of SSRS reports such as Parameterized, Drilldown, Ad hoc and Sub reports using ReportDesigner and ReportBuilder based on the requirements
  • Created different types of Tabular reports like Cascaded Parameters, Drill through, Drilldown, sub-reports and Matrix reports and developed some graphical reports using Report Designer
  • Deployed SSRS reports to ReportManager and created linked reports, snapshots, and subscriptions for the reports and worked on scheduling of the reports
  • Built CI/CD pipeline with Jenkins using repository files stored in Bitbucket which is configured with multiple environment or stages and publish into database using Octopus

Environment: Powershell Scripting, Python, Spark, PySpark, Redshift, Hive, HDFS, T-SQL, SSRS, BIML, JSON, XML, C#, .Net, .Net Core, AWS, Amazon S3, DynamoDB, Athena Glue, Power BI, Bitbucket, GitHub, SVN, Jenkins, Octopus, Visual Studios, Powershell Core.

Confidential . - Irving, TX

ETLDeveloper

Responsibilities:

  • Worked with business users, business analysts, program managers, project managers, system analysts, quality assurance analysts for reviewing business requirements
  • Collected business requirements from users and translate them as technical specifications and design documents for development
  • Designed, developed, and created ETL (Extract, Transform and Load) solutions for loading data from multiple source systems into data warehouse and data extracts using SSIS, Powershell Scripting
  • Developed logging for ETL load at package level and task level to log number of records processed by each package and each task in a package using SSIS
  • Used various control flow components such as For Each loop containers, Sequence containers, Execute SQL Task, Data Flow Task, File System Task, Expression Task, XML Task
  • Created and used extensive transformations within data flow tasks in SSIS such as Lookups, Merge Joins, Derived Column, Data Conversion, Conditional Split, Union, Union All, Multicast, Script component, Row count
  • Automated existing SSIS packages for current and future projects using BIML (business intelligence markup language) and maintained the design pattern for reporting ETL SSIS packages
  • Responsible for Deploying, SchedulingJobs, Alerting and Maintaining SSIS packages
  • Built PowerShell scripts to execute Bulk copy protocol (BCP) commands to process bulk volumes of data into multiple databases
  • Developed transact SQL queries, stored procedures, views, user-defined functions, inline table valued functions, triggers, error handling
  • Used Advance SQL, Dynamic SQL methods, creating Pivot functions, Un-pivot functions, dynamic table expressions, dynamic executions load through parameters and variables for generating data files
  • Wrote complex queries, stored procedures to validate the test results based on the requirements and to verify if the test’s XML is defined correctly
  • Loaded various data mart tables such as lookups, dimension tables using Type 2 slowly changing dimensions which hold student demographic data, multiple hash component to create unique hash value for all input records, fact tables which hold scoring data, aggregate fact tables which are mainly used for data files and reporting
  • Generated files which have very large column list through building logical data sets referring to business requirements using T-SQL queries by joining multiple tables using different join clauses, Case Expressions, while loops, Convert etc.
  • Performance Tuned SQL queries using Execution Plans, SQL Trace to pinpoint time consuming query and tune them by using hints, partitions, and indexes
  • Developed packages using Python and/or PowerShell script to automate some of the menial tasks
  • Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, bar graphs for display purposes as per business needs
  • Designed SSRS reports using parameters, drill down options, filters, sub reports
  • Supported and maintained the existing SSRS reports and responsible for source code fixes
  • Developed internal dashboards for the team using Power BI tools for tracking daily tasks
  • Migrated the scripts & packages through different environments from Development to Production using DevOps tools while maintaining versions in the GIT repository
  • Used DevOps tools to build Jenkin jobs using build parameters and deployed the code into multiple environments like DEV, QA, STG, PRD for repeatable and reliable deployments
  • Followed the process of updating and maintaining JIRA support ticket, project and its sub-tasks workflow process and communicating with ticket submitters. Maintained a track of all the loads in JIRA
  • Uploaded detailed documents on process flows, ETL flows, explanation of scripts used for processing data files, DataMart tables in confluence for knowledge sharing and team building

Environment: Powershell Scripting, C#, JSON, XML, T-SQL, MS SQL Server, SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), Report Designer, Report Builder, Power BI, Power Pivot, Red Gate Tools, Sentry Execution Planner, Visual Studio, GitHub, Jenkins, JIRA, Confluence.

Confidential

Big Data Engineer

Responsibilities:

  • Written Hive queries for data analysis to meet the business requirements
  • Migrated an existing on premises application to AWS.
  • Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and SparkSQL.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
  • Good knowledge on Spark platform parameters like memory, cores, and executors
  • By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.
  • Using Sqoop to import and export data from Oracle and MySQLinto HDFS to use it for the analysis.
  • Migrated Existing MapReduce programs to Spark Models using Python.
  • Migrating the data from Data Lake (hive) into S3 Bucket.
  • Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
  • Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
  • Used Kafka for real time data ingestion.
  • Created different topic for reading the data in Kafka
  • Read data from different topics in Kafka.
  • Involved in converting the HQL's in to spark transformations using SparkRDD with support of python and Scala
  • Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports

Environment: Linux, Apache Hadoop Framework, Snowflake, HDFS, Hive, Hbase, AWS (S3, EMR), Scala, Spark, Sqoop, Oracle, MySQL.

Confidential

Data Engineer

Responsibilities:

  • Designed a data story framework and new financial benchmark metrics on Costs and departmental expenditures.
  • Extracted and validated financial data from external data source like Quandl to generate reports to C-level executive Implemented charts, graphs and distribution of revenues through visualization tools for CFOs.
  • Reduced 500 man-hours by auto cleaning of data with validations using Python to improve efficiency.
  • Predicted revenue based on R&D and Sales expenses using financial econometric models.
  • Worked with large amounts of structured and unstructured data.
  • Developed data processing pipelines (40-50 GB/daily) using Python API, SQL with Google internal tools (Plx, Pantheon ETL) to create BigQuery datasets.
  • Migrated the data from SAP, Oracle and created Data mart using Cloud Composer (Airflow) and moving Hadoop jobs to Datapost workflows.
  • Implemented data transformations, data workflows, scripts, tables and views using GoogleSQL
  • Developed machine learning models such as Random forests using Collab notebooks and TensorFlow.
  • Designed data visualizations using Tableau, Data Studio platforms for near real-time analytics and optimized queries for performance improvements.
  • Implemented Data Quality validations with scripts during data migration and backfill data for previous months.

Environment: Python, Spark, Hadoop (HDFS, MapReduce), Hive, Pig, Sqoop, Flume, Yarn, HBase, Oozie, Red Hat, AWS Services, Redshift, EMR, UNIX, SQL scripting, Linux shell scripting, Eclipse and Cloudera

We'd love your feedback!