We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

SUMMARY:

  • A Proactive, Result - Oriented IT Professional with 8+ Years of experience in various industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Teh Hadoop working environment includes Hadoop, Spark, Map Reduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Fluent programming experience with Scala, Java, Python, SQL, T-SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
  • Worked with Spark to improve teh efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's, and Spark YARN.
  • Experience in teh application of various data sources like Oracle SE2, SQL Server, Flat Files, and Unstructured files into a data warehouse.
  • Able to precisely use Sqoop to migrate data between RDBMS, NoSQL databases, and HDFS.
  • Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI, and Microsoft SSIS.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduction and Streaming jobs using Scala and Java for data cleansing, filtering, and data aggregation. Also, possess detailed knowledge of teh MapReduce framework.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Vis

PROFESSIONAL EXPERIENCE:

Confidential

Sr. Data Engineer

Responsibilities:

  • Worked on AWS Data pipeline to configure data loads from S3 to Redshift. Using AWS Redshift, I Extracted, transformed, and loaded data from various heterogeneous data sources and destinations. Created Tables, Stored Procedures, and extracted data using T - SQL for business users whenever required.
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR. I has written a shell script to trigger data Stage jobs. Assist service developers in finding relevant content in teh existing models. Like Access, Excel, CSV, Oracle, flat files using connectors, tasks, and transformations provided by AWS Data Pipeline.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries. Worked on developing Pyspark script to encrypting teh raw data by using Hashing algorithms concepts on client-specified columns. Responsible for Design, Development, and testing of teh database and Developed Stored Procedures, Views, and Triggers. Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis. Compiling and validating data from all departments and Presenting it to teh Director of Operation. KPI calculator
  • Sheet and maintain that sheet within SharePoint. Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI. Creating a data model that correlates all teh metrics and gives a valuable output. Worked on teh tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan. Exploring with Spark to improve teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's. Involved in integration of
  • Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas. Developed and validated machine learning models including Ridge and Lasso regression for predicting teh total amount of trade. Boosted teh performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks. Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results. Implemented Copy activity, Custom Azure Data Factory Pipeline Activities. Primarily involved in
  • Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure
  • SQL DW, HDInsight/Databricks, NoSQL DB). Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) u

Confidential

Sr. Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining
  • Data Pipelines. Developed teh features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test Driven
  • Development) using Cucumber, Gherkin, and ruby. Designing teh business requirement collection approach based on teh project scope and
  • SDLC methodology. Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write - back tool, and backward. Files extracted from Hadoop and dropped on a daily hourly basis into S3. Working with Data governance and Data quality to design various models and processes. Involved in all teh steps and scope of teh project data approach to MDM, has created a Data Dictionary and Mapping from Sources to teh Target in MDM Data
  • Model. Experience managing Azure Data Lakes (ADLs) and Data Lake Analytics and an understanding of how to integrate with other Azure
  • Services. Knowledge of USQL. Responsible for working with various teams on a project to develop analytics-based solutions to target customer subscribers specifically. Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event-driven processing. Created Lambda jobs and configured Roles using AWS CLI. Responsible for wide-ranging data ingestion using Sqoop and HDFS commands.
  • Accumulate 'partitioned' data in various storage formats like text, JSON, Parquet, etc. Involved in loading data from LINUX file system to HDFS. Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution. Start working with AWS for storage and halding for a terabyte of data for customer BI Reporting tools. Built 12 node Hadoop cluster. Installed and configured Hadoop ecosystem components. Decommissioning nodes and adding nodes in teh clusters for maintenance. Monitored cluster health by Setting up alerts using Nagios and Ganglia. Adding new users and groups of users as per teh requests from teh client. Working on tickets opened by users regarding various incidents, requests Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Writing UNIX shell scripts to automate teh jobs and scheduling cron jobs for job automation using commands with Crontab. Developed various Mappings with teh collection of all Sources, Targets, and Transformations using Informatica Designer Developed Mappings using Transformations like
  • Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL, and MLlib libraries. Da

Confidential

Big Data Engineer

Responsibilities:

  • Responsibilities: Migrating data from FS to Snowflake within teh organization Imported Legacy data from SQL Server and Teradata into
  • Amazon S3. Created consumption views on top of metrics to reduce teh running time for complex queries. Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. Compare teh data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into teh data quality when these types of loads are done (To look for any data loss, data corruption). As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading teh history data from Teradata SQL to snowflake. Created Metric tables, End - user views in Snowflake to feed data for Tableau refresh. Generated
  • Custom SQL to verify teh dependency for teh daily, Weekly, Monthly jobs. Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Closely involved in scheduling Daily,
  • Monthly jobs with Precondition/Postcondition based on teh requirement. Monitor teh Daily, Weekly, Monthly jobs and provide support in case of failures/issues. Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns. Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, Hive. Working experience with data streaming process with
  • Kafka, Apache Spark, Hive. Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json, and various compression formats like Snappy, bzip2. Used Spark-Streaming APIs to perform necessary transformations and actions on teh data got from Kafka. Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Environment: Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Kafka, Jira, Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper Teradata, SQL Server, Apache Spark, Sqoop.

Confidential

Data & Reporting Analyst

Responsibilities:

  • Responsibilities: Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python. Research and recommend a suitable technology stack for Hadoop migration considering current enterprise architecture. Responsible for building scalable distributed data solutions using Hadoop.
  • Experienced in loading and transforming large sets of structured, semi - structured, and unstructured data. Developed Spark jobs and Hive Jobs to summarize and transform data. Experienced in developing Spark scripts for data analysis in both python and Scala. Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts. Built on-premise data pipelines using Kafka and spark for real-time data analysis. Created reports in TABLEAU for visualization of teh data sets created and tested Spark SQL connectors.
  • Implemented Hive complex UDF's to execute business logic with Hive Queries. Developed a different kind of custom filter and handled pre-defined filters on HBase data using API. Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and tan loading data into HDFS. Exporting of a result set from HIVE to MySQL using teh Sqoop export tool for further processing. Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop log files. Used Sqoop to transfer data between relational databases and Hadoop. Worked on HDFS to store and access huge datasets within Hadoop. Good hands-on experience with GitHub.

Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive, Tableau, Python, Scala, Oozie, Kafka, Flume, MySQL, Java, Git.

Confidential

Data Analyst

Responsibilities:

  • Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
  • Partnered with ETL developers to ensure that data is well cleaned and teh data warehouse is up - to-date for reporting purposes by Pig. Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and tan structured and stored in AWS Redshift.
  • Processed some simple statistical analysis of data profiling like cancel rate, var, skew, Kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min. Used PySpark and Pandas to calculate teh moving average and RSI score of teh stocks and generated them into teh data warehouse. Exploring with Spark to improve teh performance and optimization of teh existing algorithms in Hadoop using Spark
  • Context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. Developed complex SQL statements to extract teh Data and packaging/encrypting Data for delivery to customers. Provided business intelligence analysis to decision-makers using an interactive OLAP tool Created T/SQL statements (select, insert, update, delete) and stored procedures. Defined Data requirements and elements used in XML transactions. Created Informatica mappings using various
  • Transformations like Joiner, Aggregate, Expression, Filter, and Update Strategy. Performed Tableau administering by using tableau admin commands. Involved in defining teh source to target Data mappings, business rules, and data definitions. Ensured teh compliance of teh extracts to teh Data Quality Center initiatives Metrics reporting, Data mining, and trends in helpdesk environment using Access Worked on
  • SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas. Developed and validated machine learning models including Ridge and Lasso regression for predicting teh total amount of trade. Boosted teh performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks. Generated report on predictive analytics using
  • Python and Tableau including visualizing model performance and prediction results. Utilized Agile and Scrum methodology for team and project management. Used Git for version control with colleagues

Environment: Spark, AWS Redshift, Python, Tableau, Informatica, Pandas, Pig, Pyspark, SQL Server, T-SQL, XML, Git.

We'd love your feedback!