Sr. Data Engineer Resume
4.00/5 (Submit Your Rating)
SUMMARY:
- 8+ Years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. The Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
- Fluent programming experience with Scala, Java, Python, SQL, T - SQL, R.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
- Worked with Spark to improve the efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's, and Spark YARN.
- Experience in the application of various data sources like Oracle SE2, SQL Server, Flat Files, and Unstructured files into a data warehouse.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases, and HDFS.
- Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, PowerBI, and Microsoft SSIS.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop MapReduce programming.
- Comprehensive experience in developing simple to complex Map reduction and Streaming jobs using Scala and Java for data cleansing, filtering, and data aggregation. Also, possess detailed knowledge of the MapReduce framework.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Seasoned practice in M
PROFESSIONAL EXPERIENCE:
Confidential
Sr. Data Engineer
Responsibilities:
- Worked on AWS Data pipeline to configure data loads from S3 to Redshift. Using AWS Redshift, I Extracted, transformed, and loaded data from various heterogeneous data sources and destinations Created Tables, Stored Procedures, and extracted data using T - SQL for business users whenever required.
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR I have written a shell script to trigger data Stage jobs. Assist service developers in finding relevant content in the existing models. Like Access, Excel, CSV, Oracle, flat files using connectors, tasks, and transformations provided by AWS Data Pipeline. Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries. Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client-specified columns. Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis. Compiling and validating data from all departments and
- Presenting it to the Director of Operation. KPI calculator Sheet and maintain that sheet within SharePoint. Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI. Creating a data model that correlates all the metrics and gives a valuable output. Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's Involved in integration of
- Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas. Developed and validated machine learning models including Ridge and Lasso regression for predicting the total amount of trade. Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks. Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results. Implemented Copy activity, Custom Azure Data Factory Pipeline Activities Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory,
- Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using
Confidential
Sr. Data Engineer
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test Driven Development) using Cucumber,
- Gherkin, and ruby. Designing the business requirement collection approach based on the project scope and SDLC methodology. Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write - back tool, and backward. Files extracted from Hadoop and dropped on a daily hourly basis into S3. Working with Data governance and Data quality to design various models and processes. Involved in all the steps and scope of the project data approach to MDM, have created a Data
- Dictionary and Mapping from Sources to the Target in MDM Data Model. Experience managing Azure Data Lakes (ADLs) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL Responsible for working with various teams on a project to develop analytics-based solutions to target customer subscribers specifically. Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event-driven processing. Created Lambda jobs and configured Roles using AWS CLI. Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate 'partitioned' data in various storage formats like text, JSON, Parquet, etc.
- Involved in loading data from LINUX file system to HDFS Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution. Start working with AWS for storage and halding for a terabyte of data for customer BI Reporting tools Built 12 node Hadoop cluster Installed and configured Hadoop ecosystemcomponents Decommissioning nodes and adding nodes in the clusters for maintenance Monitored cluster health by Setting up alerts using Nagios and Ganglia Adding new users and groups of users as per the requests from the client Working on tickets opened by users regarding various incidents, requests Created a Lambda Deployment function, and configured it to receive events from S3 buckets Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer Developed Mappings using Transformations like Expression, Filter, Joiner, and
- Lookups for better data messaging and to migrate clean and consistent data Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL, and MLlib libraries. Data Integra
Confidential
Big Data Engineer
Responsibilities:
- Migrating data from FS to Snowflake within the organization Imported Legacy data from SQL Server and Teradata into Amazon S3. Created consumption views on top of metrics to reduce the running time for complex queries. Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption). As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake. Developed SQL scripts to Uploa
- Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio, and Snowflake Databases for the Project Worked on retrieving the data from FS to S3 using spark commands Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Created Metric tables, End - user views in Snowflake to feed data for Tableau refresh. Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs. Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file,
- CSV file. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement. Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues. Experience in
- Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, Hive. Working experience with data streaming process with Kafka, Apache Spark, Hive. Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json, and various compression formats like Snappy, bzip2. Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka. Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Kafka, Jira, Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper Teradata, SQL Server, Apache Spark, Sqoop.
Confidential
Data & Reporting Analyst
Responsibilities:
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python. Research and recommend a suitable technology stack for Hadoop migration considering current enterprise architecture. Responsible for building scalable distributed data solutions using Hadoop.
- Experienced in loading and transforming large sets of structured, semi - structured, and unstructured data. Developed Spark jobs and Hive Jobs to summarize and transform data. Experienced in developing Spark scripts for data analysis in both python and Scala. Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts. Built on-premise data pipelines using Kafka and spark for real-time data analysis. Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors. Implemented Hive complex UDF's to execute business logic with
- Hive Queries. Developed a different kind of custom filter and handled pre-defined filters on HBase data using API. Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data. Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS. Exporting of a result set from HIVE to MySQL using the Sqoop export tool for further processing. Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis. Experience in managing and reviewing
- Hadoop log files. Used Sqoop to transfer data between relational databases and Hadoop. Worked on HDFS to store and access huge datasets within Hadoop. Good hands-on experience with GitHub.
Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive, Tableau, Python, Scala, Oozie, Kafka, Flume, MySql, Java, Git.
Confidential
Data Analys t
Responsibilities:
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development. Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up - to-date for reporting purposes by Pig. Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift. Processed some simple statistical analysis of data profiling like cancel rate, var, skew, Kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min. Used
- PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into the data warehouse. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PostgreSQL, Data Frame, OpenShift,
- Talend, pair RDD's Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers. Provided business intelligence analysis to decision-makers using an interactive OLAP tool Created T/SQL statements (select, insert, update, delete) and stored procedures. Defined Data requirements and elements used in XML transactions. Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter, and Update Strategy. Performed
- Tableau administering by using tableau admin commands. Involved in defining the source to target Data mappings, business rules, and data definitions. Ensured the compliance of the extracts to the Data Quality Center initiatives Metrics reporting, Data mining, and trends in helpdesk environment using Access
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas. Developed and validated machine learning models including Ridge and Lasso regression for predicting the total amount of trade. Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks. Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results. Utilized Agile and Scrum methodology for team and project management. Used Git for version control with colleagues.
Environment: Spark, AWS Redshift, Python, Tableau, Informatica, Pandas, Pig, Pyspark, SQL Server, T-SQL, XML, Git.
