Sr Data Engineer Resume
SUMMARY
- Around 7+ years of working experience as Data Engineer with high proficient knowledge in Data Analysis.
- Experienced using "Big data" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms.
- Good understanding and exposure to Python programming.
- Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift.
- Hands on experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
- Hands on experience with Google cloud services like GCP, Big Query, GCS Bucket and G - Cloud Function.
- Experienced in Informatica ILM and Informatica Lifecycle Management and its tools.
- Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing.
- Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages and functions using SQL and PL/SQL for implementing the business techniques.
- Supporting ad-hoc business requests and Developed Stored Procedures and Triggers and extensively used Quest tools like TOAD.
- Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
- Extensive experience working with business users/SMEs as well as senior management.
- Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of big data.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
- Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4 & CDH5 clusters.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Proficient in Normalization/De-normalization techniques in relational/dimensional database environments and have done normalizations up to 3NF.
- Good understanding of Ralph Kimball (Dimensional) & Bill Inman (Relational) model Methodologies.
- Strong experience in using MS Excel and MS Access to dump the data and analyze based on business needs.
- Good experienced in Data Analysis as a Proficient in gathering business requirements and handling requirements management.
TECHNICAL SKILLS
Tools: InformaticaPower Center 10.4.1,InformaticaPower Exchange,IICS, InformaticaData Quality 10.2.2, Informatica BDM, Talend
Languages: HTML, C, UNIX, Shell Scripting, Python 3.7, Power shell, XML.
Database: Oracle, SQL Server, Teradata, MYSQL, Postgres, Hadoop, ANSI SQL, PL/SQL, T-SQL.
Reporting Tools: Tableau, Power BI, SSRS
Big Data Tools: HDFS, Hive, Spark, Airflow, Oozie, Scoop, Kafka,Scala
Cloud: AWS EMR, S3, Lambda, Sage Maker, Azure Data center, GCP, Big Query.
Other Tools: SQL Loader, SQL Plus, Query Analyzer, Putty, MS Office, MS Word.
PROFESSIONAL EXPERIENCE
Sr Data Engineer
Confidential
Responsibilities:
- Worked on the large-scale Hadoop Yarn cluster for distributed data processing and analysis using Spark, Hive, and HBase.
- Migrated an existing on-premises application to AWS and used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, and performed transformations and actions on RDDs.
- Assist ETL developers with specifications, documentation, and development of data migration mappings and transformations for Data Warehouse loading.
- Deployed Data Lake cluster with Hortonworks Ambari on AWS using EC2 and S3.
- Evaluates performance and conducts performance management planning.
- Assists in the development of standards and procedures affecting database management, design and maintenance.
- Provide support and participate with development team during the analysis, design, coding, development and testing processes.
- Design and develop ETL processes. Establish coding standards, perform code review and automate the health of the platform.
- Perform data analysis, develop database design, share and get consensus on design, and then develop the stored procedures to fulfill the database design.
- Manages end-user accounts and accessibility; provides technical expertise to end-users who create complex queries and reports.
- Worked on Unloading tables from the Cassandra database and writing them to data lake.
- Worked on data bricks for everyday report generating and data analysis between one lake data files and snowflake tables.
- Used AWS EMR to transform and move large amounts ofdatainto and out of otherAWSdatastores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Worked on Snowflake creating tables and other analytical data.
- Involved in creating a data lake by extracting customer data from various data sources to HDFS, which include data from Excel, databases, and log data from servers.
- Worked with cloud-based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
- Used Apache Solr to index the documents and used free-form queries to search the indexed documents.
- Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
- Developed Spark applications by using Scala and Python and implemented Apache Spark for data processing from various streaming sources.
- Developed Spark applications using Scala & Python to do the analytics on the data stored on HDFS.
- Worked as a Spark Expert and Performance Optimizer.
- Worked with Cassandra Dba team to write Cassandra configuration Properties to optimize and manage Cassandra DB with Version upgrades.
- Worked onAWSServices likeAWSSNS to send out automated emails and messages using BOTO3 after the nightly run.
- Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, and pair RDDs.
- Developed preprocessing job using Spark Data frames to transform JSON documents to flat files.
- Worked with Grid Gain in-memory file system to share the state of spark job and applications when working with files instead of RDDs.
- Involved in Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDFS using Scala.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce, and then loading data into HDFS.
- Implemented Spark using Python and Spark SQL for faster testing and processing of data.
- Extensive use of cloud shell SDK in GCP to configure/deploy the services like Dataproc, Storage and Big query.
- Worked on Databricks for data analysis and generated daily comparison results on attribute level and sent alerts on slack.
Sr Data Engineer
Confidential
Responsibilities:
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
- Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
- Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
- Agile methodology including test-driven and pair-programming concept.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
- Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and StepFunctions.
- Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
- Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
- Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
- Setup GCP Firewall rules to ingress or egress traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
- Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
- Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
- Build data pipelines in Airflow in GCP for ETL related jobs using different airflow operators
- Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
- Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
- Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
- Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
- Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.
Environment: Hadoop, Map Reduce, HDFS, Hive, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Java, AWS, GitHub, Talend Big Data Integration, Solr, Impala.
Sr Data Engineer
Confidential
Responsibilities:
- Analyze, design, and develop modern data solutions that enable data visualization using the Azure PaaS service.
- Using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Storage services, extract, transform, and load data from sources systems to Azure Data Lake Analytics.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
- Project lifecycle from analysis to production implementation, with emphasis on identifying data validation, developing logic and transformations as per requirements and creating notebooks to load the data into Delta-Lake.
- Created Databricks Delta Lake process for real-time data load from various sources (Databases, Adobe and SAP) to AWS S3 data-lake using Python/PySpark code.
- Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing in Azure Databricks
- Pipelines were created in ADF using Linked Services/Datasets/Pipeline/ to extract, transform, and load data from a variety of sources, including Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.
- Experienced in Hive queries to analyze massive data sets of structured, unstructured, and semi-structured data.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Used advanced Hive techniques such as bucketing, partitioning, and optimizing self joins to boost performance on structured data.
- The CI/CD framework designed, tested, and deployed using Kubernetes and Docker as the runtime environment.
- Experience in moving data between GCP and Azure using Azure Data Factory
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Data Bricks cluster.
- Owned several end-to-end transformations of customer business analytics problems, breaking them down into a mix of appropriate hardware (IaaS/PaaS/Hybrid) and software (MapReduce) paradigms, and then applying machine learning algorithms to extract useful information from data lakes.
- On both Cloud and On-Prem hardware, sized and engineered scalable Big Data landscapes with central Hadoop processing platforms and associated technologies including ETL tools and NoSQL databases to support end-to-end business use cases.
- Numerous Big Data training and demonstration sessions were conducted for various government and private sector customers in order to ramp them up on Azure Big Data solutions.
- Developed a number of technology demonstrators using the Confidential Edison Arduino shield, Azure EventHub, and Stream Analytics, and integrated them with PowerBI and Azure ML to demonstrate the capabilities of Azure Stream Analytics.
Environment: Azure Data Factory(V2), Azure Databricks, Python 2.0, SSIS, Azure SQL, Azure Data Lake, Azure Blob Storage, Spark 2.0, Hive.
Data Engineer
Confidential
Responsibilities:
- This project was focused on customer clustering.Used the ETL Data Stage Director to schedule and running the jobs, testing, and debugging its components & monitoring performance statistics.
- Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre-processing.
- Architected, Designed and Developed Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
- ImplementedSpark GraphXapplication to analyze guest behavior fordata sciencesegments.
- Worked on batch processing of data sources usingApache Spark, Elastic search.
- Developed Big Data solutions focused on pattern matching and predictive modeling.
- Collaborated with EDW team in,High Level designdocuments for extract, transform, validate and load ETL process data dictionaries, Metadata descriptions, file layouts and flow diagrams.
- Develop an Estimation model for various product & services bundled offering to optimize and predict the gross margin
- Designed OLTP system environment and maintained documentation of Metadata. Used forward engineering approach for designing and creating databases for OLAP model.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop.
- Worked on migrating PIG scripts and Map Reduce programs to Spark Data frames API and Spark SQL to improve performance.
- Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
- Used the ETL Data Stage Director to schedule and running the jobs, testing and debugging its components & monitoring performance statistics.
- Worked closely with the ETL Developers in designing and planning the ETL requirements for reporting, as well as with business and IT management in the dissemination of project progress updates, risks, and issues.
- Performed scoring and financial forecasting for collection priorities using Python, and SAS.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Managed existing team members lead the recruiting and on boarding of a larger Data Science team that addresses analytical knowledge requirements.
- Developed predictive causal model using annual failure rate and standard cost basis for the new bundled services.
- Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring.
- Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, and Power BI.
Environment: IBM DataStage, Python, Spark framework, AWS, Redshift, MS Excel, NoSQL, Tableau, T-SQL, ETL, RNN, LSTM MS Access, XML, MS office 2007, Outlook, MS SQL Server.
