Senior Data Engineer Resume
Orange, CA
SUMMARY
- Over 8 Years of IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python.
- Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark, and Hive.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
- Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
- Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift migration and from on-premises to AWS Cloud.
- Use of NLP, OpenNLP & StanfordNLP for Natural Language Processing, and sentiment analysis.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
- Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy and Pandas for data analysis and numerical computations.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
- Well versed withbig data on AWS cloud services i.e., EC2, S3, Glue, Anthena, DynamoDB and RedShift
- Performed the migration of Hive and MapReduce Jobs from on - premises MapR to AWS cloud using EMR.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's Spark MLlib.
- Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
- , R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Mining and Machine Learning.
- Good experience in developing web applications implementing Model View Control (MVC) architecture using Django, Flask, Pyramid and Python web application frameworks.
- Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
- Experienced in using Pig scripts to do transformations, event joins, filters and before storing the data into HDFS.
- Developed AWS Cloud Formation templates and set up Auto scaling for EC2 instances. Strong knowledge of Hive analytical functions, Hive functionality by writing custom UDFs, data cleansing and data mining.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
- Experience working on Azure Services likeData Lake, Data Lake Analytics,SQL Database, Synapse, Data Bricks, Data factory, Logic AppsandSQL Data warehouseand GCP services LikeBig Query, Dataproc, Pub sub etc.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
TECHNICAL SKILLS
Operating Systems: UNIX, Linux, Windows, and Mac
Programming Languages: Python and PL/SOL
Databases: Oracle 10/11g, MySQL, SQL Server, and PostgreSQL Tools IntelliJ), PyCharm, FileZilla, PL/SQL Developer, and TOAD
Integration Tools: Jenkins and Web Builder
Version Control: GitHub and SVN
Defect Tracking: JIRA, Git, and Version One
Cloud: VPC creation, EC2 instance, S3 buckets, RDS instances, Amazon command line, CloudFront, IAM, creating security groups, managing S3 object lifecycle, creating CDN, implementing S3 security & Encryption, Route53, working with databases and DNS, Using bootstrap scripts, AWS EC2, Azure.
PROFESSIONAL EXPERIENCE
Confidential, Orange, CA
Senior Data Engineer
Responsibilities:
- Responsible for the execution ofbig data analytics, predictive analytics, and machine learning initiatives.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
- Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
- Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
- UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
- Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
- Analyze, design, and build Modern scalable distributed data solutions using with Hadoop, AWS cloud services.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database,
- Data Bricks, Delta Lake andAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases toAzure Data Lake storeusing Azure Data factory.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
- Clustering, NLP, Neural Networks. Visualized and presented the results using interactive dashboards.
- ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
- Worked ondata migration to Hadoopand hive query optimization.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Automated resulting scripts usingApache Airflowandshell scriptingto ensure daily execution in production.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Primarily involved in Data Migration usingSQL, SQL Azure, Azure Storage,andAzure Data Factory, SSIS, PowerShell.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Designing and building data pipelines to load the data into GCP platform.
- Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary queries orpythonscripts based on source.
- UsedArcadiato connect withImpala, designed interactive dashboards and reports for the BI team.
- Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
- Created a new data model that embeds NoSQL submodules within a relational data model by applying Hybrid data modelling concepts.
- Involved in using Sqoop for importing and exporting data between RDBMS and HDFS.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query
- Install and configureApache Airflowfor S3 bucket Snowflake data warehouse and createddagsto run the Airflow.
- Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Deploy the code toEMRviaCI/CD using Jenkins
- Extensively usedCode cloudfor code check-in and checkouts for version control.
Environment: Apache Spark, Scala, GCP, Azure, Airflow, NLP, Azure Data Factory, Azure Data Lake, NOSQL, AWS, Impala, Avro, Parquet, HDFS, UNIX, Python, PySpark, RDBMS, Arcadia, Hive, CI/CD, Sqoop, PowerShell.
Confidential, Chicago, IL
Data Engineer
Responsibilities:
- Review the Data-Modelers requirement specifications with the client and provide comments to the manager about the ETL logics.
- Involved in analyzing business needs and document functional and technical specifications based upon user requirements with extensive interactions with business users.
- Worked with project manager to determine needs and applying customizing existing technology to meet those needs. AWS
- Development of Generic Graphs plans by using Ab initio and Usage of Active Transformations like Transform, Partition, De-partition, Sort components and Different Lookups Functions.
- Developed PSET generation Unix Shell script by providing values through list file.
- Used checkpoint and phasing to avoid deadlocks and re-run the graph in case of failure.
- Performing transformations of source data with transform components like join, match sorted, reformat, dedup sorted, Filter by expression.
- Wide usage of lookup files while getting data from multiple sources and size of data is limited.
- Keenly involved in creating data warehouse by using data warehouse concepts CDC and other SCD types.
- Hands on experience in Creating Tables in Hive Database By using DDL’s.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data.
- Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wrote UDF’s in Scala and Pyspark.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Hands-on experience on developing SQL Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
Environment: Azure, Ab initio, GDE&CO OP, Oracle Exadata, Hadoop Hive, Unix, Hp ALM, Jenkins, SQL, Python IDLE, MS Visio, Ab initio, Airflow, Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW).
Confidential
Data Engineer
Responsibilities:
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.
- Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
- Designed and developed NLP models for sentiment analysis.
- Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models.
- Expert in Business Intelligence and Data Visualization tools: Tableau, MicroStrategy.
- Worked on machine learning on large size data using Spark and MapReduce.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries.
- Stored and retrieved data from data-warehouses using Amazon Redshift.
- Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
- Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snowflake Schema, Fact Table and Dimension Table.
- Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.
- Created various types of data visualizations using Python and Tableau.
- Consult on broad areas including data science, spatial econometrics, machine learning, information technology and systems and economic policy with R.
- Performed Data mapping between source systems to Target systems, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data.
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Used various techniques using R data structures to get the data in right format to be analyzed which is later used by other internal applications to calculate the thresholds.
- Maintaining conceptual, logical and physical data models along with corresponding metadata.
- Worked on data migration from an RDBMS to a NoSQL database and gives the whole picture for data deployed in various data systems.
Environment: Python, Hadoop, Map Reduce, Spark, Spark MLlib, Tableau, SQL, Excel, VBA, SAS, MATLAB, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.
Confidential
Data Engineer
Responsibilities:
- Migrate the existing data from Teradata/SQL Server to Hadoop and perform ETL operations on it.
- Written shell scripts to extract data from Unix servers into Hadoop HDFS for long-term storage.
- Implemented Micro Services architecture using spring boot framework.
- Created Messaging queues using RabbitMQ to read data from HDFS to process the data. Written Spark Application to implement Slowly changing dimensions (SCD Type I).
- Created Oozie workflow in process to automate the spark application. Written pig script to load processed data from HDFS into MongoDB.
- Used MongoDB to store processed products and commodities data, which can be further down streamed into web application (Green Box/ Zoltar).
- Deployed Spark application and java web services in pivotal cloud foundry.
- Agile methodology including test-driven and pair-programming concept.
- Strong communication and analytical skills and a demonstrated ability to handle multiple tasks as well as work independently or in a team.
- Gathering business requirements from the Business Partners and Subject Matter Experts.
- Installed and Configured Hadoop cluster using Amazon Web Services (AWS) for POC purposes.
- Involved in implementing nine node CDH4 Hadoop cluster on Red hat LINUX.
- Imported data from RDBMS to HDFS and Hive using Sqoop on regular basis.
- Created Hive tables and worked on them using Hive QL, which will automatically invoke and run MapReduce, jobs in the backend. Responsible for developing PIG Latin scripts .
- Developed custom Map Reduce programs for data analysis and data cleaning using pig Latin scripts.
- Managing and scheduling batch Jobs on a Hadoop Cluster using Oozie .
- Experience in managing and reviewing Hadoop Log files.
- Experienced in loading and transforming large sets of structured, semi-structured and unstructured data.
- Used Zookeeper for providing coordination services to the cluster.
- Assisted in monitoring Hadoop cluster using Cloudera Manager.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Environment: Spring, Oracle, Linux, JDBC, Git, HTML, CSS, Angular, NodeJS, Postman, Servlets, Struts, JSP, WebLogic, PL/SQL, Eclipse.
