Sr Data Engineer Resume
Atlanta, GA
SUMMARY
- Around 8 years of professional experience in IT, working with various Legacy Database systems, which include work experience in Big Data technologies as well.
- More than 5+ Years of working experience as Data Engineer and 2 years of working experience in Data Analysis.
- Good experience on understanding of architecting, designing and operation of large scale data and analytics solutions on Snowflake Cloud Data Warehouse.
- Experience in Requirement gathering, System analysis, handling business and technical issues & communicating with both business and technical users.
- Hands on experience on complete Software Development Life Cycle SDLC for the projects using methodologies like Agile and hybrid methods.
- Experience in analyzing data using Big Data Ecosystem including HDFS, Hive, HBase, Zookeeper, PIG, Sqoop, and Flume.
- Knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
- Good understanding of Apache Airflow.
- Experience in workflow scheduling with Airflow, AWS Data Pipelines, Azure, SSIS, etc.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream - processing).
- Experience in Text Analytics, Data Mining solutions to various business problems and generating data visualizations using SAS and Python.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Experience in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
- Solid Excellent experience in creating cloud based solutions and architecture using Amazon Web services (Amazon EC2, Amazon S3, Amazon RDS,EMR, Glue) and Microsoft Azure.
- Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
- Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of big data.
- Extensive working experience in agile environment using a CI/CD model.
- Extensive experience working with structured data using Spark SQL, Data frames, Hive QL, optimizing queries, and in corporate complex UDF's in business logic.
TECHNICAL SKILLS
Big Data & Hadoop Ecosystem: Hadoop 3.3/3.0, Hive 2.3, Solr 7.2, Apache Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hue, Cloudera Manager, Stream sets
Cloud Technologies: AWS, Glue,EC2, EC3,EMR, Redshift & MS Azure, Snowflake
Data Modeling Tools: Erwin R9.7, ER Studio v16
Packages: Microsoft Office 2019, Microsoft Project, SAP and Microsoft Visio 2019, Share point Portal Server
Other Tools: VSS, SVN, CVS. Docker, CI/CD, Kubernetes
RDBMS / NoSQL Databases: Oracle 19c, Teradata R15, MS SQL Server 2019, Cosmos DB,Cassandra 3.11, HBase 1.2
Testing and defect tracking Tools: HP/Mercury, Quality Center, Win Runner, MS Visio 2016 & Visual Source Safe
Operating System: Windows 10/8, Unix, Sun Solaris
ETL/Data warehouse Tools: Informatica 9.6, SAP Business Objects XIR3.1/XIR2, Talend, Tableau
Methodologies: RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Agile, Waterfall Model.
PROFESSIONAL EXPERIENCE
Confidential - Atlanta, GA
Sr Data Engineer
Responsibilities:
- Used Azure Data Factory extensively for ingesting data from disparate source systems.
- Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
- Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.
- Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
- Designed and developed user defined functions, stored procedures, triggers for Cosmos DB
- Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.
- Take initiative and ownership to provide business solutions on time.
- Created High level technical design documents and Application design document as per the requirements and delivered clear, well-communicated and complete design documents.
- Created DA specs and Mapping Data flow and provided the details to developer along with HLDs.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
- Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders
- Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time.
- Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procs in Azure SQL DW and Hyperscale
- Created Linked service to land the data from SFTP location to Azure Data Lake.
- Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
- Created several Databricks Sparkjobs with PySpark to perform several tables to table operations.
- Extensively used SQL Server Import and Export Data tool.
- Created database users, logins and permissions to setup.
- Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.
- Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management
Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure DataLake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.
Confidential - Pittsburgh, PA
Sr. Data Engineer
Responsibilities:
- As a Data Engineer I am responsible for building scalable distributed data solutions using Hadoop.
- Involved in Agile Development process (Scrum and Sprint planning).
- Handled Hadoop cluster installations in Windows environment.
- Migrated on-premises environment in GCP (Google Cloud Platform)
- Migrated data warehouses to Snowflake Data warehouse.
- Defined virtual warehouse sizing for Snowflake for different type of workloads.
- Integrated and automated data workloads to Snowflake Warehouse.
- Created tables in snowflake DB, loading and analyzing data using Spark-Scala scripts.
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflake’s Snow SQL.
- Written POCs in Python to analyze the data quickly before applying big data solutions to process at a scale.
- Responsible for data governance rules and standards to maintain the consistency of the business element names in the different data layers.
- Developed data pipeline using Sqoop to ingest cargo data and customer histories into HDFS for analysis.
- Designed ETL using Internal/External tables and store in parquet format for efficiency.
- Involved in porting the existing on-premises Hive code migration to GCP (Google Cloud Platform) BigQuery.
- Involved in migration an Oracle SQL ETL to run on Google cloud platform using cloud Dataproc & BigQuery, cloud pub/sub for triggering the Apache Airflow jobs.
- Extracted data from data lakes, EDW to relational databases for analyzing and getting more meaningful insights using SQL Queries and PySpark.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
- Wrote Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
- Set up Data Lake in Google cloud using Google cloud storage, BigQuery and Big Table.
- Developed scripts in BigQuery and connecting it to reporting tools.
- Designed workflows using Airflow to automate the services developed for Change data capture.
- Carried out data transformation and cleansing using SQL queries and PySpark.
- Used Kafka and Spark streaming to ingest real time or near real time data in HDFS.
- Worked related to downloading BigQuery data into Spark data frames for advanced ETL capabilities.
- Built reports for monitoring data loads into GCP and drive reliability at the site level.
- Participated in daily stand- Confidential, bi-weekly scrums and PI panning.
Environment: Hadoop 3.3, GCP, BigQuery, Big Table, Spark 3.0, Sqoop 1.4.7, ETL, HDFS, Snowflake DW, Oracle Sql, MapReduce, Kafka 2.8 and Agile process.
Confidential - Charlotte, NC
Data Engineer
Responsibilities:
- As a Data Engineer worked with the analysis teams and management teams and supported them based on their requirements.
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Interacted with users for verifying User Requirements, managing Change Control Process, updating existing documentation.
- Facilitated and participated in Joint Application Development (JAD) sessions for communicating and managing expectations with the business users and end users.
- Involved in Agile development methodology active member in scrum meetings.
- Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements.
- Interacted with the SMEs (Subject Matter Experts) and stakeholders to get a better understanding of client business processes and gathered and analyzed business requirements.
- Designed, documented and deployed systems pertaining to Enterprise Data Warehouse standards and best practices.
- Installed Hadoop distribution system
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Designed and developed data pipeline in Azure cloud which gets customer data from API and process it to Azure SQL DB.
- Worked on catapulting data from database to consume on Databricks.
- Performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wrote UDF’s in Scala and PySpark.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure.
- Involved with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW), Azure Synapse.
- Involved in creating Pipelines and Datasets to load the data onto Data Warehouse.
- Build Data Warehouse in Azure platform using Azure data bricks and data factory.
- Implemented a Python-based distributed random forest via Python streaming.
- Worked on collection of large sets using Python scripting.
- Worked on storing the data frame into hive as table using Python (PySpark).
- Used Python Packages for processing JSON and HDFS file formats.
- Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store and HDInsight.
- Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
- Worked with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Azure Synapse Sql Activity.
- Worked on all data management activities on the project data sources, data migration.
- Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries define source fields and its definitions.
- Used Azure reporting services to upload and download reports.
Environment: Hadoop 2.8, Spark 2.8, Azure Blob, JSON, ADF, Scala 2.13, PySpark 3.0, Azure SQL DB, Azure Synapse HDFS, Glue, Hive SQL, Azure Data warehouse, SQL & Agile Methodology
Confidential - Philadelphia, PA
Data Engineer
Responsibilities:
- Worked as Data Engineer to review business requirement and compose source to target data mapping documents.
- Conducted technical orientation sessions using documentation and training materials.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Served as technical expert guiding choices to implement analytical and reporting solutions for client.
- Worked closely with the business, other architecture team members and global project teams to understand, document and design data warehouse processes and needs.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records.
- Maintained Tableau functional reports based on user requirements.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Used Agile (SCRUM) methodologies for Software Development.
- Developed data pipelines to consume data from Enterprise Data Lake (MapR Hadoop distribution - Hive tables/HDFS) for analytics solution.
- Created Hive External tables to stage data and then move the data from Staging to main tables.
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Developed incremental and complete load Python processes to ingest data intoElasticSearch from Hive.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
- Developed Rest services to write data intoElasticSearchindex using Python Flask specifications
- Developed complete end to end Big-data processing in Hadoop eco system.
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Created dashboards for analyzing POS data using Tableau.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Implemented partitioning, dynamic partitions and buckets in Hive.
- Deployed RMAN to automate backup and maintaining scripts in recovery catalog.
- Worked on QA the data and adding Data sources, snapshot, caching to the report.
Environment: AWS, Python, Agile, Hive 2.1, Oracle 12c, Tableau, HDFS, PL/SQL, Sqoop 1.2, Flume 1.6
Confidential - MI
Data Engineer
Responsibilities:
- Involved in building database Model and Views utilizing Python, to build an interactive web based solution.
- Collaborated with other developers to handle complicated issues related with deployment of Django based applications.
- Handled development and management of front end user interfaces with help of HTML, CSS, JQuery and JavaScript.
- Modify the existing Python/Django modules to deliver certain format of data and adding new features.
- Automated a reporting process, using Python, Luigi (library for task workflow and dependencies), and other API’s.
- Written python scripts using python libraries such as pandas, numpy that does read/write operations on large CSV files, perform data aggregations and compare data by columns.
- Experience in integrating Python REST API Frame work using Django.
- Working experience of Data Warehouse ETL /design and implementation of complex big data pipelines.
- Used Python, PySpark, shell script, oracle scheduler, Luigi, Oracle Pl SQL etc.
- Analyzing the possibility of using Kafka for data streaming and Python Lite framework for file based interfaces.
- Developed python scripts, that reads customer Attributes from AWS redshift warehouse, process those records and uploading csv files to adobe marketing cloud.
- Developing a Marketing Cloud Service on Amazon AWS. Developed serverless application using AWS Lambda, S3, Redshift, EMR and RDS.
- Using Atom for coding, Postman and SQL Workbench for testing and debugging and PyCharm for developing and Cloud Formation authoring.
- Integrated AWS Dynamo DB using AWS lambda to store the values the items and backup the Dynamo DB streams.
- Experienced in AWS Elastic Beanstalk for app deployments and worked on AWS lambda with Amazon kinesis.
- Performed S3 buckets creation, policies and on the IAM role based polices and customizing the JSON template and Glacier for storage and backup on AWS.
- Used Jenkins to deploy our code into different environments and scheduling jobs.
- Used bug-tracking tools like Jira, confluence and version controls Git, GitLab.
Environment: Python2.7, Django1.7, Luigi, windows, Linux, MySQL, SQL, Cassandra, AWS RDS, AWS S3, AWS EC2, Kafka, JSON, Restful API, MVC architecture, GitLab, Agile, Enterprise Scheduler, Bitvise SSH Client, Scrum, JIRA, GIT.
Confidential
Data Analyst
Responsibilities:
- Created and analyzed business requirements to compose functional and implementable technical data solutions.
- Identified integration impact, data flows and data stewardship.
- Involved in data analysis, data discrepancy reduction in the source and target schemas.
- Conducted detailed analysis of the data issue, mapping data from source to target, design and data cleansing on the Data Warehouse
- Created new data constraints and or leveraged existing constraints for reuse.
- Created data dictionary, Data mapping for ETL and application support, DFD, ERD, mapping documents, metadata, DDL and DML as required.
- Anticipated JAD sessions as primary modeler in expanding existing DB and developing new ones.
- Identified and analyzed source data coming from SQL server and flat files.
- Evaluated and enhanced current data models to reflect business requirements.
- Generated, wrote and run SQL script to implement the DB changes including table update, addition or update of indexes, creation of views and store procedures.
- Consolidated and updated various data models through reverse and forward engineering.
- Restructured Logical and physical data models to respond to changing business needs and to assured data integrity using PowerDesigner.
- Created naming convention files and co-coordinated with DBAs to apply the data model changes.
- Designed ETL specification documents to load the data in target using various transformations according to the business requirements.
- Used Informatica- Power center for extracting, transforming and loading
- Performed Data profiling, Validation and Integration.
- Created materialized views to improve performance and tuned the database design.
- Involved in Data migration and Data distribution testing.
- Developed and presented Business Intelligence reports and product demos to the team using SSRS (SQL Server Reporting Services).
- Performed testing, knowledge transfer and mentored other team members.
Environment: PowerDesigner, ETL, Informatica, JAD, SSRS, Sql Server, Sql & SDLC.