We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

5.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Around 8+ years of experience into Big - data related technologies on various domains like Banking.
  • Experience as Azure Cloud Data Engineer in Microsoft Azure Cloud technologies including Azure Data Factory (ADF), Azure Data Lake Storage (ADLS), Azure Synapse Analytics (SQL Data warehouse), Azure SQL Database, Azure Analytical services, Polybase, Azure Cosmos NoSQL DB, Azure Key vaults, Azure DevOps, Azure HDInsight Big Data Technologies like Hadoop, Apache Spark and Azure Data bricks.
  • Big Data - Hadoop (MapReduce & Hive), Spark (SQL, Streaming), Azure Cosmos DB, SQL Data warehouse, Azure DMS, Azure Data Factory, AWS Redshift, Atana, Lambda, Step Function and SQL.
  • Strong knowledge in Spark ecosystems such as Spark core, Spark SQL, Spark Streaming libraries.
  • Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight Big Data Technologies (Hadoop and Apache Spark) and Data bricks.
  • Experience in designing Azure Cloud Architecture and Implementation plans for hosting complex application workloads on MS Azure. Involve in data analysis and data mapping.
  • Experience in ETL developmentand withData Analyst(Data Warehouse Implementation/development)for Health careand Banking. Performed data analysis and maintenance on information stored in MySQL database.
  • Expert in teh Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases.
  • Experience working in reading Continuous json data from different source system using Kafka into Databricks Delta and processing teh files using Apache Structured streaming, PySpark and creating teh files in parquet format.
  • Good knowledge in Apache Hadoop ecosystem components Spark, Cassandra, HDFS, Hive, SQOOP, Airflow.
  • Experienced in working with different data formats CSV, JSON.
  • Experience in implementing data analysis with various analytic tools. Strong in Data Warehousing concepts, Star schema and Snowflake schema methodologies, understanding Business process/requirements.
  • Well Exposure on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines. Data elements validation using exploratory data analysis (univariate, bivariate, multivariate analysis).
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandmigrating on premise databases toAzure Data Lake storeusing Azure Data factory.
  • Designed multiple DB2 databases with logical partitions and used range partitioning based on month and employee number as teh partitioning keys of applicable tables.
  • Expert in implementing Business Rules by creating re-usable transformations like mapplets and mappings.
  • Developed and worked on Machine Learning algorithms for predictive modelling.
  • Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau.
  • Worked on refining agile process to fit data science project delivery and published MOS.
  • Designed and implemented reports for Healthcare Accreditation. Architected complete scalable data pipelines, data warehouse for optimized data ingestion.
  • Collaborated with data scientists and architects on several projects to create data mart as per requirement.
  • Responsible for maintaining Data Quality, Data Governance, Data Profiling and versioning of Master Data.
  • Constructed data staging layers and fast real-time systems to feed BI applications and machine learning algorithms.
  • Understanding of AWS, Azure webservices and at least hands on experience working in projects. Knowledge of teh software development life cycle, agile methodologies, and test-driven development.
  • Develop scalable and reliable data solutions to move data across systems from multiple sources in real time (Kafka) as well as batch modes (Sqoop). Performed Data Analysis on various sources systems.
  • Built Enterprise ingestion Spark framework to ingest data from different sources (s3, Salesforce, Excel, SFTP, FTP and JDBC Databases) which is 100% metadata driven and 100% code reuse which lets Junior developers to concentrate on core business logic rather spark/Scala coding.

TECHNICAL SKILLS

Big Data/ Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark,, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server, Apache Airflow

Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB

Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Auto scaling, CloudWatch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Dallas, TX

Responsibilities:

  • Implemented a proof of concept deploying dis product in Amazon Web Services AWS.
  • Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3 and S3 to Snowflake.
  • Worked on setting up AWS DMS and SNS for data transfer and replication and used SQL on teh new AWS Databases like RedShift and Relation Data Services.
  • Predominantly using Python and AWS (Amazon web services), and MySQL along with NoSQL (mongodb) databases for meeting end requirements and building scalable real time system.
  • Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
  • Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.
  • Designed and developed user defined functions, stored procedures, triggers for Cosmos DB.
  • Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark and Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 bucket.
  • Involved in designing teh data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
  • Analyzed teh data flow from different sources to target to provide teh corresponding design Architecture in Azure environment.
  • Involved in building theETL architectureand Source to Target mapping to load data intoData warehouse.
  • Created Build definition and Release definition for Continuous Integration and Continuous Deployment.
  • Involved in extracting customer's big data from various data sources into Hadoop HDFS. dis included data from Excel, ERP systems, databases, and log data from servers.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL) and processing teh data in InAzure Databricks.
  • Created Application Interface Document for teh downstream to create new interface to transfer and receive teh files through Azure Data Share.
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
  • Improved performance by optimizing computing time to process teh streaming data and saved cost to company by optimizing teh cluster run time.
  • Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procs in azure SQL DW and Hyperscale.
  • Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
  • Extensively used SQL Server Import and Export Data tool.
  • Created database users, logins and permissions to setup.
  • Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.
  • Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue identification, and management.
  • Addressing resource issue, monthly one on one, Weekly meeting.

Environment: Hadoop/Bigdata Ecosystem (Spark, Kafka, Hive, HDFS, Sqoop, Oozie, Cassandra, mongoDB), AWS (S3, AWS Glue, RedShift, RDS, Lambda, Atana, SNS, SQS, Cloud Formation), BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.

Data Engineer

Confidential, New York, NY

Responsibilities:

  • Used custom developed PySpark scripts to pre-process, transform data and map to tables inside teh CIF (Non- corporate Information Factory) data warehouse.
  • Developed shell scripts of Sqoop jobs for loading periodic incremental imports of structured data from various RDMS to S3 and used Kafka to ingest real-time website traffic data to HDFS.
  • Used Azure Data Factory extensively for ingesting data from disparate source systems.
  • Data analysisprojects for finance and medical management related to optimizing medical cost ratios and quality scores for self-insured and outsourced services.
  • Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
  • Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.
  • Created numerous pipelines in Azure using Azure Data Factory v2 to get teh data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
  • Integrated Azure Active Directory autantication to every Cosmos DB request sent and demoed feature to Stakeholders.
  • Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed teh required libraries for teh clusters. Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Designed and developed a new solution to process teh NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.
  • Created Linked service to land teh data from SFTP location to Azure Data Lake.
  • As part of reverse engineering discussed issues/complex code to be resolved and translated them into Informatica logic and prepared ETL design documents.
  • Used Informatica Designer to create complex mappings using different transformations to move data to a Data Warehouse.
  • Performed Exploratory Data Analysis, trying to find trends and clusters.
  • Developed mappings in Informatica to load teh data from various sources into teh Data Warehouse using different transformations like Source Qualifier, Expression, Lookup, aggregate, Update Strategy and Joiner.
  • Optimized teh performance of teh mappings by various tests on sources, targets, and transformations.
  • Scheduling teh sessions to extract, transform and load data into warehouse database on Business requirements using scheduling tool.
  • Extracted (Flat files, mainframe files), Transformed and Loaded data into teh landing area and tan into staging area followed by integration and sematic layer of Data Warehouse (Teradata) using Informatica mappings and complex transformations (Aggregator, Joiner, Lookup, Update Strategy, Source Qualifier, Filter, Router and Expression Optimized teh existing ETL pipelines by tuning SQL queries and data partition techniques.
  • Created independent data marts from existing data warehouse as per teh application requirement and updated them on bi-weekly basis.
  • Running SQL scripts, creating indexes, stored procedures for data analysis. Perform data analysis and format reports using Microsoft Excel.
  • Decreased teh Azure billing by pivoting from using Redshift storage to Hive tables for unpaid services and implemented various techniques like Partitioning and Bucketing over hive tables to improve teh query performance.
  • Automated and validated data pipelines using Apache Airflow.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure Data Lake, Sqoop, Informatica, Amazon EMR/Redshift, Presto, Apache Airflow, Hive, Azure functions Apps, Azure data lake, data bricks

Data Engineer

Confidential, New York, NY

Responsibilities:

  • Designed and developed teh real-time matching solution for customer data ingestion.
  • Worked on converting teh multiple SQL Server and Oracle stored procedures into Hadoop using Spark SQL, Hive, Scala, and Java.
  • Created production Data-lake dat ca handle transactional processing operations using Hadoop Eco-System.
  • Developed PySpark and SparkSQL code to process teh data in Apache Spark on Amazon EMR to perform teh necessary transformations.
  • Involved in validating and cleansing teh data using Pig statements and hands-on experience in developing Pig MACROS.
  • Analyzed dataset of 14M record count and reduced it to 1.3M by filtering out rows with duplicate customer IDs and removed outliers using boxplots and univariate algorithms.
  • Worked with Hadoop Big Data Integration with ETL on performing data extract, loading, and transformation process for ERP data.
  • Performed extensive exploratory data analysis using Teradata to improve teh quality of teh dataset and created Data Visualizations using Tableau.
  • Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy.
  • Experienced in using PyTorch library and implementing natural language processing.
  • Developed data visualizations in Tableau to display day to day accuracy of teh model with newly incoming Data.
  • Worked with R for statistical modeling like Bayesian and hypothesis test with dplyr and BAS packages, and visualized testing results in R to delivery business insight.
  • Model validation by Confusion Matrix, ROC, AUC, and developed diagnostic tables and graphs dat demonstrated how a model can be used to improve teh efficiency of teh selection process.
  • Presented and reported business insights by SSRS and Tableau dashboard combined with different diagrams.
  • Utilized Jira as project management methodology and Git for version control to build teh program.
  • Reported and displayed teh analysis result in teh web browser with HTML and JavaScript.
  • Involved constructively with project teams, supported teh project's goal through principle and delivered teh insights for team and client.

Environment: Hadoop, Spark SQL, Hive, Scala, Java, MS Access, SQL Server, Pig, PySpark, Tableau, Excel

Data Analyst

Confidential

Responsibilities:

  • Worked with Data Analyst for requirements gathering, business analysis and project coordination.
  • Performed migration of Reports (Crystal Reports, and Excel) from one domain to another domain using Import/Export Wizard.
  • Wrote a complex SQL, PL/SQL, Procedures, Functions, and Packages to validate data and testing process.
  • Used advanced Excel formulas and functions like Pivot Tables, Lookup, If with and/index, match for data cleaning.
  • Redesigned some of teh previous models by adding some new entities and attributes as per teh business requirements.
  • Reviewed Stored Procedures for reports and wrote test queries against teh source system (SQL Server) to match teh results with teh actual report against teh Data mart (Oracle).
  • Involved with data profiling for multiple sources and answered complex business questions by providing data to business users.
  • Performed SQL validation to verify teh data extracts integrity and record counts in teh database tables.
  • Created Schema objects like Indexes, Views, and Sequences, triggers, grants, roles, Snapshots.
  • TEMPEffectively used data blending feature in Tableau to connect different databases like Oracle, MS SQL Server.
  • Transferred data with SAS/Access from teh databases MS Access, Oracle into SAS data sets on Windows and UNIX.
  • Provided guidance and insight on data visualization and dashboard design best practices in Tableau.
  • Performed Verification, Validation and Transformations on teh Input data (Text files) before loading into target database.
  • Executed data extraction programs/data profiling and analyzing data for accuracy and quality.
  • Wrote complex SQL queries for validating teh data against different kinds of reports generated by Business Objects.
  • Documented designs and Transformation Rules engine for use of all teh designers across teh project.
  • Designed and implemented basic SQL queries for testing and report/data validation.
  • Used ad hoc queries for querying and analyzing teh data.
  • Performed Gap Analysis to check teh compatibility of teh existing system infrastructure with teh new business requirements.

Environment: SQL, PL/SQL, Oracle9i, SAS, Business Objects, Tableau, Crystal Reports, T-SQL, SAS, UNIX, MS Access 2010

Data Analyst/Hadoop Developer

Confidential 

Responsibilities:

  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Learned to create Business Process Models.
  • Wrote SQL scripts to meet teh business requirement.
  • Documented data cleansing and data profiling.
  • Coordinated between teh Business users and development team in resolving issues.
  • Assisted QA team in creating test scenarios dat cover a day in a life of teh patient for Inpatient and Ambulatory workflows.
  • Analyzed data using Hadoop Components Hive and Pig.
  • Worked on Hadoop Map Reduce, HDFS developed multiple Map Reduce jobs in java for data cleaning and preprocessing.
  • Involved in loading data from UNIX file system to HDFS.
  • Involved in development using Cloudera distribution system.
  • Worked Hands on with ETL process.
  • Developed Hadoop Streaming jobs to ingest large amount of data.
  • Load and transform large data sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Imported data using Sqoop from Teradata using Teradata connector.
  • Created Sub-Queries for filtering and faster execution of data.
  • Created multiple Join tables and fetched teh required data.
  • Worked with Hadoop clusters using Cloudera (CDH5) distributions.
  • Documented requirements and obtained signoffs.
  • Perform Importing and Exporting teh Data using SQOOP from HDFS to Relational Database systems.
  • Install and Set up HBASE and Impala.
  • Used python libraries like Beautiful Soap, Numpy and SQLAlchemy.
  • Used Apache Impala to read, write and query teh Hadoop data in HDFS, HBase and Cassandra.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive.
  • Supported Map Reduce Programs those are running on teh cluster.
  • Worked on debugging, performance tuning of Hive & Pig Jobs.
  • Bulk loads teh data into Oracle using JDBC template.
  • Analyzed views and produced reports.
  • Worked on Python OpenStack APIs and used Numpy for Numerical analysis.

Environment: Cloudera, HDFS, Pig, Hive, Map Reduce, python, Sqoop, Storm, Kafka, LINUX, HBase, Impala, Java, SQL, Cassandra, MongoDB, SVN, data profiling, data loading, QA team.

We'd love your feedback!