Data Engineer Resume
Nashville, TN
SUMMARY
- 7 years of experience in the field of IT, with a strong emphasis on designing and implementing statistically significant analytic solutions on Hadoop and Spark - based enterprise applications.
- 4 years of implementation and extensive working experience in writing Hadoop Jobs for analyzing data using a wide array of tools in Big Data like Hive, Hadoop, Spark, Oozie, Sqoop, Kafka, Zookeeper, HBase and Cloud Services like Azure, AWS.
- An accomplished Data Engineer experienced in ingestion, storage, querying, processing, and analysis of big data, an expert in coming up with data warehousing solutions working with a variety of database technologies.
- Extensive experience focused on Data warehousing, Data modeling, Data integration, Data Migration, ETL process, and Business Intelligence. Package Software: Expertise in SSIS, Informatica ETL, and reporting tools.
- Extensively worked on Spark using Scala, Python on the cluster for computational analytics, installed it on top of Hadoop performed advanced analytical applications by making use of Spark with Hive and SQL/Oracle.
- Good understanding of Statistics and developing Machine learning models, Experience in implementing data science solutions using Azure Databricks.
- Strong Programming experience in Python, Scala, and core CS concepts such as Data Structures and algorithms.
- Extensive experience in developing applications that perform Data Processing tasks using Teradata, Snowflake, Oracle, SQL Server, and Postgres databases.
- Hands-on experience building ETL pipelines, Visualizations, Analytics based quality solutions in-house using AWS, Azure Databricks, and other Open-source frameworks.
- Hands on experience on Azure data factory, ADLS Gen2, Azure Databricks, and Azure Cognitive service.
- Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera, Hortonworks, and good knowledge of MAPR distribution and AWS EMR (Elastic MapReduce).
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, and other services in AWS. Deep understanding of Cloud Architectures including AWS, Azure, GCP.Experienced in implementing schedulers using Oozie, Airflow, Crontab, and Shell scripts.Good working experience in importing data using Sqoop from various sources like RDMS, Snowflake, Teradata, Oracle to HDFS and performing transformations on it using Hive, Pig, and Spark.Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging systems.
- Experienced in migrating data from different sources using the PUB-SUB model in Redis, and Kafka producers, consumers, and preprocess data using Spark.
- Expertise in writingSparkRDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations usingSpark-Core.
- Engaged in performance tuning, scalability engineering, reliability, and feasibility in solutions design.
- Experience in developing data pipelines using Pig, Sqoop, and Flume to extract the data from weblogs and store it in HDFS. Developed customized UDF, UDTF and UDAF in Python script to extend Hive core functionality.Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Working knowledge in installing and maintaining Cassandra by configuring the Cassandra.YAML file as per the business requirement and performing reads/writes using Java JDBC connectivity.
- Written multiple MapReduce Jobs using JavaAPI, and Hive for data extraction, transformation, and aggregation.from multiple file-formats including Parquet, Avro, XML, JSON, CSV, ORCFILE, and other compressed file formats Codecs like gZip, Snappy, Lzo.
- Strong understanding of Data Modeling (Relational, dimensional, Star and Snowflake Schema), Data analysis, implementations of Data warehousing using Windows and UNIX.
- Experience in complete Software Development Life Cycle (SDLC) in both Waterfall and Agile methodologies.
- Generated various kinds of knowledge reports using Tableau, Power BI, and Qlik based on Business specifications.
TECHNICAL SKILLS
Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, Avro, Parquet, Snappy.
NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks and Apache.
Languages: C, Java, Scala, Python, R, XML,SQL, PL/SQL, HiveQL, Unix, Java Script, Shell Scripting
Source Code Control: GitHub, CVS, SVN, ClearCase
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP
Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2
DB languages: MySQL, PL/SQL, PostgreSQL & Oracle
Build Tools: Jenkins, Maven, Ant, Log4j
Business Intelligence Tools: Tableau, Power BI
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans
ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS
Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Designed, implemented efficient data pipelines to integrate data from a variety of sources into Data Lake.
- Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand the current Production state of the application and determine the impact of new implementation on existing business processes.
- Extract Transform and Load data from On-Prem SQL Server to Snowflake using a combination of Azure Data Factory, PySpark, and Azure Databricks. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing data in Databricks.
- Building pipelines, migrating on-premises data systems to a snowflake that help Confidential to achieve efficient data loading and querying on reporting objects.
- Building data management frameworks, user-defined functions to achieve better data flow.
- Design and Develop Data processing pipelines using Azure data factory, Python, and SQL
- Utilize continuous integration and continuous development framework in Azure DevOps to automate all data processes.
- Understanding the various tools and technologies like Spark, Hadoop, Snowflake, Azure pipelines, Azure Data Factory, Azure HDInsight, Azure DevOps, SQL Server, OLAP cubes, Python, Shell Scripts help me finish my tasks and achieve business goals efficiently.
- Design and Implement ETL/ELT processes using Azure Data Factory and understand the process of Azure Active Directory properties.
- Dealing with multiple formats of large data files using PySpark and Databricks.
- Working with Microsoft SQL Server Management studio in handling various processes of stored procedures, SQL Development, Performance Tuning, and Automating the process Scheduling the workflows using Azure Data Factory for end-to-end data processing pipelines.
- Responsible for loading data and Implementing schema into the snowflake
- Created Good experience in designing, Implementation of Data warehousing and Business Intelligence solutions using ETL tools.
- Developed ETL pipelines in and out data warehouse using combination of Python and Snowflake SnowSQL. Writing SQL quires against Snowflake.
Environment: Python, Spark, ETL, Databricks, Azure DevOps, Azure SQL Server, Stream lit Lib, Azure Data Factory, SQL Server, SSMS, SSRS, Snowflake, SQL, PyCharm, GIT, OLAP Cubes, RDBMS.
Confidential
Big Data Developer
Responsibilities:
- Designed robust, reusable, and scalable data-driven solutions and data pipeline frameworks to automate the ingestion, processing, and delivery of both structured and unstructured batch and real-time data streaming data using Python Programming.
- Analyze and implement various data pipelines and data requests from the business which will include processing high-volume Ingestion, extraction, and transformation.
- Design and develop Cloudera HDFS-based solutions using Spark (with interfaces Java, Python and Spark SQL), Hive QL, Flume, Talend, IBM MQ and Kafka.
- Experience working with IBM DataStage tool, Flume, Sqoop effectively using it for Data Integration and Data Migration from multiple source systems into HDFS.
- Design Develop and test ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, Parquet/Text Files into AWS Redshift.
- Determine the EMR Cluster size to run spark scripts with large data sets, a good understanding of IAM authentication, role-based access.
- Created Athena and Hive external tables on top of S3 data and ad-hoc analysis for understanding the data.
- Developed spark applications in PySpark on the distributed environment to load a huge number of CSV files with different schema into Hive Parquet tables.
- Applied transformation on the data loaded into Spark Data Frames and done in-memory data computation to generate the output response.
- Hands-on experience in developing UDF, Data Frames, and SQL queries in Spark SQL.
- Built real-time streaming data pipelines using AWS Kinesis Data streams and Kinesis Firehose to S3.
- Involved in developing Spark application using PySpark as per business requirement.
- Responsible for design and development of Spark SQL Scripts based on Functional Specifications. Created HBase tables to store various data formats of data coming from spark.
- Hands in experience in working with Continuous Integration and Deployment using Jenkins, Urban Code Deploy.
Environment: Python, Java, AWS, HDFS, Spark, Kafka, ETL, Hive, Yarn, HBase, Jenkins, Urban Code Deploy, Tableau, Sqoop, Flue, Kafka, HANA, IBM Data Stage, Linux, Shell Script, BI Reports, MySQL, RDBMS.
Confidential, Nashville, TN
Data Engineer
Responsibilities:
- Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for AWS Quick sight Dashboard.
- Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
- Created PySpark Scripts to improve the performance by tuning.
- Envisioned the architectural scheme, structure, features, functionality, and user-interface design.
- Evolved the overall master data model, including the functions, entities within those functions, and attributes within those entities as the platform design is completed and business needs shift and change.
- Applied data warehousing solutions while working with database technologies like Snowflake, Teradata.
- Developing python jobs to in corporate the SQL-queries for improving the performance over-extraction of the same data in PostgreSQL Database environment.
- Written programs in Spark using Python for Data quality check.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Consumed XML messages using Kafka and processed the XML file using Spark Streaming to capture UI updates.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to a flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, Glue, DynamoDB, EBS, RDS, and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD.
- Worked with ELASTIC MapReduce and set up a Hadoop environment in AWS EC2 Instances.
- Worked on connecting Snowflake and Cassandra database to the Amazon EMR File System for storing in S3.
- Implemented usage of Amazon EMR for processing Big Data across aHadoop Clusterof virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
- Developing API for Spark jobs to establish connections with AWS S3 buckets to push the ultimate JSON export which is the entry point for clients to refer to the data.
- Designing Airflow Dag’s to schedule Spark jobs on the cluster to generate JSON exports every day.
- DB Visualizer, Putty, IntelliJ, and Excel were the most used tools on a regular basis and Flow dock was used to keep track of the one-on-one communication between the teams.
- Implemented Partitions, Buckets, and developed Hive query to process the data and generate the data cubes for visualizing. Involved in Cluster maintenance, Cluster Monitoring, and Troubleshooting
Environment: Python, AWS, Scala, Spark, Docker, Spark RDD, AWS EC2, AWS S3, Cassandra, Snowflake, Java, PySpark, Oozie, DB Visualizer, Putty, IntelliJ, Excel, SQL, YARN, Spark SQL, HDFS, Hive, Maven, Apache Kafka, Shell scripting, Linux, PostgreSQL Database, Git, and Agile Methodologies.
Confidential, AZ
Big Data Engineer
Responsibilities:
- Designing Apache Spark programs for reading millions of transactions of data from Oracle Database to implement Structured Streaming and performing the necessary transformations using Spark SQL.
- Processing different kinds of streaming data that vary in the formats like JSON, CSV, XML, XLXS, Html, etc.
- Imported data from Oracle and MySQL databases into Spark, Performed transformations and actions.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala
- Developed python code for different tasks, dependencies, SLA watcher, and time sensor for each job for workflow management and automation using the Control-M and D-series tool.
- Experienced in writing Real-time Processing and core jobs using Spark Streaming with Kafka system.
- Worked with Memory cache for static and dynamic cache for the better throughput of sessions containing Rank, Lookup, Joiner, Sorter, and Aggregator transformations
- Worked on analyzing Hadoop clusters using different big data analytic tools including Hive, Oozie, Zookeeper, Sqoop, Spark, Kafka, and Impala with Cloudera distribution.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD, Spark YARN.
- Designing Oozie workflows to schedule Spark jobs on the cluster to generate JSON exports every day.
- Experience in using Avro, Parquet, ORC, and JSON file formats, developed UDFs in Hive and Pig.
- Played a major role in the continuous build, test, and integration process (CI/CD Pipelines).
- Developed Sqoop and Kafka Jobs to load data from RDBMS, Snowflake, External Systems into HDFS and HIVE.
- Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
Environment: Spark, Spark-Streaming, Python, Pig, Cassandra, Log4j, Oozie, Hadoop, YARN, Impala, Spark SQL, HiveQL Java, HDFS, Hive, Maven, Apache Kafka, Shell scripting, Linux, MySQL, Oracle DB, Eclipse, Oracle, Git, and Agile Methodologies.
Confidential, ColoradoSr. Data Modeler
Responsibilities:
- Worked on various kinds of transformations like Expression, Aggregator, Stored Procedure, Java, Lookup, Filter, Joiner, Rank, Router and Update Strategy. Developed reusable Mapplets and Transformations.
- Involved in Design, analysis, Implementation, Testing, and support of ETL processes for Stage, ODS, and Mart. Defined scope of the project, gathered business requirements, performing a GAP analysis.
- Laid the Architecture design for Data Lake and Implemented it using Hadoop architecture.
- Generated ad-hoc SQL queries using joins, database connections, and transformation rules to fetch data from legacy DB2 and SQL Server database systems.
- Exhaustively collected business and technical metadata and maintained naming standards.
- Used Erwin for reverse engineering to connect to existing database and ODS to create graphical representation in the form of Entity Relationships and elicit more information.
- Developed Data Mapping, Data Governance, Transformation, and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS, and OLAP
- Collaborated with ETL, BI, and DBA teams to analyze and provide solutions to data issues and other challenges while implementing the OLAP model.
- Created and Maintained Logical Data Model (LDM) for the project. Includes documentation of all Entities, Attributes, Data Relationships, Primary, and foreign key Structures, Business Rules, Glossary Terms, etc.
- Designed and developed Informatica’s Mappings and Sessions based on business user requirements and business rules to load data from source flat files and oracle tables to target tables.
- Developed and maintained a data dictionary to create metadata reports for technical and business purposes.
- Working on differentdataformats such as Flat files, SQL files, Databases, XML schema, CSV files.
- Involved in the project cycle plan for thedatawarehouse, sourcedataanalysis,dataextraction process, transformation, and ETL loading strategy designing.
- Generated various kinds of reports using Power BI and Tableau based on Client specifications.
- Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
Environment: Bitbucket, Erwin, Hadoop, Jira, Confluence, HDFS, Hive, Pig, HBase, Big Data, Oozie, Sqoop, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL Workbench, Java,Eclipse, Oracle 10g, SQL, Scala, SQL, Java, Python, Hive SQL, Spark SQL, Data Bricks, and Agile Methodologies.
