We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

4.00/5 (Submit Your Rating)

Aurora, CO

SUMMARY

  • Having around 8+years of Professional experience as a Big Data Development and python Developer with expertise in, Python, Hadoop, Spark etc.
  • Development, Implementation, Deployment and Maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
  • Expertise in extending Hive and Pig core functionality by writing custom UDFs and MapReduce using Python.
  • Good working knowledge with Data Warehousing and ETL processes.
  • Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML.
  • Profound experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
  • Worked on NoSQL databases including HBase, Cassandra and Mongo DB
  • Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, and Hortonworks.
  • In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Job History Server, Name Node, Data Node, Map Reduce, Spark.
  • Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
  • Good Knowledge in making and keeping up profoundly versatile and fault-tolerant Infrastructure in AWS environment spanning over different availability zones.
  • Expertise in working with AWS cloud services like EMR, S3, Redshift, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, Athena for big data development.
  • Experienced in setting up Apache NiFi and performing POC with NiFi in orchestrating a data pipeline for data ingestion.
  • Developed ETL solution for GCP Migration using GCP Dataflow, GCP Composer, Apache Airflow and GCP Big Query.
  • Experience working on creating and running Docker images with multiple micro - services.
  • Excellent technical and analytical skills with a clear understanding of the design goals of Entity-Relationship modeling for OLTP and dimension modeling for OLAP.
  • Experienced Orchestrating, scheduling, and monitoring job tools like Crontab, Oozie, and Airflow.
  • Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
  • Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
  • Expertise in python scripting and Shell scripting.
  • Proficient in Tableau/Power BI to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
  • Experience in infrastructure automation using Chef & Dockers.
  • Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
  • Experience in all phases of Data Warehouse development like requirements gathering, design, development, implementation, testing, and documentation.
  • Solid knowledge of Dimensional Data Modelling with Star Schema and for FACT and Dimensions Tables using Analysis Services.
  • Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
  • Experience with proficient knowledge of Data Analytics, Machine Learning (ML), Predictive Modeling, Natural Language Processing (NLP), and Deep Learning algorithms.
  • A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: HDFS, Apache NIFI, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, Impala, Apache Zookeeper, Ambari, Storm, Spark, and Kafka

No SQL Database: HBase, Cassandra, MongoDB

Monitoring and Reporting: Tableau, Power BI, Custom Shell Scripts

Hadoop Distribution: Horton Works, Cloudera, MapR

Build and Deployment Tools: Maven, Sbt, Git, SVN, Jenkins

Programming and Scripting: Scala, SQL, Shell Scripting, Python, Pig Latin, HiveQL

Databases: Oracle, MY SQL, MS SQL Server

Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS

Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript

IDE Dev. Tools: PyCharm, Vi / Vim, Sublime Text, Visual Studio Code, Jupyter Notebook.

Operating Systems: Linux, Unix, Windows 8, Windows 7, Windows Server 2008/2003, Mac OS

AWS Services: EC2, EMR, S3, Redshift, EMR, Lambda, Glue, Data Pipeline, Athena

Network protocols: TCP/IP, UDP, HTTP, DNS, DHCP

Methodologies: Agile/Scrum, Waterfall.

Others: Machine learning, NLP, StreamSets, Terraform, Docker, Chef, Ansible, Splunk, GCP Jira.

PROFESSIONAL EXPERIENCE

Confidential, Aurora, CO

Sr Data Engineer

Responsibilities:

  • Involved in developing batch and stream processing applications that require functional pipelining using Spark APIs.
  • Developed ETL solution for GCP Migration using GCP Dataflow, GCP Composer, Apache Airflow and GCP Big Query.
  • Working on Google Cloud Platform (GCP) services like cloud storage, cloud SQL, stack driver monitoring.
  • Developed Streaming applications usingPySparkto read from theKafkaand persist the datain NoSQLdatabases such asHBaseandCassandra.
  • Developed tools using Python, Shell scripting, XML to automate some of the menial tasks.
  • Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
  • Developed streaming and batch processing applications usingPySparkto ingest data from the various sources into HDFS Data Lake.
  • Developed the back-end web services using Python and Django REST framework.
  • Developed and implemented HQL scripts to create Partitioned and Bucketed tables inHivefor optimized data access.
  • ImplementedPySparkScripts usingSparkSQLto access hive tables into a spark for faster processing of data.
  • Implementing Microservices in Scala along with Apache Kafka.
  • Extract real-time data feed using Kafka, process core job using Spark Streaming to Resilient Distributed Datasets (RDD) to process them as Data Frames and save as Parquet format in HDFS and NoSQL databases.
  • DevelopedSparkjobs onDatabricksto perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
  • Developed High Speed BI layer on Hadoop platform with Apache Spark & Python.
  • Performed data cleansing and applied transformations usingDatabricksandSparkdata analysis.
  • Extensively usedDatabricksnotebooks for interactive analytics using Spark APIs.
  • WrittenHiveUDFs to implement custom functions in the hive for aggregations.
  • Worked extensively withSqoop for importing and exportingthe data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
  • Processed the schema oriented and non-schema-oriented data using Scala and Spark.
  • Provided architecture and design as product is migrated to Scala, Play framework and Sencha UI.
  • Implemented applications with Scala along with Akka and Play framework.
  • DevelopedDDLandDMLscripts inSQLandHQL for applications inRDBMSandHive.
  • Used Oozie Schedulersystems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract andZookeeperfor providing coordinating services to the cluster.
  • Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Migrated an Oracle SQL ETL to run on google cloud platform using cloud dataproc & Big Query, cloud pub/sub for triggering the airflow jobs.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Used Apache Airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
  • Good knowledge in using ApacheNiFito automate the data movement between different Hadoop systems.
  • Developed and implementedApache NiFiacross various environments, written QA scripts in Python for tracking files.
  • Experience onmoving the raw databetween different systems using Apache NiFi.
  • ETL created by multiple Informatica transformations (Source Qualifier, Lookup, Router, Update Strategy) were utilized to createSCD typemappings to illustrate changes in loan related data in a timely manner.
  • Designed roles and groups for users and resources using AWS Identity Access Management (IAM).
  • Using rest API with Python to ingest Data from and some other site to BIGQUERY.
  • Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Big Query tables.
  • Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Big Query and load it in Big Query.
  • Monitoring Big Query, Dataproc and cloud Data flow jobs via Stack driver for all the environments.
  • Involved in designing and deploying multitude of applications utilizing almost all the GCP stack (Including EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in GCP cloud formation.
  • Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) andAgile methodologies.

Environment: Hortonworks, Apache Hadoop 2.6.0, HDFS, Hive 1.2.1000, Sqoop 1.4.6, HBase 1.1.2, Oozie 4.1.0, Storm 0.9.3, YARN, NiFi, Cassandra, Zookeeper, Spark, Kafka, Oracle 11g, MySQL, Shell Script, GCP, EC2, Source Control GIT, Tera Data SQL Assistant.

Confidential, Bronx, NY

Senior Data Engineer

Responsibilities:

  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, Spark YARN.
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Scala for data cleaning and preprocessing
  • Responsible to manage data coming from different sources through Kafka.
  • Installed Kafka Producer on different severs and scheduled to produce data for every 10 seconds
  • Created functions and assigned roles in AWS Lambda to run python scripts. Created Lambda jobs and configured Roles using AWS CLI.
  • UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
  • Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
  • Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
  • Good Exposure on Map Reduce programming using PIG Latin Scripting and Distributed Application and HDFS.
  • Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Migrated Map reduce jobs to Spark jobs to achieve better performance.
  • Using Spark Data frame API in Scala for analyzing data.
  • Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
  • Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto Hadoop ecosystem.
  • Extracted and updated the data into HDFS using Sqoop import and export.
  • Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Worked with various HDFS file formats like Parquet, IAM, Json for serializing and deserializing.
  • Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
  • Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
  • Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Developed Apache Spark applications by using spark for data processing from various streaming sources.
  • Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
  • Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
  • Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
  • Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
  • Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.
  • Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.

Environment: Hadoop (HDFS, Map Reduce), Kafka, Scala, Mongo DB, Pig, Sqoop, Flume, HBase, AWS Services (Lambda, EMR, Auto scaling, EC2, S3, IAM, CloudWatch, DynamoDB), Yarn, PostgreSQL, Spark, Impala, Oozie, Hue, Oracle, NIFI, Git.

Confidential, San Diego, CA

Data Engineer

Responsibilities:

  • Migrate data from on-premises to AWS storage buckets
  • Analyze and cleanse raw data using HiveQL.
  • Deployment, migration and new application environments and infrastructure on AWS (VPC, DynamoDB, SQS, ELB, EC2, SNS, SES Redshift, Lambda, EMR, S3, Glacier, etc.)
  • Developed a python script to transfer data from on-premises to AWS S3.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the spark jobs.
  • Worked in scrum/Agile environment, using tools such as JIRA.
  • Creating Airflow DAGs for backfill, history and incremental loads.
  • Interacted with Data architects, Business Analysts, and users to understand business and functional needs.
  • Creating Spark jobs efficiently with data cache, coalesce, repartition methods to improve performance.
  • Support application in production and participates in code reviews.
  • Contributed to architecture reviews.
  • Written unit test cases for Spark code for CI/CD process.
  • Used and configured multiple AWS services like RedShift, EMR, EC2 and S3 to maintain compliance with organization standards.
  • Involved in creating Hive tables (Managed tables and External tables), loading, and analyzing data using hive queries.
  • Created an automated loan leads and opportunities match back model used to analyze loan performance and convert more business leads.
  • Demonstrated Hadoop practices and knowledge of technical solutions, design patterns and code for medium/ complex applications deployed in Hadoop production.
  • Involved in converting Hive/SQL queries.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Experience in data transformations using Map-Reduce, HIVE for different file formats.
  • Developed a python script to validate the files in one S3 with same daily files in another bucket.
  • Automated the workflow by scheduling the jobs using arow job scheduler.
  • Responsible for managing the data coming from different sources including files, RDBMS such as MSSQL Server, DB2, Oracle, etc.
  • Imported the data from different sources like AWS S3, Local file system into Spark RDD.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and python.
  • Used Hive to analyze the partitioned and Bucketed data and compute various metrics for reporting.
  • Involved in loading data from Linux file system to HDFS.
  • Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
  • Good knowledge about the configuration management tools like Bitbucket/GitHub and Bamboo (CICD).
  • Developed Py-spark script to process and transfer the files to third party vendor on Automated Basis.

Environment: Apache Spark, Scala, Amazon EMR/Redshift, AWS Step Function, AWS Cloud Watch, Oracle, Hive/SQL, Python, Pyspark.

Confidential

ETL Developer

Responsibilities:

  • Proficient working experience with SQL, PL/SQL, Database objects like Stored
  • Procedures, Functions, Triggers and using the latest features to optimize performance
  • Inline views and Global Temporary tables
  • Performed the data analysis and mapping database normalization, performance
  • Tuning, query optimization data extraction, transfer, and loading (ETL) and clean up
  • Created SSIS Packages using SSIS Designer for export heterogeneous data from OLE
  • DB Source (Oracle), Excel Spreadsheet to SQL Server.
  • Extensive use of Triggers to implement business logic and for auditing changes to
  • Critical tables in the database experience in developing external Tables, Views, Joins, Cluster indexes and Cursors
  • Defining data warehouse (star and snowflake schema), fact table, cubes, dimensions,
  • Measures using SQL Server Analysis Services.
  • Used Execution Plan, SQL Profiler and Database Engine Tuning Advisor to optimize
  • Queries and enhance the performance of databases.
  • Worked on the data warehouse design and analysed various approaches for maintaining
  • Different dimensions and facts in the process of building a data warehousing application.
  • Using reporting services (SSRS) generated various reports.
  • Optimized query performance by creating Indexes.

Environment: Oracle, SSIS, mysql, Microsoft Office Suite

Confidential

Data Analyst

Responsibilities:

  • Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables.
  • Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
  • Experience with building data pipelines in python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
  • Created ETL Pipeline using Spark and Hive for ingest data from multiple sources.
  • Involved in using SAP and transactions done in SAP - SD Module for handling customers of the client and generating the sales reports.
  • Coordinated with clients directly to get data from different databases.
  • Worked on MS SQL Server, including SSRS, SSIS, and T-SQL.
  • Designed and developed schema data models.
  • Documented business workflows for stakeholder review.

We'd love your feedback!