We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

3.00/5 (Submit Your Rating)

Santa Clara, CaliforniA

SUMMARY:

  • Over 8+ years of IT experience in a variety of industries working on BigDatatechnology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala. Can work parallelly in both GCP and Azure Clouds coherently.
  • Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in - memory computing capabilities written in Scala. Worked wif Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Very keen in noing newer techno stack that Google Cloud platform (GCP) adds.
  • Fluent programming experience wif Scala, Python, SQL, T-SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Worked on Dimensional Data modelling in Star and Snowflake schemas and Slowly Changing Dimensions(SCD).
  • Solid Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud, AWS Redshift.
  • Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Can work parallelly in both GCP and Azure Clouds coherently.
  • Hands on experience wif different programming languages such as Python, SAS.
  • Experience in handling python and spark context when writing Pyspark programs for ETL.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample noledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
  • Very keen in noing newer techno stack that Google Cloud platform (GCP) adds.
  • Experience working wif NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
  • Hands-on experience wif Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, GCP,Task Tracker, Name Node,DataNode and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed noledge of MapReduce framework.
  • Implementations of generalized solution model using AWS SageMaker
  • Hands-on experience wifAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMRand other services of the AWS family.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
  • SQL concepts, Presto SQL, Hive SQL, Python (Pandas, Numpy, SciPy, Matplotlib) and Pyspark to cope up wif the increasing volume of data.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
  • Experience working wif GitHub/Git 2.12 source and version control systems.

TECHNICAL SKILLS:

Hadoop/Big Data Technologies: HDFS, Hive, Pig, Sqoop, Yarn, Spark, Spark SQL, Kafka

Google Cloud Platform: GCP Cloud Storage, Big Query, Composer, Cloud Dataproc, Cloud SQL, Cloud Functions, Cloud Pub/Sub.

Big Data: Spark, Azure Storage, Azure Database, Azure Data Factory, Azure Analysis Services.

Hadoop Distributions: Horton works and Cloudera Hadoop

Languages: C, C++, Python, Scala, Java script, UNIX Shell Script, COBOL, SQL and PL/SQL

Python libraries: Pandas, NumPy, Sklearn, Matplotlib, and Seaborn.

Tools: Teradata SQL Assistant, Pycharm, Autosys

Operating Systems: Linux, Unix, ZOS and Windows

Databases: Teradata, Oracle 9i/10g, DB2, SQL Server, MySQL 4.x/5.x

ETL Tools: IBM InfoSphere Information Server V8, V8.5 & V9.1, Power BI, Data Studio, Tableau

Reporting: Tableau

PROFESSIONAL EXPERIENCE:

Confidential, Santa Clara, California

Sr Data Engineer

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • The roles includes creating data pipelines from application databases to Data Lake and warehouse and create stream and batch data processing pipelines.
  • Maintaining more TEMPthan 16 PB, 300 nodes Cloudera's distribution Hadoop production and dev clusters. Perform daily health checks, work on alerts, and other related tasks.
  • Developed spark applications using Spark SQL and RDDs. Excellent noledge of Spark Streaming and learning Spark ML.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Developed application in PySpark for processing databases' health information data and store it on oracle database. In depth noledge of monitoring spark application through spark UI and good in debugging the failed spark jobs. Maintaining existing spark jobs and make changes as per business need.
  • Created Hive databases, tables and queries for data analytics as and when needed, integrate them wif data processing jobs, and drop them as a part of cleanup. Also working on maintaining HBase database used by other applications.
  • Leveraged Google Cloud Platform Services to process and manage the data from streaming and file-based sources.
  • Designed and developed in house database visualization tool built on Oracle Application Express platform that visualize the real-time database health information as well as shows management reports.
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Experienced as AWS cloud engineer to create and manage EC2 instances, S3 buckets, and configure RDS instances. Monitoring the performance, spinning off the instances regularly, and help developers to get access to the cloud instances.
  • Created Python Flask application to deploy PLSQL code automatically onto the production database. Created adaptive database alerting system in Python, SQL, and Bash scripting that generates the database alerts almost in real time depending on thresholds which can be change on the fly.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Created Azure Stream Analytics Jobs to replicate the real time data to load to Azure SQL Data warehouse.
  • Deployed the codes to multiple environments wif the help of CI/CD process and worked on code defect during the SIT and UAT testing and provide supports to data loads for testing; Implemented reusable components to reduce manual interventions.
  • Processed the Structured and semi structured files like JSON, XML using Spark, HDInsight and Data bricks environments.
  • Prepared the data models for Data Science and Machine Learning teams. Worked wif the teams in setting up the environment to analyze the data using Pandas.
  • Worked wif VSTS for the CI/CD Implementation.
  • Reviewing individual work on ingesting data into azure data lake and provide feedbacks based on architecture, naming conventions, guidelines, and best practices.

Environment: GCP, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Aws, Sql Server, Python, Scala, Spark, Hive, Spark-Sql

Confidential, New York

Sr .Data Engineer

Responsibilities:

  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
  • Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Strong understanding of AWS components such as EC2 and S3
  • Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.
  • Conducted statistical analysis on Healthcare data using python and various tools.
  • Developed simple and complex Map Reduce programs in Java for Data Analysis on different data formats.
  • Involved in using AWS for the Tableau server scaling and secured Tableau server on AWS to protect the Tableau environment using Amazon VPC, security group, AWS IAM and AWS Direct Connect.
  • Developed SSIS Packages to extract Student data from source systems such as Transactional system for online assessments and legacy system for paper pencil assessments, transform data based on business rules and load the data into reporting DataMart tables such as dimensions, facts and aggregated fact tables.
  • Developed T-SQL (transact SQL) queries, stored procedures, user-defined functions, built-in functions.
  • Expertise in snowflake to create and Maintain Tables and views.
  • Optimized queries by adding necessary non-clustered indexes and covering indexes.
  • Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, bar graphs for display purposes as per business needs.
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Designed SSRS reports using parameters, drill down options, filters, sub reports.
  • Developed internal dashboards for the team using Power BI tools for tracking daily tasks.
  • Responsible for data services and data movement infrastructures
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
  • Loaded application analytics data into data warehouse in regular intervals of time
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Used AWS Atana to Query directly from AWS S3.
  • Worked on confluence and Jira
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built wif Python
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Configured AWS Lambda wif multiple functions.
  • Compiled data from various sources to perform complex analysis for actionable results
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the Tensorflow Model for efficiency
  • Implementations of generalized solution model using AWS SageMaker
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Implemented a Continuous Delivery pipeline wif Docker, Git Hub and AWS
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Collaborate wif team members and stakeholders in design and development of data environment
  • Preparing associated documentation for specifications, requirements, and testing

Environment: AWS, Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql

Confidential, Washington DC

Data Engineer

Responsibilities:

  • Responsible for developing a highly scalable and flexible authority engine for all customer data.
  • Worked on Resetting of customer attributes that provide insight about customer. Purchase frequency, marketing channel, Groupon deal categorization. Advocate different sources of data using SQL, HIVE, SCALA.
  • Integrated 3rd party data agencies (Gender, age, other purchase history from other sites) try to integrate that data to existing data store.
  • Used Kafka HDFS Connector to export data from Kafka topic to HDFS files in a variety of formats and integrates wif apache hive to make data immediately available for SQO querying.
  • Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark, Scala and GCP Dataproc.
  • Implemented dynamic partitioning in bigquery tables and used appropriate file format, compression technique to improve the performance of Pyspark jobs in the DATAPROC.
  • Built a system for analyzing the column names from all tables and identifying rmation columns of data across on-premises Databases (data migration) to GCP
  • Process and load bound and unbound Data from Google Pub/Sub topic to Big-Query using cloud Dataflow wif Python.
  • Worked on partitions of Pub/Sub messages and setting up the replication factors.
  • Effectively worked and communicated wif product, marketing, business owners, and business intelligence and the data infrastructure and warehouse teams.
  • Performed analysis on data discrepancies and recommended solutions based upon root cause.
  • Designed and developed job flow using Apache Air flow.
  • Worked on IntelliJ IDE, Eclipse IDE, Maven, SBT, GIT
  • Working on data pipe line which is build on top of Spark using scala
  • Designed, developed Created ETL(Extract, Transformand Load)Packagesusing Python, SQL Server Integration Services (SSIS) to load data into Data warehouse (Microsoft SQL Server), from Excel workbooks and Flat Files into database.
  • Implemented an application for cleansing and processing terabytes of data using Python and Spark.
  • Developed packages usingPython, Shell scripting, XML to automate some of the menial tasks.
  • UsedPythonto write data into JSON files for testing Student Item level information.
  • Created scripts for data modelling and data import and export.

Environment: Python, GCP Spark, Hive, Scala, Snowflake, JupyterNotebook, Shell Scripting, SQL Server 2016/2012, T-SQL, SSIS, Visual studio, Power BI, PowerShell.

Confidential

ETL Developer

Responsibilities:

  • Experienced in defining job flows to run multiple Map Reduce and Pig jobs using Oozie.
  • Importing log files using Flume into HDFS and load into Hive tables to query data.
  • Monitoring the runningMap Reduceprograms on the cluster.
  • Responsible for loading data from UNIX file systems to HDFS.
  • Used HBase-Hive integration, written multiple Hive UDFs for complex queries.
  • Involved in writing APIs to ReadHBasetables, cleanse data and write to anotherHBasetable.
  • Installed, configured, and maintained Apache Hadoop clusters for application development and major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.
  • Implemented six nodes CDH4 Hadoop Cluster on CentOS.
  • Importing and exporting data into HDFS and Hive from different RDBMS using Sqoop.
  • Created multiple Hive tables, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.
  • Experienced in design, development, tuning and maintenance of NoSQL database.
  • Written Map Reduce program in Python wif the Hadoop streaming API.
  • Developed unit test cases for Hadoop Map Reduce jobs wif MRUnit.
  • Excellent experience in ETL analysis, designing, developing, testing and implementing ETL processes including performance tuning and query optimizing of database.
  • Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
  • Worked wif application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Used Maven as the build tool and SVN for code management.
  • Worked on writing RESTful web services for the application.
  • Implemented testing scripts to support test driven development and continuous integration.
  • Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Experienced in running batch processes using Pig Scripts and developed Pig UDFs for data manipulation according to Business Requirements.
  • Experienced in writing programs using HBase Client API.
  • Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.

Environment: Hadoop, Map Reduce, HDFS, HBase, Hive, Impala, Pig, Java, SQL, Ganglia, Scoop, Flume, Oozie, Unix, Java, Java Script, Maven, Eclipse.

Confidential

Software Engineer

Responsibilities:

  • Developed session facade wif stateless session beans for coarse functionality.
  • Worked wif Log4J for logging purpose in the project.
  • Implemented the java web services,JSP, Servletsfor handling data.
  • Designed and developed the user interface using Struts 2.0, JavaScript, XHTML
  • Made use of Struts validation framework for validations at the server side.
  • Created and implemented the DAO layer using Hibernate tools.
  • Implementedcustom interceptorsandexception handlersfor Struts 2 application.
  • Ajaxwas used to provide dynamic search capabilities for the application.
  • Involved in the completeSDLClife cycle, design and development of the application.
  • AGILEmethodology was followed and was involved in SCRUM meetings.
  • Created various java bean classes to capture the data from the UI controls.
  • Designed UML diagrams like class diagrams, sequence diagrams and activity diagrams.
  • Developed business components using service locator, session facade design patterns.

Environment: Java 1.5, Java Script, Struts 2.0, XML, XSLT, Eclipses, Tomcat, Hibernate 3.0, Ajax, JAXB.

We'd love your feedback!