We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

4.00/5 (Submit Your Rating)

NC

SUMMARY

  • Overall 8+ years of professional IT experience in BIGDATA using HADOOP framework and also as a Cloud Data Engineer experience. Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies as well as Java / J2EE technologies.
  • Expertise in Object Oriented Concepts, Object Oriented Design (OOD), Object Oriented Analysis (OOA) Programming.
  • Expertise in Statistical analysis, Predictive modeling, Text mining, Supervised learning, Unsupervised Learning, and Reinforcement learning.
  • Highly Experienced on Data Acquisition processes and handling multiple file formats (CSV, JSON).
  • Worked on building ETL pipelines using GCP (Google Cloud Platform) and Python to create Dashboards in Data Studio.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts.
  • Experience in Big Data Analytics and developing data models using Hive, PIG, and Map reduce, SQL wif strong data architecting skills designing data - centric solutions.
  • Highly Skilled at using SQL, NumPy, Pandas and Spark for Data Analysis and Model building, Cognitive Design, deploying and operating highly available, scalable, and fault-tolerant systems using Amazon Web Services (AWS).
  • Experience in handling python and spark context when writing Pyspark programs for ETL.
  • Implemented & maintained the branching and build/release strategies using Version Control tools GIT, Subversion, Bitbucket and experienced in migrating GIT repositories to AWS.
  • Strong experience inDataMigration,DataCleansing, Transformation, Integration,DataImport, andData Export.
  • Experienced in developing MapReduce programs using Apache Hadoop for working wif Big Data.
  • Good understanding of Apache Spark High level architecture and performance tuning pattern.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
  • Expertise in Statistical analysis, Predictive modeling, Text mining, Supervised learning, Unsupervised Learning, and Reinforcement learning.
  • Proficient in Statistical methodologies such as Hypothesis Testing, ANOVA, Monte Carlo Sampling and Time Series Analysis.
  • Implemented Python libraries NumPy, SciPy, PyTables, SQL Alchemy, Matplotlib, Pandas, Beautiful Soup and urllib in various tasks.
  • Extensively worked on statistical analysis tools and adept at writing code in Advanced Excel, R, MATLAB, Python.
  • Integrated Python wif Big Data and Analytics based on Hadoop, Spark, and No-SQL databases like Hbase and MongoDB.
  • Highly skilled at writing Subqueries, Stored Procedures, Views, Triggers, Cursors, and Functions on MySQL and PostgreSQL database.
  • Skilled in developing applications using Amazon Web Services like EC2, Virtual private clouds (VPCs), Storage models (EBS, S3, and instance storage), Elastic Load balancer ELBs.
  • Experienced in using AWS Lambda and Glue for creating highly functional Data pipelines.
  • Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups.
  • Implemented a Continuous Delivery pipeline wif Docker and Git Hub
  • Good experience in handling messaging services using Apache Kafka.
  • Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
  • Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
  • Worked wif various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Hands on experience building enterprise applications utilizing Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS and JavaScript, SQL, PL/SQL.
  • Experience using Build tools like Maven and also version control tools like Git.
  • Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Java/J2EE Technologies: Java, Java Beans, J2EE (JSP, Servlets, EJB), Struts, spring, JDBC.

Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS, Azure, GCP

Cloud Technologies: Amazon Web Services (AWS), Google Cloud Storage (GCP), Microsoft Azure

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, HBase, MongoDB

Languages: Python (Pandas, NumPy, Scikit-Learn, Matplotlib, Seaborn, Nltk, TensorFlow), SQL, PL/SQL, Java, Shell scripting, R, PySpark, Pig, Hive QL, Scala, HTML, CSS, JavaScript

IDE and Tools: Eclipse, NetBeans, IntelliJ, Maven, Jupyter Notebook, Spyder

Operating Systems: Windows, LINUX, UBUNTU, CENTOS, MacOS

Version Control: GIT, SVN, CVS

SDLC Methodologies: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, NC

Sr Data Engineer

Responsibilities:

  • Migrating an entire oracle database to BigQuery and using of power bi for reporting.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Experience in building power bi reports on Azure Analysis services for better performance.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Coordinated wif team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
  • Hands on experience wif building data pipelines in python/Pyspark/HiveSQL/Presto.
  • Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
  • Worked wif google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
  • Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
  • Involved in importing the real time data toHadoopusingKafkaand implemented theOoziejob for daily imports.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Coordinated wif team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
  • Developed Kafka consumer API in Scala for consuming data from Kafka topics.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming wif Kafka as a data pipeline system using Scala programming.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswif ingested data.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Installed and configured apache airflow for workflow management and created workflows in python.
  • Designed and Co-ordinated wif Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
  • Wrote scripts in Hive SQL for creating complex tables wif high performance metrics like partitioning, clustering and skewing
  • Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
  • Worked wif google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
  • Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process.
  • Knowledge about cloud dataflow and Apache beam.
  • Good knowledge in using cloud shell for various tasks and deploying services.
  • Created BigQuery authorized views for row level security or exposing the data to other teams.
  • Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, SQOOP, Apache Spark, wif Cloudera Distribution.

Environment: GCP, RedShift, Spark, Hive, Sqoop, Oozie, HBase, Scala, MapReduce, Azure, Teradata, SQL, python, R Studio, Excel, Power Point, Tableau Hadoop, PySpark, random forest, Apache Airflow.

Confidential, NC

Sr Data Engineer

Responsibilities:

  • Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and reporting of voluminous, rapidly changing data.
  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely wif the stakeholders & solution architect.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
  • Worked on AWS Elastic Beanstalk for fast deploying of various applications developed wif Java, PHP, Node.js, Python on familiar servers such as Apache.
  • Evaluated data import-export capabilities, data analysis performance of Apache Hadoop framework.
  • Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
  • Provide guidance to development team working on PySpark as ETL platform
  • Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Performed ETL using AWS Glue.
  • Used AWS Athena to Query directly from AWS S3.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Used spark and Sqoop export to export data from Hadoop to oracle DB.
  • Involved in Developing Hive scripts to parse the raw data, populate staging tables and store the refined data in partitioned tables in the Hive.
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using Spark to generate the output response.
  • Creating Lambda functions wif Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
  • Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
  • Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
  • Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
  • Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
  • Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
  • Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
  • Integrated Apache Airflow wif AWS to monitor multi-stage ML workflows wif the tasks running on Amazon Sage Maker.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau

Confidential, NYC

Big Data/Hadoop Developer

Responsibilities:

  • Developed a data pipeline to ingest customer behavioral data and financial histories intoHadoopcluster for analysis.
  • Responsible for implementing a generic framework to handle different data collection methodologies from the client primary data sources, validate transform using spark and load into S3 bucket.
  • Involved in all phases of Installation and upgradation of Hadoop big data platform. Implementing security for Hadoop big data platform
  • Designed the sequence diagrams to depict the data flow into Hadoop.
  • Involved in importing and exporting data between HDFS and Relational Systems like Oracle, MySQL and DB2 using Sqoop.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Helped Application and Operations team to troubleshoot the performance issues.
  • Implemented Partitioning, Dynamic Partitions and bucketing in HIVE for efficient data access.
  • Created final tables in Parquet format. Use of Impala to query and manage Parquet tables.
  • Implemented data Ingestion and handling clusters in real time processing using Apache Kafka.
  • Involve in creating Hive tables, loading wif data and writing Hive queries.
  • Collected data using Spark from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Explored the usage of Spark for improving the performance and optimization of the existing algorithms inHadoopusing Spark Context, Spark SQL, and Spark Yarn.
  • Developed Spark Code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala.
  • Worked on the Spark SQL and Spark Streaming modules of Spark and used Scala and Python to write code for all Spark use cases.
  • Explored the Spark to improve the performance and optimization of the existing algorithms inHadoopusing Spark-Context, Spark-SQL, Data Frame and Pair RDD.
  • Migrated historical data to S3 and developed a reliable mechanism for processing the incremental updates.
  • Used Oozie workflow engine to manage independentHadoopjobs and to automate several types ofHadoopsuch as java MapReduce, Hive and Sqoop as well as system specific jobs.
  • Used to monitor and debugHadoopjobs/applications running in production.
  • Worked on providing user support and application support onHadoopinfrastructure.
  • Designed, developed and created ETL (Extract, Transform and Load) packages using Python to load data into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.
  • Used Python Pandas module to read CSV files to obtain member data and store the data in data structures.
  • Automated all the Python jobs using Crontab scheduler.
  • Supported the testing team onHadoopApplication Testing.

Environment: /Skills: Hadoop, HDFS, Pig, Hive, Spark, MapReduce, Python, Cloudera CDH 4.6, MapReduce, Hive, Sqoop, Oozie.

Confidential

Hadoop Developer

Responsibilities:

  • Responsible for loading customer's data and event logs intoHBaseusing Java API.
  • CreatedHBasetables to store variable data formats of input data coming from different portfolios
  • Involved in adding huge volumes of data in rows and columns to store data in HBase
  • Responsible for architectingHadoop clusters wif CDH4 on CentOS, managing wif Cloudera Manager.
  • Involved in initiating and successfully completing Proof of Concept onFlumefor Pre-Processing,
  • Used Flume to collect the log data from different resources and transfer the data type toHivetables using different SerDes to store in JSON, XML and Sequence file formats.
  • Used Hive to find correlations between customer's browser logs in different sites and analyzed them.
  • End-to-end performance tuning of Hadoop clusters andHadoop MapReduceroutines against very large data sets.
  • Created and maintained Technical documentation for launching Hadoop Clusters and for executingHive queriesandPig Scripts.
  • Created User accounts and given the users the access to the Hadoop Cluster.
  • DevelopedPig Latinscripts to extract the data from the web server output files to load into HDFS
  • Developed thePigUDF's to pre-process the data for analysis.
  • Loaded files to Hive and HDFS from MongoDB Solr.
  • Monitored Hadoop cluster job performance and performed capacity planning and managed nodes on Hadoop cluster.
  • Responsible for usingOozieto control workflow.

Environment: /Skills: Hadoop 2.0, HDFS, Pig 0.11, Hive 0.12.0, MapReduce 2.5.2, Sqoop, LINUX, Flume 1.94, Kafka 0.8.1, HBase 0.94.6, CDH4, Oozie 3.3.0

Confidential

Java Developer

Responsibilities:

  • Designed and implemented the user interface using JSP, Servlets, JavaScript, HTML, CSS and AJAX.
  • Developed Restful Microservices using Spring Rest and MVC, for OSS services.
  • Designed and developed Microservices business components using Spring Boot.
  • Used Microservices wif Spring Boot predicted service interacting through amalgamation of REST and MQ message broker.
  • Used Eureka for discovery of each microservice and to send transactions to them.
  • Implemented Action Classes and Action Forms using Struts Framework in Payroll module.
  • Used Swagger for manual testing and documentation of microservices.
  • Created Docker images using spotify Maven Plugin, for deployment of microservices
  • Consumed REST based Microservices wif Rest template based on RESTful APIs.
  • Experience wif Microservices for communication purpose.

Environment: /Skills: J2EE, JDK, Servlets, JSP, JSTL, HTML, CSS, JQuery, Struts, EJB, Spring, SWING, JMS, iBATIS, Rational Rose, LDAP, JBoss, PL/SQL, MYSQL, Toad, EMC Documentum - enterprise content management (ECM), JIRA UNIX Shell Scripting, Linux,CVS, Netbeans, ANT, ClearCase, Selenium, EJB, web services, UNIX, and Windows.

We'd love your feedback!