We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Richmond, VA

SUMMARY

  • 7 Years of Professional experiencein Analysis, Design, Development, and Implementation as a Data Engineer.
  • Excellent understanding knowledge of Hadoop Architecture and various daemons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager.
  • Knowledge on designing and implementing complete end - to-end Hadoop infrastructure using MapReduce, Hive, PIG, Sqoop, Oozie, Spark, HBase and Zookeeper and Airflow.
  • Experienced in HDFS data storage and running Map Reduce jobs.
  • Proficient in data migration from various databases to HDFS using SQOOP.
  • Extensively worked on Spark using Python and Scala on cluster for computational (analytics), installed it on top of Hadoop. Performed advanced analytical application by making use of Spark with Hive and SQL, Oracle.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and PySpark concepts.
  • Worked with distributions such as Cloudera, Hortonworks on premises clusters and the cloud clusters.
  • Dealt with Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Expertise in writing custom Kafka consumer code and modifying existing producer code in Python to push data to Spark-streaming jobs.
  • Hands on experience in integration, configuration and maintaining of Hadoop clusters in different environments also production.
  • Ample knowledge on Apache Kafka, Apache Storm to build data platforms, pipelines, and storage systems; and search technologies such as Elastic search.
  • Hands on experience in Automation of Sqoop incremental imports by using Sqoop and automating jobs using Oozie.
  • Good at implementing Kafka custom encoders for custom input format to load data in partitions.
  • Designed and developed the core data pipeline, that involves Java and Scala coding and using Kafka and Spark frameworks.
  • Expertise in using Kafka for log aggregation solution with low latency processing and distributed data consumption and widely used Enterprise Integration Patterns (EIPs).
  • Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
  • Proficient in handling and writing SQL queries.

TECHNICAL SKILLS

Programming Languages: Python, Scala (Functional), Java, SQL, PySpark.

Hadoop ecosystem: HDFS, MapReduce, Yarn, Data Node, Node Manager, SQOOP, HBase, Hive, Cassandra, Spark, Kafka, Scala, Zookeeper.

Cloud services: Amazon EC2, Amazon EMR, Amazon LAMBDA, Amazon GLUE, Amazon S3, Amazon ATHENA, Azure Data Lake, Azure Data Factory, Azure Databrick, Azure SQL Database, Azure SQL data Warehouse.

Web Technologies: JDBC, HTML, Java Script, React JS and CSS

Databases: Oracle, MS-SQL Server, MySQL, HBase, Cassandra, Mongo DB

Version Control Tools: GitHub, Bitbucket.

Operating System: Windows, Linux, Unix, Macintosh HD.

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer

Confidential, Richmond, VA

Responsibilities:

  • Involved in the project life cycle including the design, development and implementation of verifying data received in the data lake.
  • Accessing data from the data lake and receiving into AWS S3.
  • Utilized spark with python to extract XML data and converted into Hive tables.
  • Implemented python scripts on notebooks to read tables as PySpark data frames for analysis.
  • Used joins to integrate tables originating from different sources.
  • Utilized PySpark to partition and bucket the data to facilitate optimal processing in larger stages.
  • Implemented User Defined Functions (UDF) on data frames for analyzing and processing the data.
  • Defined and utilized Window Functions for aggregation.
  • Used ranking functions (rank, dense rank, percent rank, ntile, row number) and aggregation function (sum, min, max) in spark.
  • Used Talend Open studio for data integration to combine, convert and update data present at various sources.
  • Performed real time analysis of transaction data with spark streaming and Apache Casandra database.
  • Stored resultant tables and data frames as PSV files in AWS S3.
  • Added support for AWS S3 and RDS to host static /media files and the database into amazon cloud.
  • Worked on creation of customer Docker container images, tagging and pushing of data images.
  • Created PL/SQL views, stored procedures, database triggers and packages.
  • Performed Unit testing, integration testing using pyTest.
  • Used Selenium Library to write functioning test automation process that allowed the simulation of submitting different requests from multiple browsers to web application.

Environment: Python 3.4, spark 2.4.5, HIVE, Amazon AWS S3, Apache Cassandra database, Docker, Oracle, MYSQL.

Big Data Engineer

Confidential, Seattle, WA

Responsibilities:

  • Installed and configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Assisted with performance tuning and monitoring.
  • Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
  • Developed PySpark code and Spark-SQL for faster testing and processing of data.
  • Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, ORC, AVRO, JSON and CSV formats.
  • Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
  • Created reports for the BI team using SQOOP to export data into HDFS and Hive.
  • Developed custom NIFI processors for parsing the data from XML to JSON format and filter broken files.
  • Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
  • Used PySpark to convert panda’s data frame to Spark Data frame.
  • Used Kafka Utils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
  • Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
  • Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
  • Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
  • Provided design recommendations and resolved technical problems.
  • Assisted with data capacity planning and node forecasting.
  • Involved in performance tuning and troubleshooting Hadoop cluster.
  • Developed H-catalog Streaming code to stream the JSON data into Hive (EDW) continuously.
  • Administrated Hive, Kafka installing updates, patches and upgrades.
  • Supported code/design analysis, strategy development and project planning.
  • Managed and reviewed Hadoop log files.
  • Evaluated suitability of Hadoop and its ecosystem to project and implemented various proof of concept applications to eventually adopt them to benefit from the Hadoop initiative

Environment: Spark, Scala, Hive, Maven, Microservices, GitHub, Splunk, PySpark, Tableau, SQOOP, Java 1.8, Linux, Aqua-data studio, NIFI, Google cloud, J2EE, HDFS, Kafka, MySQL.

Data Analyst

Confidential

Responsibilities:

  • Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
  • Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.
  • Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
  • Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
  • Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
  • Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
  • Worked on writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala and Python.
  • Executedmultiple ETL jobsusingAWS step functionsandlambda, also usedAWS Gluefor loading and preparingdata Analyticsfor customers.
  • Used Python pytest, Pyodbc, NumPy, Openpyxl, MySQL dB, sqlite3, snowflake-python-connector and other packages.
  • Administered and monitored multi - Data center Cassandra cluster based on the understanding of the Cassandra Architecture.
  • Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
  • DevelopedSpark scripts by using Python shell commands as per the requirement.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Developed series of data ingestion jobs for collecting the data from multiple channels and external applications in Scala.
  • Configured various big data workflows to run on top of Hadoop and these workflows comprise of heterogeneous jobs like MapReduce and Involve in evaluating existing server and virtualization environments for needed and useful upgrade opportunities.
  • Worked on automating the infrastructure setup, launching and termination EMR clusters etc.,
  • Created Hive external tables on top of datasets loaded in AWS S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
  • Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.

Environment: AWS S3, EMR, Lambda, Redshift, Athena, Glue, Spark, SQL, PySpark, Scala, Python, Java, Hive, Kafka.

Java Developer

Confidential

Responsibilities:

  • Professional experience in development and deployment of various Object oriented and web - based Enterprise Applications using Java/J2EE technologies and working on the complete System Development Life Cycle (SDLC).
  • Designed and developed the UI of the website using HTML, Spring Boot, React JS, CSS, and JavaScript.
  • Utilized Spring Boot and java as backend and React JS as frontend and MYSQL as database.
  • Designed and developed data management system using MySQL. Built application using Spring JPA for database persistence.
  • Expertise in application/web servers like IBM Web Sphere, Web Logic Application Servers, JBoss and Tomcat Web Servers.
  • In the backend, worked on persisting the data shown on the screen after uploading an excel sheet to the database for a particular user.
  • Created a dashboard for managers to compare Allocation details based on their monthly time.
  • Worked on uploading the feature which would allow the user to upload an excel sheet and convert that data into readable format with the help of React JS.
  • Created framework with concepts of spring boot using Spring JPA for database persistence.
  • Experienced in developing complex MySQL queries, Procedures, Stored Procedures, Packages and Views in MySQL database.
  • Ensured availability and security for database in a production environment.
  • Configured, tuned and maintained MySQL Server database servers.
  • Implemented monitoring and established best practices around using react libraries.
  • Effectively communicated with the external vendors to resolve queries.

Environment: Java, JavaScript, Spring Boot, CSS, SQL, MySQL, React JS, Apache web server, IBM Web Sphere.

We'd love your feedback!