We provide IT Staff Augmentation Services!

Sr. Big Data/data Engineer Resume

3.00/5 (Submit Your Rating)

SUMMARY:

  • Around 9 years of professional experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
  • Experience in working with different Hadoop distributions like CDH and Hortonworks.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Having proficient experience in various Big Data technologies like Hadoop, Apache NiFi, Hive Query Language, HBase NoSQL database, Sqoop, Spark, Scala, OOZIE and Pig. Oracle Database and Unix shell Scripting technologies.
  • Implemented Enterprise Data Lakes using Apache NiFi.
  • Developed and designed Microservices components for the business by using Spring Boot.
  • Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
  • Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Good experience in creating data ingestion Pipelines, Data Transformations, Data Management, Data Governance, and real time streaming at an enterprise level.
  • Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Experience in using SDLC methodologies like Waterfall, Agile Scrum for design and development.

PROFESSIONAL EXPERIENCE:

Confidential

Sr. Big Data/Data Engineer

Responsibilities:

  • Developed mat data pipelines using Spark and PySpark. Analyzed SQL scripts and designed the solutions to implement using PySpark. Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
  • Used Pandas, NumPy, Spark in Python for developing Data Pipelines. Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python. Part of team conducting logical Data analysis and Data modeling JAD sessions, communicated data - related standards. Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark. Implement the Kafka to hive streaming process flow and batch loading of data into MangoDB using Apache NiFi. Implement end-end data flow using Apache NiFi. Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API. Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data. Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework. Implemented Spark using Scala and SparkSQL for faster testing and processing of data. Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing. Monitoring the Hive Meta store and the cluster nodes with the help of Hue. Created
  • Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File, RDBMS as part of a POC using Amazon EC2. Created AWS EC2 instances and used JIT Servers. Migrating an entire oracle database to BigQuery and using of power bi for reporting. Build data pipelines in airflow in
  • GCP for ETL related jobs using different airflow operators. Experience in GCP Dataproc, GCS, Cloud functions, BigQuery. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery. Developed various UDFs in Map-Reduce and Python for Pig and Hive. Data Integrity checks has been handled using Hive queries, Hadoop, and Spark. Build Hadoop solutions for big data problems using MR1 and MR2 in s3. Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala. Implemented the Machine learning algorithms using Spark with Python. Defined job flows and developed simple to complex Map Reduce jobs as per the requirement. Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms. Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders. Responsible in handling Streaming data from web server console logs. Installed Oozie workflow engine to run multiple Hive and Pig Jobs. Developed PIG Latin Scripts for the analysis of semi structured data. U

Confidential

Data Engineer

Responsibilities:

  • Experienced in development using Cloudera Distribution System. Design and develop ETL integration patterns using Python on Spark. Optimize the Pyspark jobs to run on Secured Clusters for faster data processing. Developed Spark scripts by using Python and Scala shell commands as per the requirement. Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation. Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables. Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone. Analyzed the user requirements and implemented the use cases using Apache NiFi. Proficient working experience on big data tools like Hadoop, Azure Data Lake, and AWS Redshift. Worked on reading and writing multiple data formats like JSON, ORC, Parquet on
  • HDFS using PySpark. Excellent working with Data modeling tools like Erwin, Power Designer and ER Studio. As a Hadoop Developer, my role is to manage the Data Pipelines and Data Lake. Has experience of working on Snow - flake data warehouse. Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Designed custom Spark REPL application to handle similar datasets. Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation. Performed Hive test queries on local sample files and HDFS files. Used AWS services like EC2 and S3 for small data sets. Managed user and AWS access using AWS IAM and KMS. Deployed microservices into AWS - EC2. Strong working knowledge on Kubernetes and Docker. Developed the application on Eclipse IDE. Developed Hive queries to analyze data and generate results. Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing. Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop. Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
  • Used Scala to write code for all Spark use cases. Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL. Assigned name to each of the columns using case class option in Scala. Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark). Involved in converting the hql's in to spark transformations using spark RDD with support of Python and Scala. Developed multiple Spark SQL jobs for data cleaning. Created Hive tables and worked on them using Hive QL. Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS. Developed Spark SQL to load tables into HDFS to run select queries on top. Deve

Confidential

Big Data Engineer

Responsibilities:

  • Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis. Migrated Existing MapReduce programs to Spark Models using Python. Migrating the data from Data Lake (hive) into S3 Bucket. Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data. Designed batch processing jobs using Apache Spark to increase speeds by ten - fold compared to that of MR jobs. Used Kafka for real time data ingestion. Created different topic for reading the data in Kafka. Read data from different topics in Kafka. Involved in converting the hql's in to spark transformations using Spark RDD with support of python and Scala. Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports. Written Hive queries for data analysis to meet the business requirements. Migrated an existing on premises application to AWS. Used AWS Cloud with Infrastructure Provisioning / Configuration. Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS. Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala. Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc. Good knowledge on Spark platform parameters like memory, cores, and executors By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.
  • Configured the monitoring solutions for the project using Data Dog for infrastructure, ELK for app logging.

Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP.

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing. Involved in loading data from UNIX file system to HDFS. Installed and configured Hive and written Hive UDFs. Importing and exporting data into HDFS and Hive using
  • Sqoop Used Cassandra CQL and Java APIs to retrieve data from Cassandra Table. Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files. Worked hands on with ETL process.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS. Extracted the data from Teradata into HDFS using Sqoop. Analyzed the data by performing Hive queries and running Pig scripts to know user behavior. Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing the Hadoop cluster through Cloudera Manager. Installed Oozie workflow engine to run multiple Hive. Developed Hive queries to process the data and generate the data cubes for visualizing.

Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Pig Script, Cloudera, Oozie.

Confidential

Data Analyst

Responsibilities:

  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding. Recommended structural changes and enhancements to systems and Databases. Conducted Design reviews and technical reviews with other project stakeholders. Was a part of the complete life cycle of the project from the requirements to the production support. Created test plan documents for all back - end database modules. Used MS Excel, MS Access, and SQL to write and run various queries. Worked extensively on creating tables, views, and SQL queries in MS SQL
  • Server. Worked with internal architects and assisting in the development of current and target state data architectures. Coordinate with the business users in providing appropriate, TEMPeffective, and efficient way to design the new reporting needs based on the user with the existing functionality. Remain knowledgeable in all areas of business operations to identify systems needs and requirements.

Environment: SQL, SQL Server, MS Office, and MS Visio.

We'd love your feedback!