Senior Big Data Engeer Resume
IN
SUMMARY
- 8+ years of IT industry Experience in design, development, maintenance, and support of Big Data Applications using Java, Python, Scala, AWS, and Hadoop Ecosystem tools like HDFS, MapReduce, Hive, Sqoop, Spark, Kafka
- Experience in processing big data on the Hortonworks, Cloudera, Databricks.
- Excellent understanding of Hadoop architecture and underlying framework including storage management.
- Good working experience with python to develop Custom Framework for generating of rules (just like rules engine). Developed Hadoop streaming Jobs using python for integrating python API supported applications.
- Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging.
- Experience with Hadoop applications such as administration, configuration management, monitoring, debugging, and performance tuning.
- Developed Python and Pyspark programs for data analysis.
- Experience in extracting source data from Sequential files, XML files, CSV files, transforming and loading it into the target Data warehouse.
- Have been involved in full life - cycle projects using Object Oriented methodologies / Programming (OOP’s).
- Expertise in using various Hadoop infrastructures such as Map Reduce, Hive, Zookeeper, Sqoop, and spark for data storage and analysis.
- Good working experience using Sqoop to import data into HDFS from RDBMS and vice - versa.
- Good experience on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.
- Experience in developing Hive Query Language scripts for data analytics.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family.
- Good Hands-on expertise with AWS storage services such as S3, EFS, Storage Gateways and partial familiarity with Snowball.
- Experience in Cluster Coordination using Zookeeper and Worked on File Formats like Text, ORC, Avro and Parquet and compression techniques like Gzip and Zlib.
- Experience in implementing projects in Agile, SCRUM, TDD and Waterfall methodologies.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Experience in core Java, JDBC and proficient in using Java APIs for application development.
- Trained Junior developers to help with the business requirements and provided documentations of the process implemented across the team.
- Solid experience in working with csv, text, sequential, Avro, parquet, orc, Jason formats of data.
- Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
- Extensively worked in building RESTful web services, and RESTful API.
TECHNICAL SKILLS
Hadoop Technologies: Hadoop, MapReduce, Pig, Hive, Sqoop, Flume, Hbase, Pyspark, Spark, Kafka.
Scripting Languages: Shell Scripting, Python, Scala.
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, Amazon EMR
Database: Oracle, MYSQL, SQL SERVER, MongoDB, ETL, NoSQL
Language: Java, Python, Scala
Development Methodologies: Agile, Scrum
Version Control System: Git, Bitbucket
Build Tools: Ant, Maven, Gradle, Jenkins.
IDE: Eclipse, IntelliJ, Microsoft visual studio
Cloud Platforms: AWS, Azure
Operating System: Windows 7/8/10, Linux, Ubuntu, Mac OS X
PROFESSIONAL EXPERIENCE
Confidential, IN
Senior Big Data Engineer
Responsibilities:
- Automated all the jobs for extracting the data from different Data Sources like MySQL to pushing the result set data to Hadoop Distributed File System.
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
- Created External Hive tables and executed complex Hive queries on them using Hive QL.
- Used Spark for transformations, event joins and some aggregations before storing the data into HDFS.
- Possess Excellent working Knowledge in Spark Core, Spark SQL, Spark Streaming.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Provide guidance to development team working on PySpark as ETL platform.
- Performed various benchmarking steps to optimize the performance of spark jobs and thus improve the overall processing.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark.
- Worked on developing POC’s in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
- Designed and Implement test environment on AWS.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used AWS services like EC2 and S3 for small data sets.
- Migrated an existing on-premises application toAWS. UsedAWSservices likeEC2andS3for small data sets processing and storage,Experiencedin Maintaining the Hadoop cluster onAWS EMR.
- Ability to spin up differentAWS instancesincludingEC2-classic and EC2-VPCusing cloud formation templates.
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications
Environment: HDFS, MapReduce, Sqoop, Cloudera, MySQL, Eclipse, Spark., Git, GitHub, Jenkin, Airflow, Shell Scripting, Pyspark, Python 3.6, Spark SQL, AWS Cloud, S3, EC2.
Confidential, Bothell, WA
Senior Big Data Engineer
Responsibilities:
- Developed Simple to complex Map/reduce Jobs using Hive and Create Hive external tables for data in HDFS locations.
- Worked on Cloudera CDH distribution.
- Implemented Spark Core in Scala to process data in memory.
- Performed job functions using Spark APIs in Scala for real time analysis and for fast querying purposes.
- Involved in file movements betweenHDFSandAWS S3and extensively worked withS3 bucketinAWS.
- Implemented AWS solutions usingEC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, AWS CLI.
- Wrote HiveQL queries by configuring number of reducers and mappers in the query needed for the output.
- Worked on importing data from SQL Server into HDFS using Sqoop and created Hive external table.
- Wrote Hive queries for data analysis to meet the business requirements.
- Used various Hive optimization techniques likes partitioning, bucketing, Map joins and merge small files and vectorization.
- Developed Spark scripts by using Scala as per the requirement.
- Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.
- Performed different types of transformations and actions on the RDD to meet the business requirements.
- Involved in loading data from UNIX file system to HDFS.
- UsedSpark-SQLto LoadParquetdata and created Datasets defined by Case classes and handled Structured data usingSpark SQL which were finally stored into Hive tables for downstream consumption.
- Implemented best offer logic in Spark by writing Spark UDFs in Scala.
- Used Spark as ETL tool.
- Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
- Worked on writing User Defined function in python and Scala for spark applications.
Environment: Hive, Impala, Bitbucket, Jira, Python, Shell Scripting, Pyspark, Python 3.3, Spark SQL, AWS Cloud, S3, EC2. Scala, Kafka, Cassandra, HBase, Spark, HiveQL, UDF, UNIX, HDFS, API, MAPREDUCE.
Confidential, Scottsdale, AZ
Big Data Engineer
Responsibilities:
- Worked on importing data from Sybase into HDFS using Sqoop and created Hive external table.
- Worked on Converting HQL’s into Spark based programs for increasing the performance.
- Involved in Designing and Developing Enhancements of CSG using AWS APIS.
- Performed different types of transformations and actions on the RDD to meet the business requirements.
- Expertly handled the stream processing and storage of data to feed into the HDFS systems using Apache Spark, Sqoop.
- Deployed the Scala code for stream processing using Apache Kafka in Amazon S3.
- Performed various import functions using Sqoop on the data from MySQL to HDFS.
- Achieved Performance tuning using Spark Context, Spark-SQL, Data Frames, Pair RDD's and Spark YARN.
- Monitored and controlled Local disk storage and Log files using Amazon CloudWatch.
- Built scalable distributed data solution using Hadoop
- Created Sqoop Jobs, Pig and Hive Scripts to perform data ingestion from relational databases and compared with the historical data.
- Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata.
- Wrote HiveQL queries for integrating different tables for create views to produce result set.
- Experience on loading and transforming of large sets of structed and unstructured data.
- Expertise in extending Hive and Pig core functionality by writing custom UDFs using Java.
- Analyzed of Large volumes of structured data using Spark SQL.
- Wrote shell script to execute HiveQL.
- Responsible for writing Automated shell scripts in Linux/Unix environment using bash.
- Extensive experience in tuning Hive queries using memory joins for faster execution and appropriating
- Involved in loading data from UNIX file system to HDFS.
- Developed python and spark program for data cleaning and pre-processing the raw data.
- Developed shell script for importing the spark cluster log into the edge node.
- Migrated Map Reduce jobs to Spark jobs to achieve better performance.
- Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
Environment: Hortonworks, Hadoop, Java, HDFS, Pig, Sqoop, Hive, Oozie, Zookeeper, NoSQL, HBase, Shell Scripting, Scala, Spark, Spark SQL, Shell Scripting, Pyspark, Spark SQL, AWS Cloud, S3, EC2. Git, GitHub.