We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Branchburg, NJ

SUMMARY

  • Overall 8+ years of development experience using Java, J2EE, JSP, Servlets, Scala and Python.
  • 7+ years of comprehensive IT experience in BigData and BigData Analytics, Hadoop, HDFS, MapReduce, YARN, Hadoop Ecosystem and Shell Scripting.
  • Highly capable for processing large sets of Structured, Semi - structured and Unstructured datasets and supporting BigData applications.
  • Hands on experience with Hadoop Ecosystem components like Map Reduce (Processing), HDFS (Storage), YARN, Sqoop, Pig, Hive, HBase, Oozie, ZooKeeper and Spark for data storage and analysis.
  • Expertise in transferring data between a Hadoop ecosystem and structured data storage in a RDBMS such as MY SQL, Oracle, Teradata and DB2 using Sqoop.
  • Experience in NoSQL databases like Mongo DB, HBase and Cassandra.
  • Good knowledge on loading data from various data sources and legacy systems into Teradata production and development warehouse using BTEQ, FASTEXPORT, MULTI LOAD, and FASTLOAD utilities.
  • Experience in Apache Spark cluster and streams processing using Spark Streaming for Real-time and Near-Realtime applications
  • Experience in Apache Airflow scheduler, for batch job processing using Spark.
  • Experience in NIFI dataflow software to automate the data flow from different sources like Kafka and IBMIQ and loading them into target tables.
  • Creating NIFI processors with data requirements and regulating the flow of data.
  • Expertise in moving large amounts of log, streaming event data and Transactional data using Spark and Flume.
  • Experience working with Cloudera Hue Interface and Impala.
  • Experience in developing MapReduce jobs in Java for data cleaning and preprocessing.
  • Strong knowledge in AWS Lambda, Amazon Kinesis, Amazon simple queue service (Amazon SQS), Amazon SNF and SWF.
  • Design, implement and maintain all AWS infrastructure and services within a managed service environment.
  • Keeping track of Hadoop cluster connectivity and security.
  • Strong knowledge with Web Services, API Gateways and application integration development and design.
  • Expertise in handling arrangement of data within certain limits (Data Layout's) using Partitions and Bucketing in Hive.
  • Strong experience working with different file formats Parquet, Avro, Json etc.,
  • Expertise in preparing Interactive Data Visualization using Tableau and Data Virtualization using Denodo Software.
  • Hands on experience in developing workflows execute MapReduce, Sqoop, Pig, Hive and Shell Scripts using Oozie.
  • Experience working with Cloudera Hue Interface and Impala.
  • Hands on experience developing Solr Indexes using MapReduce Indexer Tool.
  • Support ongoing development and code reviews of data acquisition, data movement, data cleansing, data transformation, data mapping, data quality screens, ETL jobs and schedules, and other ETL and data integration activities.
  • Ensure that ETL jobs are scheduled, monitored and generate detailed logs to support ongoing diagnostics, exception processing, and audit trails for compliance.
  • Expertise in Object-Oriented Analysis and Design (OOAD) like UML and use of various design patterns.
  • Fluent with the core Java concepts like I/O, Multi-Threading, Exceptions, Reg Ex, Data Structures and Serialization.
  • Experience in process improvement, Normalization/De-normalization, Data extraction, cleansing and Manipulation.
  • Converting requirement specification, Source system understanding into Conceptual, Logical and Physical Data Model, Data flow (DFD).
  • Expertise in working with Transactional Databases like Oracle, SQL server, My SQL, Db2 & MariaDB
  • Expertise in developing SQL queries, Stored Procedures and development experience with Agile Methodology.
  • Ability to adapt to evolving technology, Strong sense of Responsibility and Accomplishment.
  • Excellent leadership, interpersonal, problem solving and time management skills.
  • Excellent communication skills both Written (documentation) and Verbal (presentation).

TECHNICAL SKILLS

Languages: Java, Python, Scala, HiveQL.

Big Data Technologies: HDFS, Hive, SAS, MapReduce, Pig, Apache Spark, Sqoop,3.0, CDH 5.x, Kafka, Oozie, Flume, HDP 2.2, 2.4, 2.6, Apache Airflow, YARN and Spark

Hadoop Ecosystem: HDFS, Hive, MapReduce, HBase, YARN, Sqoop, Flume, Oozie, Zookeeper, Impala.

Databases: Oracle, SQL Server, Teradata, HBase, MongoDB

Scripting Languages: JavaScript, CSS, Python, Perl, and Shell Script.

Operating Systems: Windows, UNIX, Linux

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Branchburg,NJ

Responsibilities:

  • Design, develop, document, and test new requirements in the data pipeline using Spark, Scala, python and Kafka in the Hadoop ecosystem on cloud AWS environment.
  • Participate in full development life cycle including requirements analysis, design, development, deployment and operations support.
  • Creating pipelines where we pull the data from Kafka using Kafka connect in Avro format and loading into s3 buckets, reading the data into data frames using Spark, Scala. Running the transformations and compaction rules and loading the data back to s3 as paraquet format.
  • Creating hive tables and Athena views for validation by business analysts.
  • Involved in activities such as setting up enterprise infrastructure on Amazon Web Services (AWS) including EC2, ELB, EBS, S3, Auto - Scaling, AMI, RDS, IAM, Cloud Formation, VPC, CodeDeploy, Elastic Beanstalk, CloudWatch, Cloud Trial etc.
  • Also got a chance to work with container orchestration tools and container-based technologies like Kubernetes, Docker and ECS, build and Automation tools like ANT, Maven and Gradle.
  • Day to day responsibilities also include working with Openshift platform in managing Docker containers and Kubernetes Clusters.
  • Create develop and test environments of different applications by provisioning Kubernetes clusters on AWS using Docker, Ansible, and Terraform
  • Write terraform scripts for Cloudwatch Alerts.
  • Utilize AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs for mission critical production servers as backup.
  • Work with engineering team members to explore and create interesting solutions while sharing knowledge within the team.
  • Provide full operational support - analyze code to identify root causes of production issues and provide solutions or workarounds and lead it to resolution.
  • Work with engineering team members to explore and create interesting solutions while sharing knowledge within the team.
  • Provides guidance on new technologies/methodologies.
  • Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.

Environment: AWS, EC2, S3, RDS, Docker, Kubernetes, Tomcat, Jenkins, Ansible, Terraform, Python, Groovy, Linux, Shell, Salt, CloudFormation, Jira, Git.

Sr. Hadoop Developer

Confidential, Vernon hills,IL

Responsibilities:

  • Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
  • Involved Low level design for MR, Hive, Impala, Shell scripts to process data.
  • Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment implemented in Scala.
  • Perform data migration from on premise environments into AWS.
  • Used Spark Streaming API with Kafka to build live dashboards; Worked on Transformations & actions in RDD, Spark Streaming, Pair RDD Operations, Check-pointing, and SBT.
  • Implemented POC to migrate map reduce jobs into Spark RDD transformation using Scala IDE for Eclipse.
  • Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
  • Installing and configuring Hive, Sqoop, Flume, Oozie on the Hadoop clusters.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
  • Developed a process for the Batch ingestion of CSV Files, Sqoop from different sources and also generating views on the data source using Shell Scripting and Python.
  • Integrated a shell script to create Collections/morphine, Solr Indexes on top of table directories using MapReduce Indexer Tool within Batch Ingestion Framework.
  • Implemented partitioning, dynamic partitions and buckets in HIVE.
  • Developed Hive Scripts to create the views and apply transformation logic in the Target Database.
  • Involved in the design of Data Mart and Data Lake to provide faster insight into the Data.
  • Involved in using Stream Sets Data Collector tool and created Data Flows for one of the streaming application.
  • Experienced in using Kafka as a data pipeline between JMS (Producer) and Spark Streaming Application (Consumer).
  • Involved in the development of Spark Streaming application for one of the data source using Scala, Spark by applying the transformations.
  • Developed a script in Scala to read all the Parquet Tables in a Database and parse them as Json files, another script to parse them as structured tables in Hive.
  • Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
  • Configured Zookeeper for Cluster co-ordination services.
  • Developed a unit test script to read a Parquet file for testing Pyspark on the cluster.
  • Involved in exploration of new technologies like AWS, Apache Flink, and Apache NIFIetc which can increase the business value.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Java(jdk1.6), Cloudera, Oracle, SQL Server, UNIX Shell Scripting, Flume, Oozie, Scala, Spark, Sqoop, Python, kafka, PySpark, AWS.

Hadoop Developer

Confidential - Round Rock- TX

Responsibilities:

  • Responsible for operational development and support of the Hadoop cluster used at Prime Therapeutics.
  • Developed custom input adapters in Java for moving the data from raw sources to HDFS.
  • Developed Spark applications using Scala to perform data cleansing, data validation, data transformations and other enrichments.
  • Wrote Spark-Streaming applications to consume the data from WSO2 topics and write the processed streams to HBase.
  • Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Developed Airflow workflows to automate and productionize the data pipelines.
  • Worked on various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Developed many Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning exercise
  • Install, upgrade, configure, and apply patches for Cloudera Manager.
  • Setup of Hadoop cluster and maintenance support of the cluster.
  • Cluster coordination Services.
  • Keeping track of Hadoop cluster connectivity and security.
  • Capacity planning and monitoring of Hadoop cluster job performances.
  • HDFS maintenance and support.
  • Resource manager configuration and trouble shooting.
  • Setting up Hadoop users.
  • Testing HDFS, Hive, HBase, and MapReduce access for the new users.
  • Backup, recovery and maintenance.
  • Consult with business users to manage tasks, incidents and focus on Create/Update and manage reports for metrics and performance.
  • Work with Shell Scripts, Python Scripts, and Ansible.
  • Prepare file system / mount points.
  • Install required services LDAP, DNS, etc.
  • Collaborate with engineering, development, and operation teams to troubleshoot and resolve their issues. Work closely with the infrastructure engineering build team.
  • Manage Hadoop jobs using scheduler like Airflow for Batch jobs and WSO2 message broker for the Near-Real time streaming jobs.
  • Point of contact for vendor escalation.
  • Executes and provides feedback for operational policies, procedure, processes, and standards
  • Automate manual tasks.
  • Develop infrastructure documents.
  • Troubleshoot production problems within assigned software applications.

Environment: Cloudera, HDFS, Hive, Spark, Scala, HBASE, Impala, Air Flow, WSO2, UNIX, Eclipse.

Hadoop Developer

Confidential

Responsibilities:

  • Performed performance tuning and troubleshooting o5 MapReduce jobs by analyzing and reviewing Hadoop log files.
  • Involved Low level design for MR, Hive, Impala, Shell scripts to process data.
  • Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment implemented in Scala.
  • Perform data migration from on premise environments into AWS.
  • Used Spark Streaming API with Kafka to build live dashboards; Worked on Transformations & actions in RDD, Spark Streaming, Pair RDD Operations, Check-pointing, and SBT.
  • Implemented POC to migrate map reduce jobs into Spark RDD transformation using Scala IDE for Eclipse.
  • Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
  • Installing and configuring Hive, Sqoop, Flume, Oozie on the Hadoop clusters.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
  • Developed a process for the Batch ingestion of CSV Files, Sqoop from different sources and also generating views on the data source using Shell Scripting and Python.
  • Integrated a shell script to create Collections/morphine, SolrIndexes on top of table directories using MapReduce Indexer Tool within Batch Ingestion Framework.
  • Implemented partitioning, dynamic partitions and buckets in HIVE.
  • Developed Hive Scripts to create the views and apply transformation logic in the Target Database.
  • Involved in the design of Data Mart and Data Lake to provide faster insight into the Data.
  • Involved in using Stream Sets Data Collector tool and created Data Flows for one of the streaming application.
  • Experienced in using Kafka as a data pipeline between JMS (Producer) and Spark Streaming Application (Consumer).
  • Involved in the development of Spark Streaming application for one of the data source using Scala, Spark by applying the transformations.
  • Developed a script in Scala to read all the Parquet Tables in a Database and parse them as Json files, another script to parse them as structured tables in Hive.
  • Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
  • Configured Zookeeper for Cluster co-ordination services.
  • Developed a unit test script to read a Parquet file for testing Pypark on the cluster.
  • Involved in exploration of new technologies like AWS, Apache Flink, and Apache NIFIetc which can increase the business value.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Java(jdk1.6), Cloudera, Oracle, SQL Server, UNIX Shell Scripting, Flume, Oozie, Scala, Spark, Sqoop, Python, kafka, PySpark, AWS.

Web Developer

Confidential 

Responsibilities:

  • Involved in various phases of SDLC including requirement gathering, analysis, development & customization of the application.
  • Implemented the service layer based on a Spring container and exploited Spring IOC features for bean management.
  • Worked on loading data from Linux file system to HDFS.
  • Understanding and analyzing the requirements. Designed, developed and validated User Interface using HTML, Java Script, and XML.
  • Involved with writing SQL queries using Joins and Stored Procedures using Maven to build and deploy the applications in JBoss application Server in Software Development Lifecycle Model.
  • Worked on Eclipse IDE for front end development environment for insertions, updating and retrieval operations of data from oracle database by writing stored procedures.
  • Developed MapReduce jobs to convert data files into Parquet file format and included MR-Unit to test the correctness of MapReduce programs.
  • Experienced in working with various kinds of datasets for structured, semi structured and unstructured data with Teradata and Oracle for successfully loading files to HDFS from Teradata and loaded from HDFS to Hive.
  • Installed Oozie workflow engine to run multiple Hive. Developed Hive queries to process the data and generate the data cubes for visualizing
  • Concatenated ETL logics from RDBMS to Hive.
  • Implemented partitioning, bucketing and worked on Hive, using file formats and compressions techniques with optimizations.
  • Computed various metrics using MapReduce to calculate metrics that define user experience.
  • Assisted Oracle DB development team in developing stored procedures and designing the database.
  • Performed Clear Quest defects, Database change requests logging using Clear Quest.
  • Used Maven for project builds and SVN as versioning system.
  • Provided production support for the application both onsite and remotely.

We'd love your feedback!