We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

Newark, NJ

SUMMARY

  • Around 7 years of experience in IT which includes Analysis, Design, Development of Big Data ecosystem, design and development of web applications using JAVA, J2EE and data base and data warehousing development using My SQL, Oracle and Informatica.
  • Complete Understanding on Hadoop daemons such as Job Tracker, Task Tracker, Name Node and Data Node that provides client communication, job execution and management, resource scheduling and resource management.
  • Expertise on Hadoop architecture and ecosystem such as HDFS, Sqoop, Spark, Ni - Fi, Pig and Oozie.
  • Good knowledge in MapReduce, that can generate big data sets with a parallel and distributed algorithm.
  • Acquired skills using Hive that provides data query and analysis, Sqoop that helps transferring data between relational databases and Hadoop.
  • Expertise in ingesting data from external sources to HDFS for data processing using Flume.
  • Hands on experience in creating workflows and scheduling jobs using Oozie.
  • Expertise in open-source server that provides a distributed configuration service, synchronization service and naming registry using Zookeeper.
  • Experience in installation, configuration, management, supporting and monitoring Hadoop cluster using various distributions such as Cloudera, Hortonworks, and various cloud services like AWS, GCP.
  • Gathered knowledge in NIFI that can automate the flow of data and learnt how to deliver data to every part of your business with smart data pipelines using Stream Set.
  • Expertise in writing custom Kafka consumer code and modifying existing producer code in Python to push data to Spark-streaming jobs.
  • Experience with NumPy, a library that supports multi-dimensional arrays and matrices, Matplotlib, a plotting library, Pandas that is used for data manipulation and analysis with less code, and PySpark for writing spark applications using python API’s.
  • Ample knowledge on Apache Kafka, Apache Storm to build data platforms, pipelines, and storage systems; and search technologies such as Elastic search.
  • Worked extensively over semi-structured data (fixed length & delimited files) for data sanitation, report generation and standardization.
  • Extensive experience working with AWS Cloud services and AWS SDKs to work with services like AWS API Gateway, Lambda, S3, IAM and EC2.
  • Utilized machine learning algorithms such as linear regression, multivariate regression, naive bayes, Random Forests, K-means, & KNN for data analysis.
  • Experience in HANA security including User Management, Roles, and Analytic Privileges.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Good at implementing Kafka custom encoders for custom input format to load data in partitions.
  • Experienced in providing highly available and fault tolerant applications utilizing orchestration technology on Google Cloud Platform (GCP).
  • Experienced on the planning and capacity requirements for the migration path of IBM BigInsights (on-prem) solution to cloud native GCP based solution. This involved tools like DataProc, Data Flow, Cloud Functions, Google Cloud Storage and Pub/Sub.
  • Experienced with Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring, and cloud deployment manager.
  • Knowledge in automated deployments leveraging Azure Resource Manager Templates, DevOps and Git repository for Automation and usage of Continuous Integration (CI/CD).
  • Experienced in User Defined Functions (UDF) such as Spark, HiveQL, and SQL for data processing and analysis.
  • Developed expertise in database management system built to handle large amounts of data using Cassandra that is highly scalable.
  • Gained knowledge and expertise in PostgreSQL that is an open-source object-relational database system, used as the primary data store or data warehouse for many web, mobile, geospatial, and analytics applications.
  • Worked with Impala to access/analyze data that is stored in Hadoop data nodes without data movement and provides fast access for the data in HDFS and moved data from the datacenter to the cloud and the edge by using Couchbase.
  • Working experience on NoSQL databases like HBase for fault-tolerant way of storing sparse data sets, with functionality and implementation and used built-in functions which are extensions to SQL which are provided by Tera Data.
  • Extensive experience across both relational databases and non-relational databases such as Oracle for effectively managing data with high performance, authorized access and failure recovery features and worked with PL/SQL, SQL Server, MySQL, and DB2.
  • Gained experience in DATA BRICKS which was a simple collaborative environment to run interactive and scheduled data analysis.
  • Gained expertise in ELK STACK can provide faster troubleshooting, security analytics and SPLUNK, that can make information searchable, generate alerts, reports, and visualizations.
  • Extensive skills in Tableau Software that helps make Big Data small, and small data insightful and actionable.
  • Acquired hands on experience in Grafana, an open-source solution for running data analytics, and monitoring of apps with the help of cool customizable dashboards.
  • Expertise in implementing Service Oriented Architectures (SOA) with XML based Web Services (SOAP/REST).
  • Expertise in creating Dashboards and Alerts using Splunk Enterprise, Tableau, Qlik Sense and Monitoring using DAGs.
  • Skilled in monitoring servers using Nagios, Datadog, Cloud watch and using ELK stack Elastic search Logstash.
  • Highly motivated self-starter with good communication and interpersonal skills.
  • Good team player, dependable resource, and ability to learn new tools and software quickly as required.
  • Good Domain Knowledge on Telecommunication, Banking, Retail, Healthcare, and Insurance

TECHNICAL SKILLS

Hadoop Core Services: HDFS, Map Reduce, Spark, YARN.

Hadoop Distribution: Cloudera Hortonworks, Apache Hadoop.

Databases: HBase, Spark-Redis, Cassandra, Oracle, MySQL, Postgress

Data Services: Hive, Pig, Impala, Sqoop, Flume, Kafka.

Scheduling Tools: Zookeeper, Oozie.

Monitoring Tools: Cloudera Manager.

Cloud Computing Tools: AWS (Amazon EC2, Amazon EMR, Amazon LAMBDA, Amazon GLUE, Amazon S3, Amazon ATHENA), Azure (Azure Data Lake, Azure Data Factory, Azure Databrick, Azure SQL Database, Azure SQL data Warehouse), GCP

Programming Languages: Python, Java, Scala, R, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting.

Operating Systems: UNIX, Windows, LINUX.

Build Tools: Jenkins, Maven, ANT.

Frameworks: MVC, Struts, Maven, Junit, Log4J, Tableau, Splunk, Aqua-data Studio

J2EE technologies: Spring, Servlets, J2SE, JSP, JDBC

PROFESSIONAL EXPERIENCE

Confidential - NEWARK, NJ

BIG DATA ENGINEER

Responsibilities:

  • Used Spark Scala to import customer information data from Oracle database into HDFS for data processing along with minor cleansing.
  • Developed MapReduce jobs to calculate the total usage of data by commercial routers in different locations using Horton work distribution.
  • Involve d in information gathering for new enhancements in Spark Scala, Production support for field issues and label installs for Hive scripts and MapReduce jobs.
  • Developed spark applications in python on distributed environment to load massive number of CSV files with different schema in to Hive tables.
  • Developing Spark applications using Spark-SQL inDatabricksfor data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights
  • With use of Pyspark optimized API that can read the data from the various data source containing different files formats
  • Used Maven to build rpms from source code in Scala checked out from GIT repository, with Jenkins being the Continuous Integration Server and Artifactory as repository manager.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • PySpark RDD and Data Frames are used to process batch pipelines where you would need high throughput.
  • Stored data in AWS S3 like HDFS and performed EMR programs on data stored in S3.
  • Used AWS glue ETL service that consumes raw data from S3 bucket and transforms raw data as per the requirement and write the output to s3 bucket in parquet format for data analytics purpose.
  • Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs, and objects within each bucket.
  • Worked in cloud formation to automate AWS environment creation along with the ability to deploy AWS using bill scripts (Boto3 and AWS CLI) air
  • Set up scalability for application servers using command line interface for Setting up and administering DNS system in AWS using Route53.
  • Write Pythonscripts to update content in the database and manipulate files. Involved in building database Model, APIs, and Views utilizing Python technologies to build applications.
  • Visualize and manipulate the data using various machine learning libraries like NumPy, SciPy and Pandas in Python scripts for the perfect analysis of data.
  • Translated customer business requirements into technical design documents, established specific solutions, and leading the efforts including programming in Spark Scala and testing that culminate in client acceptance of the results.
  • Expertise in Object-Oriented Design (OOD) and end-to-endsoftwaredevelopment experience working on Scala coding and implementing mathematical models in Spark Analytics.
  • Created Hive external tables on top of datasets loaded in AWS S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
  • Used AWS Lambda to perform data validation, filtering, sorting, and other transformations for every data change in a HBase table and load the transformed data to RDS.
  • For handling Data integration, Informatica power center is used to handle the ETL work load that provides integration software
  • Used theAWS-CLIto suspend anAWS Lambdafunction processing anAmazon Kinesis stream, then to resume it again.
  • Loading data from different servers to S3 bucket and setting appropriate bucket permissions.
  • PySpark GraphX and GraphFrames are used for Graph processing.
  • Configured routing to send JMS files to interact with application for real time data using Kafka.
  • Managed Zookeeper for cluster co-ordination and Kafka Offset monitoring.
  • Optimized legacy queries to extract the customer information from Oracle.
  • Reviewed HDFS usage and system design for future scalability and fault tolerance.
  • Strong Experience in implementing Data warehouse solutions in Confidential Redshift.
  • Worked on various projects to migrate data from on premise databases to Confidential Redshift and RDS
  • Worked on Informatica Data Quality for providing an extensive array of cleansing and standardization

Environment: HDFS, Spark, Scala, Python, Shell Scripting, Hive, HBase, Oracle, MapReduce, Pyspark Logstash, Jenkins, Versant, Java, Kafka, Horton works, GIT, ClearCase, Zookeeper, Ansible, AWS.

Confidential, MARLBOROUGH, MA

BIG DATA DEVELOPER

Responsibilities:

  • Worked in importing and exporting utility data into HDFS and Hive Megastore from RDBMS (Oracle databases) using Sqoop.
  • Used Apache Flume to aggregate and move data from web servers to HDFS.
  • Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Involved in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
  • Developed customized UDF's in java to extend Hive and Pig Latin functionality.
  • Used Tableau for visualization to generate business report.
  • Created Oozie coordinated workflow to execute Sqoop jobs.
  • Supported mapping editors to build maps of product and transactional environments.
  • Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
  • Developed PySpark code and Spark-SQL for faster testing and processing of data.
  • Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, ORC, AVRO, JSON, and CSV formats.
  • Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
  • Migrated the EDL data to GCP, using as much as possible the existing work in HiveQL’s (data integration) to the Google Dataproc cluster for historical and recurring/delta loads. Using as much as the lift and shift architecture pattern.
  • Designed and developed event triggered data pipeline based on Cloud Pubsub for ingestion of the PII and nonPII data to the landing area on the Google Cloud Storage (GCS) buckets.
  • Worked on G-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
  • Extraction of the current EDL data using HiveQL’s and stored into the Edge node.
  • Developed and deployed Dataflow jobs to write events data from Pubsub to BigQuery and from Pubsub to Pubsub.
  • Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
  • Used PySpark to convert panda’s data frame to Spark Data frame.
  • Used KafkaUtils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
  • Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
  • Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
  • Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
  • Provided design recommendations and resolved technical problems.
  • Assisted with data capacity planning and node forecasting.
  • Involved in performance tuning and troubleshooting Hadoop cluster.
  • Developed H-catalog Streaming code to stream the JSON data into Hive (EDW) continuously. Administrated Hive, Kafka installing updates, patches, and upgrades.
  • Supported code/design analysis, strategy development and project planning.
  • Managed and reviewed Hadoop log files.
  • Evaluated suitability of Hadoop and its ecosystem to project and implemented various proof of concept applications to eventually adopt them to benefit from the Hadoop initiative.

Environment: Spark, Scala, Hive, Maven, Google Cloud Platform (GCP), Python, Microservices, GitHub, Splunk, PySpark, Tableau, Tidal, SQOOP, Java 1.8, Linux, Aqua-data studio, NIFI, Google cloud, J2EE, HDFS, Kafka, MySQL

Confidential, DEARBORN

HADOOP DEVELOPER

Responsibilities:

  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Participate in Design Reviews & Daily Project Scrums
  • Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and prepared low- and high-level documentation.
  • Hands on experience on writing MR jobs for encryption and for converting text data into Avro format.
  • Hands on experience in joining raw data with the reference data using Pig scripting.
  • Written custom UDF's in Hive.
  • Hands on extracting data from different databases and to copy into HDFS file system using Sqoop.
  • Written Sqoop incremental import job to move new / updated info from Database to HDFS.
  • Created Oozie coordinated workflow to execute Sqoop incremental job daily.
  • Used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Hands on exporting the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Involved in installing and configuring Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Working with clients on requirements based on their business needs
  • Communicate deliverables status to user/stakeholders, client and drive periodic review meetings
  • On time completion of tasks and the project per quality goals
  • Good knowledge on HBase.

Environment: Hadoop, HDFS, Map Reduce, HIVE, Pig, Sqoop, HBase, Oozie, My Sql, RSA editor, PuttyZookeeper, Ganglia, UNIX, and Shell scripting

Confidential - BOSTON, MA

JAVA / J2EE DEVELOPER

Responsibilities:

  • Responsible for gathering and analyzing requirements and converting them into technical specifications
  • Used Rational Rose for creating sequence and class diagrams
  • Developed presentation layer using JSP, Java, HTML and JavaScript
  • Used Spring Core Annotations for Dependency Injection
  • Designed and developed a 'Convention Based Coding' utilizing Hibernate's persistence framework and O-R mapping capability to enable dynamic fetching and displaying of various table data with JSF tag libraries
  • Designed and developed Hibernate configuration and session-per-request design pattern for making database connectivity and accessing the session for database transactions respectively.
  • Used HQL and SQL for fetching and storing data in databases
  • Participated in the design and development of database schema and Entity-Relationship diagrams of the backend Oracle database tables for the application
  • Implemented web services with Apache Axis
  • Designed and Developed Stored Procedures, Triggers in Oracle to cater the needs for the entire application.
  • Developed complex SQL queries for extracting data from the database
  • Designed and built SOAP web service interfaces implemented in Java
  • Used Apache Ant for the build process

Environment: Java, JDK 1.5, Servlets, Hibernate, Ajax, Oracle 10g, Eclipse, Apache Ant, Web Services (SOAP), Apache Axis, Apache Ant, Web Logic Server, JavaScript, HTML, CSS, XML

Confidential

JAVA DEVELOPER

Responsibilities:

  • Professional experience in development and deployment of various Object oriented and web - based Enterprise Applications using Java/J2EE technologies and working on the complete System Development Life Cycle (SDLC).
  • Designed and developed the UI of the website using HTML, Spring Boot, React JS, CSS, and JavaScript.
  • Utilized Spring Boot and java as backend and React JS as frontend and MYSQL as database.
  • Designed and developed data management system using MySQL. Built application using Spring JPA for database persistence.
  • Expertise in application/web servers like IBM Web Sphere, Web Logic Application Servers, JBoss and Tomcat Web Servers.
  • In the backend, worked on persisting the data shown on the screen after uploading an excel sheet to the database for a particular user.
  • Created a dashboard for managers to compare Allocation details based on their monthly time.
  • Worked on uploading the feature which would allow the user to upload an excel sheet and convert that data into readable format with the help of React JS.
  • Created framework with concepts of spring boot using Spring JPA for database persistence.
  • Experienced in developing complex MySQL queries, Procedures, Stored Procedures, Packages and Views in MySQL database.
  • Ensured availability and security for database in a production environment.
  • Configured, tuned, and maintained MySQL Server database servers.
  • Implemented monitoring and established best practices around using react libraries.
  • Effectively communicated with the external vendors to resolve queries.

Environment: Java, JavaScript, Spring Boot, CSS, SQL, MySQL, React JS, Apache web server, IBM Web Sphere.

We'd love your feedback!