We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

New, YorK

SUMMARY

  • me has around 7+ years of IT experience in software development and support with experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
  • Expertise in Hadoop eco system components HDFS, Map Reduce, Yarn, HBase, Pig, Sqoop, Spark, Spark SQL, Spring boot, Spark Streaming, and Hive for scalability, distributed computing, and high performance computing.
  • Experience in using Hive Query Language for data Analytics.
  • Experienced in Installing, Maintaining and Configuring Hadoop Cluster.
  • Strong noledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Hortonworks Data Platform 2.1 & 2.2, CDH3, CDH4 Cloudera Manager on Linux, Ubuntu OS.
  • Capable of processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture.
  • Having Good noledge on Single node and Multi node Cluster Configurations.
  • Strong noledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB, and Mark Logicand its integration with Hadoop cluster.
  • Expertise on Scala Programming language and Spark Core.
  • Worked with AWS based data ingestion and transformations.
  • Worked with Cloud Break and Blue Print to configure AWS plotform.
  • Worked with data warehouse tools like Informatica, Talend.
  • Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Good noledge on Amazon EMR, Amazon RDS S3 Buckets, Dynamo DB, RedShift.
  • Analyze data, interpret results, and convey findings in a concise and professional manner
  • Partner with Data Infrastructure team and business owners to implement new data sources and ensure consistent definitions are used in reporting and analytics
  • Promote full cycle approach including request analysis, creating/pulling dataset, report creation and implementation and providing final analysis to teh requestor
  • Good experience on Kafkaand Storm.
  • Worked with Docker to establish connection between Spark and NEO4J database.
  • Knowledge of java virtual machines (JVM) and multithreaded processing.
  • Hands on experience working with ANSI SQL.
  • Strong programming skills in designing and implementation of applications using Core Java, J2EE, JDBC, JSP, HTML, Spring Framework, Spring batch framework, Spring AOP, Springboot, Struts, JavaScript, Servlets.
  • Experience in build scripts using Maven and do continuous integrations systems like Jenkins.
  • Java Developer with extensive experience on various Java Libraries, API’s,and frameworks.
  • Hands on development experience with RDBMS, including writing complex Sql queries, Stored procedure,and triggers.
  • Very Good understanding of SQL, ETL and Data Warehousing Technologies
  • Knowledge of MS SQL Server 2012/2008/2005 and Oracle 11g/10g/9i and E-Business Suite.
  • Expert in TSQL, creating and using Stored Procedures, Views, User Defined Functions, implementing Business Intelligence solutions using SQL Server 2000/2005/2008.
  • Developed Web-Services module for integration using SOAP and REST.
  • NoSQL database experience onHBase, Cassandra,DynamoDB.
  • Flexible with Unix/Linux and Windows Environments working with Operating Systems like Centos 5/6, Ubuntu 13/14, Cosmos.
  • Has sound noledge on designing data warehousing applications with using Tools like Teradata, Oracle,and SQL Server.
  • Experience working with Solr for text search.
  • Experience on using Talend ETL tool.
  • Experience in working with job scheduler like Autosys and Maestro.
  • Strong in databases like Sybase, DB2, Oracle, MS SQL,Clickstream.
  • Strong understanding of Agile Scrum and Waterfall SDLC methodologies.
  • Strong Working experience in snowflake.
  • Hands on experience with automation tools such as Puppet, Jenkins,chef,Ganglia,Nagios.
  • Strong communication, collaboration & team building skills with proficiency at grasping new Technical concepts quickly and utilizing them in a productive manner.
  • Adept in analyzing information system needs, evaluating end-user requirements, custom designing solutions and troubleshooting information systems.
  • Strong analytical and Problem solving skills

TECHNICAL SKILLS

Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Spark, Zookeeper and Cloudera Manager,Splunk.

NO SQL Database: HBase, Cassandra

Monitoring and Reporting: Tableau, Custom shell scripts

Hadoop Distribution: Horton Works, Cloudera, MapR

Build Tools: Maven, SQL Developer

Programming & Scripting: JAVA, C, SQL, Shell Scripting, Python, Scala

Java Technologies: Servlets, JavaBeans, JDBC, Spring, Hibernate, SOAP/Rest services

Databases: Oracle, MY SQL, MS SQL server, Teradata

Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript, angular JS

Version Control: SVN, CVS, GIT

Operating Systems: Linux, Unix, Mac OS-X, Cen OS, Windows10, Windows 8, Windows 7, Windows Server 2008/2003

PROFESSIONAL EXPERIENCE

Confidential - New York

Big Data Engineer

Responsibilities:

  • Developing parser and loader map reduce application to retrieve data from HDFS and store to HBase and Hive.
  • Worked on teh Analytics Infrastructure team to develop a stream filtering system on top of Apache Kafka and Storm.
  • Importing teh unstructured data into teh HDFS using Flume.
  • Used Oozie to orchestrate teh map reduce jobs dat extract teh data on a timely manner.
  • Written Map Reduce java programs to analyze teh log data for large-scale data sets.
  • Involved in using HBase Java API on Java application.
  • Automated all teh jobs for extracting teh data from different Data Sources like MySQL to pushing teh result set data to Hadoop Distributed File System, Cloudera
  • Implemented Map Reduce jobs using Java API and PIG Latin as well HIVEQL
  • Worked with big data processing using Hadoop technologiesMap Reduce, Apache Spark, Apache Crunch, Hive, Apache Kafka, Pig and Yarn.
  • Worked in Application developed using Scala/spark/data frames to read data from Hive Tables on YARN Framework.
  • Participated in teh setup and deployment of Hadoop cluster, Cloudera.
  • Hands on design and development of an application using Hive (UDF).
  • Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Worked and learned a great deal from AmazonWebServices (AWS) Cloud services like EC2, S3
  • Hand on experience on cloud services like Amazon Web Services (AWS)
  • Created data pipelines for different events to load teh data from DynamoDB to AWS S3 bucket and tan into HDFS location.
  • Worked on reading multiple data formats onHDFSusingpython.
  • Automatically scale-up teh EMR Instances based on teh data.
  • ImplementedSparkusing Python (pySpark) andSparkSQLfor faster testing and processing of data.
  • Imported real time weblogs using Kafka as a messaging system and ingested teh data to Spark Streaming.
  • Deployed teh project on Amazon EMR with S3 Connectivity.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Worked on AWS cloud services (EC2, S3, RDS, Coludwatch, Redshift, EMR, Kinesis,
  • Loaded teh data into Simple Storage Service (S3) in teh AWS Cloud.
  • Good Knowledge in using of Amazon Load Balancer for Auto scaling in EC2 servers.
  • Worked with a high quality Data Lakes and Data Warehousing team and design teh team to scale. Build cross functional relationships with Data analysts,
  • Implemented a'server less'architecture usingAPI Gateway, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket
  • Designed teh data models to be used in data intensiveAWS Lambdaapplications which are aimed to do complex analysis creating analytical reports for end-to-end traceability
  • Worked into a Data Lake and a Data Swamp and seeNoSQL, teh technology responsible for teh benefits of a Data Lake
  • Executed teh Spark jobs in Amazon EMR.
  • Migrated an existing on-premises application to AWS.
  • Initially migrated existing MapReduce programs to spark model using python.
  • Designed data visualization to present current impact and growth of teh department using python package Matplotlib.
  • Involved in data analysis using python and handling teh ad-hoc requests as per requirement.
  • Developing python scripts for automating tasks.
  • Provide support data analysts in running Pig and Hive queries.
  • Develop and deploy teh outcome using spark and Scala cod e in Hadoop cluster running on GCP.
  • Experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Involved in HiveQL and Involved in Pig Latin.
  • Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Developed data warehouse model in snowflake for datasets using whereScape.
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Importing and exporting Data from MySQL/Oracle to HiveQL Using SQOOP.
  • Configured HA cluster for both Manual failover and Automatic failover.
  • Excellent working Knowledge in Spark Core, Spark SQL, Spark Streaming.
  • Extensive Experience on importing and exporting data using stream processing platforms like Flume and Kafka
  • Develop framework for converting existing PowerCenter mappings and to PySpark
  • Worked with to development team working on PySpark as ETL platform
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark
  • Performed API Level testing for web services, enhanced theTest harness and developed many Test suites using XML and Python.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters
  • Designed and built many applications to deal with vast amounts of data flowing through multiple Hadoop clusters, using Pig Latin and Java-based map-reduce.
  • Specifying teh cluster size, allocating Resource pool, Distribution of Hadoop by writing teh specification texts in JSON File format.
  • Responsible for defining teh data flow within Hadoop eco system and direct teh team in implement them..
  • Designs and develops test plans for ETL unit testing and integration testing
  • Involved in convertingHive/SQLqueries into Spark transformations using API’s likeSpark SQL, Data Framesand python.

Environment: Big Data Horton Work, Apache Hadoop, Hive, Python, Hue Tool, Zookeeper, Map Reduce, Sqoop, crunch API,Pig 0.10 and 0.11, HCatalog, Unix, Java, JSP, Eclipse, Maven, Oracle, SQL Server, Linux,MYSQL.

Confidential, Chicago, IL

Azure Data Engineer

Responsibilities:

  • Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine teh impact of new implementation on existing business processes.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
  • Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Worked on teh Analytics Infrastructure team to develop a stream filtering system on top of Apache Kafka and Storm.
  • Implemented large Lamda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Demonstrated expert level technical capabilities in areas of Azure Batch and Interactive solutions, Azure Machine learning solutions and operationalizing end to end Azure Cloud Analytics solutions.
  • Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
  • Involved in data analysis using python and handling teh ad-hoc requests as per requirement.
  • Responsible for estimating teh cluster size, monitoring and troubleshooting of teh Spark databricks cluster.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • To meet specific business requirements wrote UDF’s inScalaandPyspark.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh Sql Activity.
  • Hands-on experience on developing SQL Scripts for automation purpose.

Environment: Hadoop, MapReduce, HDFS, Pig, Hive, Spark, Kafka, IntelliJ, ADF,Cosmos, Sbt, Zeppelin, YARN, Scala, SQL, Git.

Confidential, Santa Clara, CA

Big Data Developer

Responsibilities:

  • Processed BigData using a Hadoop cluster consisting of 40 nodes.
  • Designed and configured Flume servers to collect data from teh network proxy servers and store to HDFS.
  • Loaded teh customer profiles data, customer spending data, credit from legacy warehouses onto HDFS using Sqoop.
  • Applied transformations and filtered both traffic using Pig.
  • Used Pattern matching algorithms to recognize teh customer across different sources and built risk profiles for each customer using Hive and stored teh results in HBase.
  • Performed unit testing using MRUnit.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real time and Persists into Cassandra.
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
  • Experience in design and develop teh POC in Spark using Scala to compare teh performance of Spark with Hive and SQL/Oracle.
  • Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
  • Consumed teh data from Kafka using Apache spark.
  • Performed various benchmarking steps to optimize teh performance of spark jobs and thus improve teh overall processing.
  • Used Spark API over Horton work Hadoop YARN to perform analytics on data in Hive and involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in teh backend.
  • Responsible for building scalable distributed data solutions using Hadoop
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on teh Hadoop cluster
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Developed Simple to complex Map/reduce Jobs using Hive and Pig.
  • Involved in convertingMap Reduceprograms into Spark transformations using Spark RDD on python.
  • Developed Spark scripts by using pythonShellcommands as per teh requirement.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted teh data from MySQL into HDFS using Sqoop.
  • Developed Merge jobs inPythonto extract and load data into MySQL database.
  • Analyzed teh data by performing Hive queries and running Pig scripts to study employee behavior

Environment: Hadoop, Hive, Zookeeper, Python, Map Reduce, Sqoop, Pig 0.10 and 0.11, JDK1.6,HDFS, Flume, Oozie, DB2, HBase, Mahout, Unix, Linux

Confidential

Big Data Developer

Responsibilities:

  • Worked with respective business units in understanding teh scope of teh analytics requirements.
  • Performed core ETL transformations in Spark.
  • Automated data pipelines which involve data ingestion, data cleansing, data preparation and data analytics.
  • Created end to end Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities on user behavioral data.
  • Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL API
  • Developed end-to-end data pipeline using FTP Adaptor, Spark, Hive, and Impala.
  • Used Python to write code for all Spark use cases.
  • Implemented design patterns in Scala for teh application.
  • Implemented Spark using Scala utilized Spark SQL heavily for faster development, and processing of data.
  • Exploring with Spark for improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
  • Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and tan loading data into HBase tables.
  • Exported teh analyzed data to teh relational databases using Sqoop, to further visualize and generate reports for teh BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Created components like Hive UDFs for missing functionality in HIVE for analytics.
  • Worked on various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.
  • Created Oozie workflows and coordinators to automate data pipelines daily, weekly, and monthly.

Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Azure, Control-M, AWS, Teradata Studio, Oracle 12c, Tableau, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib

Confidential

Big Data Developer/Admin

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Developed Simple to complex Map reduce Jobs using Hive and Pig.
  • Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted teh data from MySQL into HDFS using Sqoop
  • Exported teh analyzed data to teh relational databases using Sqoop for visualization and to generate reports for teh BI team. Extensively used Pig for data cleansing.
  • Created partitioned tables in Hive. Managed and reviewed Hadoop log files.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.Installed and configured Pig and written Pig Latin scripts.
  • Developed Pig Latin scripts to extract teh data from teh web server output files to load into HDFS.
  • Load and transform large sets of structured, semi structured and unstructured data Responsible to manage data coming from different sources
  • Worked with application teams to install operating system, Hadoopupdates, patches, version upgrades as required.

Environment: Hadoop, MapReduce, HDFS, Hive, Pig, Java, SQL, Sqoop, Java (jdk 1.6), Eclipse, Git, Unix, Linux, Subversion.

We'd love your feedback!