Data Engineer Resume
Bronx, NY
SUMMARY:
- Over 7 years of professional IT experience which includes experience in Bigdata ecosystem related technologies.
- Having 5 years of hands - on experience working with Hadoop, HDFS, Map Reduce framework and Hadoop ecosystem like Hive, HBase, Sqoop and Oozie.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm. Working experience with standard methodologies like System Development Life Cycle (SDLC), Rational Unified Process (RUP), and Agile/SCRUM methodologies.
- Expertise in Big Data analytics and data manipulation using Hadoop tools MapReduce, HDFS, Yarn, Pig, Hive, HDFS, Spark, Flume, Sqoop, Avro, Sqoop, AWS and Zookeeper.
- Experienced in Cloud Services such as AWS EC2, EMR, RDS, S3 to assist with bigdata tools, solve the data storage issue and work on deployment solution.
- Experience working with D-Streams, Accumulator, Broadcast variables, RDD Caching for Spark Streaming.
- Ability to build deployment on AWS, build scripts (Boto 3 & AWS CLI) and automated solutions using Shell and Python.
- Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, AWS CLI.
- Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, and SSRS.
- Experience with complex SQL queries, procedures, triggers to obtain filtered data for various RDBMS such as SQL Server, Teradata, and NoSQL databases such as MongoDB & HBase.
- Experience in performing ETL on top of streaming log data from various web servers into HDFS using Flume.
- Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Experience integrating Kafka and Spark by using Avro for serializing and deserializing data, and for Kafka producer and consumer.
- Proficient in Oozie for workflow management, with separate workflows for each layer like Staging, Transformations and Archive layers.
- Experience in huge data processing with Hadoop by writing PIG Latin Scripts and HIVE Queries and UDF.
- Experience on Splunk Enterprise Deployments and enabled continuous integration on as part of configuration management.
- Experience in Spark batch applications to ingest data into common data lake using Scala.
- Expertise in Hadoop Cluster capacity planning, performance tuning, cluster Monitoring and Troubleshooting.
- Expertise in transforming business requirements into building models, designing algorithms, data mining, and reporting solutions across huge volume of unstructured & structured data.
TECHNICAL SKILLS:
Big Data Ecosystem: Hadoop, HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, Flume, HBase, Zookeeper, Hue, Cloudera, Hortonworks, Spark, Scala, Kafka, Storm
Programming Skills: C, C++, Core Java, Shell Scripting, PL/SQL, Scala, Python
Java/J2EE: J2EE, JSF, Servlets, Structs, Spring
Web Technologies: HTML, CSS, XML, JDBC, JSP, JSTL, Web Services
Operating System: Windows, Linux, Unix
Design: UML, E-R Modelling, Rational Rose
Tools: Eclipse, TOAD, Maven, GIT, Amazon AWS, Bit Bucket
Database: MySQL, SQLite, Oracle, MS SQL Server, HBASE
EXPERIENCE:
Data Engineer
Confidential, Bronx, NY
Responsibilities:
- Participated in Release planning, Sprint ceremonies and Daily Stand-ups and responsible for design and development of Big Data applications using Cloudera Hadoop.
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Created Hive tables for loading with data and writing Hive queries that will run internally in MapReduce way.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Designed web crawler to collect customer related data and stored them in a JSON file.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
- Developed PIG Latin scripts to extract data from web server output files to load into HDFS.
- Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas.
- Involved in integrating HBase with Pyspark to import data into HBase and performed some CRUD operations on HBase.
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib.
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Used Sqoop to import and export data from HDFS to RDBMS and vice-versa for visualization and to generate reports.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation.
- Created detailed AWS Security Groups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWS EC2 instances.
- Used Tableau for building and publishing customized interactive visualizations to present the analysis results by finding patterns, anomalies, and predictions.
- Used cross-validation to test the model with batches of data and tuned parameters to find the best parameters for the model which eventually boosted the performance.
- Performed unit and system testing to validate the output of above data wrangling techniques against the expected results.
Environment: Hadoop, HDFS, Cloudera, Python, AWS, Spark, YARN, Map Reduce, Hive, Teradata SQL, PL/SQL, Pig, TalenD, Data Lake, Data Integration 6.1/5.5.1 (ETL), Kafka, Sqoop, Oozie, HBase, Cassandra, Java, Scala, Python, UNIX Shell Scripting
Data Engineer
Confidential, Irving, TX
Responsibilities:
- Cluster capacity planning along with operations team and management team and Cluster maintenance as well as creation and removal of nodes, HDFS support and maintenance.
- Strong knowledge of Rack awareness topology in the Hadoop cluster.
- Involved in loading data from LINUX file system to Hadoop Distributed File System.
- Responsible for building scalable distributed data solutions using Hadoop.
- Managing and reviewing Hadoop log files.
- Data migration from RDMS to Hadoop using Sqoop for analysis and implemented Oozie jobs for automatic data imports from source.
- Created HBase tables to store various data formats of PII data coming from different portfolios.
- Exporting the analyzed and processed data to the Relational databases using Sqoop for visualization and for generation of reports for the team.
- Involved in integrating HBase with PySpark to import data into HBase and also performed some CRUD operations on HBase.
- Installed Oozie workflow engine to run multiple ecosystems like Hive and Pig jobs.
- Analyzing large amount of data sets to determine optimal way to aggregate and report on these data sets.
- Implemented Cassandra connection with the Resilient Distributed Datasets (local and cloud).
- Performed Sqooping for various file transfers through the HBase tables for processing of data to several NoSQL DBs- Cassandra, MongoDB.
- Developed Hadoop data processes using Hive and Impala
- Importing and exporting data into HDFS using Sqoop.
- Implemented Pig and Hive queries, Developed UDF’s to pre-process the data for analysis.
- Used Hive data warehouse tool to analyze the data in HDFS and developed Hive queries.
- Created external tables with proper partitions for efficiency and loaded the structured data in HDFS resulted from MapReduce jobs.
- Built pipelines to move hashed and un-hashed data from Azure Blob to Data Lake.
- Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
- Collaborated on insights with Data Scientists, Business Analysts and Partners.
- Performed advanced procedure like text analytics and processing, using the in-memory computing capabilities of Spark using Python.
- Created pipelines to move data from on-premises servers to Azure Data Lake.
- Designed and implemented Partitioning buckets in Hive.
- Support for setting up QA environment and updating of configurations for implementing Scripts with Pig and Sqoop.
- Analyzed large data sets by running Hive queries and Pig scripts.
- Worked with data science team to gather requirements for various data Mining projects
Environment: Cloudera, Hadoop, MapReduce, HDFS, Pig, Sqoop, Hive, HBase, Cassandra, MySQL, NoSQL, Shell Scripting, Linux, Zookeeper, Impala, Maven, Eclipse
Hadoop Developer
Confidential, Salt Lake City, UT
Responsibilities:
- Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
- I was responsible for analyzing the business requirement and estimating the tasks and preparing the design documents for the existing Abinitio and Teradata code for converting into hive/spark SQL.
- Develop the Spark SQL logics which mimics the Teradata ETL logics and point the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables.
- Imported data from Abinitio LDR(Load Ready Files) and into Spark RDD and performed transformations and actions on RDD’s.
- Experienced in designing and deployment of Hadoop cluster and different big data analytic tools including Pig, Hive, Flume, HBase and Sqoop .
- Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by Flume.
- Implementing quality checks and transformations using Spark.
- Developed simple and complex MapReduce programs in Hive, Pig and Python for Data Analysis on different data formats.
- Performed data transformations by writing MapReduce and Pig scripts as per business requirements.
- Implemented Map Reduce programs to handle semi/unstructured data like xml, json, Avro data files and sequence files for log files.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and analysis.
- Experienced in Kerberos authentication to establish a more secure network communication on the cluster.
- Analyzed substantial data sets by running Hive queries and Pig scripts.
- Managed and reviewed Hadoop and HBase log files.
- Experience in creating tables, dropping, and altered at run time without blocking updates and queries using Spark and Hive .
- Experienced in writing Spark Applications in Scala and Python.
- Used Spark SQL to handle structured data in Hive.
- Imported semi-structured data from Avro files using Pig to make serialization faster
- Processed the web server logs by developing multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis.
- Experienced in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Experienced in connecting Avro Sink ports directly to Spark Streaming for analyzation of weblogs.
- Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
- Managing and scheduling Jobs on a Hadoop Cluster using UC4( Confidential preoperatory scheduling tool) workflows.
- Continuous monitoring and managing the Hadoop cluster through Hortonworks (HDP) distribution.
- Configured various views in Yarn Queue manager.
- Involved in review of functional and non-functional requirements.
- Indexed documents using Elastic search.
- Responsible for using Flume sink to remove the date from Flume channel and deposit in No-SQL database like MongoDB
- Involved in loading data from UNIX file system and FTP to HDFS.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Loaded JSON-Styled documents in NoSQL database like MongoDB and deployed the data in cloud service Amazon Redshift.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR, AZURE .
- Used Zookeeper to provide coordination services to the cluster.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with reference tables and historical metrics.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.
- Experience in processing large volume of data and skills in parallel execution of process using Abinitio functionality.
- Designed and implemented Spark jobs to support distributed data processing.
- Experience in optimizing Map Reduce Programs using combiners, partitioners, and custom counters for delivering the best results.
- Written Shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
Environment: Hortonworks (HDP), Hadoop, Spark, Sqoop, Flume, Elastic Search, AWS, EC2, S3, Pig, Hive, MySQL, Python, MapReduce, HDFS, Tableau, Abinitio
Hadoop Developer
Confidential, South Jordan, UT
Responsibilities:
- Handled importing of data from various data sources, performed transformations using Hive, Map reduce, loaded data into HDFS.
- Extracted the data from MySQL into HDFS using Sqoop.
- Exported the analyzed data to the Relational databases using Sqoop for visualization and to generate reports for the BI team.
- Developed Simple to complex Map Reduce jobs.
- Analyzed the data by performing Hive Queries and running Pig Scripts to know user behavior.
- Created partitioned tables in Hive.
- Administered and supported distribution of Horton works.
- Wrote Korn shell, Bash shell, Pearl scripts to automate most DB maintenance tasks.
- Worked on Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Importing and exporting data into HDFS and HIVE using SQOOP.
- Responsible to manage data coming from different sources.
- Monitoring the running Map Reduce programs on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Installed and configured Hive and Created Hive UDF’s.
- Involved in creating Hive Tables, loading with data and Writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Developed scripts and automated data management from end to end and sync up between the clusters.
Environment: Apache Hadoop, Java, Bash, ETL, Map Reduce, Hive, Pig, Horton works, Deployment tools, Data tax, Flat files, Oracle 11g/10g, MySQL, Window NT, UNIX, Sqoop, Oozie
