We provide IT Staff Augmentation Services!

Big Data Etl Engineer Resume

4.00/5 (Submit Your Rating)

San Antonio, TX

PROFESSIONAL SUMMARY:

  • 7 years of experience in the field of data analytics, data processing and database technologies.
  • 6 years of experience with the Hadoop ecosystem and Big Data tools and frameworks.
  • Ability to troubleshoot and tune relevant programming languages like SQL, and Python. Able to design elegant solutions through the use of problem statements.
  • Accustomed to working with large complex data sets, real - time/near real-time analytics, and distributed big data platforms.
  • Proficient in major vendor Hadoop distribution like Cloudera, Hortonworks, and MapR.
  • Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.
  • Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
  • Experience collecting real-time log data from different sources like webserver logs and social media data from Facebook and Twitter using Flume, and storing in HDFS for further analysis.
  • Experience deploying large multiple nodes of a Hadoop and Spark cluster.
  • Experience developing custom large-scale enterprise applications using Spark for data processing.
  • Experience developing Oozie workflows for scheduling and orchestrating the ETL process.
  • Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, configuration of nodes, YARN, MapReduce, Sentry, Spark, Falcon, Hbase, Hive, Pig, Sentry, Ranger.
  • Developed Scripts and automated data management from end to end and sync up between all the clusters.
  • Strong hands on experience in Hadoop Framework and its ecosystem including but not limited to HDFS Architecture, MapReduce Programming, Hive, Sqoop, HBase, Oozie,
  • Worked on disaster management with Hadoop cluster.
  • Involved in building a multi-tenant cluster.
  • Experience in Mainframe data and batch migration to Hadoop.
  • Hands on experience in installing, configuring Cloudera's and Horton distribution.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Extensively used Apache Flume to collect logs and error messages across the cluster.

TECHNICAL SKILLS:

Programming Languages & IDEs: Unix shell scripting, Object-oriented design, Object-oriented programming, Functional programming, SQL, Hive QL, Python, XML, REST API, Jupyter Notebooks, IntelliJ, PyCharm

DATA & FILE MANAGEMENT: Apache Cassandra, Apache Hbase, MapR-DB, MongoDB, Oracle, SQL Server, DB2, RDBMS, MapReduce, HDFS, Parquet, Avro, JSON, Snappy, Gzip, DAS, NAS, SAN

Methodologies: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean, Six Sigma

Cloud Services & Distributions: Azure, Elasticsearch, Solr, Lucene, Cloudera, Databricks, Hortonworks, MapR

Big Data Platforms, Software, & Tools: Apache Ant, Apache Cassandra, Apache Flume, Apache Hadoop, Apache Hadoop YARN, Apache Hbase, Apache Hcatalog, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Pig, Apache Spark, Spark Streaming, Spark MLlib, GraphX, SciPy, Pandas, RDDs, DataFrames, Datasets, Mesos, Apache Tez, Apache ZooKeeper, Cloudera Impala, HDFS, Hortonworks, MapR, MapReduce, Apache Airflow and Camel, Apache Lucene, Elasticsearch, Elastic Cloud, Kibana, Apache SOLR, Apache Drill, Presto, Apache Hue, Sqoop, Kibana, AWS, Cloud Foundry, GitHub, Bit Bucket

PROFESSIONAL EXPERIENCE:

Big Data ETL Engineer

Confidential, San Antonio, TX

Responsibilities:

  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL and U - SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake + Azure Storage) and processing the data in In Azure Databricks.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Assigned to spark jobs for performance tuning of Spark Applications, using correct level of Parallelism and memory tuning.
  • Migration of on premise data (Oracle/ SQL Server/ Teradata) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2) and AZCOPY.
  • Designed end to end ETL processes and mappings to data lakes and BLOB storages which were used enterprise wide.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
  • Conduct code reviews for team members to ensure proper test coverage and consistent code standards.
  • Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks.

Environment: HDFS, Hive, Cloudera, Databricks, Azure, MYSQL,Spark, PySpark,Informatica, Tableau, Microstrategy

Big Data Developer

Confidential, Miami, FL

Responsibilities:

  • Developed and maintained a 39 node MapR cluster that supported about 50 BI, business, and developers.
  • Implemented workflows using Apache Oozie and Airflow framework to automate tasks.
  • Designed and presented a POC on introducing Impala in project architecture.
  • Implemented YARN Resource pools to share resources of cluster for YARN jobs submitted by users.
  • Administered MapR cluster and reviewed log files of all daemons when/if problems occurred.
  • Performance tuning of HIVE service for better Query performance on ad - hoc queries.
  • Collect, aggregate, and move data from servers to HDFS using Apache Spark and Sqoop.
  • Used SparkSQL to perform analytics on data in Hive.
  • Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on both the managed and external tables.
  • Used Impala where possible to achieve faster results compared to Hive during data Analysis.
  • Implemented data ingestion and cluster handling in real time processing using Kafka.
  • Designed and presented a POC on introducing Impala in project architecture.
  • Implemented YARN Resource pools to share resources of cluster for YARN jobs submitted by users.
  • Performed performance tuning for Spark Steaming e.g. setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.
  • Performance tuning of HIVE service for better Query performance on ad-hoc queries.
  • Collect, aggregate, and move data from servers to HDFS using Apache Spark
  • Created programs into Apache Spark and PySpark RDD operations.
  • Performed storage capacity management, performance tuning and benchmarking of clusters.
  • Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run Map Reduce jobs in the backend.
  • Worked on disaster management with Hadoop cluster.
  • Created Hive external tables and designed data models in hive.
  • Performed both major and minor upgrades to the existing MapR Hadoop cluster.
  • Implemented High Availability of using VSphere images.
  • Worked with clients to better understand their reporting and dash boarding needs and present solutions using structured Waterfall and Agile project methodology approach.
  • Integrated Hadoop with Active Directory and enabled LDAP for Authentication.

Environment: HDFS, Hive, Sqoop, Oozie, Zoo keeper, MapR, MYSQL, Sentry, Spark, PySpark, YARN, MapReduce, Informatica, Tableau, Microstrategy

DevOps Architect

Confidential - Orlando, FL

Responsibilities:

  • Consulted on project to get real - time insights about customer experience, what is driving customer experience, and the impact of collaborative offers in the competitive market. The consulting team built a system to analyze customer data derived from POS systems, including loyalty programs and promotions. The system analyzed ERP, CRM, conversions, social media, and various disparate data sources.
  • Worked on importing and exporting data using Sqoop between HDFS to RDBMS.
  • Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on both the managed and external tables.
  • Performed import and export of dataset transfer between traditional databases and HDFS using Sqoop.
  • Used Impala where possible to achieve faster results compared to Hive during data Analysis.
  • Implemented workflows using Apache Oozie framework to automate tasks.
  • Implemented data ingestion and cluster handling in real time processing using Kafka.
  • Designed and presented a POC on introducing Impala in project architecture.
  • Implemented YARN Resource pools to share resources of cluster for YARN jobs submitted by users.
  • For one of the use case, used Spark Streaming with Kafka & HDFS & MongoDB to build a continuous ETL pipeline. This is used for real time analytics performed on the data.
  • Administered Hadoop cluster(CDH) and reviewed log files of all daemons.
  • Performed performance tuning for Spark Steaming e.g. setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.
  • Performance tuning of HIVE service for better Query performance on ad-hoc queries.
  • Collect, aggregate, and move data from servers to HDFS using Apache Spark & Spark Streaming.
  • Used Spark API over Hadoop YARN to perform analytics on data in Hive.
  • Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS.
  • Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.
  • Migrated ETL jobs to Pig scripts for transformations, joins, aggregations before HDFS.
  • Migrated complex MapReduce programs into Apache Spark RDD operations.
  • Performed storage capacity management, performance tuning and benchmarking of clusters.
  • Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run Map Reduce jobs in the backend.
  • Worked on disaster management with Hadoop cluster.
  • Created Hive external tables and designed data models in hive.
  • Performed both major and minor upgrades to the existing Cloudera Hadoop cluster.
  • Implemented High Availability of Name Node, Resource manager on the Hadoop Cluster.
  • Used Spark SQL and DataFrames API to load structured and semi structured data into Spark Clusters.
  • Involved in the process of designing Cassandra Architecture including data modeling.
  • Integrated Hadoop with Active Directory and enabled Kerberos for Authentication.

Environment: HDFS, PIG, Hive, Sqoop, Oozie, Zoo keeper, Cloudera Manager, Ambari, Oracle, MYSQL, Cassandra, Sentry, Falcon, Spark, YARN, MapReduce

Hadoop Engineer

Confidential - San Antonio, TX

Responsibilities:

  • Consulting projects for clients using POS systems analytics and automation to capture realtime customer transaction data for predictive analytics and forecasting. Analysis of date supported inventory, logistics and merchandising, as well as sales and marketing. Systems were able to track customer activity in real time to provide advanced analytics on data streams like advanced windowing, event correlation, event clustering, anomaly detection.
  • Built continuous Spark streaming ETL pipeline with Spark, Kafka, Scala, HDFS and MongoDB.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs
  • Worked on importing the unstructured data into the HDFS using Spark Streaming & Kafka.
  • Worked on installing clusters, commissioning & decommissioning of data node, configuring slots, and on name node high availability, and capacity planning.
  • Configured Fair Scheduler to allocate resources to all the applications across the cluster.
  • Configured Spark streaming to receive real time data from Kafka and store to HDFS using Scale.
  • Loading data from diff servers to AWS S3 bucket and setting appropriate bucket permissions.
  • Extraction of data from different databases and scheduling Oozie workflows to execute the task daily.
  • Wrote shell scripts to execute scripts (Pig, Hive, and MapReduce) and move the data files to/from HDFS.
  • Used Zookeeper for various types of centralized configurations, GIT for version control, and Maven as a build tool for deploying the code.
  • Handled 20 TB of data volume with 120 - node cluster in Production environment.
  • Involved in creating Hive tables, loading the data and writing hive queries.
  • Worked with Spark Context, Spark -SQL, DataFrame and Pair RDDs.
  • Developed various data connections from data sourced to SSIS, and Tableau Server for report and dashboard development.
  • Analyzed Hadoop cluster using big data analytic tools including Kafka, Pig, Hive, Spark, MapReduce.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Developed metrics, attributes, filters, reports, dashboards and also created advanced chart types, visualizations and complex calculations to manipulate the data.
  • Used Hive, spark SQL Connection to generate Tableau BI reports.
  • Imported data into HDFS and Hive using Sqoop and Kafka. Created Kafka topics and distributed to different consumer applications.
  • Created Hive Generic UDF's to process business logic that varies based on policy.
  • Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration.
  • Worked with clients to better understand their reporting and dash boarding needs and present solutions using structured Waterfall and Agile project methodology approach.

Environment: Hadoop, HDFS, Hive, Spark, YARN, MapReduce, Kafka, Pig, MongoDB, Sqoop, Storm, Cloudera, Impala, Zookeeper, Oozie

We'd love your feedback!