Big Data Engineer Resume
Tampa, FloridA
SUMMARY
- 5+ years of professional experience in the IT industry, involving 3+ years of experience with Big Data tools in developing applications using Apache Hadoop/Spark echo systems and 2 years of experience in software applications development lifecycle with Python Technologies.
- Excellent understanding/knowledge ofHadooparchitecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Edge Node, and MapReduce programming paradigm.
- Huge hands - on development of complex data ingestion pipelines, data transformations, data management, and data governance in a centralized enterprise data hub
- Experienced in working with Spark ecosystem using Spark-SQL, Data Frames, and Scala/python queries on different data file formats like .txt, .csv, etc.
- Designing and creating Hive external tables using a shared meta-store instead of the derby with partitioning and bucketing.
- Experience with Autosys scheduler to manage Hadoop jobs by developing, deploying, and maintaining JIL scripts.
- Experience in integrating Hive and HBase for effective operations.
- Knowledge of supporting data projects using Elastic Map Reduce on Amazon web Services (AWS) and importing/exporting data to and from S3.
- Experience working on different file formats like Avro, Parquet, ORC, and Sequence and Compression techniques like Snappy in Hadoop.
- Strong understanding of NoSQL databases and hands-on work experience in writing applications on NoSQL databases like HBase.
- Experience in creating HBase tables to load large sets of data from various data sources.
- Implemented POC to migrate Map Reduce jobs into Spark RDD, Data frame transformations using Scala and python.
- Have knowledge in creating real-time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Working knowledge on major Hadoop ecosystems Hive, Sqoop, Impala, etc.
- Good experience in Cloudera, Hortonworks & ApacheHadoopdistributions.
- Knowledge with high throughput streaming apps reading from Kafka queues and writing enriched data back to outbound Kafka queues.
- Experience in tuning and troubleshooting performance issues inthe Hadoopcluster.
- Good working experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
- Experience working with various Robotic Process Automation tools like Blue Prism, UI Path, and Appian Process Robot to develop automation solutions to various repetitive, rule-based and mundane business tasks.
TECHNICAL SKILLS
Programming Languages: Java, Python, Scala
Scripting Languages: Shell script, JavaScript, HTML, CSS, XML
Development tools: IntelliJ, Eclipse, Visual Studio
Database: MySQL, Oracle, SQL Server, HBase, Cassandra, MongoDB
Cloud: AWS,S3,Redshift,EC2
Operating System: Mac OS, Windows 10, Linux
Version Control Tools: SubVersion, GitHub
Methodologies: Agile, Waterfall
Big Data Ecosystem: Hadoop, MapReduce, YARN, Hive, Pig, Sqoop, HBase, Kafka, Oozie, Impala, Spark, Spark SQL (Data Frames and Dataset).
Data Visualizations: Tableau.
PROFESSIONAL EXPERIENCE
Confidential, Tampa, Florida
Big Data Engineer
Responsibilities:
- Responsible for designing, implementing, and testing the ETL data pipeline from End-to-End by using tools
- Responsible for ingestion, consumption, maintenance, production bug fixes, data cleansing, troubleshooting production job failures with workarounds, or re-running jobs if necessary.
- Implemented Spark utilizingDataframes and Spark SQL API for faster processing ofbatch and real-time streaming data.
- Handled large datasets using Partitions, Broadcasts inSpark, Effective & efficient Joins,
- Transformations and others during the ingestion process itself.
- Writing ETL jobs using Spark/Scala,Pig/MapReduce/Hbase, Databricks.
- Written MapReduce/Pig programs for ETL and developed Customized UDF’s in java.
- Developed a data pipeline to retrieve data from Kafka and store data into HDFS.
- Experience with autosys to automate and schedule daily jobs.
- Developed complex ETL transformations and performance tuning.
Environment: Linux, Eclipse, jdk1.8.0, Hadoop 2.9.0, HDFS, Map Reduce, Hive 2.3, Kafka 2.0.0, CDH 5.4.0, Autosys r11.3, Sqoop 1.4.7, Tableau, Shell Scripting, Scala 2.12, Spark 2, Python 3.6/3.5/3.4Maven Repository, Gradle Build.
Confidential
Big Data Support Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop, and Spark with Scala.
- Implemented Spark using Scala, utilizingDataframes and Spark SQL API for faster processing ofBatch and real-time streaming data.
- Developed scripts to perform business transformations on the data using Hive and Impala for downstream applications.
- Handled large datasets using Partitions, Broadcasts inSpark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDDs, and Scala.
- Created a common data lake for migrated data to be used by other members of the team.
- Implemented pre-defined operators insparksuch as map, flat Map, filter, ReduceByKey, GroupByKey, AggregateByKey and CombineByKey etc.
- Worked with different file formats (Sequential, AVRO, RC, Parquet and ORC) and different Compression Codecs (gzip, snappy, lzo).
- Experience in working with amazon Redshift clusters for storing large datasets.
- Expertise to read data from Amazon S3 and process it using Spark Applications.
- Developed complex ETL transformation & performance tuning.
- Import and export data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Working extensively on Hive, SQL, Scala,Spark, and Shell.
- Developed a data pipeline using Kafka to store data into HDFS.
- Experienced in writingSparkRDD transformations, actions for the input data andSpark-SQL queries, Data frames to import data from Data sources to perform data transformations, read/write operations usingSpark-Core and saving the results to output directory into HDFS.
- Responsible for design and development of Spark Applications using Scala to interact with hive and MySQL databases.
- Experience with Oozie workflow to automate and schedule daily jobs.
- Experience with job control tools like Autosys.
- Scheduling and managing cron jobs, wrote shell scripts to generate alerts.
- Hands on experience in installing, configuring and using eco-system components like Hadoop, MapReduce, HDFS.
Environment: Linux, Eclipse, jdk1.8.0, Hadoop 2.9.0, HDFS, Map Reduce, Hive 2.3, Kafka 2.0.0, CDH 5.4.0, Autosys r11.3, Sqoop 1.4.7, Tableau, Shell Scripting, Scala 2.12, Spark 2, Python 3.6/3.5/3.4, Maven Repository, Gradle Build.
Confidential
Software Engineer / Python Developer
Responsibilities:
- Involved in Requirements gathering, Requirement analysis, Design, Development, Integration, and Deployment.
- Building ETL jobs using Pyspark API with Jupyter notebooks in on-premise cluster for certain transforming needs and HDFS as data storage system.
- Worked on reading and writing multiple data formats like JSON,ORC, Parquet on HDFS using PySpark.
- Developed spark applications in python(PySpark) on the distributed environment to load a huge number of CSV files with different schema into Hive ORC tables.
- Built database model, Views, and API's using Python for interactive web-based solutions.
- Worked closely with DevOps for CI/CD for Deploying into Cloud using Jenkins and Chef.
- Developed entire frontend and backend modules usingPythonon Django Web Framework.
- DevelopedPythonbatch processors to consume and produce various feeds.
- Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.
- Utilized PyUnit, the Python unit test framework for testing the functionality of the application.
Environment: Python, Django, MySQL, PyUnit, Git, DevOps, Flask, JSON, Ansible, PHP.
Confidential
Data Analyst Trainee
Responsibilities:
- Experienced working with a team of developers on Python applications for prioritizing tasks and for RISK management.
- Hands-on experience with Python libraries such as NumPy, SciPy, and Matplotlib.
- Have experience in writing Subqueries, Stored Procedures, Triggers, Cursors, and Functions on MySQL and PostgreSQL databases.
- Worked on the development of SQL and stored procedures on MySQL.
- Involved in Agile Methodologies and SCRUM Process.
- Delivered interactive dashboards to predict daily job metrics, visualize trends and seasonality across locations
- Involved in Requirements gathering, Requirement analysis, Design, Development, Integration, and Deployment.
Environment: Python, MySQL, Numpy, Pandas, NLTK, Scikit-learn, Seaborn, Matplotlib, Tidyverse, Git and Linux.