Big Data Engineer Resume
New Orleans, LouisianA
SUMMARY
- Overall, 8 years of professional IT experience in Application Development and Data Analytics using various languages and tools like SQL, Scala, Java, Python.
- Over 2+ years of experience in Apache Spark’s Core, Spark SQL, Spark Streaming.
- Around 2 years’ experience working on Apache Kafka.
- 5+ years of experience in design and development of Big Data Analytics using Hadoop ecosystems related technologies.
- Expertise in Big Data technologies as consultant, proven capability in project - based and as an individual developer with good communication skills.
- Over 3 of extensive experience working on big data ETL.
- Have competence on different big data frameworks such as Kafka, Hive, Elastic search, Solr, HDFS, YARN etc.
- Experience in Implementing AWS solutions using EC2, S3 and EMR
- Extensive experience on Big Data Analytics with hands on experience in writing Map Reduce jobs on Hadoop Ecosystem including Hive, Pig, HBase, Sqoop, Impala, Oozie, Zookeeper, Spark, Kafka, Cassandra, and Flume.
- Very good understanding in ecosystems with Cloudera CDH1, CHD2, CDH3, CDH4, CDH5, Hortonworks HDP2.1 and Hadoop MR1 & MR2 Architectures.
- Strong knowledge of Rack awareness topology in the Hadoop cluster.
- Hands on experience in installation, configuration, supporting and managing Cloudera, Hortonworks Hadoop platform along with CDH3 and CDH4 clusters.
- Working on Data Lake set up with Hortonworks and AWS team.
- Experience in managing multi-tenant Cassandra clusters on public cloud environment - Amazon Web Services (AWS)-EC2.
- Experience in execution of Batch jobs through the data streams to SPARK Streaming.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, writing and optimizing the HiveQL queries.
- Developed Reusable solution to maintain proper coding standard across different java project.
- Very good in Application Development and Maintenance of SDLC projects using different programming languages such as Java, C, Scala, SQL and NoSQL.
- Developing various cross platform products while working with different Hadoop file formats like Sequence File, RC File, ORC, AVRO & Parquet.
- Expert in importing and exporting data from different Relational Database Systems like MySQL and Oracle into HDFS and Hive using Sqoop.
- Strong in analyzing data using HiveQL, Pig Latin, HBase and Map Reduce programs in java.
- Expertise in Extending Hive and Pig core functionality by writing custom UDF’s.
- Experience with databases like DB2, MySQL, SQL, MongoDB.
- Configured Kerberos Security setup for authentication and integrated with Active Directory(AD)
- Configured Ranger and Knox services for authorization purposes
- Good technical Skills in SQL Server, ETL Development using Spark tool.
- Written documentation to describe program development, logic, coding, testing, changes, and corrections.
- Expertise in writing SQL, PL/SQL to integrate of complex OLTP and OLAP database models and data marts, worked extensively on Oracle, SQL SERVER, and DB2.
- Experience in all the life cycle phases of the projects on large data sets and experience with performance tuning and troubleshooting.
- Extensive knowledge of UNIX and Shell scripting.
- Strong background in mathematics and have very good analytical and problem-solving skills.
- Proactive problem-solving mentality that thrives in an agile work environment.
- Ability to work effectively with associates at all levels within the organization.
- Have good knowledge on NoSQL databases like HBase, Cassandra and MongoDB.
- Experienced in Data Ingestion projects to inject data into Data Lake using multiple sources systems using Talend Bigdata.
- Extensively used ETL methodology for performing Data Migration, Extraction, Transformation and loading using Talend and designed data conversions from wide variety of source systems.
- Experience in creating complex SQL Queries and SQL tuning, writing PL/SQL blocks like stored procedures, Functions, cursors, Index, triggers and packages.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, MapReduce, Hive, HBase, Pig, Sqoop, Flume, Oozie, Cassandra, YARN, Zookeeper, Spark SQL, Apache Spark, Impala, Apache Drill, Kafka, Elastic MapReduce.
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR.
Java & J2EE Technologies: Core Java, Servlets, Java API, JDBC, Java Beans.
IDE and Tools: Eclipse, Net beans, Maven, ANT, Hue (Cloudera Specific), Toad, Sonar, JDeveloper.
Frameworks: MVC, Structs, Hibernate, Spring.
Programming Languages: C, C++, Java, Scala, Python, Linux shell.
Web Technologies: HTML, XML, DHTML, HTML5, CSS, JavaScript.
Databases: MYSQL, DB2, MS-SQL Server, Oracle.
NO SQL Databases: HBase, Cassandra, Mongo DB.
Methodologies: Agile Software Development, Waterfall.
Version Control Systems: GitHub, SVN, CVS, Clearcase.
Operating Systems: RedHat Linux, Ubuntu Linux, Windows XP/Vista/7/8/10, Sun Solaris, Suse Linux.
Data Visualization: Power Bi, Tableau, Qlik.
PROFESSIONAL EXPERIENCE
Confidential, New Orleans, Louisiana
Big Data Engineer
Responsibilities:
- Worked on a live 65 nodes Hadoop cluster running CDH4.7.
- Installed & configured multi-node Hadoop cluster for data store & processing.
- Experience in AWS cloud environment on S3 storage and EC2 instances.
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
- Configured Flume to capture the news from various sources for testing the classifier.
- Experience in developing MapReduce jobs using various Input and output formats.
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and preprocessing, analyzing and training the classifier using MapReduce jobs, Pig jobs and Hive jobs.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Involved in loading data into Cassandra NoSQL Database.
- Developed Spark applications to move data into Cassandra tables from various sources like Relational Database or Hive.
- Worked on Spark streaming collects the data from Kafka in near real time and performs necessary transformations and aggregations on the fly to build the common learner data model and persists the data in Cassandra.
- Worked on Cassandra Data modelling, NoSQL Architecture, DSE Cassandra Database administration, Key space creation, Table creation, Secondary and Solr index creation, User creation & access administration.
- Experience in performance tuning a Cassandra cluster to optimize writes and reads.
- Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Used Pig and Hive in the analysis of data.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Performed Sqooping for various file transfers through the HBase tables for processing of data to several NoSQL DBs- Cassandra, MongoDB.
- Implemented Spark Storm builder topologies to perform cleansing operations before moving data into Cassandra.
- Developed ETL workflow which pushes webserver logs to an Amazon S3 bucket.
- Implemented Cassandra connection with the Resilient Distributed Datasets (local and cloud).
- Importing and exporting data into HDFS and Hive.
- Implemented ETL code to load data from multiple sources into HDFS using Pig Scripts.
- Implemented Pig as ETL tool to do Transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Translate business requirements into maintainable software components and understand impact (Technical and Business)
- Provide guidance to development team working on PySpark as ETL platform
- Makes sure that quality standards are defined and met.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Worked on Talend ETL scripts to pull the data from TSV Files/Oracle Data Base into HDFS.
- Worked extensively on design, development, and deployment of talend jobs to extract data, filter the data and load them into datalake.
- Implemented and maintained the monitoring and alerting of production and corporate servers such as EC2 and storage such as S3 buckets using AWS Cloud Watch.
- Creating S3 buckets and managing policies for S3 buckets and Utilized S3 bucket and backup on AWS.
- Created an AWS RDS Aurora DB cluster and connected to the database through an Amazon RDS Aurora DB Instance using the Amazon RDS Console.
- Configured an AWS Virtual Private Cloud (VPC) and Database Subnet Group for isolation of resources within the Amazon RDS Aurora DB cluster.
- Hands on experience in Amazon RDS Aurora performance tuning.
- Extract data from source system and transform into newer systems using Talend DI Components.
- Worked on Storm to handle the parallelization, partitioning, and retrying on failures and developed a data pipeline using Kafka and Strom to store data into HDFS.
- Exploring with Spark improving the performance and optimization of the Existing algorithms in Hadoop using Spark context, Spark-SQl, data frame pair RDD’s, Spark YARN.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Experience with MySQL on both Linux and Windows.
- Converted databases from My|SAM format to Innodb storage engine for databases that needed greater referential integrit.
- Managing database clustering on NDB technology.
- Automated Data Import Script using shell scripting, PHP, MySQL and regular expressions.
- MySQL database backup and recovery strategies and Replication and synchronization.
- Created, tested, and maintained PHP scripts, MySQL programming, forms, reports, triggers and procedures for the Data Warehouse.
- Created database application using PHP and MySQL as the database to monitor customer profiles and complaints.
- Involved in trouble shooting and fine-tuning of databases for its performance and concurrency.
- Troubleshooting performance problems over phone and via email.
- MySQL processes, security management and queries optimization.
- Recovering the databases from backup in disasters.
- MySQL processes and queries optimization. Exported and Imported database 10g objects from development to production.
- Created different Power Bl reports utilizing the desktop and the online service and schedule refresh.
- Assist end users with problems installing the Power Bl desktop, installing and configuring the Personal and On-Premises gateway, connecting to data sources and adding the different users.
- Assisted customer in configuring and troubleshooting their row-level security for their company.
- Created different visualization in the reports using custom visuals like Bar Charts, Pie Charts, Line Charts, Cards, Slicers, etc. Also, using different transformation inside edit query into clean-up the data.
- Experienced in developing Power Bl reports and dashboards from multiple data sources using data blending.
- Developed Stored Procedures and used them in Stored Procedure transformation for data processing and have used data migration tools.
- Experience in using SIS tools like Import and Export Wizard, Package Installation, and SSIS Package Designer.
- Experience in importing/exporting data between different sources like Oracle/Access/Excel etc. using SIS/DTS utility.
- Experience in ET processes involving migrations and in sync processes between two databases.
- Experience in Microsoft Visula C# in script component of SSIS.
- Transformed data from one server to other servers using tools like Bulk Copy Program (BCP), and SQL Server Integration Services (SSIS) (2005/2008).
- Experience in creating configuration files to deploy the SIS packages across all environments.
- Documented Informatica mappings in Excel spread sheet.
- Tuned the Informatica mappings for optimal load performance.
- Have used BTEQ, FEXP, FLOAD, MLOAD Teradata utilities to export and load data to/from Flat files.
- Created and Configured Workflows and Sessions to transport the data to target warehouse Oracle tables using Informatica Workflow Manager.
- Migrated Oracle 9i databases to MySQL using the MySQL Migration Toolkit.
- Supported Map Reduce Programs those are running on the cluster.
Environment: Hadoop, HDFS, Cloudera, Python, AWS, Spark, YARN, Map Reduce, Hive, Teradata SQL, PL/SQL, Pig, Talend, Data Lake, Data Integration 6.1/5.5.1 (ETL), Kafka, Sqoop, Oozie, HBase, Cassandra, Java, Scala, Python, UNIX Shell Scripting.
Confidential, Dallas, TX
Sr. Hadoop Developer
Responsibilities:
- Prepared design blueprints and application flow documentation, gathering requirements from the Business patterns
- Maintained the data in Data Lake (ETL), coming from the Teradata Database, writing on an average of 80 GB daily.
- Overall, the data warehouse had 5 PB of data and used a 135-node cluster to process the data.
- Responsible for creating Hive Tables to load the data from MySQL by using Sqoop, writing java snippets to perform cleaning, pre-processing, and data validation.
- Experienced in creating Hive schema, external tables, and managing views. Worked on performing Join operations in Spark using hive.
- Writing HQL statements as per the user requirements.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and working with Spark-Shell.
- Developed Spark code using Java and Spark-SQL for faster testing and data processing.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in parquet format.
- Experienced working with AWS services like EMR, Redshift, S3, Glue, Kinesis, and Lambda for serverless ETL.
- To process the massive volume of structured data, Spark SQL was used. In addition, spark Data Frames transformations were implemented and steps to migrate Map Reduce algorithms.
- Explored with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Used Data Frame API solutions to pre-process massive volumes of structured data in various file formats, including Text files, CSV, Sequence files, XML and JSON files, and Parquet files, and then turn the data into named columns.
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
- Created Complex ET Packages using SIS to extract data from staging tables to partitioned tables with incremental load.
- Created SSIS Reusable Packages to extract data from Multi formatted Flat files, Excel, XML files into UL Database and DB2 Billing Systems.
- Developed, deployed, and monitored SSIS Packages.
- Created SIS Packages using SSIS Designer for export heterogeneous data from OLE DB Source (Oracle), Excel Spreadsheet to SQL Server 2005/2008.
- Performed operations like Data reconciliation, validation and error handling after Extracting data into SQL Server.
- Worked on SSIS Package, DTS Import/Export for transferring data from Database (Oracle and Text format data) to SQL Server.
- Created SSS reports using Report Parameters, Drop-Down Parameters, Multi-Valued Parameters Debugging Parameter Issues Matrix Reports and Charts.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
- Ensure necessary system security by using best-in-class AWS cloud security solutions. Additionally, experienced in deploying Java projects using Maven/ANT and Jenkins.
- DevOps and CI/CD pipeline knowledge - Mainly Teamcity, Selenium. Implement continuous integration/ delivery (CI/CD) pipelines in AWS when necessary.
- Experienced with batch processing of data sources using Apache Spark. Developed predictive analytic using Apache Spark Java APIs.
- Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Java.
- Worked on the core and Spark SQL modules of Spark extensively. Extensively used Broadcast Variables and Accumulators for better performances.
- Hands on experience inAzureDevelopment, worked onAzure web application,App services,Azure storage,Azure SQL Database,Virtual machines,Fabric controller,Azure AD, Azure search, and notification hub.
- Designed, configured and deployed MicrosoftAzurefor a multitude of applications utilizing theAzurestack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
- Expertise in MicrosoftAzure Cloud Services(PaaS & IaaS), Application Insights, Document DB, Internet of MYYThings (IoT),Azure Monitoring, Key Vault, Visual Studio Online (VSO) and SQL Azure.
Environment: Hadoop, HDFS, Hive, Java 1.7, Spark 1.6, SQL, HBase, UNIX Shell Scripting, MapReduce, Putty, WinSCP, IntelliJ, Teradata, Linux.
Confidential, Charlotte, NC
BIG Data Engineer
Responsibilities:
- Designed and supported the new and evolving sources of data being brought into the data warehouse using PySpark Framework.
- Worked closely with data architects and follow best practices for data management consumption.
- Worked closely with business analysts to work through business requirements and develop processes to provide the needed data visibility via the data warehouse and reporting platform
- Designed and created automated applications and reporting solutions.
- Developed shell scripts for running Hive scripts in Hive and Impala.
- Responsible for optimization of data-ingestion, data-processing, and data-analytics.
- Expertise is developing PySpark application which build connection between HDFS and HBase and allows data transfer between them.
- Worked on RDBMS likeOracle DB2 SQLServer andMy SQLdatabase.
- Developed workflows to cleanse and transform raw data into useful information to load it to a Kafka Queue to be loaded into HDFS and NoSQL database.
- Responsible to do sanity testing of the system once the code is deployed in production.
- Involved in quality assurance of the data mapped into production.
- Involved in code walk through, reviewing, testing and bug fixing.
- Work closely with Data scientist/analyst team and data integrity is being maintained
- Monitor and troubleshoot performance issues on the data warehouse servers.
- Worked on setup process of Hadoop cluster on Amazon EMR / S3 for poc.
- Involved in managing and reviewing Hadoop log files for troubleshooting any Data or HDP service failures.
- Developed Scripts and Batch Jobs to schedule various Hadoop Program.
- Used Amazon EC2 as an instance from Amazon S3 web services on Databricks framework.
- Daily Duties included management of design services, providing sizing and configuration assistance, ensuring strict data quality.
- Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs, and objects within each bucket
- Analysed current business practices, processes and procedures as well as identifying future business opportunities for leveraging data storage and retrieval system capabilities.
- Implemented data platform improvements and new features
- Assisted the job failures with the resolution of data platform bug fixes.
- Experienced on Hadoop cluster on Azure HD Insight Platform and deployed Data analytic solutions using tools like Spark and BI reporting tools.
- Designed and developed effective business solutions (Azure Blob storage) to store and retrieve data.
- Designed advanced analytics ranging from descriptive to predictive models to machine learning techniques.
- Knowledge of Azure IOT framework and Data Streaming.
- Interface with clients, vendors, and internal users of the data platform on understanding the data
- Participate in group design and architecture sessions
- Author documentation for standard operating procedures, knowledge base articles, etc.
- Develop integration tests to validate solutions
- Analyze data and make recommendations to optimize current business operations
- Take part in regular agile rituals (stand-ups, sprint planning/retro, technical feasibility analysis)
Technologies: AWS, HDP 2.6.5, Spark, Python, java 1.6, Sqoop, Hive, Tez, DaaS, MR, Stone branch, Tableau, Oracle, MySQL.
Confidential
BIG Data Engineer
Responsibilities:
- Experience in Requirement Gathering from Business/Data analyst teams.
- Experience in Design and build data pipelines from various source systems on-prem as well as cloud data sources for the enterprise data warehouse and legacy applications.
- Partnered with Data architects, domain experts, data analysts and other teams to build foundational data sets that are trusted, well understood, aligned with business strategy and enable self-service
- Developed data pipeline for real time use cases using Kafka and Spark Streaming.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM fine - grained access to AWS resources to users.
- Designed Database Schema and created Data Model to store real-time Data with NoSQL store.
- Extracting real time data using Kafka and spark streaming by Creating DStreams and converting them into RDD, processing it and stored it into Cassandra.
- Hands on Experience with HDP-2.6.5/HDP 3.1.5 and familiar with Ecosystem tools (Spark/Hive/DaaS/Sqoop/Ambari)
- Experience with working on Java/python related ingestion framework to bring required data from source system as per Business requirement.
- Experience in prod Support and Maintain analytics ecosystem and worked on schedulers such as Zeena/ stone branch.
- Experience in identify areas of improvement and ensure application of standards and best practices
- Experience with PySpark and Sqoop Data fabric ingestion framework.
- Experience in ETL optimization, writing custom PySpark functions (UDFs) and tuning PySpark or Spark SQL code.
- Experience in package management and writing custom Python packages.
- Experience in handling data lineage, data governance, ensuring data quality, feature stores etc.
- Knowledge of data engineering best practices and in using industry-standard methodologies.
- Experience in working on data requests from internal & external stakeholders (like investors and business partners).
- Experience in Monitoring & Troubleshooting data pipeline status and troubleshoot the failures in time with out any failures.
- Experience working long with Data science/Business Analyst/Admin team's/Operations team to resolve any failures and documenting best practices to avoid failures in future.
Technologies: AWS, Hdp-3.1.5, HDP 2.6.5, Spark, Python, Java 1.8, Sqoop, Hive, Tez, DaaS, MR, Zeena, Tableau, Alteryx, Jupyter, PYCHARM, Anaconda, Oracle, MySQL, PostgreSQL,Sagemaker,Athena,GLUE.