Sr Big Data Engineer Resume
Chesterfield, MO
SUMMARY
- Around 8 years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Excellent noledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
- Good understanding of Hadoop Gen1/Gen2 architecture and hands - on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Secondary Name Node,Data Node, Map Reduce concepts andYARNarchitecture which includes Node manager, Resource manager and App Master
- Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
- Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
- Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
- Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
- Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data
- Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
- Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
- Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
- Experienced in Creating Vizboards for data visualization inPlatforafor real - time dashboard on Hadoop.
- Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
- Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
- Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa.
- Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
- Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa
- Worked on Scala code base related to Apache Spark performing teh Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts
- Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
- Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
- Experience in database development using SQL and PL/SQL and experience working on databases likeOracle 12c/ 11g/10g, SQL Server and MySQL.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing teh big data as per teh requirement.
- Extensive experience usingMAVENas a Build Tool for teh building of deployable artifacts from source code.
- Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports, SQL*Plus, Toad, Crystal Reports.
- Tested, Cleaned, and Standardized Data to meet teh business standards using Execute SQL task, Conditional Split, Data Conversion, and Derived column in different environments.
- Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Experience tuning spark jobs for efficiency in terms of storage and processing.
TECHNICAL SKILLS
Big Data Technologies: HDFS, Hive, MapReduce, Pig, Sqoop, Flume, Oozie, Hadoop distribution, and Hbase, Spark, Spark Streaming, Yarn, Zookeeper, Kafka, ETL.(Nifi,Talend etc.)
Programming languages: Core Java, Spring Boot, R, Scala, Terraform
Databases: MySQL, MS-SQL Server 20012/16, Oracle 10g/11g/12cScripting/Web
Languages: HTML5, CSS3, XML, SQL, Shell/Unix, Perl, Python.
Databases: Cassandra, HBASE, MongoDB, Oracle, MS SQL, Teradata.
Operating Systems: Linux, Windows XP/7/8/10, Mac.
Software Life Cycle: SDLC, Waterfall and Agile models.
Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j,SOAP UI, ANT,Maven, Alteryx, Visio, Jenkins, Jira, Intellij.
Data Visualization Tolls: Tableau, SSRS
Cloud Services: AWS (EC2, S3, EMR, RDS, Lambda, Cloudwatch, Auto scaling, Redshift,Cloud Formation, Glue etc.), Azure Databricks, Data Lake, Data Factory,Data Storage, Azure SQL
EXPERIENCE:
Confidential, Chesterfield, MO
Sr Big Data Engineer
Responsibilities:
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
- Worked inAWSenvironment for development and deployment of custom Hadoop applications.
- Developing Spark programs with Python, and applied principals of functional programming to process teh complex structured data sets.
- Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.
- Involved as primary on-site ETL Developer during teh analysis, planning, design, development, and implementation stages of projects using IBM Web Sphere software (Quality Stage v9.1, Web Service, Information Analyzer, Profile Stage)
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Prepared Data Mapping Documents and Design teh ETL jobs based on teh DMD with required Tables in teh Dev Environment
- Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on teh rules provided by teh business manager using ERWIN r9.6.
- Strong experience in working withELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
- Filtering and cleaning data using Scala code and SQL Queries
- Experience in data processing like collecting, aggregating, moving teh data using Apache Kafka.
- Used Kafka to load data into HDFS and move data back to S3 after data processing
- Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
- Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
- Analyzing SQL scripts and designed teh solution to implement using PySpark
- Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
- Loaded and transformed large sets of structured, semi structured and unstructured data usingHadoop/Big Data concepts.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) onEC2.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by teh team and satisfying teh business rules.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.
- Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
- Used Talend for Big Data Integration using Spark and Hadoop.
- UsedKafkaandKafka brokers, initiated teh spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on teh fly to build teh common learner data model and persists teh data in HDFS.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.
Environment: Hadoop, Spark, Scala, Hbase, Hive, UNIX, Erwin, TOAD, MS SQL Server database, XML files, AWS, Cassandra, MongoDB, Kafka, IBM Info Sphere Data Stage, Snowflake, PL/SQL, Oracle 12c, Flat files, Autosys, MS Access database.
Confidential, Charlotte, NC
Big Data Engineer
Responsibilities:
- Used Agile Scrum methodology/ Scrum Alliance for development.
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing teh data in in Azure Databricks
- Worked on creating Data Pipelines for Copy Activity, moving, and transforming teh data with Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
- Used Apache NiFi to automate data movement between different Hadoop components
- Used NiFi to perform conversion of raw XML data into JSON, AVRO
- Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
- Developed Spark scripts by using Python in PySpark shell command in development.
- Experienced in Hadoop Production support tasks by analysing teh Application and cluster logs
- Created Hive tables, loaded with data, and wrote Hive queries to process teh data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases
- Created reports using visualizations such as Bar chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, Tree map etc. in Power BI.
- Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
- Perform Big Data analysis using Scala, Spark, Spark SQL, Hive, Mlib, Machine Learning algorithms.
- Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML
- Developed automation system using PowerShell scripts and JSON templates to remediate teh Azure services
- Implemented ETL jobs using Nifi to import from multiple databases such as Exadata, Teradata, MS-SQL to HDFS for Business Intelligence
- Worked on real time streaming, performed transformations on teh data using Kafka and Spark Streaming
- Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase
- Installed and configured apache airflow for workflow management and created workflows in python
- Used HBase for storing teh Kafka topic, partition number and Offsets value. Also used phoenix jar to connect HBase table.
- Extract Transform and Load data from Sources Systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Cloud Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
- Using Spark Data frame API in Scala for analyzing data
- Designed and implemented an ETL framework using Java to load data from multiple sources into Hive and from Hive into Vertica
- Extract data from data lakes, EDW to relational databases for analysing and getting more meaningful insights using SQL Queries and PySpark
- Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
- Written teh Map Reduce programs,HiveUDFsin Java.
- Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager
- Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
- Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
Environment: Hadoop Yarn, Azure, Databricks, Data lake, Data storage, Power BI, Azure SQL, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Python, Kafka, Hive, java, Scala, Sqoop, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux.
Confidential, Denver, CO
Data Engineer
Responsibilities:
- Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions usingHadoop.
- Experience in designing and developing applications inSparkusing Scala to compare teh performance ofSparkwithHiveand SQL/Oracle.
- Implementedpython scriptswhich perform transformations and actions on tables and send incremental data to teh next zone by using spark submit.
- Experienced in working with sparkeco system usingSpark SQLandScalaqueries on different formats like Text file, CSV file.
- WrittenHive jobsto parse teh logs and structure them in tabular format to facilitate effective querying on teh log data.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Experienced in managing and reviewing theHadoop log files.
- Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
- Developed Spark scripts by usingscalashell commands as per teh requirement.
- Developed spark code andspark-SQL/streamingfor faster testing and processing of data.
- Exported teh analyzed data to relational databases usingsqoopfor visualization and to generate reports.
- Experience in importing and exporting Terabytes of data usingSqoopfromHDFSto Relational Database Systems and vice-versa.
- Involved in creating Hive Tables, loading with data and writingHive querieswhich will invoke and run Map Reduce jobs in teh backend.
- Migrated an existing on-premises application to AWS.
- Implemented teh workflows using ApacheOozieframework to automate tasks.
- Worked on different file formats like Sequence files,XMLfiles and Map files using Map Reduce Programs.
- Implemented data ingestion and handling clusters in real time processing usingKafka.
- Moved Relational Database data using Sqoop into HiveDynamic partitiontables usingstaging tables.
- Designed and implementedIncremental Importsinto Hive tables.
- Very good understanding of Partitions, bucketing concepts in Hive and designed bothManaged and External tablesin Hive to optimize performance.
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance
Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, Flume, Python, Kafka, Apache Spark, HBase, Scala, Zookeeper, Maven, AWS, MySQL.
Confidential
Hadoop Developer
Responsibilities:
- Installed and configuredFlume,Hive,Pig,SqoopandOozieon teh Hadoop cluster.
- Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
- Experience in creating variousOoziejobs to manage processing workflows.
- DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
- DevelopedPigScripts to store unstructured data inHDFS.s
- UsedAWS S3to store large amount of data in identical/similar repository.
- Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
- Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
- Developed job workflow inOozieto automate teh tasks of loading teh data intoHDFSand few otherHivejobs.
- DevelopedPigLatin scripts to extract and filter relevant data from teh web server output files to load into HDFS.
- Analyzed teh data by performingHivequeries and runningPigscripts to study customer behavior.
- OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
- Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into teh Hadoop Distributed File System andPIGto pre-process teh data.
- Experienced in managing and reviewing teh Hadoop log files usingShell scripts.
- DevelopedFlumeAgents for loading and filtering teh streaming data intoHDFS
- UsedHiveto analyze teh partitioned and bucketed data and compute various metrics for reporting.
- Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.
Confidential
Data Analyst
Responsibilities:
- Teh implementation of teh project went through several phases namely: data set analysis, preprocessing data set, user-generated data extraction, and modeling.
- Used SSIS to create ETL packages to Validate, Extract, Transform, and Load data into Data Warehouse and Data Mart.
- Used SAS for pre-processing data, SQL queries, data analysis, generating reports, graphics, and statistical analyses.
- Provided statistical research analyses and data modeling support for a mortgage product.
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
- Performed Data Cleaning, Data Visualization, Information retrieval, Feature Engineering using Python libraries such as Pandas, NumPy, Sklearn, Matplotlib, and Seaborn.
- Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that meet customer requirements using Microsoft SQL Server
- Optimized teh performance of queries with modification in T-SQL queries, removed teh unnecessary columns and redundant data, normalized tables, established joins, and created index.
- Used SAS/SQL to pull data out from databases and aggregate to provide detailed reporting based on teh user requirements.
- Built machine learning models for Regression based on Decision Trees, Support Vector Machine, and Random Forest to predict teh different risk levels of applicants and used Grid Search to improve teh accuracy over teh cleaned data.
Environment: SQL Server, DB2, Oracle, SQL Server Management Studio, MS BI Suite(SSIS/SSRS), T-SQL, Machine learning, Linux, Python 2.x (Scikit-Learn/Scipy/Numpy/Pandas).