Sr Big Data Engineer Resume
O Fallon, MO
SUMMARY
- Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Excellent knowledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
- Good understanding of Hadoop Gen1/Gen2 architecture and hands - on experience withHadoopcomponents such as Job Tracker, Task Tracker, Name Node, Secondary Name Node,Data Node, Map Reduce concepts andYARNarchitecture which includes Node manager, Resource manager and App Master
- Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
- Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
- Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
- Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
- Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa
- Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
- Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
- Experience in database development using SQL and PL/SQL and experience working on databases likeOracle 12c/ 11g/10g, SQL Server and MySQL.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
- Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
- Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
- Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
- Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
- Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports, SQL*Plus, Toad, Crystal Reports.
- Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Spinning up subprocess, which monitors and stays in sync with all DAGs in the specified DAG directory. Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered
TECHNICAL SKILLS
Hadoop and Big Data Technologies: HDFS, MapReduce, Flume, Sqoop, Pig, Hive, Morphline, Kafka, Oozie, Spark, Nifi, Zookeeper, Elastic Search, Apache Solr, Talend, Cloudera Manager, R Studio, Confluent, Grafana
Databases: Oracle, DB2, MS-SQL Server, MySQL, MS-Access, Teradata
NoSQL Databases: HBase, Couchbase, Mongo DB, Cassandra
Programming and Scripting Languages: C, SQL, Python, C++, Shell scripting, R
Cloud Environment: Amazon Web Services (AWS), Microsoft Azure
Operating Systems: Windows, Unix (Red Hat Linux, Cent OS, Ubuntu), MAC-OSIDE Development Tools Eclipse, Net Beans, IntelliJ, R Studio
PROFESSIONAL EXPERIENCE
Confidential, O’Fallon, MO
Sr Big Data Engineer
Responsibilities:
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
- Used Apache NiFi to automate data movement between different Hadoop components
- Used NiFi to perform conversion of raw XML data into JSON, AVRO
- Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
- Developed Spark scripts by using Python in PySpark shell command in development.
- Experienced in Hadoop Production support tasks by analysing the Application and cluster logs
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in in Azure Databricks
- Worked on creating Data Pipelines for Copy Activity, moving, and transforming the data with Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing
- Created reports using visualizations such as Bar chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, Tree map etc. in Power BI.
- Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
- Perform Big Data analysis using Scala, Spark, Spark SQL, Hive, Mlib, Machine Learning algorithms.
- Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML
- Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services
- Extract Transform and Load data from Sources Systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
- Using Spark Data frame API in Scala for analyzing data
- E1xtract data from data lakes, EDW to relational databases for analysing and getting more meaningful insights using SQL Queries and PySpark
- Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
- Written the Map Reduce programs,HiveUDFsin Java.
- Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase
- Installed and configured apache airflow for workflow management and created workflows in python
- Used HBase for storing the Kafka topic, partition number and Offsets value. Also used phoenix jar to connect HBase table.
- Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
- Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
Environment: Hadoop Yarn, Azure, Databricks, Data lake, Data storage, Power BI, Azure SQL, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Python, Kafka, Hive, java, Scala, Sqoop, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux.
Confidential, Green Bay, WI
Big Data Engineer
Responsibilities:
- Developed Spark scripts by using Scala, Java as per the requirement.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Working on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Developed shell scripts for running Hive scripts in Hive and Impala.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in MongoDB.
- Experience in working with NoSQL databases like HBase and Cassandra.
- UsedZookeeperto provide coordination services to the cluster. Experienced in managing and reviewingHadooplog files.
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications
- Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to Kafka for future use cases purpose.
- Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Design and Implemented the Sqoop incremental imports, delta imports on tables without primary keys and dates from Teradata and SAP HANA and appends directly into Hive Warehouse.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Performed Data ingestion usingSQOOP, Apache Kafka, Spark StreamingandFLUME.
- Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Worked on Cluster co-ordination services through Zookeeper.
- Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Start working with AWS for storage and holding the tera byte of data for customer BI Reporting tools
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Worked with Apache Nifi to Develop Custom Processors for the purpose of processing and disturbing data among cloud systems Created a new CFT and validated the IP addresses in lambda and ran the Spark Master and destroyed the old CFT stack in Dev, QA and Prod.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS
Environment: Hadoop YARN, MapReduce, HBase, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle,Kafka, Yarn, Linux, GIT, Oozie.
Confidential, Franklin, TN
Big Data Engineer
Responsibilities:
- Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions usingHadoop.
- WrittenHive jobsto parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Experienced in managing and reviewing theHadoop log files.
- Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
- Experience in designing and developing applications inSparkusing Scala to compare the performance ofSparkwithHiveand SQL/Oracle.
- Implementedpython scriptswhich perform transformations and actions on tables and send incremental data to the next zone by using spark submit.
- Experienced in working with sparkeco system usingSpark SQLandScalaqueries on different formats like Text file, CSV file.
- Experience in importing and exporting Terabytes of data usingSqoopfromHDFSto Relational Database Systems and vice-versa.
- Involved in creating Hive Tables, loading with data and writingHive querieswhich will invoke and run Map Reduce jobs in the backend.
- Designed and implementedIncremental Importsinto Hive tables.
- Very good understanding of Partitions, bucketing concepts in Hive and designed bothManaged and External tablesin Hive to optimize performance.
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance
- Migrated an existing on-premises application to AWS.
- Implemented the workflows using ApacheOozieframework to automate tasks.
- Worked on different file formats like Sequence files,XMLfiles and Map files using Map Reduce Programs.
- Implemented data ingestion and handling clusters in real time processing usingKafka.
- Moved Relational Database data using Sqoop into HiveDynamic partitiontables usingstaging tables.
Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, Flume, Python, Kafka, Apache Spark, HBase, Scala, Zookeeper, Maven, AWS, MySQL.
Confidential
Hadoop Developer
Responsibilities:
- Installed and configuredFlume,Hive,Pig,SqoopandOozieon the Hadoop cluster.
- Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
- Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
- Developed job workflow inOozieto automate the tasks of loading the data intoHDFSand few otherHivejobs.
- Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
- Experience in creating variousOoziejobs to manage processing workflows.
- DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
- DevelopedPigScripts to store unstructured data inHDFS.
- DevelopedPigLatin scripts to extract and filter relevant data from the web server output files to load into HDFS.
- Analyzed the data by performingHivequeries and runningPigscripts to study customer behavior.
- OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
- Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into the Hadoop Distributed File System andPIGto pre-process the data.
- Experienced in managing and reviewing the Hadoop log files usingShell scripts.
- DevelopedFlumeAgents for loading and filtering the streaming data intoHDFS
- UsedHiveto analyze the partitioned and bucketed data and compute various metrics for reporting.
- Experienced in performing manual and automation testing on web and desktop applications.
- Developed automation test cases using Selenium with Java, Python and executed them locally and remotely using Jenkins.
- Helped the testing team to set up selenium grid to run huge volume of test cases using VMs.
- Reported defects and reports that were found and generated after the test execution.
- Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.
Confidential
Junior Software Engineer
Responsibilities:
- Involved in all phases of Software Development Life Cycle (SDLC).
- Wrote Hive UDFs in Java where the functionality is too complex.
- Used Pig (Pig Latin) scripts for ad-hoc data retrieval
- Created Hive tables and wrote Hive queries using HiveQL
- Used spring framework for Dependency Injection and integrated with Hibernate.
- Extensively worked with Hibernate Query Language (HQL) to store and retrieve the data from Oracle database.
- Worked on Installing and configuring MapReduce, HDFS and developed multiple MapReduce jobs in java for data cleaning and pre-processing.
- Developed shell scripts to run the nightly batch cycle and to set environment variables.
- Used Maven to build the project, run unit tests and deployed artifacts to Nexus repository.
- Involved in writing SQL queries and procedures.
- Developed RESTful Web Services to retrieve mutual funds data.
- Used SOAPUI to test the web services.
Environment: Hadoop, Cloudera Distribution, Java, HDFS, Hive, Sqoop, Eclipse, JSP, HTML, JavaScript, Spring MVC, Hibernate, Jersey, SOAP UI, Oracle 9i, JBoss