Sr. Big Data Engineer Resume
Boise, ID
SUMMARY
- Over 7 years of IT experience in Software Development wif strong work experience as Big Data /Hadoop Developer wif solid understanding of Hadoop framework.
- Expertise in Hadoop architecture and various components such as HDFS, YARN, High Availability, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming paradigm.
- Experience wif all aspects of development from initial implementation and requirement discovery, through release, enhancement, and support (SDLC & Agile techniques).
- Good experience in core python object - oriented programming
- Good experience wif communication wif devices using python such as Ports, Sockets etc.,)
- Experience on Delta lake and data Delta lake on AWS and Azure.
- Having experience in Design, Development, Data Migration, Testing, Support and Maintenance using Redshift Databases.
- Experience on ApacheHadooptechnologies likeHadoopdistributed file system (HDFS), Map Reduce framework, Hive, PIG, Pyspark, Sqoop, Oozie, HBase, Spark, Scala and Python.
- Experience in AWS cloud solution development using Lambda, SQS, SNS, Dynamo DB, Athena, S3, EMR, EC2, Redshift, Glue, and CloudFormation.
- Experience in using Microsoft Azure SQL database, Data Lake, Azure ML, Azure data factory, Functions, Databricks and HDInsight.
- Working experience in Big data on cloud using AWS EC2 & Microsoft Azure, and handled redshift & Dynamo databases wif huge amount of data 300 TB.
- Extensive experience in migrating on premise Hadoop platforms to cloud solutions using AWS and Azure.
- Experience in writing python as ETL framework and Pyspark to process huge amount of data daily.
- Strong experience in implementing data models and loading unstructured data using HBase, Dynamo Db and Cassandra.
- Monitor and Create Alerts for critical KPI’s, Metrics, Data Visualization for Business Process.
- Created multiple report dashboards, visualizations and heat maps using tableau, QlikView and qliksense reporting tools.
- Experience wif SSIS, Power BI Desktop, Power BI Services, M Language) Interactions, DAX.
- Strong experience in extracting and loading data using complex business logic’s using Hive from different data sources and built teh ETL pipelines to process tera bytes of data daily.
- Experienced in transporting and processing real time event streaming using Kafka and Spark Streaming.
- Hands on experience wif importing and exporting data from Relational databases to HDFS, Hive and HBase using Sqoop.
- Experienced in processing real time data using Kafka producers and stream processors and implemented stream process using Kinesis and data landed into data lake S3.
- Experience in implementing multitenant models for teh Hadoop 2.0 Ecosystem using various big data technologies.
- Designed and developed spark pipelines to ingest real time event-based data from Kafka and other message queue systems and processed huge data wif spark batch processing into data warehouse hive.
- Experienced in creating and analyzing Software Requirement Specifications (SRS) and Functional Specification Document (FSD).
- Excellent working experience in Scrum / Agile framework, Iterative and Waterfall project execution methodologies.
- Development level experience in Microsoft Azure providing data movement and scheduling functionality to cloud-based technologies such as Azure Blob Storage and Azure SQL Database.
- Designed data models for both OLAP and OLTP applications using Erwin and used both star and snowflake schemas in teh implementations.
- Capable of organizing, coordinating, and managing multiple tasks simultaneously.
- Excellent communication and inter-personal skills, self-motivated, organized and detail-oriented, able to work well under deadlines in a changing environment and perform multiple tasks effectively and concurrently.
- Strong analytical skills wif ability to quickly understand client’s business needs. Involved in meetings to gather information and requirements from teh clients.
TECHNICAL SKILLS
Hadoop: Hadoop, Spark(Pyspark),Map Reduce, HIVE, PIG, Impala SQOOP, HDFS, HBASE, Oozie, Ambari, Spark, Scala and Mongo DB
Cloud Technologies: AWS Kinesis, Lambda, EMR, EC2, SNS, SQS, Dynamo DB, Step Functions, Glue, Athena, CloudWatch, Azure Data Factory, Azure Data Lake, Functions, Azure SQL Data Warehouse, Databricks and HDInsight, Snowflake
RDMS: Amazon Redshift, Postgres, Oracle, SQL Server, IBM DB2 And TeraData, Netezza, MSSQL
No SQL: Mongo DB, Cassandra, HBase
ETL Tools: Data Stage, Talend and ABInitio
Reporting Tools: Power BI, Tableau, TIBCO Spotfire, Qlikview and Qliksense
Deployment Tools: Git, Jenkins, Terraform and CloudFormation
Programming Language: Python, Scala, PL/SQL, SQL and Java
Scripting: Unix Shell and Bash scripting
PROFESSIONAL EXPERIENCE
Confidential, Boise, ID
Sr. Big Data Engineer
Responsibilities:
- Developed Spark scripts by using Scala, Java as per teh requirement.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
- Automated teh data processing wif Oozie to automate data loading into teh Hadoop Distributed File System.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Working on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Worked on Cluster co-ordination services through Zookeeper.
- Integrated and automated data workloads to Snowflake Warehouse.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
- Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Start working wif AWS for storgae and halding for tera byte of data for customer BI Reporting tools
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Worked wif Apache Nifi to Develop Custom Processors for teh purpose of processing and disturbing data among cloud systems Created a new CFT and validated teh IP addresses in lambda and ran teh Spark Master and destroyed teh old CFT stack in Dev, QA and Prod.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Developed shell scripts for running Hive scripts in Hive and Impala.
- Responsible for developing data pipeline wif Amazon AWS to extract teh data from weblogs and store in MongoDB.
- Experience in working wif NoSQL databases like HBase and Cassandra.
- UsedZookeeperto provide coordination services to teh cluster. Experienced in managing and reviewingHadooplog files.
- Implemented AWS provides a variety of computing and networking services to meet teh needs of applications
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to Kafka for future use cases purpose.
- Migrated an existing on-premises application to AWS.
- Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
- Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce.
- Installed application on AWS EC2 instances and configured teh storage on S3 buckets.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used teh AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS
- Implemented large scale technical solutions using Object Oriented Design and Programming concepts using Python
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing teh big data as per teh requirement.
- Design and Implemented teh Sqoop incremental imports, delta imports on tables wifout primary keys and dates from Teradata and SAP HANA and appends directly into Hive Warehouse.
- Involved in creating Hive tables, loading wif data and writing hive queries which will run internally in MapReduce way.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Working wif relational database systems (RDBMS) such as Oracle and database systems like HBase.
- Involved in writing T-SQL working on SSIS, SSAS, Data Cleansing, Data Scrubbing and Data Migration.
- Performed Data ingestion usingSQOOP, Apache Kafka, Spark StreamingandFLUME.
Environment: Hadoop YARN, MapReduce, AWS, EC2, S3, Auto Scaling, Cloud Watch, Cloud Formation, IAM, Security Groups, Redshift, EMR, Snowflake, HBase, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle,Kafka, Yarn, Linux, GIT, Oozie.
Confidential, Sunnyvale, CA
Big Data Engineer
Responsibilities:
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- Developed Python scripts to automate data sampling process. Ensured teh data integrity by checking for completeness, duplication, accuracy, and consistency
- Wrote various data normalization jobs for new data ingested into Redshift.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Unit tested teh data between Redshift and Snowflake.
- Used Oozie Scheduler system to automate teh pipeline workflow and orchestrate teh map reduces jobs that extract teh data on a timely manner
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze teh existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using teh right technology for teh job to get done.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Optimizing and tuning teh Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Redesigned teh Views in snowflake to increase teh performance.
- Integrated Kafka wif Spark Streaming for real time data processing
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along wif Continuous Integration and Continuous Deployment wif AWS Lambda and AWS code pipeline.
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin
- Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
- Worked on Big data on AWS cloud services me.e. EC2, S3, EMR and DynamoDB
- Worked wif Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
- Developed code to handle exceptions and push teh code into teh exception Kafka topic.
- Was responsible for ETL and data validation using SQL Server Integration Services.
Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, Snow Flake MySQL, PostgreSQL.
Confidential, St Louis, Missouri
Data Engineer
Responsibilities:
- Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
- Responsible for design & deployment ofSpark SQLscripts andScalashell commands based on functional specifications.
- Implemented versatile microservices to deal wif simultaneousness and high traffic. Advanced existing Scala code and improved teh cluster execution.
- Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work wif more perplexing situations and ML solutions.
- Involvement in working wif Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
- Created Dax Queries to generated computed columns in Power BI.
- Broad involvement in working wif SQL, wif profound noledge on T-SQL (MS SQL Server).
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily wif teh whole Microsoft stack.
- Experience in Configure, Design, Implement and monitorKafkaCluster and connectors.
- Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
- Worked wif data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh SQL Activity.
- Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
- Supported analytical phases, dealt wif data quality, and improved performance utilizing Scala's higher order functions, lambda expressions, pattern matching and collections.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Provide guidance to development team working on PySpark as ETL platform
- Managed assets and scheduling over teh cluster utilizing Azure Kubernetes Service.
- Performed information purging and applied changes utilizing Databricks and Spark information analysis.
- Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
- Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information effectively
- Used Power BI, Power Pivot to develop data analysis prototype, and used Power View and Power Map to visualize reports
- Used Azure Synapse to bring these worlds together wif a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
- Worked onKafkaandSparkintegration for real time data processing.
Environment: Hadoop, Spark, Hive, Sqoop, HBase, Oozie, Talend, Kafka Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Microservices, K-Means, KNN. Ranger, Git
Confidential, Boston, MA
Data Engineer
Responsibilities:
- Designed and implemented Incremental Imports intoHivetables and writing Hive queries to run onTEZ.
- CreatedETLMapping wif Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Implemented teh workflows usingApache Oozieframework to automate tasks.
- Designed and implemented Incremental Imports intoHivetables.
- Ingest real-time and near-real time (NRT) streaming data intoHDFSusingFlume.
- Worked wif NoSQL databases likeHBasein makingHBasetables to load expansive arrangements of semi structured data.
- Visualized teh results using Tableau dashboards and teh Python Seaborn libraries were used for Data interpretation in deployment.
- Involved in creatingHivetables, loading wif data and writingHive queriesthat will run internally in MapReduce way
- Involved in collecting, aggregating and moving data from servers to HDFS usingFlume.
- Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS usingSqoop.
- Involved in data ingestion intoHDFSusingSqoopfor full load and Flume for incremental load on variety of sources like web server,RDBMSand Data API’s.
- Acted for bringing in data underHBaseusing HBase shell alsoHBaseclient API.
- Experienced wif handling administration activations usingClouderamanager
- Involved in developingSpark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save teh results to output directory into HDFS.
- Involved in writing optimizedPigScript along wif developing and testingPig LatinScripts.
- Involved in transforming data from Mainframe tables toHDFS, andHBasetables using Sqoop.
- Created customSOLRQuery segments to optimize ideal search matching.
- Stored teh time-series transformed data from teh Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
- Collected data using Spark Streaming fromAWSS3bucket in near-real- time and performs necessary Transformations and Aggregations to build teh data model and persists teh data inHDFS.
- InstalledOozieworkflow engine to run multipleHiveandPigjobs which run independently wif time and data availability.
- Automatically scale-up teh EMR instances based on teh data.
- Imported Bulk Data intoHBaseUsing MapReduce programs.
- UsedSCALAto storestreaming datato HDFS and to implementSparkfor faster processing of data.
- Involved in migrating tables fromRDBMSintoHivetables usingSQOOPand later generate visualizations using Tableau.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, AWS, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.
Confidential
Hadoop Developer
Responsibilities:
- Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
- Experience in creating variousOoziejobs to manage processing workflows.
- DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
- DevelopedPigScripts to store unstructured data inHDFS.
- DevelopedPigLatin scripts to extract and filter relevant data from teh web server output files to load into HDFS.
- Installed and configuredFlume,Hive,Pig,SqoopandOozieon teh Hadoop cluster.
- UsedAWS S3to store large amount of data in identical/similar repository.
- Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
- Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
- Developed job workflow inOozieto automate teh tasks of loading teh data intoHDFSand few otherHivejobs.
- Analyzed teh data by performingHivequeries and runningPigscripts to study customer behavior.
- OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
- Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into teh Hadoop Distributed File System andPIGto pre-process teh data.
- Experienced in managing and reviewing teh Hadoop log files usingShell scripts.
- DevelopedFlumeAgents for loading and filtering teh streaming data intoHDFS
- UsedHiveto analyze teh partitioned and bucketed data and compute various metrics for reporting.
- Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.