Sr. Big Data Engineer Resume
El Segundo, CA
SUMMARY
- Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
- Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
- Implement Continuous integration/continuous development best practice using Azure Devops, ensuring code versioning
- Created and maintained various Shell and Python scripts for automating various processes and optimized Map Reduce code, pig scripts and performance tuning and analysis.
- Performance tune up Phoenix/Hbase, Hive queries and Spark.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Extensive knowledge in writing Hadoop jobs for data analysis as per teh business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
- Experience in importing and exporting teh data using Sqoop from HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
- Good Knowledge and experience inAmazon Web Service (AWS)concepts likeEMR and EC2web services successfully loaded files toHDFSfromOracle, SQL Server, Teradata and Netezza using Sqoop.
- Created Notebooks using Databricks, Scala and spark and capturing teh data from Delta tables in Delta lakes.
- Design and develop hive, hbase data structure and Oozie workflow.
- Used Pig Latin at client side cluster and HiveQL at server side cluster.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
- Designed workflows & coordinators for teh task management and scheduling using Oozie to orchestrate teh jobs
- Experienced on implementation of a log producer in Scala dat watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Having good knowledge in writing Map Reduce jobs through Pig, Hive, and Sqoop.
- Pleasant experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on Implementation of a log producer in Scala dat watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Developed Spark/ Scala, Python for regular expression (regex) project in teh Hadoop/Hive environment with Linux/Windows for big data resources.
- Developed Oozie workflowschedulers torun multiple Hive and Pig jobsdat run independently with time and data availability.
- Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.
- Good knowledge onAWScloud formation templates and configuredSQSservice through javaAPIto send and receive teh information.
- Extensively using open source languagesPerl,Python,ScalaandJava.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Good working experience on Hadoop tools related toData warehousinglikeHive, Pigand also involved in extracting teh data from these tools on to teh cluster using Sqoop.
- Containerize data wrangling jobs inDocker containers utilizing Git and Azure DevOpsfor version control
- Experience with operating systems: Linux, RedHat, and UNIX.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, Map Reduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark2.0, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and Snappy.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache
Languages: Java, Python, SQL, Scala, JavaScript, XML and C/C++
No SQL Databases: Cassandra, MongoDB and HBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts
XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB
Cloud Platform: AWS, Azure and Snowflake
Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J
Frameworks: Struts, spring and Hibernate
App/Web servers: WebSphere, WebLogic, JBoss and Tomcat
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2
Operating systems: UNIX, LINUX, Mac os and Windows Variants
Data analytical tools: R and MATLAB
ETL Tools: Talend, Pentaho
PROFESSIONAL EXPERIENCE
Confidential, El Segundo, CA
Sr. Big Data Engineer
Responsibilities:
- Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
- Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
- Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with teh whole Microsoft stack.
- Experience in Configure, Design, Implement and monitorKafkaCluster and connectors.
- Used Azure Event Gridfor managing eventservice dat enables you to easily manage events across many differentAzureservices and applications.
- Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
- Responsible for design & deployment ofSpark SQLscripts andScalashell commands based on functional specifications.
- Implemented versatile microservices to deal with simultaneousness and high traffic. Advanced existing Scala code and improved teh cluster execution.
- Worked with data science group to do preprocessing and include feature engineering, halped Machine Learning algorithm in production.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in in Azure Databricks.
- Managed assets and scheduling over teh cluster utilizing Azure Kubernetes Service.
- Performed information purging and applied changes utilizing Databricks and Spark information analysis.
- Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information TEMPeffectively
- Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
- Worked onKafkaandSparkintegration for real time data processing.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh SQL Activity.
- Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
- Supported analytical phases, dealt with data quality, and improved performance utilizing Scala's higher order functions, lambda expressions, pattern matching and collections.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Provide guidance to development team working on PySpark as ETL platform
Environment: Hadoop, Spark, Hive, Sqoop, HBase, Oozie, Talend, Kafka Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Microservices, K-Means, KNN. Ranger, Git
Confidential, Englewood, CO
Big Data Engineer
Responsibilities:
- Involved in managing and monitoringHadoopcluster using Cloudera Manager.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
- Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
- Worked on Written a python script which automates to launch teh EMR cluster and configures teh Hadoop applications using boto3.
- Involved in working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Experienced in analyzing and Optimizing RDD's by controlling partitions for teh given data
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka
- Implemented Spark using Python and Spark SQL for faster processing of data and Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
- Written Spark-SQL and embedded teh SQL in SCALA files to generate jar files for submission onto teh Hadoop cluster
- Extensively worked with Avro and Parquet, XML, JSON files and converted teh data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Stored and retrieved data from data-warehouses using Amazon Redshift.
- Involved in file movements between HDFS andAWSS3 and extensively worked with S3 bucket inAWS.
- Converted allHadoopjobs to run in EMR by configuring teh cluster according to teh data size
- Involved in ConfiguringHadoopcluster and load balancing across teh nodes.
- Involved in Hadoopinstallation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of teh data.
- Created various types of data visualizations using Python and Tableau.
- Automated and monitored complete AWS infrastructure with terraform.
- Created data partitions on large data sets in S3 and DDL on partitioned data.
- Used Python and Shell scripting to build pipelines.
- Developed data pipeline using SQOOP, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
- Developed workflow in Oozie to automate teh tasks of loading data into HDFS and pre-processing with Pig and Hive.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis
- Used HiveQL to analyze teh partitioned and bucketed data and compute various metrics for reporting
- Worked with querying data using SparkSQL on top of Spark engine.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
Environment: HDFS, Hive, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Spark, Tableau, Yarn, Cloudera, Scala, Sqoop, Datastage, SQL, Terraform, Splunk, RDBMS, Python, Elastic search, data Lake, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, NIFI, Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes.
Confidential, Franklin, TN
Big Data Engineer
Responsibilities:
- Acted for bringing in data underHBaseusing HBase shell alsoHBaseclient API.
- Experienced with handling administration activations usingClouderamanager
- Involved in developingSpark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save teh results to output directory into HDFS.
- Involved in writing optimizedPigScript along with developing and testingPig LatinScripts.
- Involved in transforming data from Mainframe tables toHDFS, andHBasetables using Sqoop.
- Created customSOLRQuery segments to optimize ideal search matching.
- Stored teh time-series transformed data from teh Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
- Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
- Designed and implemented Incremental Imports intoHivetables and writing Hive queries to run onTEZ.
- CreatedETLMapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Implemented teh workflows usingApache Oozieframework to automate tasks.
- Designed and implemented Incremental Imports intoHivetables.
- Ingest real-time and near-real time (NRT) streaming data intoHDFSusingFlume.
- Worked with NoSQL databases likeHBasein makingHBasetables to load expansive arrangements of semi structured data.
- Visualized teh results using Tableau dashboards and teh Python Seaborn libraries were used for Data interpretation in deployment.
- Involved in creatingHivetables, loading with data and writingHive queriesdat will run internally in MapReduce way
- Involved in collecting, aggregating and moving data from servers to HDFS usingFlume.
- Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS usingSqoop.
- Involved in data ingestion intoHDFSusingSqoopfor full load and Flume for incremental load on variety of sources like web server,RDBMSand Data API’s.
- Collected data using Spark Streaming fromAWSS3bucket in near-real- time and performs necessary Transformations and Aggregations to build teh data model and persists teh data inHDFS.
- InstalledOozieworkflow engine to run multipleHiveandPigjobs which run independently with time and data availability.
- Automatically scale-up teh EMR instances based on teh data.
- Imported Bulk Data into HBase Using MapReduce programs.
- Used SCALA to storestreaming datato HDFS and to implementSparkfor faster processing of data.
- Involved in migrating tables fromRDBMSintoHivetables usingSQOOPand later generate visualizations using Tableau.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, AWS, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.
Confidential
Hadoop Developer
Responsibilities:
- Processed data into HDFS by developing solutions.
- Analyzed teh data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
- Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing teh data onto HDFS.
- Worked extensively with HIVE DDLs and Hive Query language (HQLs).
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
- Created Map Reduce Jobs to convert teh periodic of XML messages into a partition avro Data.
- Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS.
- Created components like Hive UDFs for missing functionality in HIVE for analytics.
- Cluster co-ordination services through Zookeeper.
- Developed data pipeline using flume, Sqoop and pig to extract teh data from weblogs and store in HDFS.
- Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
- Created HBase tables to load large sets of structured data.
- Managed and reviewed Hadoop log files.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Trouble shooting.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g.
Confidential
BI Developer
Responsibilities:
- Involved in various phases of Software Development Life Cycle (SDLC) as requirement gathering, data modeling, analysis, architecture design & development for teh project.
- Formulated teh strategy, development, and implementation of executive and line of business dashboards through SSRS.
- Developed SSIS pipelines to automate ETL activities and migrate SQL server data to Azure SQL database.
- Built SSIS packages and scheduled jobs to migrate data from disparate sources into SQL server and vice versa.
- Created SSIS packages with which data from different resources were loaded daily to create and maintain a centralized data warehouse. Made teh package dynamic so it fit teh environment.
- Developed data profiling, munging and missing value imputation scripts in Python on raw data as a part of understanding teh data and its structuring.
- Managed internal Sprints, release schedules, and milestones through JIRA. Functioned as teh primary point of contact for teh client Business Analysts, Directors and Data Engineers for project communications.
- Involved in design, development and Modification ofPL/SQLstored procedures, functions, packages, and triggers to implement business rules into teh application.
- DevelopedETLprocesses to load data from Flat files, SQL Server, and Access into teh target Oracle database by applying business logic on transformation mapping for inserting and updating records when loaded.
- Have good InformaticaETLdevelopment experience in an offshore and onsite model and involved inETLCode reviews and testingETLprocesses.
Environment: MSBI, SSIS, SSRS, SSAS, Informatica, ETL, PL/SQL, SQL Server 2000, Ant, CVS, PL/SQL, Hibernate, Eclipse, Linux