Senior Big Data Engineer Resume
St Louis, MissourI
SUMMARY
- Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Excellent hands on business requirement analysis, designing, developing, testing and maintaining the complete data management & processing systems, process documentation and ETL technical and design documents.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
- Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
- Responsible for data engineering functions including, but not limited to: data extract, transformation, loading, integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management.
- Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
- Pleasant experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Extensively usedPythonLibraries PySpark, Pytest, Pymongo, Oracle, Py Excel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Expertise working with AWS cloud services like EMR, S3,Redshift, EMR cloud watch, for big data development.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Experience with operating systems: Linux, RedHat, and UNIX.
- Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Sustaining the Big Query, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Experience in NoSQL databases and worked on table row key design and to load and retrieve data for real time data processing and performance improvements based on data access patterns.
- Extensive experience in Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce concepts.
- Experience in analyzing the data using HQL, Pig Latin, and custom Map Reduce programs in Java.
- Experience in integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing.
- Developed extraction mappings to load data from Source systems to ODS to Data Warehouse.
- Involved in conceptual, logical and physical datamodelling and used star schema in designing the data Warehouse.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Flume, Sqoop, Impala, Spark, Parquet, Snappy, ORC, Ambari, Cassandra, TEZAWS S3, EC2 and EMR, ORACLE, SQL server, SNS, NifiADL, ADF, Azure SQL.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache, AWS
Languages: Java, SQL, Scala & Python
Development / Build Tools: Eclipse, Maven and IntelliJ
Scheduling and Automation: Shell scripts, Oozie workflows to automate and for scheduling Cron and TWS (Tivoli Workload Scheduler)
DB Languages: MySQL and PL/SQL
Cloud Technologies: MS Azure and AWS
RDBMS: Oracle, MySQL, MS SQL, and DB2
NO SQL Databases: Cassandra, MongoDB, HBase
Operating systems: UNIX, LINUX and Windows Variants
PROFESSIONAL EXPERIENCE
Confidential, St. Louis, Missouri
Senior Big Data Engineer
Responsibilities:
- Worked in Azure environment for development and deployment of Custom Hadoop Applications.
- Responsible to manage data coming from different sources through Kafka.
- Working in big data technologies like spark, Scala, Hive, Hadoop cluster (Cloudera platform).
- Making a data pipelining with help Data Fabric job SQOOP, SPARK, Scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
- Write programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data.
- Used scala function, dictionary and data structure (array, list, map) for better code reusability
- Based on Development, we need to do the Unit Testing.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on the Hive data. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Data bricks.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups.
- Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Data warehouse and improved the query performance.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.
- Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
- Implemented Data Lake to consolidate data from multiple source databases such as Exadata, Teradata using Hadoop stack technologies SQOOP, HIVE/HQL.
- Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics.
- Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex datatypes and Parquet file format.
- Used Cloudera Manager continuous monitoring and managing of the Hadoop cluster for working application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Developed data pipelines using Sqoop, Pig and Hive to ingest customer member data, clinical, biometrics, lab and claims data into HDFS to perform data analytics.
- Analyzed Teradata procedure and imported all the data from Teradata to My SQL Database for Hive QL queries information for developing Hive Queries which consist of UDF’s where we don’t have some of the default functions in Hive.
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems. Managed and reviewedHadoop log files.
- Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Environment: SPARK, Kafka, Map Reduce, Python, Hadoop, Hive, Pig, Spark, PySpark, SparkSQL, Azure SQL DW, Data brick, Azure Synapse, Azure Data lake, ARM, Azure HDInsight, Blob storage, Apache Spark, Oracle 12c, Cassandra, Git, Zookeeper, Oozie.
Confidential, King of Prussia, PA
Big Data Engineer
Responsibilities:
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and debugging.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
- Worked on Apache NIFI to decompress and move JSON files from local to HDFS.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Strong understanding of AWS components such as EC2 and S3
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data.
- Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications
- Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
- Importing existing datasets from Oracle to Hadoop system using SQOOP.
- Brought data from various sources in to Hadoop and Cassandra usingKafka.
- Experienced in usingTidal enterprise scheduler and OozieOperational Services for coordinating the cluster and scheduling workflows.
- Applied spark streaming for real time data transforming.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
- ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
- Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
- Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
- Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
- Loaded the aggregated data onto Oracle from Hadoop environment using Sqoop for reporting on the dashboard.
- Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
- If we don’t have data on our HDFS cluster, I will be scooping the data from netezza onto out HDFS cluster.
- Transferred the data using Informatica tool from AWS S3 to AWS Redshift.
- Worked on Hive UDF’s and due to some security privileges I have to ended up the task in middle itself.
- Writing map reduce code using pythonin order to get rid of certain security issues in the data.
- Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
- The data is ingested into this application by using Hadoop technologies likePIG and HIVE.
- Integrated data sources from Kafka (Producer and Consumer API) for data stream-processing in Spark using AWS Network.
Environment: Hadoop (HDFS, MapReduce),Scala, Data bricks, Post gre Sql, Spark, Impala, Hive, MangoDB, Pig, Devops, HBase, Oozie, Hue, Sqoop, Flume, Oracle, AWS Services (Lambda, EMR, Auto scaling), AWS, Mysql,Sql Server, Python, Scala, Spark, Hive, Spark-Sql.
Confidential, Columbus, OH
Data Engineer
Responsibilities:
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Responsible for implementing ETL process through Kafka-Spark-HBase Integration as per the requirements of customer facing API
- Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Helped maintain and troubleshoot UNIX and Linux environment
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Built pipelines to move hashed and un-hashed data from XML files to Data lake.
- Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
- Experienced in handling data from different datasets, join them and pre-process using Pig join operations.
- Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
- Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
Environment: Spark, Kafka, Hadoop, HDFS, Spark-SQL, AWS, Python, Map Reduce, Pig, Hive, Oracle 11g, My SQL, MongoDB, Hbase, Oozie, Zookeeper, Tableau.
Confidential
Data Engineer
Responsibilities:
- As a Data Engineer, my role includes analyzing and evaluating the business rules, data sources, data volume and come up with estimation, planning and execution plan to ensure architecture meets the business requirements.
- Managed the imported data from different data sources, performed transformation using Hive and Map- Reduce and loaded data in HDFS.
- Recommended improvements and modifications to existing data and ETL pipelines.
- Supporting other Data Engineers, providing mentoring, technical assistance, troubleshooting and alternative development solutions.
- Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server and MySQL.
- Developed a python script to transfer data from on-premises to AWS S3
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive
- Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Used Hadoop YARN to perform analytics on data in Hive.
- Developed and maintained batch data flow using HiveQL and Unix scripting
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked extensively with Sqoop for importing metadata from Oracle.
- Involved in creating Hive tables and loading and analyzing data using hive queries.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
Environment: Hadoop, MapReduce, AWS, HBase, JSON, Spark, Kafka, Hive, Pig, Hadoop YARN, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle, Yarn, Linux, Oozie.
Confidential
Data Engineer
Responsibilities:
- Implemented Avro and parquet data formats for apache Hive computations to handle custom business requirements.
- Designed, implemented and deployed within a customer’s existingHadoop / Cassandracluster a series of custom parallel algorithms for various customer defined metrics and unsupervised learning models.
- Installed and configuredHive, Pig, Sqoop, FlumeandOozieon the Hadoop cluster.
- InstalledOozie workflowengine to run multiple Hive and Pig Jobs.
- Developed Simple to complex Map/reduce Jobs using HiveandPig
- Developed Map Reduce Programs for data analysis and data cleaning.
- Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc.
- Performed data cleansing, enrichment, mapping tasks and automated data validation processes to ensure meaningful and accurate data was reported efficiently.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Implemented Apache PIG scripts to load data from and to store data into Hive.
Environment: Hive, Hadoop, Cassandra, Spark, Pig, Sqoop, Ooze, Hive, Scala. Python, MS Office.