We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Foster City, CA

SUMMARY

  • Over 7+ years of IT experience as a Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms
  • Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
  • Hands - on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
  • Extensive experience with Real-time streaming technologies Spark, Storm, Kafka
  • Performed optimizing MapReduce Programs using combiners, partitioners, and custom counters for delivering the best results.
  • Performed the migration of Hive and MapReduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
  • Developing ETL pipelines in and out of data warehouse using combination of Python and Snowsql.
  • Strong Knowledge in data preparation, data Modeling and data visualization using PowerBi and had experience in developing various reports, dashboards using various visualization using Table.
  • A very keen knowing of techno stack that Google Cloud Platform (GCP) adds.
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality
  • Experience in Creating ETL mappings using Informatica to moveData from multiple sources like Flat files, Oracle into a common target area such asData Warehouse.
  • Implementations of generalized solution model using AWS Sagemaker
  • Experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data-centric solutions.
  • Experience working with data modeling tools like Erwin and ER/Studio.RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Good knowledge of integrating Spark Streaming with Kafka for real time processing of streaming data
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Experience in building PowerBI reports using Azure cloud Platform.
  • Worked on Configuring Zookeeper, Kafka and log stash cluster for data ingestion and Elastic search performance and optimization and Worked on Kafka for live streaming of data.
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Knowledge of ETL methods for Experience in collection of Log Data and JSON data into HDFS using Flume and processed the data using Hive/Pig.
  • Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
  • Good experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
  • Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka
  • UsedZookeeperto provide coordination services to the cluster.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Hands-on Experience in NoSQL databases like Mongo DB, HBase and Cassandra.
  • Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.
  • Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.
  • Strong knowledge in data preparation, data modelling, data visualization using power BI and had experience in developing various reports, dash boards using various visualizations in tableau.
  • Can work parallelly in both GCP and Azure cloud coherently.
  • Very keen in knowing newer techno stack that Goggle Cloud Platform (GCP) adds.
  • Intermediate in Performance Testing tools such as HP Quality Center, JConsole and VisualVM.
  • Hands on Experience in Spark architecture and its integrations likeSpark SQL,DataFramesandDatasetsAPIs.

TECHNICAL SKILLS

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++

Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases: MySQL, SQL Server, Oracle 12c, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase

Workflow mgmt. tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend, GCP, Power BI.

Cloud Technologies: AWS and Microsoft Azure

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

Operating Systems: Unix, Linux, Windows

PROFESSIONAL EXPERIENCE

Confidential, Foster City, CA

Senior Big Data Engineer

Responsibilities:

  • Developed Spark scripts by using Scala, Java as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to Kafka for future use cases purpose.
  • Migrated an existing on-premises application to AWS.
  • Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
  • Developer high fidelity spark Kafka streaming application - which consume Json format packet messages and return geo location data to mobile application for requested IMEI.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • UsedZookeeperto provide coordination services to the cluster. Experienced in managing and reviewingHadooplog files.
  • Implemented AWS provides a variety of computing and networking services to meet the needs of applications
  • Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce.
  • Implemented large scale technical solutions using Object Oriented Design and Programming concepts using Python
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Design and Implemented the Sqoop incremental imports, delta imports on tables without primary keys and dates from Teradata and SAP HANA and appends directly into Hive Warehouse.
  • Experience in GCP Dataproc, GCS, Cloud functions bigQuery.
  • Designing production-level data pipelines using Kafka
  • Proficiency in working on Event-Driven Architecture
  • Experience in moving Data between GCP and azure using Azure data factory.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Data bricksInvolved in the assessment of tools and technology stack for the new platform Developed
  • Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, Confidential -SQL, and SQL Server using Python.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Working with relational database systems (RDBMS) such as Oracle and database systems like HBase.
  • Involved in writing Confidential -SQL working on SSIS, SSAS, Data Cleansing, Data Scrubbing and Data Migration.
  • Working on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Developed python programs and excel functions using VB Script to move data and transform data.
  • Used Power BI to develop data analysis prototype and used power view to visualize reports.
  • Published Power BI reports in the required organizations and made power BI dashboards available in web clients and mobile apps.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Optimized the Hive queries by setting different combinations of Hive parameters.
  • Developed shell scripts for running Hive scripts in Hive and Impala.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in MongoDB.
  • Performed Data ingestion usingSQOOP, Apache Kafka, Spark StreamingandFLUME.
  • Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
  • Developed multiple POCs using Py Spark and deployed on the YARN cluster, compared the performance of Spark, with Hive
  • Worked on Cluster co-ordination services through Zookeeper.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
  • Databricks Start working with AWS for storgae and halding for tera byte of data for customer BI Reporting tools
  • Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
  • Worked with Apache Nifi to Develop Custom Processors for the purpose of processing and disturbing data among cloud systems Created a new CFT and validated the IP addresses in lambda and ran the Spark Master and destroyed the old CFT stack in Dev, QA and Prod.
  • Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets.
  • Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
  • Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
  • Databricks and convert them into Parquet files using Scala commands and write them back to S3.
  • Participated in all Sprint Planning and Scrum meetings as part of Agile Methodology
  • Hybernate Co-ordinated with the offshore team in guiding them to deliver their components on time.
  • Multithreading the Restful Web Services (JAX RS) to import location file to a SaMS system.
  • Implemented the Restful Web Services (JAX RS) to import location file hybernate to a SaMS system.
  • Create ODBC/JDBC connectivity to Databricks platform and execute the Microstrategy reports
  • UsedHibernate frameworkfor Persistence layer, involved in writing Stored Procedures for data retrieval and data storage and updates inOracledatabase usingHibernat

Environment: Hadoop YARN, MapReduce, HBase, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle,Kafka, Yarn, Linux, GIT, Oozie, Power BI, GCP.

Confidential, Plano, TX

Big Data Engineer

Responsibilities:

  • Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Configured Flume to extract the data from the web server output files to load into HDFS.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Demonstrated expert level technical capabilities in areas of Azure Batch and Interactive solutions, Azure Machine learning solutions and operationalizing end to end Azure Cloud Analytics solutions.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Designed changes to transform current Hadoop jobs to HBase.
  • Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
  • Developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Responsible for data services and data movement infrastructures
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters.
  • Experience and expertise on Terraform to deploy the GCPs in CI/CD.
  • Compiled data from various sources to perform complex analysis for actionable results
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the Tensorflow Model for efficiency
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
  • Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
  • Created DAX quaries to generated computed coloumns in Power BI.
  • Developed pipelines to move the data from Azure blob storage/file share to azure SQL Data warehouse and blob.
  • Published power BI reports in the required originations and made Power BI Dashboards available in web clients and mobile app.
  • Collaborate with team members and stakeholders in design and development of data environment
  • Experienced knowledge over designing Restful services using java-based APIs like JERSEY.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Created Cassandra tables to store various data formats of data coming from different sources
  • Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Used windows Azure SQL reporting services to create reports with tables, charts and maps.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
  • Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Build Data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Used cloud shell SDK In GCP to configure the services Dataproc,Storage, BigQuery.

Environment: Hadoop YARN, Spark, Spark Streaming, Power BI, Data warehouse, GCP, MapReduce, Spark SQL,Kafka, Scala, Kafka, Azure, Python, Hive, Sqoop, Impala, Tableau, Talend, Oozie, Control-M, HBase, Java, Oracle 12c, Linux

Confidential, Madison, WI

Big Data Engineer

Responsibilities:

  • Built customtableau/ SAP Business Objectsdashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for that selected object.
  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design inHadoopand Big Data
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
  • Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Experience in implementing Spark RDD's in Scala.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
  • Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Used python pandas, Nifi, Jenkins, nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
  • Manipulated and summarized data to maximize possible outcomes efficiently
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Analyzed and recommended improvements for better data consistency and efficiency
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Develop near real time data pipeline using spark
  • Involved in continuous Integration of application using Jenkins.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.

Environment: Ubuntu, Hadoop, Spark (PySpark, Nifi, Jenkins, Talend, SparkSQL, SparkMLIib), Pig, Python, Tableau, GitHub, AWS EMR/EC2/S3, and OpenCV.

Confidential

Data Engineer

Responsibilities:

  • Used the DataStage Designer to develop processes for Extracting, Cleansing, Transforming, Integrating, and Loading data into Data warehouse.
  • Used JMS to pass messages as payload to track statuses, milestones and states in the workflows.
  • Worked on Performance tuning of WebSphere ESB in different environments on different platforms.
  • Configured and Implemented web services specifications in collaboration with offshore team Created Data models and database design.
  • Developed various Server and Parallel jobs using Oracle, ODBC, FTP, Peek, Aggregator, Filter, Funnel, Copy, Hash File, Change Capture, Merge, look up, Join, Sort, Merge, Lookup stages.
  • Developed PL/SQL Procedures, Functions, Packages, Triggers, Normal and Materialized Views.
  • Continuous monitoring and managing theHadoop clusterthroughCloudera Manager.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers.
  • Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Written Kafka REST API to collect events from front end.
  • Extracted feeds form social media sites such as Facebook, Twitter using Python scripts.
  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
  • Implemented a prototype for the complete requirements using Splunk, python and Machine learning concepts
  • Developed Spark scripts to import large files from Amazon S3 buckets.

Environment: PL/SQL, HDFS, Kafka, Apache Nifi, AWS, S3, Spark, Python, Splunk, JMS, Git.

Confidential

Data Analyst

Responsibilities:

  • Participated in testing of procedures and Data utilizing, PL/SQL to ensure integrity and quality of Data in Data warehouse.
  • Gathered Data from Help Desk Ticketing System and write ad-hoc reports and, charts and graphs for analysis.
  • Developed and run ad-hoc Data queries from multiple database types to identify system of records, Data inconsistencies, and Data quality issues.
  • Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers Performed Tableau administering by using tableau admin commands.
  • Involved in defining the source to target Data mappings, business rules and Data definitions.
  • Ensured the compliance of the extracts to the Data Quality Center initiatives
  • Designed Data Stage parallel jobs using designer to extract data from various source systems, Transformation and conversion of data, Load data-to-data warehouse and Send data from warehouse to third party systems like Mainframe.
  • Performed ETL Performance tuning to increase the ETL process speed.
  • Have addressed production, UAT issues, proper action was taken accordingly based on priority and requirement
  • Performed debugging and troubleshooting the web applications using Git as a version-controlling tool to collaborate and coordinate with the team members.
  • JUnit was used to implement test cases for beans.

Environment: SQL, PL/SQL, Confidential /SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP.

We'd love your feedback!