Senior Big Data Engineer Resume
PA
SUMMARY
- Around 9 years of experience in software analysis, datasets, design, development, testing, and implementation of Cloud, Big Data, Big Query, Spark, Scala, and Hadoop.
- Expertise in Big Data technologies, Data Pipelines, SQL/NoSQL, Cloud based RDS, Distributed Database, Serverless Architecture, Data Mining, Web Scrapping, Cloud technologies like AWS EMR, Redshift, Lambda, Step Functions, Cloud Watch.
- Hands on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, MapReduce, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie and HBase.
- Experience in implementing E2E solutions on Big Data using Hadoop framework, executed, and designed big data solutions on multiple distribution systems like Cloudera (CDH3 & CDH4), Hortonworks.
- Strong knowledge in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
- Experience in designing, developing, and deploying projects in GCP suite including GCP Suite such as Big Query, Data Flow, Data proc, Google Cloud Storage, Composer, Looker etc.
- Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in - depth knowledge in Spark APIs like Spark Framework- SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
- Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
- Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
- Expertise in end-to-end Data Processing jobs to analyze data using MapReduce, Spark, andHive.
- Good understanding and experience in Data Modeling (Dimensional and Relational) concepts like Star- Schema Modeling, Snowflake Schema Modeling, Fact and Dimension Tables.
- Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
- Good at conceptualizing and building solutions quickly and recently developed a Data Lake using sub-pub Architecture. Developed a pipeline using Scala and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
- Experience in working with AWS services AWS Glue, Amazon Managed Kafka, Athena, IAM roles and policies.
- Strong experience in using Spark Streaming, Spark SQL, and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.
- Design and developed services to persist and read data from Hadoop, HDFS, Hive and writing Java based MapReduce batch jobs using Hortonworks Hadoop Data platform.
- Good understanding ofNoSQLData bases and hands on work experience in writing applications on No SQL databases likeCassandraandMongo DB.
- Good knowledge in querying data fromCassandrafor searching grouping and sorting.
- Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
- Good experience in GeneratingStatistics and reportsfrom Hadoop.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2Federation, High Availability and YARN architecture and good understanding of workload management, scalability, and distributed platform architectures.
- Implemented various algorithms for analytics usingCassandrawithSpark and Scala.
- Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
TECHNICAL SKILLS
Programming languages: Python, Scala, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Operating Systems: UNIX, LINUX, Solaris, Mainframes
Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, StreamSets, PySpark, Spark, Spark SQL
Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala
Cloud Technologies: AWS, AZURE
IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica and Tableau.
Others: AutoSys, Crontab, ArcGIS, Clarity, Informatica, Business Objects, IBM MQ, Splunk
PROFESSIONAL EXPERIENCE
Confidential, PA
Senior Big Data Engineer
Responsibilities:
- Participated in all phases including Analysis, Design, Coding, Testing and Documentation and gathered requirements and performed Business Analysis.
- Worked on Building dashboards in Tableau with ODBC connections from different sources like Big Query/presto SQL engine.
- Developed Entity-Relationship diagrams and modeling Transactional Databases and Data warehouses using ER/ Studio and Power Designer.
- Convert the power center code using the Informatica developer to DBM Mappings.
- Writing shell scripts to schedule the Informatica domain and the repository backups on a weekly basis.
- Maintained data pipeline up-time of 99.9% while ingesting streaming and transactional data across 7 different primary data sources using Spark, Redshift, S3, and Python.
- Ingested data from disparate data sources using a combination of SQL.
- Google Analytics API, and Salesforce API using Python to create data views to be used in BI tools like Tableau.
- Writing MapReduce code using python to get rid of certain security issues in the data.
- Developed different pipelines in the Streamsets according to the requirements of the business owner.
- Intensively used python, JSON and Groovy scripts coding to deploy the Streamsets pipelines into the server.
- Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
- Worked with Kafka to integrate data from multiple topics to the database. Manage RESTful API, integrate with stream sets to move data.
- Used Pig Latin at client-side cluster and HiveQL at server-side cluster.
- Importing the complete data from RDBMS to HDFS cluster using Sqoop.
- Creating external tables and moving the data onto the tables from managed tables.
- Performing the subqueries in Hive and partitioning and bucketing the imported data using HiveQL.
- Moving this partitioned data onto the different tables as per business requirements.
- Invoking an external UDF/UDAF/UDTF python script from Hive using Hadoop Streaming approach which is supported by Ganglia.
- Setting up the work schedule using oozie and identifying the errors in the logs, rescheduling/resuming the job.
- Able to handle whole data using HWI (Hive Web Interface) using Cloudera Hadoop distribution UI.
- Involved in Designing and Developing Enhancements to product features.
- Involved in Designing and Developing Enhancements of CSG using AWS APIS.
- Worked with Terraform Templates to automate the Azure lab’s virtual machines using terraform modules. and deployed virtual machine scale sets in a production environment.
- Designing and developing ETL Solutions in the Informatica power center.
- Created monitors, alarms, and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS.
- Created a Lambda deployment function and configured it to receive events from your S3 bucket.
- Designed the data models to be used in data-intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end.
- Writing code that optimizes the performance of AWS services used by application teams and provides Code-level application security for clients (IAM roles, credentials, encryption, etc.)
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java and Scala for data cleaning and preprocessing.
- Written Templates for Azure infrastructure as code using Terraform to build staging and production environments.
- Integrated Azure log Analytics with Azure VMs for monitoring the log files store them and track metrics and used Terraform as a tool, managed different infrastructure resources cloud, VMware and Docker.
- Developed file cleaners using Python libraries and utilized Python libraries Boto3, Pandas
- Created RDD’s in Spark technology.
- Extracting data from data warehouse (Teradata)on to the Spark RDD’s,
- Working on Stateful Transformations in Spark Streaming.
- Good hands-on experience on Loading data onto Hive from Spark RDD’s.
- Worked on Spark SQL UDF’s and Hive UDF’s also worked with Spark accumulators and broadcast variables.
- User profile and other unstructured data storage using Java and MongoDB.
- Advocated, demonstrated, and trained in Git, Python, SDLC, SQL, and RDB design for the team.
Environment: Hadoop, Sqoop, Hive, HDFS, Streamsets, YARN, Java, Pyspark, Zookeeper, HBase, Apache Spark, Scala, Kafka, Oracle, Python, Scala, Terraform, Restful web service.
Confidential, CT
Senior Big Data Engineer
Responsibilities:
- Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and prepared low and high-level documentation.
- Performing transformations using Hive, MapReduce, hands on experience in copying .log, snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS and extracted the data into HDFS from MYSQL using Sqoop.
- Imported required tables from RDBMS to HDFS using Sqoop and used Storm/ Spark streaming and Kafka to get real time streaming of data into HBase.
- Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
- Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.
- Experience in Writing Map Reduce jobs for text mining and worked with predictive analysis team and Experience in working with Hadoop components such as HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Impala and Flume.
- Wrote HIVE UDF's as per requirements and to handle different schema’s and xml data.
- Implemented ETL code to load data from multiple sources into HDFS using Pig Scripts.
- Integrated data from Cloudera Big data stack. Hadoop, Hive, HBase, Mongo DB. Build Stream sets pipeline to accommodate change.
- Developed data pipeline using Python, hive to load data into data link. Perform data analysis data mapping for several data sources.
- Responsible for sending quality data through a secure channel to the downstream system using role base access control and Streamsets.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, SQL Data warehouse to meet business functional requirements.
- Have extensive experience in creating pipeline jobs, scheduling triggers, Mapping data flows using Azure Data Factory(V2) and using Key Vaults to store credentials.
- Implemented Azure, self-hosted integration runtime in ADF. Created, provisioned different Databricks clusters, notebooks, jobs and autoscaling.
- Processing Healthcare HL7 files using the Informatica Data transformation libraries and loading them into Hadoop.
- Created customized DBM mapping for incremental using Informatica Developer and deploy them as a part of the application.
- Written the Map Reduce programs, and Hive UDFs in Java.
- Developed Java Map Reduce Programs for the analysis of sample log files stored in the cluster.
- Designed and developed User Defined Function (UDF) for Hive and Developed the Pig UDF to pre-process the data for analysis as well as experience in (UDAFs) for custom data-specific processing.
- Created Airflow Scheduling scripts in Python.
- Automated the existing scripts for performance calculations using scheduling tools like Airflow.
- Designed and developed the core data pipeline code, involving work in Python and built on Kafka and Storm.
- Good knowledge of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
- Performance tuning using Partitioning, bucketing of IMPALA tables.
- Created cloud-based software solutions written in Scala Spray IO, Akka, and Slick.
- Hands on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
- Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Worked on NoSQL databases including HBase and Cassandra.
- Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
Environment: Map Reduce, HDFS, Hive, Pig, HBase, Python, Streamsets, SQL, Sqoop, Java, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, GCP, AKKA, Zookeeper, J2EE, Linux Red Hat, HP-ALM, Eclipse, Cassandra, SSIS.
Confidential, Englewood, CO
Big Data Engineer
Responsibilities:
- Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data.
- Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
- Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
- Designed and implemented a Cassandra NoSQL based database that persists high-volume user profile data.
- Migrated high-volume OLTP transactions from Oracle to Cassandra
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling
- Created Data Pipeline of Map Reduce programs using Chained Mappers.
- Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
- ModelledHivepartitions extensively for data separation and faster data processing and followedPigandHivebest practices for tuning.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Importing and exporting data intoHDFSfrom database and vice versa usingSqoop.
- Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioral data into HDFS for analysis.
- UsedMavenextensively for building jar files ofMap Reduceprograms and deployed to Cluster.
- Created customized BI tool for manager team that perform Query analytics using HiveQL.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Worked with NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB
- Implemented optimization and performance tuning in Hive and Pig.
- Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
Environment: RHEL, HDFS, Map-Reduce, AWS, Hive, Pig, Sqoop, Flume, Oozie, Mahout,HBase, Hortonworks data platform distribution, Cassandra.
Confidential, Richmond, VA
Big Data Engineer
Responsibilities:
- Responsible for designing and implementing End to End data pipeline using Big Data tools including HDFS, Hive& Spark.
- Extracting, Parsing, Cleaning, and ingesting the incoming web feed data and server logs into the HDFS by handling structured and unstructured data.
- Worked on loading CSV/TXT/AVRO/PARQUET files using pyspark language in Spark Framework and process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table.
- Worked extensively on tuning SQL queries and database modules for optimum performance.
- Writing complex SQL queries like CTEs, subqueries, joins, Recursive CTEs.
- Good experience in Database, Data Warehouse and schema concepts like SnowFlake Schema.
- Worked on Cluster size of nodes Communicate with business users and source data owners to gather reporting requirements and to access and discover source data content, quality, and availability.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Involved in file movements between HDFS & AWS S3 and extensively worked with S3 bucket inAWS.
- Integration of data stored in S3 with Databricks to perform ETL processes using pyspark and spark SQL.
- Using Spark-SQL to load data from JSON to create schema RDD and loading in Hive Tables.
- Used Scala SBT to develop Scala coded spark projects and executed using spark-submit.
- Expertise on Spark, Spark SQL, Tuning and Debugging the Spark Cluster (Yarn).
- Improving Efficiency by modifying existing Data pipelines on Matillion to load the data into AWS Redshift.
- Deployed the Airflow server and setup dags for scheduled tasks.
- Very good Experience with Hashi Corp Vault to write and read secrets into and from the lockboxes.
- Migration of MicroStrategy reports and data from Netezza to IIAS.
- Experienced with batch processing of Data sources using Apache Spark.
- Extensive Usage of Python Libraries, Pylint and Auto testing framework behave.
- Well-versed with Pandas data frames and Spark data frames.
- Developed Power enters mappings to extract data from various databases, Flat files and load into DataMart using PySpark and Airflow.
- Created data partitions on large data sets in S3 and DDL on partitioned data.
- Implemented rapid provisioning and life-cycle management for using Amazon EC2 and custom Bash scripts.
- Experienced in writing Map Reduce Jobs in Java for processing large sets of structured semi-structured and unstructured data sets and stores them in HDFS.
Environment: Unix Shell Script, Python 2&3, Scheduler (Cron), Jenkins, Artifactory, Matillion, EMR, Databricks, PyCharm, Spark SQL, Hive, SQL Jupiter, MicroStrategy, Putty, Power BI, Hive, Java, AWS.
Confidential
Hadoop Developer
Responsibilities:
- Involved in frequent meetings with clients to gather business requirements & converting them to technical specifications for development team.
- Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Written Hive queries for data analysis to meet the business requirements.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
- The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
- Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system.
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Loaded the aggregated data onto DB2 for reporting on the dashboard.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop. HBase. Hive, DB2, MS Office, Windows