Big Data Engineer | Hadoop Resume
Houston, TX
SUMMARY:
- Around 5 + Years of experience in IT industry in various roles as Big Data Engineer, Solution Engineer and AWS Cloud Developer with experience in DataOps, Cloud migration, ETL/ ELT in both Data Lake and Data Warehouse, Data logistics, Capacity Planning, Managing, Debugging, Performance Tuning and Administration.
- Expertise in Python, Pyspark, SQL, Hive QL, Pig, CQL, Spark SQL (Scala), Java and Shell Scripting (AWK) with decent knowledge in functional programming and data structures needed to write UDF’s, higher - order methods & business logics.
- Very good experience in developing Big Data Mechanisms & patterns for complex business problems involving volume, velocity, veracity & variety.
- Experience with Cloudera CDH 5.x, Hortonworks HDP 2.x, 3.x, EMR 4.x, 5.x, HDInsight and proven expertise in Spark & Hadoop Administration with Ambari, HDF, Cloudera Manager, Yarn Zookeeper and Hue in fully distributed high available Cluster environment.
- Decent experience working with AWS services such as Lambda, Kinesis, S3, RDS, Redshift, EMR, SQS, SNS, EBS, MKS, EC2, IAM, AWS Glue, AWS Athena, ELK, EKK Stack & DynamoDb. Involved in core development & administration of AWS Data Lake and written scripts to make API Calls using python and Boto3 client for many services in AWS.
- Integrated AWS Glue with Dev, Staging, QA & Production environment using AWS Cloud formation and authored Python & Scala ETL scripts in ETL pipeline for managing metadata & automating ETL Jobs using Glue Data Catalogue, crawlers and Dynamic frames.
- Decent skills in data warehousing and AWS migration with tools such as AWS SCT, DMS, Datapipeline.
- Expertise in Diagnosing, Optimizing, Testing, Debugging and Performance tuning of the Hadoop cluster and its components.
- Very Good Experience working with Streamsets & NIFI in creating Data pipeline, solving issues related to Data Drift & Schema Evolution and involved in ETL/ELT automation tasks by integrating Hive & Spark with DataOps.
- Good consulting experience in managing corporate C-level and Non-technical executives by providing data intensive solutions in driving data to ROI in Domains of Insurance, Banking, Airline, Telecom & Mining.
- Involved in queue extraction Dstreams from near real-time data using Kafka & Spark Streaming with Scala and also have development experience using Presto with EMR, Spark- Core, Context, Dataframe API, RDD and Spark SQL using Scala 2.11 with build tools such as Maven & SBT.
- Extracted data from FTP Servers, SAN, MySQL, Oracle DB, Teradata, Tibco and Salesforce for batch/real-time processing using Spark and then transport to NoSQL DB such as Cassandra, Hbase, Presto, Elastic Search, DynamoDb and also involved in drill-down and roll-up operations depending on business requirements.
- Implemented full CI/CD pipeline by integrating SCM (Git) with automated testing tool Gradle & Deployed using Jenkins (Declarative Pipeline) and Dockerized containers in production and also engaged in few Devops tools like Ansible, Chef, AWS Cloudformation, AWS Code pipeline, Terraform and Kubernetes.
- Good experience with ETL Data Structures, Data Warehousing and Delivering Tables as per requirements of multiple formats of data and provided end to end solution for Healthcare Big Data Warehousing and Data modelling.
- Decent Knowledge on Azure Big Data with HDInsight like creating cluster on VNet, ETL Operations using Hadoop and real time with Spark streaming & Kafka.
TECHNICAL SKILLS:
Distributions: Cloudera, Hortonworks, AWS EMR, Azure HDInsight
Big Data Components: Hadoop, HDFS, YARN, Oozie, Zookeeper, Hive, Pig, Spark, Tez, Presto, HUE, Sqoop, Flume, Airflow, Streamsets, NIFI and Apache Kudu
Security: Apache Kerberos, Apache Ranger, Knox gateway, LDAP, Apache Sentry, Nginx
DevOps: Cloud Formation, Terraform, Ansible, Jenkins CICD, Docker and Kubernetes
Realtime: Kafka, Flume, Logstash, Beats, Tibco Streambase, EMS and Spark Streaming
Programming Languages: Python, Scala, Spark SQL, Pyspark, HiveQL, CQL, Java, R
Databases: Cassandra, Elastic Search, MongoDB, Hbase (TSDB), SQL Server, Oracle DB, Solr, Teradata, Redshift, Redis.
Visualization Tools: Kibana, Grafana, Tibco Spotfire X, Tableau, Qlikview
Monitoring Tools: Splunk, Cloud Watch, ELK, Nagios
PROFESSIONAL EXPERIENCE:
Big Data Engineer | Hadoop
Confidential - Houston, TX
Responsibilities:
- Involved in Capacity Planning, Optimizing, Installing, configuring and monitoring of fully distributed Hadoop Cluster.
- Performed DDL, DML, Queries, Views and Indexing using Hive such as Creating External & Managed Tables.
- Written many UDF’s, UDTF’s, UDAF’s using Scala and Pyspark scripts for accessing cache, Null checks and Row/column manipulations.
- Involved in Cleaning and conforming the data and integrated Streamsets for data quality screening in ETL Streams and configured ETL Pipeline using JDBC multitable Consumer, & CDC pipeline by enabling data drift in SDC for automating SQL merges and replicating table Schemas for syncing data.
- Responsible for Exploratory Analytics, Data Wrangling and sometimes feature engineering using Spark SQL & Spark MLlib.
- Responsible for managing the custom workflows daily in Airflow UI and maintain the DAGs by cleaning up logs & killing DAGs with high latency during batch jobs.
- Authored Python (Pyspark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Implemented watcher alerts for Next-day alerting/ monitoring for application logs which partitioned by-day using ELK stack deployed on Docker containers and also configured Logstash-forwarders.
- Authored Custom Filter/parsers using Grok patterns/Regex on unstructured logfile in Logstash and transported indexed data to Elasticsearch and visualized using Kibana.
- Worked on Elasticsearch Crud API for document Indexing and reindexing using Bulk API with added X-pack extension for monitoring Elastic stack and audit both rest & transport calls.
- Involved in creating centralized authorization using Ambari & OpenTSDB (Hbase) on HDP sandbox and visualizing layer as Grafana for AAA standard security.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters and worked multiple consumer groups in Kafka and persisted data in Cassandra.
- Successfully tested Cassandra, Hive integration with Presto by deploying 4 worker node and 2 presto nodes in Solaris 16 GB VM boxes and processed 300K rows/sec.
- Configured Presto for optimizing query by changing default parameters such as memory pools and performed benchmark testing for different file formats using Hibench.
- Created Streamsets pipeline for event logs using Kafka, Streamsets Data Collector and Spark Streaming in cluster mode by customizing with mask plugins, filters and distributed existing Kafka topics across applications using Streamsets Control Hub.
- Shipped HDFS Indexed documents to Elastic search and written Scala scripts for Querying and ingesting Dataframes in bulk transport using embedded Elastic4s (Scala) module for Crud.
AWS Cloud Developer | Big data
Confidential - Boston, MA
Responsibilities:
- Ingested data from FTP servers to HDFS using Shell Scripts and written API Calls to GDS server to push data by daily to FTP server.
- Written Pig scripts to transform text data into AVRO file and created Hive DDL tables using partitions, dynamic partitions and Buckets depending on business actions.
- Responsible for administration of 250+nodes cluster on Hortonworks Datalake and involved in Changing Hadoop Configuration as per new rack requirements using Ambari 2.0 and upgrading cluster as well.
- Supported 24*7 (On-call) if any issues in production such as high disk space or memory utilization at peak hours or if any errors/bugs related.
- Worked on Enterprise Messaging Bus with Kafka-Tibco connector and published Queues were abstracted using Spark Dstreams and parsed XML, JSON data in Hive.
- Involved in setting up 10 nodes Kafka cluster env with 3 Web servers, 4 Kafka Brokers and 3 Spark- Streaming (Kafka consumers), 3 Zookeeper Nodes and broker capacity at 1M per second.
- Good awareness in best practices involving big data integration and automation for achieving high throughput and implemented few practices such as quality control & validation checks, Data Profiling, Check pointing for Recovery, windowing, serializing and debugging apps
- Written Spark-Sql (Scala) scripts for business actions as per front end requirement and performed exploratory reports.
- Configured Flume agents and file roll sink for third party data ingestion into HDFS.
- Configured resources to Executors, Drivers and Memory in Yarn and optimized existing resources by introducing broadcast variables & changing number of partitions in RDD’s using DAGs.
- Implemented NIFI Data pipeline which can handle 15TB with automatic XML Dataset parsed to JSON and responsible for governing data lineage and also centralized audit tracking in NIFI Metadata Repository.
- Authored Replication, Retention Policies used to automate dataset copy by weekly and retention for 6 months using NIFI UI and Ranger Access Manger.
- Configured NIFI and Ranger using Ambari NIFI Ranger plugin to establish NIFI CA communication with Ranger truststore while enabling Kerberos Active Directory preconfigured with LDAPS in HDF.
Solution Engineer Associate
Confidential
Responsibilities:
- Worked Closely with Solution Architect where our team dealt with C-level executives in gathering requirements from Clients and responsible in every stage of project until Delivery.
- Garnered extensive knowledge in Big Data mechanisms and patterns by solving multiple challenges involving different domains and varied problems.
- Supported migration project and worked on Azure data directory, Azure IAM & Multi-factor Authentication.
- Involved in Ingress & egress using Data Store such as pulling blobs from Datalake, adding blobs to Azure storage and duplicating data across regions as per requirements in Cosmos DB.
- Worked on setting up migration network like certifications, VNet & Subnets, App gateways and Load balancers, Sometimes Azure site Recovery and security configuration changes.
- Documented Use cases, POC’s, feasibility reports, workflow diagrams, Presentations and Project Architecture.
- Involved in cloud data migration project from SQL server to Microsoft Azure Data Factory.
- Created On premises cluster reports and workload analytics & data insights which aided for HDInsight Planning & resource management.
- Co-authored 300-page proprietary guide for Advanced Big data Architectural patterns involving ingestion, Storage, Processing, Governance, Transformation and Data Logistics.
- Collaborated with clients and stake holders regarding Schema design, functional requirements, closing fit gap sessions and monitoring production.
- Provided technical and infrastructure solutions in implementing Lambda architecture and distributed framework from scratch.
- Involved in Clickstream data/events while working with many join methods, grouped records by hot key, and also aggregated metrics using Hive.
- Involved in Customer 360 project intended to decrease churn and also part of the project involving textual analytics processing with 12000 sets of notes of healthcare data adhering to HIPAA 5010 compliance requirements.