We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Plano, TX

SUMMARY:

  • Over 8+ years overall IT experience in a various type of industries, which includes most experience in Big Data Analytics and development.
  • Experience in working on Spark - Scala programming with good knowledge on Spark Architecture and its In-memory Processing.
  • Experienced in transforming Hive/SQL queries into Spark changes using Spark Data Frames and Python. Excellent experience on Scala, Apache Spark, Spark Streaming, Pattern Matching and Map - Reducing.
  • Experienced in working with Flume to load the log data from multiple sources directly into HDFS.
  • Excellent experience on Scala, and tools known such as addTaskCompletionListener (), barrier (), allGather (), resourcesJMap (), taskMetrics ().
  • Experience on streaming frameworks like Apache storm to overload the data from messaging distribution systems like Apache Kafka into HDFS.
  • Good Exposure on Apache Hadoop Map Reduce programming PIG Scripting and Distribute Application and HDFS.
  • Configured and implemented enabling the SSL for Hadoop Web UIs (HDFS, YARN, Job History, Spark, Tez and Hue) in AWS.
  • Proficient in big data ingestion and streaming tools like Flume, Sqoop, Kafka, and Storm.
  • Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, AWS CLI.
  • Extensive experience working in Oracle, DB2, SQL Server and MySQL database and Java Core concepts like OOPS, Multithreading, Collections and IO.
  • Good experience of AWS Elastic Block Storage (EBS), different volume types and use of various types of EBS volumes based on requirement.
  • Good experience on working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, HDInsight/Databricks.
  • Extensive experience working in Oracle DB2 SQL Server and My SQL database.
  • Improved infrastructure design and approaches of different projects in the cloud platform Confidential Web Services (AWS) by configuring the Security Groups, Elastic IP's and storage on S3 Buckets.
  • Worked on version control tools like subversion and GIT and utilized Source code administration customer apparatuses like GitHub.

TECHNICAL SKILLS:

Bigdata Ecosystem: HDFS, Map Reduce, YARN, Hive, Pig, HBase, Impala, Zookeeper, Sqoop, Flume, Spark, Solr.

Cloud Environment: AWS, Azure, & GCPs.

NoSQL: HBase, Cassandra, Mongo DB.

Databases: Oracle 11g/12C, Teradata, DB2, MS-SQL Server, MySQL, MS-Access.

Programming Languages: Scala, Python, SQL, Java, Adv SQL, PL/SQL, Linux shell scripts.

BI Tools: Tableau, Apache Superset.

Alerting & Logging: Grafana, Kibana, Splunk.

Automation: Airflow, Oozie.

PROFESSIONAL EXPERIENCE:

Confidential, Plano, TX

Big Data Engineer

Responsibilities:

  • Worked on developing Kafka producer and consumers, Cassandra clients and PySpark with components HDFS, Hive.
  • Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Constructed product-usage SDK data and data aggregations by using PySpark, Scala,
  • Spark SQL and Hive context in partitioned Hive external tables maintained in AWS S3 location for reporting, data science dashboarding, and ad-hoc analyses Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
  • Very Good Knowledge about the Oracle tools like TOAD, SQL PLUS and SQL Navigator.
  • Worked on BI re-platforming; provided a report on Tableau, PowerBI, Domo, Apache Superset, QlikView, Qliksense, Microstrategy
  • Worked on Apache Superset for indexing and load balanced querying to search for specific data in larger datasets. Used Job management scheduler Apache Superset to execute the workflow.
  • Proficient in advance features of Oracle 9i for PL/SQL programming like Using cursor variables, Ref. cursors, nested tables, and Dynamic SQL.
  • Wrote Package containing several Procedures and Functions in PL/SQL to handle sequencing issue.
  • Experience in Designing, Architecting and implementing scalable cloud-based web applications using AWS and GCP.
  • Set up a GCP Firewall rules in order to allow or deny traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Creation, configuration and monitoring Shards sets. Analysis of the data to be shared, choosing a shard Key to distribute data evenly. Architecture and Capacity planning for MongoDB clusters. Implemented scripts for mongo DB import, export, dump and restore.
  • Used MongoDB internal tools like Mongo Compass, Mongo Atlas Manager & Ops Manager, Cloud Manager etc. Worked on MongoDB database concepts such as locking, transactions, indexes, sharing, replication and schema design.
  • and Pig jobs which run independently with time and data availability.
  • Excellent understanding and knowledge of NOSQL databases like MongoDB and HBase.
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Data Extraction and Transformation and Load (Databricks & Hadoop)
  • Setting up AWS and Azure Databricks Account
  • Experienced in Developing Spark application using Spark Core, Spark SQL, and Spark Streaming API's.
  • Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.
  • Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
  • Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
  • Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
  • Created Airflow Scheduling scripts in Python.
  • Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Oozie, Airflow, Spark, Drill, Kylin, Sqoop, Kylo, Nifi, EC2, ELB, S3 and EMR.
  • Installed and configured Apache airflow for workflow management and created workflows in python.
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
  • Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deep learning. Configured the above jobs in Airflow.
  • Helped develop validation framework using Airflow for the data processing.
  • Extensive working knowledge and experience in building and automating processes using Airflow.
  • Worked with version control systems like Subversion, Perforce, and GIT for providing common platforms for all the developers.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.

Environment: Apache Spark, Kafka, Cassandra, GCP, MongoDB Databricks, Flume, YARN, Sqoop, Oozie, Hive, Pig, Java, Hadoop distribution of Cloudera 5.4/5.5, Linux, XML, Eclipse, MySQL, AWS.

Confidential, Phoenix, AZ

Big Data Engineer

Responsibilities:

  • Created Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Implemented Spark SQL to load JSON data, frequently used collect (), coalesce () for faster processing of data.
  • Created Data sources to load data into SQL Server (Staging database) before and after performing cleansing on the Extract tables.
  • Setup GCP Firewall rules to allow or deny traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Setup Alerting and monitoring using Stack driver in GCP.
  • Experience in managing life cycle of MongoDB including sizing, automation, monitoring and tuning.
  • Experience on working with MongoDB Ops Manager, Cloud Manager and Atlas Manager
  • Experience in integrating databases like MongoDB, MySQL with webpages like HTML, PHP and CSS to update, insert, delete and retrieve data with simple ad-hoc queries.
  • Developed Data Model by Data Virtualization techniques using Denodo 5.5 / 6.0 by connecting to multiple data sources such as SQL Server, Oracle, Hadoop etc.
  • Good experience in working on Data Lake Design & Implementation on AWS.
  • Worked with continuous integration/continuous delivery using tools such as Jenkins, Git, Ant, and Maven, created workflows in Jenkins and Worked on the CI-CD model setup Using Jenkins.
  • Responsible for performing various transformations like sort, join, aggregations, filter in-order to retrieve various datasets using Apache spark.
  • DevOps role converting existing AWS infrastructure to Server-less architecture (AWS Lambda, Kinesis) deployed via Cloud Formation.
  • Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
  • Exported data to Cassandra (NoSQL) database from HDFS using Sqoop and performed various CQL commands on Cassandra to obtain various datasets as required.
  • Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
  • Building Reusable Data ingestion and Data transformation frameworks using Python.
  • Designed and built Data Quality frameworks for covering Data Quality aspects like Completeness, Accuracy, coverage using Python, Spark, and Kafka.
  • Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tablesx.
  • Transform and analyze the data using PySpark, HIVE, based on ETL mappings.
  • Developed PySpark programs and created the data frames and worked on transformations.
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Provide guidance to development team working on PySpark as ETL platform
  • ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprise.
  • Architected SQL Server Failover Solution for Mirror and Replication architecture.
  • Managed MS SQL Server 2008 customer facing databases with Mirror and Replication architecture.
  • Design and implement Exadata/ZBA Database Backup architecture. Architecture implemented utilizing 3 InfiniBand networks for the multiple sets of Exadata and 1 InfiniBand network for the attached TSM Backup system for offsite tape storage to create a ZBA SAN.
  • Develop stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Practical work experience in building data science solutions and production-ready systems on big data platforms such as Snowflake, Spark, Hadoop.
  • Created and maintained the development operations pipeline and systems like continuous integration, continuous deployment, code review tools and change management systems.
  • Used Agile software methodologies in the Project as per the client direction and used to have daily scrum meeting.
  • Worked on a strategic initiative project that would enable the firm to move data applications from the legacy system from Sybase to Data Lake.
  • Architecture design, build POC, Hadoop Admin, Linux admin, performance and architecture design, upgrades.
  • Big data Hadoop and Cassandra prod support and architecture with Bigdata Hadoop and Cassandra prod support and architecture.
  • Experienced in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's.
  • Functional knowledge of Banking and Health Insurance domain.

Environment: Confidential, GCP, Mongo DB, Hadoop, Snowflake, python, Pig, Hive, Oozie, NoSQL, Sqoop, Flume, HDFS, HBASE, Map-Reduce, MySQL, Horton Works, Impala, Cassandra DB, Mongo, IBM WebSphere, Tomcat.

Confidential, Nutley, NJ

Big Data Engineer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Experience in integrating Denodo with Oracle, SQL Server, MySQL Workbench databases using JDBC.
  • Worked on AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data. Using Amazon S3 as data lake and Amazon Redshift as the Data warehouse.
  • Implemented and maintained the monitoring and alerting of production and corporate servers/storage using Cloud Watch.
  • Experience in extracting source data from Sequential files, XML files, CSV files, transforming and loading it into the target Data warehouse.
  • My area of expertise has been on performing duties such as Analytics, Design, Data warehouse Modeling, Development, Implementation, Maintenance, Migration and Production support of large-scale Enterprise Data Warehouses.
  • Designed and developed high-quality integration solutions by using Denodo virtualization tool (read data from multiple sources including Oracle, Hadoop, and MySQL).
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • Cluster balancing and performance tuning of Hadoop components like HDFS, Hive, Impala, MapReduce, Oozie workflows.
  • Created the framework for the dashboard using Tableau and optimized the same using open-source Google optimization tools.
  • Involved in publishing of various kinds of live, interactive data visualizations, dashboards, reports and workbooks from Tableau Desktop to Tableau servers
  • Extensively participated in translating business needs into Business Intelligence reporting solutions by ensuring the correct selection of toolset available across the Tableau BI suite.
  • Handled importing of data from various data sources using Sqoop, performed transformations using Hive, MapReduce and loaded data into HDFS.
  • Designed, Automated the dataflow pipelines which will ingest data from real time and batch processing.
  • Well Exposure on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.
  • Developed data pipeline using Flume, Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.
  • Configured Sqoop and developed scripts to extract data from MySQL into HDFS.
  • Designed, developed, and implemented solutions with data warehouse, ETL, data analysis, and BI reporting technologies.
  • Used Oozie as workflow engine and Falcon for Job scheduling. Debugged the technical issues and errors was resolved.
  • Experience in preparing deployment packages and deploying to Dev and QA environments and prepare deployment instructions to Production Deployment Team.
  • Analyzing data with HIVE, TEZ, Spark SQL and comparing its results with TEZ and SPARK SQL.
  • Written Hive query’s structure them in tabular format to facilitate effective querying on the log data to perform business analytics.
  • Triggered workflows based on time or availability of data using Oozie.
  • All infrastructure installation, spark, and Scala coding, sbt architecture, docker architecture, docker container design, Kafka cluster installation and topic design, File beat architecture, zeppelin configuration, Cassandra cluster installation and Cassandra database design.
  • Performed visualization using SQL integrated with Zeppelin on different input data and created rich dashboards.
  • Work on overall architecture assessment on the technology stack, adaptability of the open-source systems.

Environment: Hadoop, Map-Reduce, Tableau, AWS, EMR, HBase, NIFI, Hive, Impala, Pig, Hive, Sqoop, Hdfs, Flume, Oozie, Spark, Spark SQL, Spark Streaming, Scala, Cloud Foundry, Kafka and Confidential.

Confidential, Warangal

Big Data Engineer

Responsibilities:

  • Integrated Metadata Management, data quality assessment, data quality monitoring, cleansing package builder for data stewards and business analysts to collaborate and govern trustworthiness of data.
  • Upgraded the Hadoop Cluster from CDH 3 to CDH 4, setting up High Availability Cluster and integrating HIVE with existing applications.
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
  • Extract Transform and Load data from sources systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics.
  • Transformed data from one server to other servers using tools like Data Transformation Services DTS and SQL Server Integration Services SSIS 2005/2008/2013.
  • SQL Server 2005/2008 RDBMS database development including T-SQL programming.
  • Created Stored Procedures, Functions, Indexes, Tables, and Views and wrote T-SQL code and SQL joins for applications.
  • Created stored procedures to transform the data and wrote T-SQL statements for various needs of the transformations while creating the packages.
  • Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
  • Having good Knowledge in NOSQL data base like DynamoDB and MongoDB.
  • Created HBase tables to store various data formats of PII data coming from different portfolios.
  • Configured Sqoop and developed scripts to extract data from MySQL into HDFS.
  • Developed scripts in Hive to perform transformations on the data and load to target systems for use by the data analysts for reporting.
  • Designed and developed Application based on Spring framework using MVC design patterns.
  • Used AWS Infrastructure and features of AWS like S3, EC2, RDS, ELB to host the portal.
  • Used Git for version control and JIRA for project tracking.
  • Demonstrate best practices for unit testing, CI/CD, performance testing, capacity planning, documentation, monitoring, alerting, and incident response.
  • Consistently attended meetings with the client subject matter experts to acquire functional business requirements.
  • Integrated DQ workflows with DI ETL jobs with cleansing and matching logic.
  • Created several DWH and Data Mart from the ground up, created conceptual, logical, and physical models along with processes and define enterprise standards and swim lanes.
  • Architectural and In-depth knowledge of RDBMS and Columnar Databases like MS SQL, Azure SQL DB, Azure SQL DW and ADLS (Azure Data Lake Store)
  • Hands-on working experience on dealing with multiple file formats like JSON, AVRO, Parquet, and leverage storm/spark/pig scripts to load into NoSQL DB’s like MongoDB, Cassandra
  • Build new Power BI reports and leverage Microsoft Azure Cloud to publish and pin.

Environment: Hadoop, Talend ETL Tool, AWS, Falcon, HDFS, Denodo, MapReduce, Pig, Hive, Sqoop, HBase, Oozie, Flume, Zookeeper, java, SQL, Scripting, Spark, Kafka.

We'd love your feedback!