We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Madison, NJ

SUMMARY

  • Software/IT and Big Data experience covers 8+ years.
  • Develop Big Data projects using Hadoop, Hive, Flume, and Apache open - source tools/technologies.
  • Skilled applying Big Data analytics with hands-on experience in Data Extraction, Transformation, Loading and Data Analysis, Data Visualization using Cloudera Platform (HDFS, Hive, Sqoop, Flume, HBase, Oozie).
  • Experienced working with different Hadoop ecosystem components such as HDFS, HBase, Spark, Yarn, Kafka, Zookeeper, HIVE, Sqoop, Oozie, and Flume.
  • Experience importing and exporting data between HDFS and Relational Database Management systems using Sqoop.
  • Experienced in Application Development using Hadoop, RDBMS and Linux shell scripting and performance tuning.
  • Experienced in loading data to hive partitions and creating buckets in Hive.
  • In-depth Knowledge of AWS Cloud Services like Compute, Network, Storage, and Identity and Access management.
  • Skilled working multi-clustered environments and setting up Cloudera Hadoop Echo System.
  • Background with traditional databases such as Oracle, Teradata, SQL Server, ETL tools/processes and data warehousing architectures.
  • Hands on with Zookeeper and Oozie Operational Services for coordinating the cluster and scheduling workflows.
  • Proven skill with various Apache component technologies.
  • Experience in web-based languages such as HTML, CSS, PHP, XML, and other web methodologies, including Web Services and SOAP and REST.
  • Extensive knowledge of NoSQL databases such as HBase and Cassandra.
  • Experience working with Cloudera and Hortonworks Distribution of Hadoop.
  • Importing of data from various data sources, performed transformations using Hive, loaded data into HDFS and extracted the data from relational databases like Oracle, MySQL, Teradata into HDFS and Hive using Sqoop.
  • Expertise with Hive queries and various scripts to load data from local file system and HDFS to Hive.
  • Hands-on experience on fetching the live stream data from RDMS to HBase table using Spark Streaming and Apache Kafka.
  • Work with cloud environments like Amazon Web Services, EC2 and S3.
  • Hands on experience on working with Amazon EMR framework transferring data to EC2 server.
  • Expert with extraction of data from structured, semi-structured and unstructured data sets to store in HDFS.
  • Skilled programming with Python.

TECHNICAL SKILLS

Scripting: Python, PHP

Big Data Platforms: Hadoop, Cloudera Hadoop, Hortonworks Hadoop, Cloudera Impala, Talend, Informatica, AWS, Microsoft Azure, Adobe Cloud, Elastic Cloud, Anaconda Cloud

Big Data Tools: Apache Hue, Apache Sqoop, Spark, Scala, Hive, HDFS, Zookeeper, Oozie, Airflow

Database Technologies: SQL Server 2008 R2/2012/ 2014, MySQL, SQL Server Reporting Services (SSRS), SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), SQL Server Management Studio. RDBMS and NoSQL; HBase, Cassandra

Data Processing: ETL Processes, EDP, Real-time processing, Batch processing, Streaming Processes, Cloud Security, Cloud Filtering, Linear regression rather than logistic regression, DataCleaner

SharePoint Technologies: Workflows, Event Receivers, Web Parts, Site Definitions, Site Templates, Timer Jobs, SharePoint Hosted Apps, Provider Hosted Apps, Search, Business Connectivity Services (BCS), User Profiles, Master Pages, Page Layouts, Managed Metadata, SharePoint Designer, InfoPath, Nintex, ShareGate, OAuth, templates, Taxonomy, ShareGate, Metalogix, Nintex, Forms, InfoPath, SharePoint Designer, Visual Studio, MS Office, SharePoint Search, SharePoint User Profiles.

BI/Reporting & Visualization: Business Analysis, Data Analysis, Use of Dashboards and Visualization Tools, Power BI, Tableau

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Madison, NJ

Responsibilities:

  • Reviewing business requirement document for completeness and analyzing actual and forecast budgeting and forecasting requirements.
  • Developing Spark jobs using Spark SQL, Python, and Data Frames API to process structured data into Spark clusters.
  • Writing Spark applications for data validation, cleansing, transformation, and customed aggregation and tuning Spark to increase job's performance.
  • Utilizing Data Frames and Spark SQL API for faster processing of data.
  • Programming a function in Python to ingest image responses in Kafka Producer.
  • Drafting a schema for a custom HBase table.
  • Working on AWS Kinesis for processing real-time data.
  • Handling schema changes in the data stream using Spark.
  • Developing multiple SQL queries to join tables and create dashboards.
  • Applying AWS step-function for orchestrating and automating the pipeline.
  • Configuring master and slave nodes for Spark.
  • Creating Kafka topics for Kafka brokers to listen from Rest API.
  • Processing multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR), and AWS Redshift.
  • Using AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
  • Implementing AWS security measures (e.g., AWS Identity and Access Management (IAM)).
  • Ingesting data through AWS Kinesis Data Stream and Firehose from various sources to S3.
  • Monitoring and managing services with AWS CloudWatch.
  • Querying with Athena on data residing in AWS S3 bucket.
  • Creating a pipeline to gather data using PySpark, Kafka, and Hive.

Cloud Data Engineer

Confidential, Ashburn, VA

Responsibilities:

  • Wrote Flume and HiveQL scripts to extract, transform, and load data into database.
  • Set up Extract/Transform/Load (ETL) to Hadoop file system (HDFS) and wrote HIVE UDFs.
  • Implemented Spark in EMR for processing Big Data across Data Lake in AWS System
  • Worked with Amazon AWS IAM console to create custom users and groups.
  • Created AWS Lambda function for extracting the data from Kinesis Firehose and posted the data to AWS S3 bucket on scheduled basis (every 4 hours) using AWS Cloud Watch event.
  • Implemented serverless architecture using AWS Lambda with Amazon S3 and Amazon DynamoDB.
  • Migrated AWS data between different database platforms like Local SQL Server to Amazon RDS and EMR Hive.
  • Used Spark DataFrame API over Cloudera platform to perform analytics on Hive data.
  • Configured flume agent with the source, memory as channel and HDFS as the sink
  • Configured flume agent batch size, capacity and transaction capacity, roll size, roll count and roll intervals
  • Worked with different big data formats like Avro, CSV, Parquet, JSON, and seq file
  • Performed performance tuning and provided successful path towards Redshift Cluster and AWS RDS DB engines.
  • Worked on AWS S3 bucket integration for application and development projects.
  • Managed and reviewed Hadoop log files in AWS S3.
  • Designed Logical and Physical data modelling for various data sources on Confidential Amazon Redshift.
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.

Big Data Developer

Confidential, Sylmar, CA

Responsibilities:

  • Built large-scale and complex data processing pipelines.
  • Built Apache Spark data structures and programmed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Used Spark-SQL to Load JSON data and created Schema RDD and loaded it into Hive Tables and handled Structured data using SparkSQL.
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Analyzed Hadoop clusters using big data analytic tools including Hive, and MapReduce.
  • Worked on analyzing Hadoop cluster and different big data analytic tools, including MapReduce, Hive, HDFS, Spark, Kafka and Apache NiFi.
  • Installed Oozie workflow engine to run multiple Spark Jobs.
  • Created a Kafka broker which used schema to fetch structured data in structured streaming.
  • Interacted with data residing in HDFS using Spark to process the data.
  • Configured Spark Streaming to receive real-time data to store in HDFS.
  • Ingested production line data from IoT sources through internal RESTful APIs.
  • Created Python scripts to download data from the APIs and perform pre-cleaning steps.
  • Built Spark applications to perform data enrichments and transformations using Spark Data Frames with Cassandra look ups.
  • Performed ETL operations on IoT data using PySpark.
  • Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark.
  • Configured AWS S3 to receive and store data from the resulting PySpark job.
  • Wrote DAGs for Airflow to allow scheduling and automatic execution of the pipeline.
  • Performed cluster-level and code-level Spark optimizations.
  • Created AWS Redshift Spectrum external tables to query data in S3 directly for data analysts.
  • Configured Zookeeper to coordinate the servers in clusters to maintain the data consistency and to monitor services
  • Converted Hive/SQL queries into Spark transformations using Spark RDDs Python.
  • Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contained the bandwidth data through the Hortonworks JDBC connector for further analytics of the data.
  • Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Hadoop Administrator

Confidential, Dallas, TX

Responsibilities:

  • Helped design back-up and disaster recovery methodologies involving Hadoop clusters and related databases.
  • Performed cluster capacity and growth planning and recommended nodes configuration.
  • Worked with highly unstructured and structured data of 1.2 PB in raw size.
  • Applied Hadoop system administration using Hortonworks/Ambari and Linux system administration (RHEL 7, Centos.)
  • Configured YARN Capacity and Fair scheduler based on organizational needs.
  • Optimized and integrated Hive, Sqoop, and Flume into existing ETL processes, accelerating the extraction, transformation, and loading of massive structured and unstructured data.
  • Used Hive to simulate data warehouse for performing client-based transit system analytics.
  • Monitored production cluster by setting up alerts and notifications using metrics thresholds.
  • Tuned MapReduce counters for faster and optimal data processing.
  • Performed upgrades, patches and fixes using either rolling or express method.
  • Completed HDFS balancing and fine-tuning for MapReduce applications.
  • Developed data migration plan for other data sources into the Hadoop system.

Software/Database Analyst

Confidential, Charlotte, NC

Responsibilities:

  • Configured tables, and supported database operations.
  • Implemented mail alert mechanism for alerting users when their selection criteria were met.
  • Conducted unit and systems integration tests to ensure system functioned to specification.
  • Established communications interfacing between the software program and the database backend.
  • Programmed a range of functions (e.g., automate logistical tracking and analysis, automate schematic measurements, etc.).
  • Developed client-side testing/validation using JavaScript.
  • Worked hands-on with technologies such as XML, Java, and JavaScript.

We'd love your feedback!