Big Data Engineer Resume
Parsippany, NJ
SUMMARY
- 10+ years’ experience in IT Technology; 7+ years’ experience in the Hadoop/Big Data/Cloud space.
- Demonstrated hands - on skill with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.
- Experienced analyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.
- Skilled with Python-based design and development programming.
- Skilled creating PySpark Data Frames on multiple projects and tying into Kafka.
- Configure Big Data Hadoop and Apache Spark in Big Data.
- Build AWS Cloud Formation templates used for Terraform with existing plugins.
- Develop AWS Cloud Formation templates to create custom infrastructures of pipelines.
- Implement AWS IAM user roles and policies to autanticate and control user access.
- Apply expertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.
- Performance-tune data-heavy dashboards and reports for optimization using various options such as Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source, etc.
- Program SQL queries for data validation of reports and dashboards.
- Prove skill setting up Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).
- TEMPEffective working with project stakeholders for gathering requirements to create As-Is and As-Was dashboards.
- Recommended and used various best practices to improve dashboard performance for Tableau server users.
- Design and develop custom reports using data extraction and reporting tools, and develop algorithms based on business cases.
- Apply in-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts.
- Maintain ELK (Kibana) and write Spark scripts using Scala.
- Configure Spark Streaming to receive real time data from internal system and store the stream data to HDFS.
- Create HIVE scripts for ETL, create HIVE tables, and write HIVE queries.
TECHNICAL SKILLS
Apache: Kafka, Flume, Hadoop, YARN, Hive, MAVEN, Oozie, Spark, Zookeeper, Cloudera
Distributions: Hortonworks, Cloudera, AWS, ELK, AWS Lambda, EMR Amazon Web Services (AWS), Azure
Scripting: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX
Data Processing (Compute) Engines: Apache Spark, Spark Streaming
Databases: Microsoft SQL Server, Apache HBase, Apache Hive
Data Visualization Tools: Tableau, Power BI
Scheduler Tool: Airflow
File Formats: Parquet, Avro, JSON, ORC, Text, CSV
Operating Systems: Unix/Linux, Windows, Ubuntu, Apple OS
PROFESSIONAL EXPERIENCE
Big Data Engineer
Confidential
Responsibilities:
- Applied Amazon Web Services (AWS) Cloud services such as EC2, S3, EBS, RDS, VPC, and IAM.
- Set up Puppet Master client and wrote scripts to deploy applications on Dev, QA, and production environment.
- Developed and maintained build/deployment scripts for testing, staging, and production environments using Maven, Shell, ANT, and Perl Scripts.
- Developed Puppet modules and role/profiles for installation and configuration of software required for various applications/blueprints.
- Wrote Python scripts to manage AWS resources from API calls using BOTO SDK and worked with AWS CLI.
- Wrote Ansible playbooks to launch AWS instances and used Ansible to manage web applications.
- Created UDFs using Scala.
- Archived data using Amazon Glacier.
- Monitored resources, such as Amazon DB Services, CPU Memory, and EBS volumes.
- Monitored logs for better understanding the functioning of the system.
- Automated, configured, and deployed instances on AWS, Azure environments, and Data Centers.
- Hands-on with EC2, Cloud Watch, Cloud Formation and managing security groups on AWS.
- Maintained high-availability clustered and standalone server environments and refined automation component with scripting and configuration management (Ansible).
- Implemented Apache Spark and Spark Streaming projects using Scala and Spark SQL.
- Wrote Spark applications using Python and Scala.
- Implemented Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Web Apps, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.
- Automated Vulnerability Management patching and CI/CD using Chef and GitLab, Jenkins, and AWS/Open Stack.
- Implemented AWS EMR Spark using PySpark and utilized DataFrames and SparkSQL API for faster processing of data.
- Configured network architecture on AWS with VPC, Subnets, Internet gateway, NAT, and Route table.
- Set up systems and network security using Cloud Watch and Nagios.
- Worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, CodePipeline) in the process right from developer code check-in to deployment.
- Customized Kibana for dashboards and reporting to provide visualization of log data and streaming data.
Cloud Engineer
Confidential, Parsippany, NJ
Responsibilities:
- Configured, deployed, and automated instances on AWS, Azure environments, and Data Centers.
- Applied EC2, Cloud Watch, Cloud Formation, and managed security groups on AWS.
- Programmed software installation shell scripts.
- Programmed scripts to extract data from multiple databases.
- Programmed scripts to schedule Oozie workflows to execute daily tasks.
- Produced distributed query agents to perform distributed queries against Hive.
- Loaded data from different sources such as HDFS or HBase into Spark data frames and implement in-memory data computation to generate the output response.
- Developed Spark programs using PySpark.
- Produced Hive external tables and designed information models in Hive.
- Produced multiple Spark Streaming and batch Spark jobs using Python.
- Processed Terabytes of information in real time using Spark Streaming.
- Created and managed code reviews.
- Wrote streaming applications with Spark Streaming/Kafka.
- Created automated Python scripts to convert data from different sources and to generate ETL pipelines.
- Applied Hive optimization techniques such as partitioning, bucketing, map join, and parallel execution.
- Ingested data from various sources and processed the Data-at-Rest utilizing Big Data technologies such as HBase, Hadoop, Map Reduce Frameworks, and Hive.
- Monitored Amazon DB and CPU Memory using Cloud Watch.
- Used Spark SQL to realize quicker results compared to Hive throughout information analysis.
- Converted HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Developed DBC/ODBC connectors between Hive and Spark for the transfer of the newly populated data frames from MSSQL
- Executed Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets.
- Implemented a Hadoop Cloudera distributions cluster using AWS EC2.
- Worked with AWS Lambda functions for event-driven processing to various AWS resources.
- Managed AWS Redshift clusters such as launching the cluster by specifying the nodes and performing the data analysis queries.
Data Engineer
Confidential, Houston, TX
Responsibilities:
- Designed relational database management system (RDBMS) and incorporated with SQOOP and HDFS.
- Created Hive external tables on RAW data layer pointing to HDFS location.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
- Utilized Hive Query Language (HQL) to create Hive internal tables on a data service layer (DSL) Hive database.
- Integrated Hive internal tables with Apache HBase data store for data analysis and read/write access.
- Developed Spark jobs using Spark SQL, PySpark, and DataFrames API to process structured, and unstructured data into Spark clusters.
- Analyzed and processed Hive internal tables according to business requirements and saved the new queried tables in the application service layer (ASL) Hive database.
- Developed knowledge base of Hadoop architecture and components such as HDFS, Name Node, Data Node, Resource Manager, Secondary Name Node, Node Manager, and MapReduce concepts.
- Installed, configured, and tested Sqoop data ingestion tool and Hadoop ecosystems.
- Imported and appended data using Sqoop from different Relational Database Systems to HDFS.
- Exported and inserted data from HDFS into Relational Database Systems using Sqoop.
- Automated the pipeline flow using Bash script.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.
Hadoop Data Engineer
Confidential, Salisbury, NC
Responsibilities:
- Used Zookeeper and Oozie for coordinating the cluster and programing workflows.
- Used Sqoop to expeditiously transfer information between information data bases and HDFS and used Flume to stream the log data from servers.
- Enforced partitioning, bucketing in Hive for higher organization of the info.
- Worked with different file formats and compression techniques to standards.
- Loaded information from UNIX system to HDFS.
- Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.
- Documented Technical Specs, Dataflow, information Models and sophistication Models.
- Documented needs gatheird from stake holders.
- With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.
- Involved in researching various available technologies, industry trends, and cutting-edge applications.
- Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.
- Performed storage capacity management, performance tuning, and benchmarking of clusters.
Linux Systems Administrator
Confidential, San Jose, CA
Responsibilities:
- Installed, configured, monitored, and administrated Linux servers.
- Installed, deployed, and managed Linux RedHat Enterprise, CentOS, Ubuntu, and installed patches and packages for Red Hat Linux Servers.
- Configured and installed RedHat and Centos Linux Servers on virtual machines and bare metal installations.
- Performed kernel and database configuration optimization such as I/O resource usage on disks.
- Created and modified users and groups with root permissions.
- Administered local and remote servers using the SSH on a daily basis.
- Created and maintained Python scripts for automating build and deployment processes.
- Utilized Nagios-based open-source monitoring tools to monitor Linux Cluster nodes.
- Created users, managed user permissions, maintained user and file system quotas, and installed and configured DNS.
- Worked with DBA team for database performance issues, network related issues on LINUX/UNIX servers and with vendors regarding hardware related issues.
- Monitored CPU, memory, hardware and software including raid, physical disk, multipath, filesystems, and networks using Nagios monitoring tool.
- Hosted servers using Vagrant on Oracle virtual machines.
- Automated daily tasks using bash scripts while documenting the changes in the environment and in each server, analyzing the error logs, user logs and /var/log messages.
- Adhered to industry standards by securing systems, directory and file permissions, groups and supporting user account management along with the creation of users.