We provide IT Staff Augmentation Services!

Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Around 6 years of technical experience as Big Data Engineer, Data Modeler, Data Architect, Data Analyst that includes design, develop, and implementation of data models for enterprise level applications and systems.
  • Expertise in writing Hadoop Jobs to analyze data using MapReduce, Apache Crunch, Hive, Pig, and Splunk.
  • Experienced in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, and EMR, Elastic search), Hadoop, Python, Spark and TEMPeffective use of MapReduce, SQL and Cassandra to solve big data type problems.
  • Experience in developing Spark programs for batch and real - time processing, Developed Spark Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
  • Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
  • Expertise in integration of various data sources like RDBMS, Spreadsheets, Text files, JSON and XML files.
  • Solid knowledge of Data Marts, Operational Data Store (ODS), OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Expertise in Data Architect, Data Modeling, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools such as Informatica Power Centre.
  • Experience in designing, building and implementing complete Hadoop ecosystem comprising of Map Reduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark.
  • Experience with Client-Server application development using Oracle PL/SQL, SQL PLUS, SQL Developer, TOAD, and SQL LOADER.
  • Strong experience with architecting highly per formant databases using PostgreSQL, PostGIS, MySQL and Cassandra.
  • Extensive experience in using ER modeling tools such as Erwin and ER/Studio, Teradata, BTEQ, MLDM and MDM.
  • Experienced on R and Python for statistical computing. Also experience with MLlib (Spark), Matlab, Excel, Minitab, SPSS, and SAS.
  • Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.
  • Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Excellent experienced on NoSQL databases like MongoDB, Cassandra and write Apache Spark streaming API on Big Data distribution in the active duster environment.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures and functions.
  • Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop.
  • Performed the performance and tuning confidential source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE and Data Stage.
  • Strong knowledge of Software Development Life Cycle (SDLC) and expertise in detailed design documentation.

TECHNICAL SKILLS

Big Data Technologies: MapReduce, HBase, HDFS, Sqoop, Spark, Hadoop, Hive, PIG, Impala

Cloud Architecture: AWS EC2, Elastic Search, Elastic Load Balancing & Azure

Databases: Oracle, SQL Server, MySQL, HBase, MongoDB, DynamoDB and Elastic Cache

OLAP tools: Tableau, SAP BO, SSAS, Business Objects & Crystal Reports

Operating System: Linux, Unix and Windows

Web Technologies: HTML, CSS, Java Script, XML, Restful

Tools: and IDE: Eclipse, Maven, ANT, DB Visualizer

Languages: C, C++, Java, Python, SQL, HiveQL

Web Application Servers: Apache Tomcat, Weblogic, JBoss

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Big Data Engineer

Responsibilities:

  • Responsible for implementation and ongoing administration of Hadoop infrastructure.
  • General operational expertise such as good troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and networks.
  • Working with data delivery teams to setup new Hadoop users. This job includes setting up Linux users, setting up Kerberos principals and HDFS, Hive and MapReduce access for the new users.
  • Responsible for upgrade and configure HDP 2.6.4 to HDP 3.1.0 include upgrade Ambari 2.6.1 to Ambari 2.7.3.
  • Install and configure Data Analytics Studio (DAS) in HDP 3.1.
  • Manage and review Hadoop log files.
  • Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
  • Troubleshoot connectivity issues with applications or tools (ex: qlikview, qlicksense, SAS, Mongo, Informatica, and R) and memory issues for Spark.
  • Responsible for R and RStudio install/ upgrade and configure (3.5.3 to 3.6.2) with multiuser autantication.
  • Documenting project design and test plan for various projects landing on Hadoop platform.
  • Work closely with platform Data Engineering teams and Data Scientist team to set level expectations for big data projects.
  • Install and configure multiple version of python in the HDP environment (version 2.7.5, 3.7.3) for Data science and Development users.
  • Performed several upgrades on Hortonworks distribution of Hadoop using Ambari Responsible for install and configure Anaconda package and setup JUPYTER HUB for multiuser access, include setup Python 3, PySpark, R kernels in the Jupyter notebook.
  • Implement best practices to configure and tune Big Data environments, application and services, including capacity scheduling.
  • Experience monitoring overall infrastructure security and availability, and monitoring of space and capacity usage including Hadoop, Hadoop clusters, and Hadoop API's.
  • Configuring and tuning Hadoop using the various configuration files available within Hadoop Responsible for loading unstructured and semi - structured data into Hadoop cluster coming from different sources using Flume and managing.
  • Knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming paradigm.
  • Experience with deploying Hadoop in a VM and AWS Cloud as well as physical server environment.
  • Monitor Hadoop cluster connectivity and security and File system management.
  • Perform capacity planning based on Enterprise project pipeline and Enterprise Big Data roadmap.
  • Provide technical inputs during project solution design, development, deployment and maintenance phases.
  • Work closely with hardware & software vendors, design & implement optimal solutions.
  • Assist and advice network architecture and datacenter teams during hardware installations, configuration and troubleshooting.
  • Provide guidance and assistance for administrators in such areas as server builds, operating system upgrades, capacity planning, performance tuning.

Confidential, Minneapolis, MN

Big Data Engineer

Responsibilities:

  • Worked as Data Engineer on several Hadoop Ecosystem components with Cloudera Hadoop distribution.
  • Worked on managing and reviewing Hadoop log files.
  • Tested and reported defects in Agile Methodology perspective.
  • Worked on migrating Pig scripts programs to Spark and Spark SQL to improve performance.
  • Extensively involved in writing Oracle, PL/SQL, stored procedures, functions, and packages.
  • Loaded data from different source (database & files) into Hive using Talend tool.
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
  • Worked on interviewing business users to gather requirements and documenting requirements.
  • Used Flume to collect, aggregate, and store web log data from different sources.
  • Imported and exported data into HDFS and Hive using Sqoop and Flume.
  • Used Pattern matching algorithms to recognize the customer across different sources and built risk profiles for each customer using Hive and stored the results in HBase.
  • Implemented a proof of concept deploying this product in Amazon Web Services AWS.
  • Developed and maintained stored procedures, implemented changes to database design including tables.
  • Ingested data from various sources and processing the Data-at-Rest utilizing Big Data technologies.
  • Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Navigator.
  • Worked with AWS to implement the client-side encryption as Dynamo DB does not support at rest encryption at this time.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
  • Created Integration Relational 3NF models that can functionally relate to other subject areas and responsible to determine transformation rules accordingly in the Functional Specification Document.
  • Involved in reports development using reporting tools.
  • Loaded and transformed huge sets of structured, semi structured and unstructured data.
  • Developed and implemented logical and physical data models using enterprise modeling tools Erwin.
  • Created Hive queries and tables that helped line of business identify trends by applying strategies on historical data before promoting them to production.
  • Developed Pig scripts to parse the raw data, populate staging tables and store the refined data in partitioned DB2 tables for Business analysis.
  • Designed and developed cubes using SQL Server Analysis Services (SSAS) using Microsoft Visual Studio.
  • Performed performance tuning of OLTP and Data warehouse environments using SQL.
  • Created data structure to store the dimensions in an TEMPeffective way to retrieve, delete and insert the data.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
  • Implemented referential integrity using primary key and foreign key relationships.
  • Developed Staging jobs where in using data from different sources.

Environment: HBase, Oozie 4.3, Hive 2.3, Sqoop 1.4, SDLC, OLTP, SSAS, SQL, Oracle 12c, PL/SQL, ETL, AWS, Sqoop, Flume.

Confidential

Big Data Developer

Responsibilities:

  • Responsible for loading the customer’s data and event logs from Kafka into HBase using REST API.
  • Worked on debugging, performance tuning and Analyzing data using Hadoop components Hive Pig.
  • Imported streaming data using Apache Storm and Apache Kafka into HBase and designed hive tables on top.
  • Created Hive tables from JSON data using data serialization framework like AVRO.
  • Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Deployed Hadoop cluster using Cloudera Hadoop 4 (CDH4) with Pig, Hive, HBase and Spark.
  • Developed restful webservice using Spring Boot and deployed to pivotal web services.
  • Used build and deployment tools like Maven.
  • Involved in Test Driven Development (TDD).
  • Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Responsible for processing unstructured data using Pig and Hive.
  • Managed and reviewed Hadoop log files. Used Scala for integration Spark into Hadoop.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data
  • Extensively used Pig for data cleansing and HIVE queries for the analysts.
  • Created PIG script jobs in maintaining minimal query optimization.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Worked on various Business Object Reporting functionalities such as Slice and Dice, Master/detail, User Response function and different Formulas.

We'd love your feedback!