Senior Big Data Engineer Resume
Boise, ID
SUMMARY
- Above 8+ years of experience as Big Data Engineer/Data Engineer and Data Analystincluding designing, developing and implementation ofdata modelsfor enterprise - level applications and systems.
- Hands-On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
- Analyzed data and provide insights with R Programming and Python Pandas
- Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
- Experience in Creating ETL mappings using Informatica to moveData from multiple sources like Flat files, Oracle into a common target area such asData Warehouse.
- Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers, and packages.
- Expertise in Hadoop components - HDFS, YARN, Name Node, Data Node and Apache Spark.
- Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportusing ETL tools such as Informatica.
- Implemented large scale technical solutions using Object Oriented Design and Programming concepts using Python.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
- Experience in NoSQL databases and worked on table row key design and to load and retrieve data for real-time data processing and performance improvements based on data access patterns.
- Hands-on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node,DataNode and Hadoop Map Reduce programming.
- Experienced in using various Python libraries like NumPy, SciPy, Python-Twitter, Pandas.
- Experience in utilizing SAS Procedures, Macros, and other SAS applications for data extraction using Oracle and Teradata.
- Extensive experience inRelational Data Modeling, Dimensional Data Modeling, Logical/Physical Design, ER Diagrams
- Cloudera certified Developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
- Extensive experience in Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce concepts.
- Experience in building large scale highly available Web Applications. Working knowledge of web services and other integration patterns.
- Understanding of data storage and retrieval techniques, ETL and databases to include Key-Value data stores, document data stores.
- Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
- Hands-on experience with Spark Core, Spark SQL, and Data Frames/Data Sets/RDD API.
- Developed applications using Spark and Scala for data processing.
- Performed predictive Modeling, Pattern Discovery, Market Basket Analysis, Segmentation Analysis, Regression Models, and Clustering.
- Hands-on use of Spark andScalaAPIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience in Amazon web services (AWS) cloud like S3, EC2 and EMR and in Microsoft Azure.
- Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, AzureStorage, Azure AD Licenses, Office365)
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Exposure on the usage of Apache Kafka to develop data pipeline of logs as a stream of messages using producers and consumers.
- Excellent understanding and knowledge of NoSQL databases like HBase, MongoDB and Cassandra.
- Used GitHub version control tool to push and pull functions to get the updated code from the repository.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Hands on experience in using other Amazon Web Services like Autoscaling, Redshift, Dynamo DB, Route53.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
TECHNICAL SKILLS
Big Data/Hadoop Technologies: Map Reduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages: HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/RStudio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
Databases: Microsoft SQL Server, MySQL, Oracle, DB 2, Teradata
NO SQL Databases: Cassandra, HBase, MongoDB
Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML
Development Tools: Microsoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans.
Cloud Technologies: AWS (EC2, IAM, S3, Autoscaling, Cloud Watch, Route53, EMR, Redshift, DynamoDB), Azure and Google Cloud Platform (GCP)
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, Apache Ant, RTC, RSA, Control-M, Oozie, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos.
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Boise, ID
Senior Big Data Engineer
Responsibilities:
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto the Hadoop ecosystem.
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning, and Bucketing.
- Worked with various HDFS file formats like Parquet, IAM and JSON for serializing and deserializing.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, SparkSQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, Spark YARN.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Good experience in using Relational databasesOracle, SQL, andPostgreSQL
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Used AWS IAM to detect and stop risky identity behaviors using rules and other statistical algorithms.
- Responsible to manage data coming from different sources through Kafka.
- Installed Kafka Producer on different servers and scheduled to produce data for every 10 seconds.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing.
- Worked withETLprocesses totransfer/migrate data from relational database and flat files common staging tables in various formats to meaningful data inOracle and MS-SQL.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Strong Knowledge on the architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka.
- Exposure to Spark, Spark Streaming, snowflake, Scala, and Creating the Data Frames handled in Sparkwith Scala.
- Good Exposure on MapReduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Experienced Good understanding of NoSQL databases and hands-on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
- Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark -SQL, Pair RDD's, Spark YARN
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Migrated MapReduce jobs to Spark jobs to achieve better performance.
- Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and debugging.
- Developed a NIFI Workflow to pick up the data from the SFTP server and send that to Kafka broker.
- Used HUE for running Hive queries. Created partitions using Hive to improve performance.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Performed Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Integration with the cloud, and microservices-based IAM architectures. Forrester sees API security solutions being used.
- Designed the data models to be used in data-intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, and definition of Key Business elements from Aurora.
- Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Worked on auto-scaling the instances to design cost-effective, fault-tolerant, and highly reliable systems.
Environment: Hadoop (HDFS, Map Reduce), Scala, Data bricks, Yarn, IAM, PostgreSQL, Spark, Impala, Hive, Mongo DB, Pig, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git, AWS Services (Lambda, EMR, Auto scaling).
Confidential, Austin, TX
Sr. Data Engineer
Responsibilities:
- As a Big Data Developer, worked on Hadoop cluster scaling from 4 nodes in a development environment to 8 nodes in the pre-production stage and up to 24 nodes in production.
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Built pipelines to move hashed and un-hashed data from XML files to Data lake.
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.
- Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Data bricks, NoSQL DB)
- Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Involved in designing the row key in the HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in sorted order.
- Wrote Junit tests and Integration test cases for those Microservice.
- Worked in Azure environment for development and deployment of Custom Hadoop Applications.
- Develop and deploy the outcome using spark and Scala code in the Hadoop cluster running on GCP.
- Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
- Developed NiFi workflow to pick up the multiple files from the FTP location and move those to HDFS on daily basis.
- Scripting: Expertise in Hive, PIG, Impala, Shell Scripting, Perl Scripting, and Python.
- Worked with developer teams on NiFi workflow to pick up the data from the rest API server, from data lake as well as from SFTP server and send that to Kafka.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Proven experience with ETL frameworks (Airflow, Luigi, or our own open-sourced garcon)
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Used Hadoop YARN to perform analytics on data in Hive.
- Developed and maintained batch data flow using HiveQL and Unix scripting
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.
- Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.
- Primarily involved in the Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Developed workflow in Oozie to manage and schedule jobs on the Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL data warehouse environment
- Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and Pig UDF’s.
- Used windows Azure SQL reporting services to create reports with tables, charts and maps.
Environment: Hadoop, Azure, Microservices, MapReduce, Agile, HBase, JSON, Spark, Kafka, JDBC, Hive, JSON, Pig, Oozie, Sqoop, Zookeeper, Flume, Impala, SQL, Scala, Python, Unix, GitHub.
Confidential, IL
Big Data Engineer
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage, and it was sometimes extended up to 24 nodes during production.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending the functionality of Hive with custom UDF s and UDAF's.
- The new Business Data Warehouse (BDW) improved query/report performance reduced the time needed to develop reports and established a self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
- Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
- Implemented AJAX, JSON, and Javascript to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend.
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Used Git for version control with the Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts, etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
- Performed statistical analysis using SQL, Python, R Programming and Excel.
- Worked extensively with Excel VBA Macros, Microsoft Access Forms
- Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Developed story-telling dashboards inTableauDesktop and published them on toTableauServer which allowed end-users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Analyzed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies per project scope.
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, HBase, Map Reduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Oozie, Linux.
Confidential, Troy, MI
Data Engineer / Hadoop Developer
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
- Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce (EMR)on(EC2).
- Performing the forking action whenever there is a scope of a parallel process for optimization of data latency.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Performed pig script which picks the data from one HDFS path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as a parameter in the Oozie script.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity. Build an ETL that utilizes spark jar inside which executes the business analytical model.
- Hands-on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, Git adds to add files, Git commits after the code build and Git push to the pre-prod environment for the code review and later used a screwdriver. YAML which builds the code generates artifacts which release into production.
- Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
- Connected to AWS Redshift through Tableau to extract live data for real-time analysis.
- Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
- Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customer churn and classification of customers.
- Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
- Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Data analysis using regressions, data cleaning, excel v-lookup, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
- Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.
Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka, JSON, XML PL/SQL, SQL, HDFS, Unix, Python, PySpark, Azure.
Confidential
Java/Hadoop Developer
Responsibilities:
- Involved in review of functional and non-functional requirements.
- Installed and configured Pig and also written Pig Latin scripts
- Developing Scripts and Batch Job to schedule various Hadoop Program
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Utilized various utilities like Struts Tag Libraries, JSP, JavaScript, HTML, & CSS.
- Build and deployed war file in a WebSphere application server.
- Implemented Patterns such as Singleton, Factory, Facade, Prototype, Decorator, Business Delegate and MVC.
- Written Hive queries for data analysis to meet the business requirements.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows
- Involved in frequent meetings with clients to gather business requirements & converting them to technical specifications for the development team.
- Adopted agile methodology with pair programming technique and addressed issues during system testing.
- Involved in Bug fixing and Enhancement phase, used find bug tool.
- Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a MapReduce way. Developed a custom FileSystem plugin for Hadoop so it can access files on the Data Platform.
- Version Controlled using SVN.
- Developed application in Eclipse IDE. Experience in developingspring Bootapplications for transformations.
- Primarily involved in front-end UI using HTML5, CSS3, JavaScript, jQuery, and AJAX.
- Used struts framework to build MVC architecture and separate presentation from business logic.
- Involved in rewriting middle-tier on WebLogic application server. Actively involved in Code-Reviews & Coding Standards, Unit testing & Integration Testing.
- The custom FileSystem plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Setup and benchmarked Hadoop/HBase clusters for internal use
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Eclipse, Oracle, PL/SQL, MongoDB, Toad.
