We provide IT Staff Augmentation Services!

Hadoop Developer Resume

3.00/5 (Submit Your Rating)

VirginiA

SUMMARY:

  • Over 9+ years of experience in Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
  • Excellent Experience in Designing, Developing, Documenting, Testing of ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.
  • Strong knowledge in working with ETL methods for data extraction, transformation, and loading in corporate - wide ETL Solutions and Data Warehouse tools for reporting and data analysis
  • Experience with different ETL tool environments like SSIS, Informatica, and reporting tool environments like SQL Server Reporting Services, and Business Objects.
  • Have experience inApache Spark, Spark Streaming, Spark SQL and NoSQLdatabases likeHBase, Cassandra, andMongoDB.
  • Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.
  • Good experience in Amazon Web Service (AWS) concepts like EMR and EC2 Webservices which provides fast and efficient processing of Teradata Big Data Analytics.
  • Experience in Big Data/Hadoop, Data Analysis, Data Modeling professional with applied information Technology.
  • Strong experience working with HDFS, MapReduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig and HBase.
  • Utilized analytical applications like SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
  • Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG,HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.
  • Integrated Kafka with Spark Streaming for real time data processing.
  • Strong experience in the Analysis, design, development, testing and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
  • Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS,Map Reduce, KAFKA.
  • Expert in designing Server jobs using various types of stages like Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner and Link Collector.
  • Proficiency inBig DataPractices and Technologies likeHDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka.
  • Working knowledge of Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
  • Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.
  • Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer.
  • Good knowledge in understanding the security requirements and implementation using Azure Active Directory, Sentry, Ranger, and Kerberos for authentication and authorizing resources.
  • Experienced in working with Spark eco system using SCALA and HIVE Queries on different data formats like Text file and parquet.
  • Expertise in configuring the monitoring and alerting tools according to the requirement likeAWSCloudWatch.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Solid Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, and concepts, configuring the servers for auto-scaling and elastic load balancing.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
  • Expertise in transforming businessrequirementsinto analytical models, designing algorithms, building models, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation and Visualization and reporting solutions that scales across massive volume of structured and unstructured Data.
  • Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau and user-filters using Tableau.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement. Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

TECHNICAL SKILLS:

Big Data: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, MapReduce, PIG, SQOOP, Kafka, Hbase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, PL/SQL

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL

Cloud Technologies: AWS, Microsoft Azure

Frameworks: Django REST framework, MVC, Hortonworks

ETL/Reporting: Ab Initio GDE 3.0, CO>OP 2.15,3.0.3, Informatica, Tableau

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Web/App Server: UNIX server, Apache Tomcat

Operating System: UNIX, Windows, Linux, Sun Solaris

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

PROFESSIONAL EXPERIENCE:

Confidential, Virginia

Hadoop Developer

Responsibilities:

  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQLDB).
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc.
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Working on tickets opened by users regarding various incidents, requests
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • Involved in all the steps and scope of the project data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Built real time pipeline for streaming data usingKafkaandSparkStreaming.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Used IBM InfoSphere DataStage Designer,Director,and Administration for creating&implementing jobs.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Developed UNIX shell script to run the IBM InfoSphere DataStage job,transfer files to the different landing zone.
  • Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
  • Mask the sensitive Pll information in claims notes datasets using DataGuise/DG Secure Too
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Scheduling the IBM InfoSphere DataStage job using AutoSys.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Spark-Streaming, Hive, Scala, Hadoop, Kafka, Spark, Sqoop, Docker, Spark SQL, TDD, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper, Azure, Unix/Linux Shell Scripting,Python, PyCharm, Informatica, Informatica PowerCenterLinux,, Shell Scripting.

Confidential, Chicago, IL

Big Data Developer

Responsibilities:

  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes and Importing Table definitions and Metadata using IBM InfoSphere DataStage Manager
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to kafka for future use cases purpose.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Analyzed existing SQL and IBM InfoSphere DataStage jobs for better performance
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
  • Developed Shell scripts for running IBM InfoSphere DataStage Jobs and transferring files to other internal teams and External vendors.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Create data pipeline of gathering, cleaning and optimizing data using Hive, Spark.
  • Combining various datasets in HIVE to generate Business reports.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data warehouse to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Conduct root cause analysis and resolve production problems and data issues
  • Performance tuning, code promotion and testing of application changes
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Description of End-to-end development of Actimize models for trading compliance solutions of the the project bank.
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load

Environment: HDFS, NiFi, Pig, Hive, Cloudera Manager (CDH5),Hadoop,Pyspark, S3, Kafka,Scrum, Git,Sqoop,Oozie.Pyspark,Informatica,Tableau,OLTP,OLAP,HBase,Python,Shell Scripting XML,Unix.Cassandra,Informatica,SQLServer.

Confidential, San Francisco, CA

Java/ HadoopDeveloper

Responsibilities:

  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
  • Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
  • Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Interacted with business partners, Business Analysts and product owner to understand requirements and build scalable distributed data solutions using Hadoop ecosystem.
  • Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and state full transformations.
  • Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.
  • Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.
  • Wrote Oozie scripts and setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop,Python Jupyter Notebook, Hive and NoSql.
  • Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
  • Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries.
  • Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
  • Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
  • Developing ETL pipelines in and out of data warehouse using combination of Python and SnowflakesSnowSQL Writing SQL queries against Snowflake.
  • Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature
  • Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
  • Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.

Environment: Apache Spark, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF.

Confidential

Software Engineer

Responsibilities:

  • Implemented Lambda to configure Dynamo DBAutoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
  • Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort by 70%.
  • Worked on AWSServices like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.
  • Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
  • Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
  • Used Jenkins pipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
  • Involved with development of Ansible playbooks with Python and SSH as wrapper for management of AWS node configurations and testing playbooks on AWS instances.
  • Developed PythonAWSserverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
  • Implemented CloudTrail in order to capture the events related to API calls made to AWS infrastructure.
  • Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into the internal cache system.
  • Chunking the data to convert them from larger data sets to smaller chunks using python scripts which will be useful for faster data processing.
  • Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash.
  • Worked on AWS Elastic Beanstalk for fast deploying of various applications developed with Java, PHP, Node.js, Python on familiar servers such as Apache.
  • Developed server-side software modules and client-side user interface components and deployed entirely in Compute Cloud of Amazon Web Services (AWS).

Environment: AWS, S3, EC2, LAMBDA, EBS, MySQL, Python, Git, Jenkins, IAM, Datadog, CloudTrail, CLI, AnsibleDynamoDB, Cloud Watch, Docker, Kubernetes.

We'd love your feedback!