- Having 7+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Extensively usedPythonand Datascience Libraries Numpy, Pandas, Scipy, awswrangler, PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy and Beautiful Soup.
- Hands - on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
- Strong experience in QlikView Product family such as QlikView Desktop Client, QlikView Management Console, QlikView Server.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like DynamoDB, MognoDB, HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Expertise working with AWS cloud services like EMR, S3,Redshift, EMR cloud watch, for big data development.
- Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience with QlikView sheet objects Pivot, List, Multi-box, multiple charts types, KPI’s, custom requests for Excel Export, and Fast Change and objects for Management Dashboard reporting.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Hands on experience on HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- Experience in Hive partitioning, bucketing and perform joins on Hive tables and implement Hive SerDes.
- Worked on different file formats like delimited files, Avro, json and parquet.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Hands on Experience in designing and developing applications in Spark using Scala and Pyspark to compare the performance of Spark with Hive and SQL/Oracle
- Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
- Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins and AWS.
- Able to work on own initiative, highly proactive, self-motivated commitment towards work and resourceful.
- Strong debugging and critical thinking ability with good understanding of frameworks advancement in methodologies and strategies
Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala
Programming languages: Python, Java, R
Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS
Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbour (KNN), Principal Component Analysis
Languages: Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions
Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.
Version Control: GIT, GIT HUB
IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau
Databases: Oracle, SQL Server, MySQL, DynamoDB, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).
Operating Systems: Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux
Cloud Technologies: MS Azure, Amazon Web Services (AWS), Google cloud
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, Google Shell, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.
Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven, Alteryx, Visio.
Confidential, Dallas, TX
Senior Data Engineer/ Big Data Developer
- Working on migration, Integration, designing and developing data according to the requirements like mainframe files,
- Other integration files. Designing DynamoDB data and perform different operations using data analytics
- Primarily Responsible for converting Manual Report system to fully automated CI/CD Data Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
- Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function.
- Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad, Omniture data and CSG using their API.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Used AWS system manager to automate operational tasks across AWS resources.
- Wrote Lambda function code and set Cloud watch Event as trigger with Cron job Expression.
- Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
- Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for predictive analytics and uploading inferenced data to redshift.
- Deployed the project on Amazon EMR with S3 connectivity for setting backup storage.
- Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
- Wrote Python modules to extract data from the MySQL source database.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Migrated high avail webservers and databases to AWS EC2 and RDS with min or no downtime.
- Worked with AWS IAM to generate new accounts, assign roles and groups.
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.
Environment: Redshift, Pyspark, EC2, EMR, Glue, S3, Kafka, IAM, PostgreSQL, Jenkins, Maven, AWS CLI, Git.
Confidential, Addison, TX
Big Data Engineer/ Hadoop Developer
- Developed a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
- Worked in designing tables in Hive, MYSQL using SQOOP and processing data like importing and exporting of databases to the HDFS, involved in processing large datasets of different forms including structured, semi-structured and unstructured data.
- Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
- Worked with Hadoop architecture and the daemons of Hadoop including Name-Node, Data Node, Job Tracker, Task Tracker, and Resource Manager.
- Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library
- Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.
- Written new spark jobs in Scala to analyse the data of the customers and sales history.
- Scheduled Jobs using crontab, run deck and control-m.
- Build Cassandra queries for performing various CRUD operations like create, update, read and delete, also used Bootstrap as a mechanism to manage and organize the html page layout
- Importing and exporting data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed SQL scripts and designed the solutions to implement using PySpark.
- Used JSON and XML SerDe's for serialization and de-serialization to load JSON and XML data into Hive tables.
- Used SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
- Created Dashboards style of reports using QlikView components like List Box, Slider, Buttons, Charts, and Bookmarks.
- Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
- Involved in converting the existing Qlikview applications to Qliksense.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Worked in development of applications especially in LINUX environment and familiar with all its commands and worked on Jenkins continuous integration tool for deployment of project and deployed the project into Jenkins using GIT version control system
- Managed the imported data from different data sources, performed transformation using Hive, Pig and Map- Reduce and loaded data in HDFS.
- Executed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability and developed Oozie workflow to run job onto data availability of transactions.
- To achieve Continuous Delivery goal on high scalable environment, used Docker coupled with load-balancing tool Nginx.
- Implemented secure environment using QlikView Section Access
- Worked on QlikView server and publisher to automate the QlikView jobs and define the user security and data reduction.
Confidential, Tampa, Florida
- Transformed business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Installing, configuring and maintaining Data Pipelines
- Designined the business requirement collection approach based on the project scope and SDLC methodology.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used Apache NiFi to copy data from local file system to HDP.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Developed QlikView Dashboards using Chart Box (Drill Down, Drill up & Cyclic Grouping).
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
Environment: Cloudera Manager (CDH5),Pyspark, Qlik,HDFS, NiFi, Pig, Hive, S3, Kafka, Snowflake, Pycharm, Scrum, Git.
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Performing hive tuning techniques like partitioning and bucketing and memory optimization.
- Worked on different file formats like parquet, orc, json and text files.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Used spark sql to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark sql.
- Worked on analysing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, ZooKeeper, Sqoop, Spark and Kafka.
- As a Big Data Developer implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, MongoDB, Hive, Oozie, Flume, Sqoop and Talend etc.
- Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, Pair RDD's, Spark, YARN,pyspark.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- The Databricks platform follows best practices for securing network access to cloud applications.
- Hands on experiences on git bash commands like git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver.yaml which actually build the code, generates artifacts which releases in to production.
- Performed data validation which does the record wise counts between the source and destination.
- Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.
- Worked on implementation of some check points like hive count check, Sqoop records check, done file create check, done file check and touch file lookup.
- Worked on both Agile and Kanban methodologies
Environment: Hadoop, Map Reduce, HDFS, Hive, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, GitHub, Talend Big Data Integration, Impala