Big Data Engineer Resume
Costa Mesa, CA
SUMMARY
- Over 8 years of experience in IT Industry in the Big data platform having extensive hands - on experience in Apache Hadoop ecosystem and enterprise application development. Good knowledge on extracting the models and trends from the raw data collaborating with the data science team.
- Proficient in all phases of the Software Development Lifecycle (SDLC)
- Extensive knowledge in Data Integration, Data Mapping, Information Gathering, Data Cleansing, Data Manipulation, Data Processing, Performance Tuning, and data governance.
- Strong Knowledge in Data Validation, Data Cleansing, Data Verification, Data Profiling, Integration and Master Data Management Services
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database
- Experience in Cloud Platform - AWS. Trained in AWS EC2, S3, Load Balancer, Auto Scaling.
- Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
- Experience in statistical model approaches and Machine Learning algorithms such as Linear Regression, Random-Forest Regression, Logistic Regression, Naive Bayes, Decision Trees, K-Means Clustering and Association Rules using R Studio.
- Experienced in Data manipulation for loading and extraction as well as with python libraries as NumPy, SciPy and Pandas for Data analysis and numerical computations.
- Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources.
- Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts.
- Good knowledge of Hadoop ecosystem, big data, HDFS (Hadoop File System) and RDBMS
- Experience working using Spark technology, Spark SQL, Tuning and debugging the spark cluster.
- Worked with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Knowledge in Hadoop security concepts and implementations (Sentry/Ranger)
- Optimized Extract Transform and Load (ETL) workflows from spreadsheets, database tables and other sources using SQL Server Integration Services (SSIS) and SQL Server Reporting Service (SSRS)
- Extensive experience in relational Data modeling, Dimensional data modeling, logical/Physical Design, ER Diagrams and OLTP and OLAP System Study and Analysis.
- Familiar with Hadoop architecture and its different components such as HDFS, Job tracker, Task tracker, Resource Manager, Name Node, Data Node and MapReduce concepts.
- Expert in MS Excel including date functions, text calculations, look ups, Pivot tables, ODBC, advanced summations, and VBA.
- Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, PIG, Hive, RDDs, Data Frames & MapReduce.
- Executed medium-to-complex SQL queries for data analysis and data validation.
- Knowledge in Business Intelligence (BI) tools like Tableau, QlikView, Power BI
- Analyzed business requirements and developed Requirement Traceability Matrix (RTM)
- Excellent communication skills and analytical skills and ability to perform as part of the team.
TECHNICAL SKILLS
Bigdata Framework: Hadoop, HDFS, MapReduce, Pig, Hive, Sqoop, Oozie, Zookeeper, Flume, HBase, Amazon EC2, S3 and Red Shift), Spark, Storm, Impala, Kafka, Ranger, YARN, Airflow
Databases: Oracle, MySQL, SQLite, NO SQL, RDBMS, SQL Server 2014, HBase 1.2, MongoDB 3.2. Teradata, Cassandra
Database Tools: PL/SQL Developer, Toad, SQL Loader.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapReduce, Apache EMR
Web Programming: Html, CSS, Xml, JavaScript.
Programming Languages: Python, SQL, Scala, UNIX, C
Machine Learning: Regression, clustering, SVM, Decision trees, Classification, Random Forest, Artificial Neural Network
Data Visualization: QlikView, Tableau9.4/9.2, ggplot2 (R), D3, Zeppelin
Technologies/Tools: Azure Machine Learning, Informatica, Elastic Search, NIFI, Apache Theano, Torch, NumPy.
Version Controllers: GIT, SVN, CVS
Operating Systems: LINUX, UNIX, Windows; Azure, AWS; VMWare, EMC.
PROFESSIONAL EXPERIENCE
Confidential, Costa Mesa, CA
Big Data Engineer
Responsibilities:
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Bigdata technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
- Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ERWIN r9.6.
- Analysis of functional and non-functional categorized data elements for data profiling and mapping from source to target data environment. Developed working documents to support findings and assign specific tasks.
- Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark with Cloudera distribution.
- Used Amazon Web Services (AWS) which include EC2, S3, Cloud Front, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM which allowed automated operations.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Hands on experience on Cloudera Hue to import data on the GUI.
- Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
- Responsible to manage data coming from different sources through Kafka.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds.
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
- Created Airflow Scheduling scripts in Python.
- Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Configured, deployed, and maintained multi-node Dev and Tested Kafka Clusters.
- Developed in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Worked with Google Compute Cloud Data Flow and Big Query to manage and move data within a 200 Petabyte Cloud Data Lake for GDPR Compliance. Also designed star schema in Big Query.
- Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
- Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software inAWSEC2.
- Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Worked on business problem statement and provided solution, technical specification document with ETL mapping.
- Importing and exporting data form local system to HDFS.
- Composed Pig scripts to process the data and developed data pipeline using Talend Integration ETL to store data into HDFS and Hive performed the real-time analytics on the incoming data.
- Involved in creating Hive tables, loading with data and writing Hive queries using the HiveQL which will run internally in the map-reduce way.
- Extracted the data from oracle into Hive using Sqoop.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Implementations of generalized solution model using AWS Sage Maker.
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server less data pipeline which can be written to Glue Catalog and can be queried from Atana.
- Experience in working with DBT. In my projects, I have mainly used DBT for automating the testing and deploying of the data transformation.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Good knowledge in using Data Manipulations, Tombstones, Compactions in Cassandra. Well experienced in avoiding faulty Writes and Reads in Cassandra.
- Performed data analysis with Cassandra using Hive External tables.
- Designed the Column families in Cassandra.
- Experienced in runningHadoopstreaming jobs to process terabytes of xml format data.
- Used Spark API overHadoopYARN as execution engine for data analytics using Hive.
- Migrating the data from Data Lake (hive) into S3 Bucket.
- Done data validation between data present in Data Lake and S3 bucket.
- Experience with Agile and Scrum Methodologies. Involved in designing, creating, managing Continuous Build and Integration Environments.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Participated in all phases of Machine Learning and Data Mining; data collection, data cleaning, developing models, validation, visualization. Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.
Confidential, Scottsdale, AZ
Data Engineer
Responsibilities:
- Architected, Designed and Developed Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
- Developed Big Data solutions focused on pattern matching and predictive modeling. Worked on Amazon Redshift and AWS a solution to load data, create data models and run BI on it.
- Developed various operational Drill-through and Drill-down reports using SSRS. Generated periodic reports based on the statistical analysis of the data using SQL Server Reporting Services (SSRS)
- Used advanced features of T-SQL in order to design and tune T-SQL to interface with the Database
- Designed OLTP system environment and maintained documentation of Metadata. Used forward engineering approach for designing and creating databases for OLAP model.
- Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
- Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts. Developed Hive and MapReduce tools to design and manage HDFS data blocks and data distribution methods.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Worked closely with the ETL Developers in designing and planning the ETL requirements for reporting, as well as with business and IT management in the dissemination of project progress updates, risks, and issues.
- Worked on AWS S3 bucket integration for application and development projects. Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift.
- Created HBase tables to store various data formats of PII data coming from different portfolios. Implemented Forward engineering to create tables, views and SQL scripts and mapping documents.
- Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS. Worked with MDM systems team with respect to technical aspects and generating reports. Used Impala to read, write and query the Hadoop data in HDFS or HBase or Cassandra.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snow flake Schemas.
- Developed a NIFI Workflow to pick up the data from SFTP server and send dat to Kafka broker.
- Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Participated in Normalization /De-normalization, Normal Form and database design methodology. Expertise in using data modeling tools like MS Visio and Erwin Tool for logical and physical design of databases.
- Involved in Planning, Defining and Designing data base using Erwin on business requirement and provided documentation.
- Strong Knowledge on architecture and components of Tea Leaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka.
- Used of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
- Participated in several facets of MDM implementations including Data Profiling, metadata acquisition and data migration.
- Develop consumer-based features and applications using Python, Django, HTML, behavior Driven Development (BDD) and pair-based programming.
- Designed and developed components using Python with Django framework. Implemented code in python to retrieve and manipulate data.
- Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap analysis.
- Derived insights from machine learning algorithm using SAS to analyze web log files and campaign data to recommend/improve promotional opportunities.
- Data Manipulation and Aggregation from a different source using Nexus, Toad, BusinessObjects, Power BI and Smart View.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs. Worked in tuning Hive and Pig scripts to improve performance.
- Extensively used Apache Sqoop for efficiently transferring bulk data between Apache Hadoop and relational databases (Oracle) for product level forecast. Extracted the data from Teradata into HDFS using Sqoop.
- Developed TWS workflow for scheduling and orchestrating the ETL process. Functional, non-functional and performance testing of key systems prior to cutover to AWS
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Have good experience working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW)
- As a Hadoop Developer my responsibility is managing the data pipelines and data lake.
Confidential, CA
Data Engineer
Responsibilities:
- Migrated the Django database from SQLite to MySQL to PostgreSQL with complete data integrity and Designed, developed, and deployed CSV Parsing using the big data approach on AWS EC2.
- Developed tools using Python 3.6/3.4.6, Shell scripting, XML to automate some of the menial tasks. Interfacing with supervisors, artists, systems administrators, and production to ensure production deadlines are met. Developed frontend and backend modules using Python on Django including Tasty Pie Web Framework using Git.
- Involved in analysing business requirements and prepared detailed specifications dat follow project guidelines required for project development.
- Created and maintained technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts.
- Experience with Hadoop security tools Kerberos, Ranger on HDP 2.x stack and CDH 5.x.
- Developed reusable transformations to load data from Flat files and other data sources to the Data Warehouse.
- Developed Hive SQL queries, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation, and execution.
- Wrote complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line.
- Automated workflows using shell scripts to pull data from various data bases into Hadoop.
- Wrote Hive jobs to parse the logs and structure them in tabular format to facilitate TEMPeffective querying on the log data.
- Developed bash scripts to bring the T-Log files from ftp server and tan processing it to load into Hive tables.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and tan imported into hive tables.
- Ingested user behaviour log files from external servers such as FTP servers and external S3 buckets to centralize data lake.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Worked on migrating on-prem Hadoop cluster data and data pipelines to AWS cloud.
- Involved in deploying our Microservices on Docker containers and created Kubernetes clusters for reliability and scalability of the Microservices.
- Configured AWS IAM and Security Group as per requirement and distributed them as groups into various availability zones of the VPC.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Generated Search Commands to retrieve multiline log events in the form Single transaction giving Start Line and End Line as inputs.
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data
- Guaranteed high accessibility & execution through flat scaling & burden adjusted segments.
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
Confidential
Data Engineer
Responsibilities:
- Wrote complex PL/SQL queries, scripts, and stored procedures to support data integrity issues for large Oracle database applications. Analyzes, identifies, and resolves data issues by creating complex scripts to resolve data conditions and anomalies.
- Involved in last phases of Software Development Life Cycle including Code Re-Design, Implementation, Bug-fixing, Performance Testing, Penetration Testing, Debugging and Documentation.
- Involved in generating the reports of the results of the scripts to analyze the necessities by using data visualization toll tableau.
- Involved in developing web applications using Django Framework to implement the model view control architecture.
- Designed and coded Hibernate Plug-In for Spring ORM mapping and implemented HQLs by creating DAO, which connects to Oracle DB, to persist and retrieve data.
- Implemented Spring security for SQL injunction and user access privileges, used various Java, J2EE design patterns like DAO, DTO, Singleton etc.
- Involved in Developing a Restful service using Python Flask framework.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Involved on the Exposure of Multi-Threading factory to distribute learning process back-testing and the into various worker processes.
- Worked on Unix Socket is used in a client-server application framework and worked on Linux server virtualization by creating Linux VM's for server consolidations.
- Created entire application using Python, Django, MySQL and Linux and Created data pipelines using Apache Spark, a big-data processing and computing framework.
- Developed the presentation layer using HTML, CSS, JavaScript, jQuery and AJAX and Used jQuery libraries for all client-side JavaScript manipulations.
- Designed and created backend data access modules using PL/SQL stored procedures and Oracle.
- Created PDF reports using XML documents to send it to all customers at the end of month.
- Developed a fully automated continuous integration system using Git, Gerrit, Jenkins, MySQL and custom tools developed in Python and Bash.
- Designed object model, data model, tables, constraints, necessary stored procedures, functions, triggers, and packages for Oracle Database.
- Designed and created backend data access modules using PL/SQL stored procedures and Oracle along Used SAX/DOM Parser for parsing the data to Oracle Database.
- Designed object model, data model, tables, constraints, necessary stored procedures, functions, triggers, and packages for Oracle Database.
- Automated the existing scripts for performance calculations using NumPy and SQLAlchemy.
- Interacted with QA to develop test plans from high-level design documentation.
- Involved in Using AWS Cloud Watch performed Monitoring, customized metrics, and file logging.
- Participated in requirement gathering and worked closely with the architect in designing and modeling.
- Worked on development of SQL and stored procedures on MYSQL and Designed and developed a horizontally scalable APIs using Python Flask.
Confidential
Data Analyst
Responsibilities:
- Worked with Data Warehouse team in developing Dimensional Model and analyzing the ER-Diagrams.
- Identified and analyze stakeholders and subject areas.
- Participated in Business Analysis, talking to business Users and determining the entities and attributes for Data Model.
- Identified and determined physical attributes and their relationships through cross-analysis of functional areas.
- Identified and analyzed source data coming from Oracle, SQL server and flat files.
- Extensively used ERWIN to design and restructure Logical and Physical Data Models.
- Evaluated and enhanced current data model as per the requirements
- Performed forward and reverse engineering, applying DDLs to database in restructuring the existing data Model using ERWIN.
- Designed ETL specification documents to load the data in target using various transformations according to the business requirements.
- Used Informatica- Power center for extracting, transforming, and loading.
- Performed Data profiling, Validation, and Integration.
- Created materialized views to improve performance and tuned the database design.
- Involved in Data migration and Data distribution testing.
- Performed testing, knowledge transfer and mentored other team members.
Environment: Informatica, Repository Manager, Workflow Manager, ERWIN 3.0, Oracle 10g/9i, Teradata, TOAD, UNIX, and Shell scripting.
