Big Data Engineer Resume
Phoenix, AZ
SUMMARY
- Having 8 years of experience in IT industry with 5+ years of experience in building Scalable, Distributed and Complex applications in Bigdata environment which includes designing and implementing end - to-end pipelines using Spark and HadoopEcosystem.
- Expertise in Spark, Hive, HDFS, Map Reduce, HBase, Presto, Pig, Sqoop,Kafka, Flume and Oozie.
- Hands on experience in programming using Java, Python, Scala and SQL.
- Well versed in manipulating/analyzing large datasets within structured and semi structured (JSON, XML) data.
- Experienced working with Hadoop distributions both on-prem (CDH, HDP) and in cloud (AWS).
- Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc.,
- Experienced in developing production ready spark application using Spark RDDApis, DataframesSpark-SQL and Spark-Streaming API's.
- Worked extensively on finetuning spark applications to improve performance and troubleshooting failures in spark applications.
- Strong experience in using Spark Streaming, Spark Sql and other components of spark -accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.
- Worked on exporting/importing data using Sqoop from HDFS to Relational Database Systems and vice-versa
- Proficiency in distributed query-processing tools like Hive, Spark-SQL
- Used tools like Sqoop, Kafka to ingest data into Hadoop.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Uses Talend Open Studio to load files into Hadoop HIVE tables and performed ETL aggregations in Hadoop HIVE.
- Excellent ability to use analytical tools to mine data, Predictive analysis, evaluating the underlying patterns and implement complex algorithms for data analysis.
- Assisted in Cluster maintenance, Cluster Monitoring and Troubleshooting, Managing and Reviewing data backups and log files.
- Basic Knowledge on Kudu, Nifi, Kylin and Zeppelin with Apache Spark.
- Experience in working with NoSQL data stores like HBase.
- Strong knowledge of version control systems like SVN and GITHUB.
- Experienced in using waterfall, Agile and Scrum models of software development process framework.
- Experience in working with Onshore/ Offshore model, code reviews and solving the defects.
- Good level of experience in Core Java, JEE technologies.
- Good knowledge in Oracle PL/SQL and shell scripting.
- Strong problem-solving skills, quick learner and able to work independently as well as a team member of varying sized teams.
- Ability to plan, manage, motivate and work efficiently as an independent or collaboratively in a team.
- Self-motivated, enthusiastic and always keen to learn new methodologies and techniques
TECHNICAL SKILLS
Languages: C, C++, XML, Python, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Shell scripting, JSON, Ajax, Java, ScalaDatabases Oracle 11g, MySQL, Teradata, Cassandra, HBase, MongoDB, MariaDB, Neo4j.
Cloud Technologies/Libraries: AWS, Azure, Data Lake, Hadoop, Map Reduce, HDFS, HBase, Hive, Pig, Impala, Spark, Keras, Caffe, TensorFlow, OpenCV, Scikit-learn, Pandas.
Development Tools: Microsoft SQL Studio, IntelliJ, Eclipse, NetBeans, Visual Studio
Machine Learning Algo: Neural Networks, Decision trees, Support Vector Machines, Random forest, Convolutional Neural Networks, Logistic Regression, PCA, K- means, KNN.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos 7.0/6.0.
BI Tools: Microsoft Power BI, Tableau, SSIS, SSRS, SSAS, Business Intelligence Development Studio (BIDS), Visual Studio, Crystal Reports, Informatica 6.1
Database Design Tools and Data Modelling: MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, Physical & logical data modeling, Normalization, and De-normalization techniques.
PROFESSIONAL EXPERIENCE
Confidential
Big Data Engineer
Responsibilities:
- Installed and Configured multi-nodes fully distributed Hadoop cluster.
- Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
- Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Worked with NoSQL databases like Base to create tables and store the data Collected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
- Developed custom aggregate functions usingSparkSQL and performed interactive querying.
- Wrote Pig scripts to store the data into HBase.
- Created Hive tables, dynamic partitions, buckets for sampling, and worked on them using Hive QL
- Stored the data in tabular formats using Hive tables and Hive Sere.
- Experienced on loading and transforming of large sets of Structured, Semi Structured and Unstructured Data.
- Collaborated with intra applications teams to fit our business models on existing on-prem platform setup.Implemented algorithms for real time analysis in Spark
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Used the Spark -Cassandra Connector to load data to and from Cassandra.
- Real time streaming the data using Spark with Kafka and SOA
- Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Pig Latin Scripts to perform Map Reduce jobs.
- Developed product profiles using Pig and commodity UDFs.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in creating tables, dropping and altered at run time without blocking updates and queries using HBase and Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
- Scheduled the batch jobs using event engine while creating dependency jobs.
- Created flow diagrams, UML diagrams of designed architecture to make understand and get approval from productowners and the business teams for all the user requirements requested.
- Integrated with Restful API’s to create Service now Incidents when there is a process failure within the batch job.
- Developed a capability to implement audit logging at required stages while applying business logic.
- Implemented spark data frames on huge incoming datasets of various data formats like JSON, CSV and Parquet.
- Actively worked in resolving many of the Tech challenges. one of them is like handling the nested JSON withmultiple data sections in the same file and converting them in to spark friendly Data Frames.
Environment: Hadoop, Map Reduce, Hive, Pig, Hbase, Sqoop, Flume, Cassandra, Scala, Spark, Oozie, Kafka, Linux, Java, Tableau, Eclipse, HDFS, PIG, Java (JDK), MySQL
Confidential, Phoenix, AZ
Sr. Data Engineer
Responsibilities:
- Developed complete end to end Big-data processing in hadoop eco system.
- Provided application support during the build and test phases of the SDLC for their product.
- Used Oozie for automating the end to end data pipelines and Oozie coordinators for scheduling the work flows.
- Recreated existing application logic and functionality in the Azure Data Lake, Data Factory, Data Bricks, SQL Database and SQL datawarehouse environment
- Performed data profiling and transformation on the raw data using Pig, Python, and oracle
- Developed predictive analytic using Apache Spark.
- Created dimensional model for the reporting system by identifying required dimensions and facts using Erwin.
- Developed and implemented a data pipeline using Kafka and Strom to store data into HDFS.
- Created automated python scripts to convert the data from different sources and to generate the ETL pipelines.
- Worked with Snowflake SaaS for cost effective data warehouse implementation on cloud.
- Designed and implemented database solutions in Azure SQL Data Warehouse, b
- Developed customer cleanse functions, cleanse lists and mappings for MDM Hub
- Worked extensively on Oracle PL/SQL, and SQL Performance Tuning.
- Involved in modeling (Star Schema methodologies) in building and designing the logical data model into Dimensional Models.
- Created shared dimension tables, measures, hierarchies, levels, cubes and aggregations on MS OLAP/ OLTP/Analysis Server (SSAS).
- Created indexes both non clustered and clustered indexes in order to maximize the query performance in T-SQL.
- Created Hive External tables and loaded the data into tables and query data using HQL.
- Generated multiple enterprise reports like SSRS and Crystal report, worked on Tableau.
- Managed Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Used Pig as ETL tool to do transformations, joins and some pre-aggregations before storing the data into HDFS.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
- Worked on the creating Adhoc reports, Database Imports and Exports using SSIS
Environment: Erwin9.8, SQL, Oracle12c, PL/SQL, Bigdata3.0, Hadoop3.0, Azure Data Lake, Spark, Scala, APIs, Pig0.17, Python, Kafka1.1, HDFS, ETL, MDM, OLAP, OLTP, SSAS, T-SQL, Hive2.3, SSRS, Tableau, MapReduce, Sqoop1.4, Scala, HBase1.2, SSIS.
Confidential, Los Angeles, CA
Data Analyst/Data Engineer
Responsibilities:
- Developed normalized Logical and Physical database models to design OLTP system.
- Extensively involved in creating PL/SQL objects i.e. Procedures, Functions, and Packages. and also documented the requirements.
- Worked with cloud providers and API's for Amazon (AWS) EC2, S3, VPC with GFS storage.
- Worked with Data ingestion, querying, processing and analysis of big data.
- Performed tuned and optimized various complex SQL queries.
- Performed bug verification, release testing and provided support for Oracle based applications.
- Used Model Mart of Erwin for effective model management of sharing, dividing and reusing model information and design for productivity improvement
- Followed Test driven development of Agile Methodology to produce high quality software.
- Extensively used Hive optimization techniques like partitioning, bucketing, MapJoin and parallel execution.
- Worked with Real-time Streaming using Kafka and HDFS.
- Worked with Alteryx a data Analytical tool to develop workflows for the ETL jobs.
- Designed the data marts in dimensional data modeling using star and snowflake schemas.
- Wrote, tested and implemented Teradata Fastload, Multiload, DML and DDL.
- Used various OLAP operations like slice / dice, drill down and roll up as per business requirements.
- Wrote SQL queries, stored procedures, views, triggers, T-SQL and DTS/SSIS.
- Handled importing of data from various data sources, performed data control checks using Spark and loaded data into HDFS.
- Designed SSRS reports with sub reports, dynamic sorting, defining data source and subtotals for the report.
- Designed and implemented importing data to HDFS using Sqoop from different RDBMS servers.
- Worked with Sqoop commands to import the data from different databases.
- Gathered SSRS reports requirements and created in Tableau.
- Designed and developed Map Reduce jobs to process data coming in different file formats like XML.
Environment: Erwin9.8, SQL, PL/SQL, Kafka1.1, AWS, API's, Agile, ETL, HDFS, OLAP, HDFS, T-SQL, SSIS, Teradata15, Hive2.3, SSRS, Sqoop1.4, Tableau, Map Reduce, XML.
Confidential, Chicago, IL
Data Engineer
Responsibilities:
- Identified, evaluated, and documented potential data sources in support of project requirements within the assigned departments as per agile methodology
- Built Data pipeline to enable streaming of scraped data (Scrapy, Beautiful Soup) and data ingestion into PostgreSQL, Amazon Redshift writing Pyspark jobs
- Developed MapReduce/Spark Python modules for predictive analytics & machine learning in Hadoop on AWS
- Provided regular maintenance and upgrades for data warehouse servers
- Data Mining and Modeling:Collected, cleansed and provided modeling and analyses of structured and unstructured data used for major business initiatives.
- Export/import data into Hive & Database using Spark, Scoop from HDFS.
- Re-designed and developed a critical ingestion pipeline to process over 300 TB of data. Utilized the following Big Data technologies for HEB Marketing Data Warehouse: Hadoop (HDFS, PYSPARK, HBase, Hive, Oozie, SPARK SQL, Sqoop, and ZooKeeper).
- Constructing a state-of-the-art data lake on AWS using EMR, PySpark and Airflow.
- Worked on Amazon Redshift to load data, create data models and run BI on it.
- Worked with DBAs to create a best fit physical data model from the logical data model.
- Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift.
- Applied Data Governance rules (primary qualifier, class words and valid abbreviation in Table name and Column names).
- Conducted data profiling, qualities control/auditing to ensure accurate and appropriate use of data.
- Modified data sources and wrote complex SQL's in custom SQL to get data in the required form or layout so that it can be used for visualization purposes.
- Explored data in variety of ways and across multiple visualizations using Power BI.
- Performed data quality analysis using advanced SQL skills
- Reported out on the credit risk of customers according to established policies
- Provided data analysis and portfolio stratification across entire credit risk portfolio in support of capital modeling and credit risk reporting
Environment: ETL, Hadoop (HDFS, PYSPARK, HIVE, SPARK SQL, SQOOP), AWS, SQL, Postgre SQL, Tableau, UNIX and SQL server.
Confidential
Data Analyst/BI Developer
Responsibilities:
- Analyzed the requirements and segregated them into high level and low-level Use Cases, activity diagrams, using MS VISIO according to UML methodology thus defining the Data Process Models.
- Monitored and analyzed the historical data, productivity and current trends to identify opportunities using Tableau. Worked on creating Dashboards and Reports using Tableau as per the business needs. Developed executive KPI reports using Power BI.
- Conducted sessions with the Business Analysts and Technical Analysts to gather the requirements.
- Used SQL joins, aggregate functions, analytical functions, group by, order by clauses and interacted with DBA and developers for query optimization and tuning.
- Developed the stored procedures as required, and user defined functions and triggers as needed using T-SQL.
- Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Worked on Data Verifications and Validations to evaluate the data generated according to the requirements is appropriate and consistent.
- Performed data analysis and data profiling using complex SQL on various sources systems.
- Utilized a diverse array of technologies and tools as needed, to deliver insights such as Python, Tableau and more.
- Developed complex PL/SQL procedures and packages using views and SQL joins.
- Optimized the data environment in order to efficiently access data Marts and implemented efficient data extraction routines for the delivery of data.
- Used MS Excel, and Power Point to process data, create reports, analyze metrics, implement verification procedures, and fulfill client requests for information.
Environment: Python, SQL, PL/SQL, T-SQL, MS VISIO, UML, MS Excel, Tableau, Power Point, SQL server