Sr. Big Data Engineer Resume
Symmes, OH
SUMMARY
- Above 9+ years of Experience in Data Analysis, Data Modeling, Data Architect, Big Data/Hadoop professional with applied information Technology.
- Experience in Data modeling, complex data structures, Data processing, Data quality, Data lifecycle.
- Experience in Amazon AWS cloud which includes services like: EC2, S3, EBS, ELB, AMI, IAM, Route53, Autoscaling, CloudFront, CloudWatch, Security Groups.
- A very good understanding of job workflow scheduling and monitoring tools like Oozie and ControlM.
- Experience in metadata design, real time BI Architecture including Data Governance for greater ROI.
- Experienced in designing Architecture for Modeling a Data Warehouse by using tools like Erwin r9.6/r9.5, Sybase Power Designer and E - R Studio.
- Proficient in System Analysis, ER/Dimensional Data Modeling, Database design and implementing RDBMS specific features.
- Well versed with Data Migration, Data Conversions, Data Extraction/ Transformation/Loading (ETL)
- Experience with Object Oriented Analysis and Design (OOAD) using UML, Rational Unified Process (RUP), Rational Rose and MS Visio.
- Experienced in Developing Triggers, Batch Apex, and Scheduled Apex classes.
- Experience in building high performance and scalable solutions using various Hadoop ecosystem tools like Pig, Hive, Sqoop, Spark, Solr and Kafka.
- Defined real time data streaming solutions across the cluster using Spark Streaming, Apache Storm, Kafka, Nifi and Flume.
- Excellent experience in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments.
- Experience in Teradata RDBMS using Fast load, Fast Export, Multi load, T pump, and Teradata SQL Assistance and BTEQ Teradata utilities.
- Experienced in Data Modeling including Data Validation/Scrubbing and Operational assumptions.
- Very good knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and Identifying Data Mismatch.
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Oozie, Hive, Sqoop, Pig, Zookeeper and Apache Storm.
- Experience in working with MapReduce programs using Apache Hadoop for working with BigData.
- Strong experience working with conceptual, logical and physical data modeling considering Metadata standards.
- Experience working with Agile and Waterfall data modeling methodologies.
- Experience in Ralph Kimball and Bill Inmon approaches.
- Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2 and SQL Server databases.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, JobTracker, TaskTracker, NameNode, DataNodes and MapReduce concepts.
- Strong knowledge in working with UNIX/LINUX environments, writing shell scripts and PL/SQL Stored Procedures.
- Implemented POC to migrate MapReduce jobs into Spark RDD transformations using Scala.
- Developed Apache Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Hands on developing and debugging YARN (MR2) Jobs to process large Datasets.
- Data Processing: Processed data using MapReduce and Yarn. Worked on Kafka as a proof of concept for log processing
- Worked with Oozie workflow engine to schedule time based jobs to perform multiple actions.
- Experienced in importing and exporting data from RDBMS into HDFS using Sqoop.
- Hands on experience in working with database like Oracle, MySQL and PL/SQL.
- Experienced in developing Web Services with Python programming language.
- Experience in Performance Tuning, Optimization and Customization.
TECHNICAL SKILLS
Big Data Eco-System: Hadoop3.0, HDFS, MapReduce, Hive 2.3, Pig, Hbase 1.2, Spark 2.2, Spark Streaming, Spark SQL, Kafka, Cloudera CDH4, CDH5, Hortonworks, Hadoop Streaming, Splunk, Zookeeper 3.4, Oozie, Sqoop, Flume 1.8, Impala, Solr, and Ranger.
Data Modeling Tools: ER/Studio V17, Erwin 9.6/9.5, Power Sybase Designer.
OLAP Tools: Tableau, SAP BO, SSAS, Business Objects, and Crystal Reports 9
Testing and defect tracking Tools: HP/Mercury, Quality Center, Win Runner, MS Visio & Visual Source Safe
Operating System: Windows, Unix, Sun Solaris
ETL/Data warehouse Tools: Informatica 9.6/9.1, SAP Business Objects XIR3.1/XIR2, Talend, Tableau, Pentaho.
Languages: SQL, Shell Scripting, C/C++, Python 3.6, R, Scala
Operating system: Windows, Macintosh, Linux and Unix
DBMS / RDBMS: Oracle12c, SQL Server 2016/2014, DB2, Teradata 15/14
AWS tools: EC2, S3 Bucket, AMI, RDS, Redshift.
Methodologies: RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Agile, Waterfall Model.
PROFESSIONAL EXPERIENCE
Confidential - Symmes, OH
Sr. Big data Engineer
Responsibilities:
- Extensively involved in Design phase and delivered Design documents in Hadoop eco system with HDFS, Hive, Pig, Sqoop and Spark with Scala.
- Collected the logs from the physical machines and the Open Stack controller and integrated into HDFS using Kafka.
- Involved in the high-level design of the Hadoop architecture for the existing data structure and Business process
- Worked with clients to better understand their reporting and dash boarding needs and present solutions using structured Agile project methodology approach.
- Worked on analyzing Hadoop cluster and different Big Data Components including Pig, Hive, Spark, HBase, Kafka, Elastic Search, database and SQOOP.
- Involved in loading disparate datasets into Hadoop Data Lake, this would be available to the data science team to predict the future.
- Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management (MDM).
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
- Installed Hadoop, Map Reduce, HDFS, and developed multiple Map-Reduce jobs in PIG and Hive for data cleaning and pre-processing.
- Experienced in pulling the data from Amazon S3 bucket to Data Lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
- Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run a MapReduce.
- Explored MLlib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for use case.
- In preprocessing phase of data extraction, we used Spark to remove all the missing data for transforming of data to create new features.
- Worded with commercial distribution of Hadoop including Hortonworks production HDP, Cloudera CDH and AWS (EMR, S3, and EC2).
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Involved in loading data from UNIX file system to HDFS using Flume and HDFS API.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Participated in design reviews, code reviews, unit testing and integration testing.
- Developed RDD's/Data Frames in Spark using Scala and Python and applied several transformation logics to load data from Hadoop Data Lake to Cassandra DB.
- Exported the analyzed data to the NoSQL Database using HBase for visualization and to generate reports for the Business Intelligence team using SAS.
- Created Hive tables as per requirement as internal or external tables, intended for efficiency.
- Implemented installation and configuration of multi-node cluster on the cloud using Amazon Web Services (AWS) on EC2.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts
- Worked with Elastic MapReduce (EMR) and setting up environments on Amazon AWS EC2 instances.
- Used JIRA for bug tracking and GIT for version control.
Environment: Hadoop 3.0, HDFS, hive 2.3, Pig, Sqoop, Spark 2.2, Scala, Hbase 1.2, Kafka, Elastic Search, MapReduce, MLlib, Flume 1.8, Python, AWS, Web Services, GIT, JIRA, MDM
Confidential - Battle Creek, MI
Sr. Hadoop/Data Engineer
Responsibilities:
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters with agile methodology.
- Worked on evaluation and analysis of Hadoop cluster and different big data analytic tools like Hbase and Sqoop.
- Developed MapReduce programs to perform data filtering for unstructured data.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Impala.
- Successfully loaded files to hive and HDFS from MongoDB, Cassandra and Hbase.
- Worked on Classic and Yarn distributions of Hadoop like the Apache Hadoop, ClouderaCDH4 and CDH5.
- Created and altered HBase tables on top of data residing in Data Lake.
- Worked on analyzing, writing Hadoop MapReduce jobs using Java API, Pig and Hive.
- Created and manage S3 buckets and policies for storage and backup purposes.
- Worked on developing ETL processes to load data from multiple data sources to HDFS using Flume and Sqoop.
- Performed structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools.
- Worked in the cluster disaster recovery plan for the Hadoop cluster by implementing the cluster data backup in Amazon S3 buckets.
- Installed and configure Zookeeper service for coordinating configuration-related information of all the nodes in the cluster to manage it efficiently.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's in Scala and Python.
- Used SQL queries, Stored Procedures, User Defined Functions (UDF), Database Triggers using tools like SQL Profiler and Database Tuning Advisor (DTA).
- Worked with multiple teams and understanding their business requirements for understanding data in the source files.
- Created end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Explored with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
- Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
Environment: Hadoop 3.0, Hbase 1.2, Sqoop, MapReduce, Pig, Hive 2.3, Impala, HDFS, MongoDB 3.6, Cassandra, Pig, Zookeeper, SQL queries, Spark, Scala, Python, YARN
Confidential - Union, NJ
Sr. Data Architect/Data Modeler
Responsibilities:
- Designed the Logical Data Model using ER Studio with the entities and attributes for each subject areas.
- Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Responsible for technical Data governance, enterprise wide Data modeling and Database design.
- Implemented data warehouse designs to collect and extract/transform/loading of legacy data to core SAP system.
- Designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using ER Studio.
- Incorporated business requirements in quality conceptual, logical data models using ER Studio and created physical data models using forward engineering techniques to generate DDL scripts.
- Advised on and enforces data governance to improve the quality/integrity of data and oversight on the collection and management of operational data.
- Implemented Dimensional Modeling using Star and Snow Flake Schema, Identifying Facts and Dimensions, Physical and logical data modeling using ER Studio
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Demonstrated expertise utilizing ETL tools, including SQL Server Integration Services (SSIS), Data Transformation Services (DTS), and Data Stage and ETL package design, and RDBMS systems like SQL Servers, Oracle, and DB2.
- Review and Patch of Netezza and Oracle environments including DB2, OS and Server firmware.
- Designed and Deployed high-performance, custom applications at scale on Hadoop /Spark.
- Selected the appropriate AWS service based on data, compute, database, or security requirements.
- Used Flume extensively in gathering and moving log data files from Application Servers to a central location in Hadoop Distributed File System (HDFS) for data science.
- Extracting Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports
- Involved in architecting Hadoop clusters Translation of functional and technical requirements into detailed architecture and design.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
- Performed Data Analysis, Data Migration and data profiling using complex SQL on various sources systems including Oracle and Netezza.
- Designed ER diagrams (Physical and Logical using ER Studio) and mapping the data into database objects.
Environment: ER Studio, BTEQ, SQL, Teradata, AWS, Oracle, RDBMS, Netezza, Hadoop, Spark, HDFS, Flume, Amazon Redshift, Elastic Search, Oozie, MapReduce, Hive, Pig, Sqoop
Confidential - West Chester, PA
Sr. Data Analyst/Data Modeler
Responsibilities:
- Analyze the OLTP Source Systems and Operational Data Store and research the tables/entities required for the project.
- Designing the measures, dimensions and facts matrix document for the ease while designing.
- Created data flowcharts and attribute mapping documents, analyzed the source meaning to retain and provide proper business names following the very stringent FTB's data standards.
- Developed several scripts to gather all the required data from different databases to build the LAR file monthly.
- Designed ER diagrams, logical model and physical database for Oracle and Teradata as per business requirements using Erwin.
- Developed numerous reports to capture the transactional data for the business analysis.
- Developed complex SQL queries to bring data together from various systems.
- Responsible for technical data governance, enterprise wide data modeling and database design.
- Organized and conducted cross-functional meetings to ensure linearity of the phase approach.
- Collaborated with a team of Business Analysts to ascertain capture of all requirements.
- Created multiple reports on the daily transactional data which involves millions of records.
- Used Joins like Inner Joins, Outer joins while creating tables from multiple tables.
- Created Multi set, temporary, derived and volatile tables in Teradata database.
- Implemented Indexes, Collecting Statistics, and Constraints while creating tables.
- Utilized ODBC for connectivity to Teradata via MS Excel to retrieve automatically from Teradata Database.
- Developed various ad hoc reports based on the requirements
- Designed & developed various Ad hoc reports for different teams in Business (Teradata and Oracle SQL, MS access, MS excel)
- Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and formatted the results into reports and kept logs.
- Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
- Delivered the artifacts within the time lines and excelled in the quality of deliverables.
- Validated the data during UAT testing.
- Performing source to target Mapping
- Involved in Metadata management, where all the table specifications were listed and implemented the same in Ab Initio metadata hub as per data governance.
- Developed Korn Shell scripts to parallel extract and process data from different sources simultaneously to streamline performance and improve execution time in a parallel process for better time, resource management and efficiency.
- Used Teradata utilities such as TPT (Teradata Parallel Transporter), FLOAD (Fastload) and MLOAD (Multiload) for handling various tasks.
- Developed Logical data model using Erwin and created physical data models using forward engineering.
Environment: Erwin 8.0, Teradata 13, TOAD, Oracle 10g/11g, MS SQL Server 2008, Teradata SQL Assistant, XML Files, Flat files
Confidential
Data Analyst
Responsibilities:
- Analysis of functional and non-functional categorized data elements for Data Migration, data profiling and mapping from source to target data environment. Developed working documents to support findings and assign specific tasks.
- Participated in requirements session with IT Business Analysts, SME's and business users to understand and document the business requirements as well as the goals of the project.
- Used and supported database applications and tools for extraction, transformation and analysis of raw data
- Developed complex T-SQL code such as Stored Procedures, functions, triggers, Indexes, and views for the business application.
- Involved in complete SSIS life cycle in creating SSIS packages, building, deploying and executing the packages all environments. (QA, Development and Production)
- Created SSIS Packages for migration of data to MS SQL Server database from other databases and source like Flat Files, MS Excel, Sybase, CSV files.
- Optimized stored procedures using temp tables and indexing strategies to increase speed and reduce runtime.
- Automated processes from MS Access and Excel and rewrote to SQL views and tables.
- Developed reports for users in different departments in the organization using SQL Server Reporting Services (SSRS).
- Designed report models based on user requirements and used report builder to generate the reports.
- Used tools (Excel and SQL) to analyze, query, sort and manipulate data according to defined business rules and procedures.
- Performed data mining on data using very complex SQL queries and discovered pattern.
- Extensively used MS Access to pull the data from various data bases and integrate the data.
- Developed SQL, BTEQ (Teradata) queries for Extracting data from production database and built data structures, reports.
- Performed in depth analysis in data & prepared weekly, biweekly, monthly reports by using SQL, SAS, Ms Excel, Ms Access, and UNIX.
Environment: T-SQL, SSIS, MS SQL, MS Excel, MS Access, SQL queries, BTEQ, UNIX
