Sr. Data Engineer Resume
FL
SUMMARY
- Over 8+ years of diversified IT experience in E2E data analytics platforms (ETL - BI-Java) as Bigdata, Hadoop, Java/J2EE Development, Informatica, Data Modeling and System Analysis, In Banking, Finance, Insurance and Telecom domains.
- Hands on experience Hadoop framework and its ecosystem like Distributed file system (HDFS), MapReduce, Pig, Hive, Sqoop, Flume, Spark.
- Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workflows), extending the functionality by writing custom UDFs.
- Extensive experience in developing Data warehouse applications using Hadoop, Informatica, Oracle, Teradata, MS SQL server on UNIX and Windows platforms and experience in creating complex mappings using various transformations and developing strategies for Extraction, Transformation and Loading (ETL) mechanism by using Informatica 9.x/8. x.
- Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning, Dynamic-Partitioning, Bucketing and Parallel Execution concepts.
- As Data Architect designed and maintained high performance ELT/ETL processes.
- Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java, custom UDFs.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
- Knowledge on Cloud computing infrastructure AWS (amazon web services).
- Created modules for spark streaming in data into Data Lake using Spark.
- Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables, concepts like Lambda Architecture, and Batch processing, Oozie.
- Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping designer, Mapplet Designer, ETL Transformations, Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.
- Expertise in using core Java, J2EE, Multithreading, JDBC, Shell Scripting and proficient in using Java API's Collections, Servlets, JSP for application development.
- Worked closely to review pre- and post-processed data to ensure data accuracy and integrity with Dev and QA teams.
- Experience in Java, J2ee, JDBC, Collections, Servlets, JSP, Struts, Spring, Hibernate, JSON, XML, REST, SOAP Web services, Groovy, MVC, Eclipse, Weblogic, Websphere, and Apache Tomcat severs.
- Working experience with Functional programming languages like Scala, and Java.
- Extensive knowledge of Data Modeling, Data Conversions, Data integration and Data Migration with specialization in Informatica Power Center.
- Expertise in extraction, transformation and loading data from heterogeneous systems like flat files, excel, Oracle, Teradata, MSSQL Server.
- Good work experience with UNIX/Linux commands, scripting and deploying the applications on the servers.
- Strong skills in algorithms, data structures, Object oriented design, Design patterns, documentation and QA/testing.
- Experienced in working as part of fast paced Agile Teams, exposure to testing in scrum teams, Test-Driven development.
- Excellent domain knowledge in Insurance, Telecom and Banking/Finance.
TECHNICAL SKILLS
BigData Technologies: AWS EMR, S3, EC2-Fleet, Spark-2.2, 2.0 and 1.6, Hortonworks HDP, Hadoop, MapReduce, Pig, Hive, Apache Spark, Spark SQL, Informatica Power Center 9.6.1/8.x, Kafka, NoSQL, Elastic MapReduce(EMR), Hue, YARN, Nifi, Impala, Sqoop, Solr, Oozie.
Data Bases: Cloudera Hadoop CDH 15.x, Hortonworks HDP, Oracle 10g/11g, Teradata, DB2, Microsoft SQL Server, MySQL, NoSQL, SQL databases.
Platforms (O/S): Red-Hat LINUX, Ubuntu, Windows NT/2000/XP.
Programming Languages: Java, Scala, SQL, UNIX shell script, JDBC, Python, Perl.
Security Management: Hortonworks Ambari, Cloudera Manager, Apache Knox, XA Secure, Kerberos.
Web Technologies: DHTML, HTML, XHTML, XML, XSL (XSLT, XPATH), XSD, CSS, JavaScript, SOAP, RESTful, Agile, Design Patterns
Data Warehousing: Informatica PowerCenter/Powermart/Dataquality/Bigdata, Pentaho, ETL Development, Amazon Redshift, IDQ.
Database tools: JDBC, HADOOP, Hive, No-SQL, SQL Navigator, SQL Developer, TOAD, SQL Plus, SAP Business Objects
Data Modeling: Rational Rose, Erwin 7.3/7.1/4.1/4.0
Code Editors: Eclipse, Intellij.
PROFESSIONAL EXPERIENCE
Confidential - FL
Sr. Data Engineer
Responsibilities:
- Data from HDFS into Spark RDDs, for running predictive analytics on data.
- Used Hive Context which provides a superset of the functionality provided by SQLContext and Preferred to write queries using the HiveQL parser to read data from Hive tables (fact, syndicate).
- Modeled Hive partitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
- Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
- Caching of RDDs for better performance and performing actions on each RDD.
- Created Hive Fact tables on top of raw data from different retailer’s which indeed partitioned by Time dimension key, Retailer name, Data supplier name which further processed pulled by analytics service engine.
- Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Create and maintain the data pipelines using Matillion ETL, Fivetran.
- Evaluated Fivetran and Matillion for streaming and batch data ingestion into Snowflake
- Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS/AWS S3.
- Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
- Developed Python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.
- Developed a PySpark program that writes data frames to HDFS as avro files.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
- Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
- Involved in Migrating Objects from Teradata to Snowflake.
- Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
Environment: Cloudera Hadoop 15.8, AWS, Spark-Java, Python, Spark-Scala, Hive, Impala, HDFS, Oozie, TFS, Agile, MS-SQL, PySpark, Eclipse, Snowflake, Fivetran
Confidential, Lake Success, NY
Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR 5.6.1.
- Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS.
- Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Worked on creating Spring-Boot services for Oozie orchestration.
- Deployed Spring-Boot entity services for Audit Framework of the loaded data.
- Worked with Avro, Parque, ORC file formats and compression techniques like LZO.
- Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
- Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions, Dynamic Partitions, Buckets on HIVE tables.
- In charge of PySpark code, creating data frames from tables in data service layer and writing them to a Hive data warehouse.
- Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
- Analyzed the Incident, Change and Job data from snowflake and created a dependency tree-based model on the occurrence of incident for every application service present internally.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Worked on migrating MapReduce programs into Spark transformations using Scala.
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
- Used Job management scheduler apache Oozie to execute the workflow.
- Used Ambari to monitor node's health and status of the jobs in Hadoop clusters.
- Designing and implementing data warehouses and data marts using components of Kimball Methodology, like Data Warehouse Bus, Conformed Facts & Dimensions, Slowly Changing Dimensions, Surrogate Keys, Star Schema, Snowflake Schema, etc.
- Worked on Tableau to build customized interactive reports, worksheets and dashboards.
- Implemented Kerberos for strong authentication to provide data security.
- Implemented LDAP and Active directory for Hadoop clusters
- Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
Environment: Azure, EMR, Lambda, CloudWatch, Amazon Redshift, SnowFlake, Spark-Java, Spark- Scala, Athena, Hive, HDFS, Spark, Scala, Oozie, Bitbucket Github, Fivetran
Confidential, Rochester MN
Data Engineer
Responsibilities:
- Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
- Worked with different feeds data like JSON, CSV, XML, DAT and implemented Data Lake concept.
- Developed Informatica design mappings using various transformations.
- Most of the infrastructure is on AWS, used,
- Used AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
- Programmed ETL functions between Oracle and Amazon Redshift.
- Maintained end to end ownership for analyzed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
- Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark framework etc.).
- Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library.
- Conducted statistical analysis on Healthcare data using python and various tools.
- Developed entire frontend and backend modules using Python on Django Web Framework and created User Interface (UI) using JavaScript, bootstrap, Cassandra with MySQL and HTML5/CSS.
- Build machine learning models to showcase big data capabilities using Pyspark and MLlib.
- Used AWS services like EC2 and S3 for small data sets.
- Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
- Performed data analysis and predictive data modeling.
- Explore clickstream events data with SparkSQL.
- Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and advertising data related to Consumer Product Goods.
- Merged with D4 Rise Analytics team and helped them to remove all the manual work they were doing on Incident data by generating metrics pulling data from various sources like ServiceNow, Snowflake, AROW job data and few other API calls and created an Incident Dashboard with lot of intelligence built in it.
- Spark SQL is used as a part of Apache Spark big data framework for structured, Shipment, POS, Consumer, Household, Individual digital impressions, Household TV impressions data processing.
- Created DataFrames from different data sources like Existing RDDs, Structured data files, JSON Datasets, Hive tables, External databases.
- Load terabytes of different level raw data into Spark RDD for data Computation to generate the Output response.
- Leadership of a major new initiative focused on Media Analytics and Forecasting will have the ability to deliver the sales lift associated the customer marketing campaign initiatives.
- Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
- Coordinating the QA, PROD environments deployments.
- Python was used in automation of Hive and Reading Configuration files.
- Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
- Using Hive to analyze the partitioned data and compute various metrics for reporting.
Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, MapR, Sqoop, Zookeeper, Snowflake Python, AWS EMR, AWS S3, Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile Methodology, Scrum.
Confidential
Hadoop Consultant
Responsibilities:
- Understand the requirements and prepared architecture document for the Big Data project.
- Worked with Hortonworks distribution
- Supported MapReduce Java Programs those are running on the cluster.
- Optimized Amazon Redshift clusters, Apache Hadoop clusters, data distribution, and data processing
- Developed MapReduce programs to process the Avro files and to get the results by performing some calculations on data and also performed map side joins.
- Imported Bulk Data into HBase Using MapReduce programs.
- Used Rest ApI to Access HBase data to perform analytics.
- Designed and implemented Incremental Imports into Hive tables.
- Involved in creating Hive tables, loading with data and writing Hive queries that will run internally in MapReduce way
- Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
- Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS using Sqoop.
- Migrated complex map reduce programs into in memory Spark processing using Transformations and actions.
- Worked on POC for IOT devices data, with spark.
- Used SCALA to store streaming data to HDFS and to implement Spark for faster processing of data.
- Worked on creating the RDD's, DF's for the required input data and performed the data transformations using Spark Python.
- Involved in developing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Developed PIG scripts for the analysis of semi structured data.
- Developed PIG UDF'S for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Worked on Oozie workflow engine for job scheduling.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Experienced in managing and reviewing the Hadoop log files using Shell scripts.
- Migrated ETL jobs to Pig scripts to do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
- Worked with Avro Data Serialization system to work with JSON data formats.
- Used AWS S3 to store large amount of data in identical/similar repository.
- Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.
- Used Enterprise Data Warehouse database to store the information and to make it access all over organization.
- Responsible for preparing technical specifications, analyzing functional Specs, development and maintenance of code.
- Worked with the Data Science team to gather requirements for various data mining projects
- Written shell scripts for rolling day-to-day processes and it is automated.
Environment: Amazon Redshift clusters, Apache Hadoop clusters, SQL Server, XML,JSON
Confidential
Sr. ETL Consultant
Responsibilities:
- Gathering requirements for RDM project which involves implementing EDW data quality fixes and Retail data mart.
- Prepare functional and technical specification design document for building Member Data Mart according to ICDW Banking Model.
- Responsible for data gathering from multiple sources like Teradata, Oracle.
- Created Hive tables to store the processed results in a tabular format.
- Written Map Reduce jobs in java to process the log data.
- Implemented external and managed tables using HIVE.
- Work with the Teradata analysis team using BigData technologies to gather the business requirements.
- Fixing error data, Data Reconciliation process.
- Used Partitioning and bucketing concepts for performance optimization in hive.
- Responsible for delivering the Informatica artifacts for Mart Specific Semantic Layer for subject areas like Reference, Third Party, Involved Party, Event, Customer and etc.
- Reviewing the deliverable and ensured that the quality of code before delivering to client by reviewing the code and testing the code.
- Involved in implementation as Kimball’s methodology, OLAP, SCDs (type1, type2 and type3), starschema and snowflake schema.
- Involved in understanding the existing EDW process of Retail Business and implementing the components in the ICDW.
- Prepared and implemented successfully automated UNIX scripts to execute the end to end history load process.
- Prepared Job execution tool Tivoli design in order to run Membership Reporting Data mart in production environment.
- Managing the versioning of the mappings, scripts, documents in version controlled tool SCM.
Environment: Hadoop, Informatica 9.01, Teradata-13, Abinitio, UNIX, DB2.