We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Rochester, MN

SUMMARY

  • 9+ years of experience in designing and developing data driven solutions - data warehousing, Business Intelligence, analytics, data ingestion - extraction, transformation and loading of data from Transactional databases (OLTP) to Data Warehousing Systems (OLAP)
  • Analyzing and understanding source systems and business requirements to Design the Enterprise Data warehousing and Business Intelligence Solutions, DataMart and Operational Data Store.
  • Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, Hbase, Python, Snowflake, S3 storage, Airflow.
  • Experience in job workflow scheduling and monitoring tools like Airflow and Autosys.
  • Experienced in Designing, Developing, Documenting, Testing ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.
  • Worked in Production support team for maintaining the mappings, sessions and workflows to load the data in Data Warehouse.
  • Hands-on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala, and Hue.
  • Extensively worked on AWS services like EC2, S3, EMR, RDS, SageMaker, Athena, Glue Data Catalog, RDS(Aurora), Redshift, DynamoDB, and Elastic Cache (Memcached & Redis) & Quick Sight and other services of the AWS family.
  • Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3.
  • Experience on Cloud Databases and Data warehouses ( SQL Azure and Confidential Redshift/RDS).
  • Worked closely with the Enterprise Data Warehouse team and Business Intelligence Architecture team to understand repository objects that support the business requirement and process.
  • Extensive knowledge in working with Azure cloud platforms (HDInsight, Datalake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Extensive experience in working with NoSQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra, and HBase.
  • Experience in building data pipelines using Azure Data factory, Azure data bricks and loading data to Azure data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.
  • Work with cross functional teams planning, modeling, and implementing solutions utilizing NoSQL technologies.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation and Visualization, and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Excellent knowledge about the architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark Streaming, and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.
  • Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
  • Proficient in Python scripting and developed various internal packages to process big data.
  • Developed various shell scripts and Python scripts to automate Spark jobs and Hive scripts.
  • Strong experience in Data Analysis, Data Profiling, Data Cleansing & Quality, Data Migration, Data Integration
  • Thorough knowledge in all phases of the Software Development Life Cycle (SDLC) with expertise in methodologies like Waterfall and Agile.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Worked with healthcare claims data and wrote SQL queries to organize the data and sorted, summarized, and reported salient changes within the datasets.
  • Evaluated and established the validity of incoming claims data and data element combinations to ensure accuracy and completeness of all reporting results.
  • Collaborated with data and product staff on various aspects of incoming healthcare claims data.
  • Experienced in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
  • Good understanding of Data Modeling techniques, Normalization and Data Warehouse concepts using Star schema and Snowflake schema modeling.
  • Well versed with Snowflake features like clustering, time travel, cloning, logical data warehouse, caching etc.
  • Developed Talend jobs to populate the claims data to data warehouse - star schema, snowflake schema, Hybrid Schema.
  • Strong knowledge in Logical and Physical Data Model design using ERWIN
  • Good skills in Python programming
  • Experienced on Big Data Hadoop Ecosystem components, HDFS, Apache Spark
  • Strong knowledge on Snowflake Database, Worked on reading data from semi-structured sources (XML files)
  • Proficient in Talend Cloud Real Time Big Data Platform, Informatica Power Center, Informatica Power Exchange, Informatica B2B Data Transformation, Oracle SQL and PL/SQL, Snowflake, Unix Shell Scripting
  • Has experience with Informatica Power Center in all phases of Data Analysis, Design, Development, Implementation and production support of Data Warehousing applications using Informatica Power Center 10.x/9.x/8.x, SQL, PL/SQL, Oracle, DB2, Unix, Power Shell
  • Worked on designing and developing ETL solutions for complex data ingestion requirements using Talend Cloud Real Time Big Data Platform, Informatica Power Center, Informatica Intelligent Cloud Services, Python, PySpark and implemented data streaming using Informatica Power Exchange.
  • Developed PySpark programs and created the data frames and worked on transformations.
  • Performed root cause analysis on the slowly running solutions (ETL Jobs, reporting jobs, SQL Queries, Stored Procedures and Views) and improved the solutions for better performance
  • Collaborated with different teams like source vendors, data governance, and business teams on data quality issues, as well as architecture or structure of data repositories
  • Assisting in root cause analysis, investigating any data errors or anomalies and assisting in the implementation of solutions & Framework to correct data problems and establishing and publishing KPIs and SLAs
  • Created SOX Control, DQR Framework to monitor data quality issues and provide support for internal audit and external SOX compliance audits
  • Worked with Senior Management, Business users, Analytical Team, PMO and Business Analyst team on Requirement discussion and Data integration strategy planning
  • Research new technologies while keeping up to date with technological developments in relevant areas of ETL, Cloud Solutions, Big Data, Analytics & Relational/NoSQL DBs

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Rochester MN

Responsibilities:

  • Architected analytical data pipeline including but not limited to stakeholders’ interviews, data profiling, and extraction process designing from diverse sources, and data load optimization strategies.
  • Using Kimball four step process (Business process definition, Grain declaration, Fact and Dimension identification), designed Dimensional Data Models for Loan Servicing and Loan Origination with daily transactional facts and Customer, Account, Loan status, Credit profile etc. slowly changing dimensions.
  • Developed ETL using Microsoft toolset (SSIS, TSQL, MS SQL Server) to implement Type 2 Change Data Capture process for various dimensions.
  • After data extraction from AWS S3 buckets and Dynamo DB, implanted Python/SQL using libraries (pandas, numpy, Json, Urllib, PyODBC, SQLAlchemy) based JSON parsing daily pipeline for Credit Profile data including Experian Credit Reports (Prequal Credit report, Full Credit Profile, BizAggs, SbcsAggregates, SbcsV1, SbcsV2 and Premier Profile).
  • Conducts quantitative analyses of raw claims data, Rx claims, and various healthcare data sources
  • Implemented and Designed AWS Solutions using EC2, S3, EBS, Elastic Load balancer (ELB), VPC, Amazon RDS, CloudFormation, Amazon SQS, and other services of the AWS infrastructure.
  • Parsed and evaluated “Lending Club ®” historical Small Business Loan Json data using Python/SQL. Purpose was to tune in-house loan ScoreCard model and test different products’ What-if analysis in SSIS.
  • Fulfill Customer Behavior Score model ETL requirements using Rolled up dimensional model data. During the process used Feature Engineering methodologies (identification, aggregation and processing) based upon data scientists and Machine Learning algorithm modelers feedback.
  • Created data pipeline for different events in Azure Blob storage into Hive external tables and used various Hive optimization techniques like partitioning, bucketing, and Mapjoin.
  • Worked on Azure Data Factory to integrate data of both on-prem (MySQL, PostgreSQL, Cassandra) and cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Created pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Primarily involved in Data Migration process using SQL, Azure SQL, SQL Azure DW, Azure storage and Azure Data Factory (ADF) for Azure Subscribers and Customers.
  • Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
  • Created Spark clusters and configured high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Primarily involved in Data Migration process using SQL, Azure SQL, SQL Azure DW, Azure storage and Azure Data Factory (ADF) for Azure Subscribers and Customers.
  • Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
  • Primarily responsible for creating new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances, SQL Azure DW instances, HD Insight clusters and installing DMGs on VMs to connect to on premise servers.
  • Responsible for ingesting data from various source systems (RDBMS, Flat files, Big Data) into Azure (Blob Storage) using framework model.
  • Involve into Application Design and Data Architecture using Cloud and Big Data solutions on AWS, Microsoft Azure.
  • Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution. Re-designing the Legacy Application solutions with minimal changes to run on cloud platform.
  • Worked on building the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and Python codes.
  • Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data and load into Azure Data Warehouse using Azure Web Job and Functions.
  • Build various pipeline to integrate the Azure Cloud to AWS S3 to get the data into Azure Database.
  • Interaction with direct Business Users and Data Architect for changes to Data Warehouse design in on-going basis.
  • Involved in Data modeling and design of data warehouse in star schema methodology with conformed and granular dimensions and FACT tables.
  • Identified/documented data sources and transformation rules required to populate and maintain data warehouse content.
  • Implemented Azure Data Factory operations and deployment into Azure for moving data from on-premises into cloud
  • Used Spark DataFrames to create various Datasets and applied business transformations and data cleansing operations using Data Bricks Notebooks.
  • Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Apache E, Apache NiFi.
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprises.
  • Develop stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Created DWH, Databases, Schemas, Tables, write SQL queries against Snowflake.
  • Optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Developed spark applications in python and PySpark on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Ingested data in mini-batches and performs RDD transformations on mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
  • Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
  • Extracted Tables and exported data from Teradata through Sqoop and placed them in Cassandra.
  • Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and Apache Ranger for authorization.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Data Engineer

Confidential, Illinois

Responsibilities:

  • Implemented CARS (Customer Anti-Money Laundering Risk Scoring) and Transaction Monitoring (TM) Model requirements and played key role in data source requirement analysis, ETL Datastage code development and deployment.
  • Broad understanding of healthcare data like claims clinical data quality metrics and health outcomes.
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
  • Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases
  • Provided seamless connectivity between BI tools like Tableau and Qlik to Redshift endpoints.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow.
  • Presented efficient TM and CARS model enhancement strategies in terms of risk score assignment on various financial activity and profile triggers-based risk factors by applying preferred feature picking methods using Entropy, Mutual Information Gain and Decision tree to streamline High Risk Customer alert processing and SAR/CTR filing.
  • Played key role in design and implementation of Predictive Analytics based enrichments on CARS, TM model and in process used Bayesian Networks algorithm, coordinated with multi facet business domains and stake holders to gain knowledge regarding classification of Independent and Dependent risk factors in perspective of High Risk customer alert stacking for investigators and Customer Due Diligence(CDD) process.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
  • Implemented TM and CARS model outlier identification algorithms using PySpark involving feature (Risk Factors) engineering, StringIndexer, Vecotrs/Vector Assembler, Linear Regression, Evaluation (RMSE, Feature Correlation check) to detect members’ Unusual Behavior which in effect tune CDD and SAM process through feedback.
  • Ingested wide variety of structured, unstructured, and semi structured data into RDBMS (feasibility conditioned according to architecture) as well as into AWS Data echo systems with batch processing and real time streaming.
  • Worked as Data engineer for members’ cluster and grouping for general activity reporting employing PySpark classification approaches including LogisticRegression, Decision Tree, Random Forest (feature importance identification) and also unsupervised K-Means Clustering for pattern matching.
  • Provided seamless connectivity between BI tools like Tableau and Qlik to Redshift endpoints.
  • Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
  • Designed Stacks using Amazon Cloud Formation templates to launch AWS Infrastructure and resources. Developed AWS CloudFormation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups.
  • Worked on creating server-less Micro services by integrating AWS Lambda, S3/Cloud watch/API Gateway.
  • Used JSON schema to define table and column mapping from S3 data to Redshift.
  • Provided ML Data Engineer expertise in Negative News model enhancement with diverse data provided by LexusNexus and other international vendors. For Data ingestion: AWS (EMR, Kinesis Streams & Firehose, RDS, DynamoDB), SparkStreaming; For Data Prep: Python Web Scrapping, PyPDF2, Spark Natural Language Processing (Tokenizer, StopWordsRemover, CountVectorizer, Inverse Document Frequency, StringIndexer), AWS Glue (ETL), IBM DataStage.
  • Have done POC on Redshift spectrum to create external tables by using S3 files.
  • Designed and developed data cleansing, data validation, load processes ETL using Oracle SQL and PL/SQL and UNIX.
  • Doing ETL jobs with Hadoop technologies and tools like Hive, Sqoop and Oozie to extract records from different databases into the HDFS.
  • Installation of NoSQL MongoDB on physical machines, Virtual machines as well as AWS
  • Support and management of NoSQL database Install, configure, administer, and support multiple NoSQL instances Perform database maintenance and troubleshooting.
  • Experienced in developing web-based applications using Python, Django, QT, C++, XML, CSS, JSON, HTML, DHTML, JavaScript and JQuery.
  • Developed entire frontend and backend modules using Python on Django Web Framework.
  • Coordinated with Marketing group in Machine Learning Lab activities for MSA (Member Sentiment Analysis) model development using Spark Streaming, Python and PySpark NLP data prep techniques, Spark Alternate Least Square (ALS) model and tuned parameters.
  • Effectively resolved the persistent overfitting problems in model tuning process by placing feedback controls, periodic model review strategies (variable data split for train, test, evaluate) and detailed documentation so that data anomalies, scalability and cold start problem don’t adversely affect the established model with passage of time.
  • Implemented a Continuous Delivery pipeline with Docker, Jenkins and GitHub and AWS AMI's, whenever a new GitHub branch gets started, Jenkins, Continuous Integration server, automatically attempts to build a new Docker container from it.
  • Monitoring Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, and configured notifications for the alarms generated based on events defined.
  • Worked with an in-depth level of understanding in the strategy and practical implementation of AWS Cloud-Specific technologies including EC2 and S3.
  • Manage AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancer for QA and UAT environments as well as infrastructure servers for GIT and Puppet.
  • Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics, and machine learning applications.
  • Managed resources and scheduling across the cluster using Azure Kubernetes Service (AKS). AKS has been used to create, configure, and manage a cluster of virtual machines.
  • Used Scala for amazing concurrency support where Scala played the key role in parallel processing of the large data sets.
  • Developed map-reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing.

Environment: ETL, Tableau, AWS EC2, AWS Lambda, AWS Glue, NoSQL, MongoDB, Python, Django, QT, C++, XML, CSS, JSON, HTML, DHTML, JavaScript and JQuery.

Data Engineer

Confidential, Columbus, OH

Responsibilities:

  • Developed MapReduce programs to parse and filter the raw data store the refined data in partitioned tables in the Greenplum.
  • Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with Greenplum reference tables and historical metrics
  • Developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Managing and scheduling Jobs on a Hadoop cluster using Oozie and cron jobs.
  • Involved in running MapReduce jobs for processing millions of records.
  • Responsible for creating Hive tables, loading the structured data resulted from MapReduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
  • Setting up and managing Kafka for stream processing and Kafka cluster with separate nodes for brokers.
  • Experienced in migrating Hive QL into Impala to minimize query response time.
  • Created UDFs to calculate the pending payment for the given Residential or Small Business customer, and used in Pig and Hive Scripts.
  • Extensively used Marvel plugin for checking and maintenance of Elastic Search Cluster Health.
  • Wrote test cases in MRunit for unit testing of MapReduce Programs
  • Used Elastic Search & MongoDB for storing and querying the offers and non-offers data.
  • Deployed and built the application using Maven.
  • Maintain Hadoop, Hadoop ecosystems, third party software, and database(s) with updates/upgrades, performance tuning and monitoring using Ambari
  • Worked on creating indexes and working with Indexes using SOLR on Hadoop Distributed Platform
  • Experience in managing and reviewing Hadoop log files
  • Extensively worked on User Interface for few modules using JSPs, JavaScript and Ajax
  • Used Python scripting for large scale text processing utilities
  • Experienced in moving data from Hive tables into Cassandra for real time analytics on Hive tables
  • Responsible for data modeling in MongoDB in order to load data which is coming as structured as well as unstructured data
  • Implemented CRUD operations involving lists, sets and maps in DataStax Cassandra.
  • Obtained good experience with NOSQL database Cassandra.
  • Sound programming capability using Python, core JAVA along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
  • Strong exposure in Automation of maintenance tasks in Bigdata environment through Cloudera Manager API.
  • Administration, installing, upgrading and managing distributions of Hadoop (CDH5, Cloudera manager), HBase. Managing, monitoring and troubleshooting Hadoop Cluster using Cloudera Manager.
  • Used Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Migrate data from on-premises to AWS storage buckets.
  • Participated in JAD sessions with business users and SME's for better understanding of the reporting requirements.
  • Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts.
  • Analyzing the source data to know the quality of data by using Talend Data Quality.
  • Broad design, development and testing experience with Talend Integration Suite and knowledge in Performance Tuning of mappings.
  • Developed jobs in Talend Enterprise edition from stage to source, intermediate, conversion and target.
  • Developed a python script to transfer data from on-premises to AWS S3.
  • Developed a python script to hit REST API's and extract data to AWS S3.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Created yaml files for each data source and including glue table stack creation.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3.
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS).
  • Created a Lambda Deployment function and configured it to receive events from S3 buckets.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • Developed Python scripts to update content in the database and manipulate files.
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data.
  • Exported the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Converting data load pipeline algorithms written in python and SQL to scala spark and pyspark.
  • Mentor and support other members of the team (both on-shore and off-shore) to assist in completing tasks and meet objectives.

Environment: Hadoop, Spark, Hive, Hbase, Abinitio, Scala, Python, ETL, NoSQL (Cassandra), Azure Databricks, HDFS, MapReduce, Azure Data Lake Analytics, Spark SQL, T-SQL, U-SQL, Azure SQL, Sqoop, Apache Airflow.

Confidential

Jr. Big Data Developer

Responsibilities:

  • Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.
  • Involved in loading data from LINUX file system to HDFS.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Configured Performance Tuning and Monitoring for Cassandra Read and Write processes for fast I/O operations and low latency time. used Java API and Sqoop to export data into DataStax Cassandra cluster from RDBMS.
  • Experience working on processing unstructured data using Pig and Hive.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
  • Developed Pig Latin scripts to extract data from the web server output files to load into HDFS.
  • Extensively used Pig for data cleansing.
  • Implemented SQL, PL/SQL Stored Procedures.
  • Worked on debugging, performance tuning of Hive & Pig Jobs.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on tuning the performance Pig queries.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
  • Actively involved in code review and bug fixing for improving the performance.

Application Developer (ETL DataStage)

Confidential

Responsibilities:

  • Used IBM InfoSphere suite products for ETL development, enhancement, testing, support, maintenance and debugging software applications that support business units and support functions in consumer banking sector.
  • Utilized Hadoop Ecosystem for Big Data sources in Customer Relationship Hub and Master Data Management: for data ingestion: Kafka, Storm and Spark Streaming; for data landing: HBase, Phoenix relational DB layer on HBase; for query and ETL used Phoenix, Pig and HiveQL; for job runtime management: Yarn and Ambari.
  • Developed ETL packages using SQL Server Integration services tool to perform data migration from legacy systems like DB2, SQL Server, Excel Sheets, XML files, Flat Files to SQL Server databases using various tools such as SQL Server Integration Services SSIS.
  • Performed database health checks daily tasks including backup and restore by using SQL Server tools like SQL Server Management Studio, SQL Server Profiler, SQL Server Agent, and Database Engine Tuning Advisor on Development and UAT environments.
  • Performed the ongoing delivery, migrating client mini-data warehouses or functional data-marts from different environments to MS SQL server.
  • Involved in Implementation of database design and administration of SQL based database.
  • Developed SQL scripts, Stored Procedures, functions and Views.
  • Worked on DTS Package, DTS Import/Export for transferring data from various database Oracle and Text format data to SQL Server 2005.
  • Designed and implemented various machine learning models (e.g., customer propensity scoring model, customer churn model) using Python (NumPy, SciPy, pandas, scikit-learn), Apache Spark (SparkSQL, MLlib).
  • Provide performance tuning & optimization of data integration frameworks and distributed database system architecture that is optimized for a.
  • Designed and developed a solution in Apache Spark to extract transactional data from various HDFS sources and ingest it to Apache Hbase tables.
  • Designed and developed Streaming jobs to send events and logs from Gateway systems to Kafka.

Environment: HortonWorks, DataStage 11.3, Oracle, DB2, UNIX, Mainframe, Autosys.

We'd love your feedback!