We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

0/5 (Submit Your Rating)

Weehawken, NJ

SUMMARY

  • Experience in Data Engineering, Development and Implementation as a Data Engineer.
  • A Good Proficiency experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API and Spark API for analyzing the data.
  • Software professional with commendable experience in IT Industry, involved in Developing, implementing and maintenance of various web - based application using Java and Big Data Ecosystems and ETL tools Talend and Informatica.
  • Well versed in implementing E2E solutions on big data using Hadoop framework.
  • Proficient in Data Warehousing, Data Mining concepts and ETL transformations from source to target systems.
  • Worked in multiple Hadoop distributions like Cloudera, Hortonworks, MapR and AWS.
  • Experience with Developing and Maintaining Applications written for Amazon Simple Storage, AWS Elastic Map Reduce, and AWS Cloud Formation. Imported the data from various sources like AWS S3, local file system into Spark RDD.
  • Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS using Sqoop.
  • Implemented POC's to migrate map reduce programs into Apache Spark transformations using spark and Scala.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly.
  • Improving performance and optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Hands on experience on developing UDF, DATA Frames and SQL queries in Spark SQL.
  • Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.
  • Used Sqoop to transfer data between RDBMS and HDFS. Extensively used Apache Flume to collect the logs and error messages across the cluster.
  • Good knowledge on various scripting languages like Linux/Unix shell scripting and Python, continuous integration and automated deployment (CI/CD) and management using Jenkins.
  • Hands on experience on major components in Hadoop ecosystem such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Experienced in performing real time analytics and transfer on data using HBase, HIVE queries & Pig.
  • Implemented Dynamic Partitioning and Bucketing for best practice and performance improvement.
  • Experienced in creating a producer and consumer API's using Kafka.
  • Having experience in developing a data pipeline using Kafka to store data into HDFS. Performed real time data streaming using Spark Streaming, Kafka, and Flume.
  • Configured Zookeeper to coordinate the servers in clusters to maintain the data consistency.
  • Experience working with Open stack, Ansible, Kafka, Elasticsearch, Hadoop, Stream sets MySQL, Cloudera, MongoDB, UNIX Shell Scripting, PIG scripting, Hive, FLUME, Zookeeper, Sqoop, Oozie. Python, Spark, Git and a variety of RDBMS in the UNIX and Windows environment, agile methodology.
  • Deployment Distributed and Implementation of Enterprise applications in J2EE environment.
  • Comprehensive knowledge of Software Development Life Cycle (SDLC), having a thorough understanding of various phases like Requirements Analysis, Design, Development and Testing.
  • Proficient in developing web page quickly and effectively using, HTML, CSS3, JavaScript and jQuery and experience in making web page cross-browser compatible.
  • Experienced using Ansible scripts to deploy Cloudera CDH version to setup Hadoop Cluster.
  • Proficiency in working with PL/SQL implementation on Data warehousing concepts and strong experience in implementing data warehousing methodologies.

TECHNICAL SKILLS

Data Modeling Tools: Erwin r9.7, Rational System Architect, IBM Info sphere Data Architect, ER Studio v16

BI Tools: Tableau 10, SAP Business Objects, Crystal Reports

Methodologies: Agile, SDLC, Ralph Kimball data warehousing methodology, Joint Application Development (JAD)

RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access

Operating Systems: Microsoft Windows 7/8 and 10, UNIX, and Linux.

Packages: Microsoft Office 2019, Microsoft Project, SAP and Microsoft Visio 2019, Share point Portal Server

Cloud Platform: Amazon Web Services, MS Azure

Databases: Oracle 12c/11g, Teradata R15/R14, MS SQL Server 2016/2014, DB2.

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Weehawken, NJ

Responsibilities:

  • Designed and developed sqoop, Linux shell scripts for data ingestion from various data sources of credit Suisse in to HDFS Data lake.
  • Created Datasets/ DataFrames from RDDs using reflection and programmatic inference of schema over RDD.
  • Developed Python Spark programs for processing HDFS Files using RDDs, Pair RDDs, Spark SQL, Spark Streaming, DataFrames, Accumulators, Broadcast variables.
  • Developed pyspark Kafka streaming programs to integrate various Credit-Suisse source systems to Hadoop. Developed Pyspark programs using various Transformations and operations.
  • Comprehensive knowledge and experience in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation.
  • Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
  • Hands on experience in setting up workflow using Apache Airflow
  • Data transformations on HIVE and use static, dynamic partitioning and bucketing for performance improvements. Established and Followed Spark programming best practices.
  • Performance tuning of Pig, HIVE and Spark jobs used caching / persistence, partitioning and Best practices.
  • Work with support teams in resolving operational & performance issues.
  • Research, evaluate and utilize new technologies/tools/frameworks around Hadoop eco system. Involved in all phases of the Software development life cycle (SDLC) using Agile Methodology.
  • Worked on MySQL database writing queries using Python MySQL connector and MySQL Package.
  • Implemented CRUD operations for the business logic using RESTFUL services.
  • Designed and developed use-case, Class and Object Diagrams for the business requirements.
  • Used multithreading in programming to improve overall performance. Automated build and deploy process to production environment. Used Jenkins for continuous integration (CI) and continuous deployment (CD).
  • Used Agile methodologies - Scrums, Sprints, tracking of tasks using JIRA management tool.
  • Responsible for managing large databases using Panda data frames and MySQL.
  • Used Pandas library for statistical Analysis, regular expressions and python collections.
  • Designed and implemented a dedicated MYSQL database server to drive the web applications and report on daily progress.
  • Used regular expressions in order to match the pattern with the existing one.
  • Skilled in using collections in Python for manipulating and looping through different user defined objects.
  • Developed complex SQL Queries, Stored Procedures, Triggers, Cursors, Functions, and Packages along with performing DDL and DML operations on the database.
  • Involved in designing and developing the JSON, XML Objects with MySQL.
  • Demonstrated expertise using ETL tools: Talend Data Integration, SQL Server Integration services.
  • Expertise extraction, transformation, loading data from Oracle. DB2, SQL Server, MS Access, Flat Files and XML using Informatica and Talend.
  • Created ETL/ Talend jobs both design and code to process data to target databases.

Confidential, Branchburg, NJ

Sr. Data Engineer

Responsibilities:

  • Involved in Analysis, Design and Implementation/translation of Business User requirements.
  • Worked on collection of large sets using Python scripting. Spark SQL
  • Worked on large sets of Structured and Unstructured data.
  • Worked on creating DL algorithms using LSTM and RNN.
  • Actively involved in designing and developing data ingestion, aggregation, and integration in Hadoop environment.
  • Developed Sqoop scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date.
  • Developed SQOOP scripts to migrate data from Oracle to Big data Environment.
  • Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.
  • Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size
  • Implemented Spring security for SQL injunction and user access privileges, Used various Java, J2EE design patterns like DAO, DTO, Singleton etc.
  • Experience in creating Hive Tables, Partitioning and Bucketing.
  • Performed data analysis and data profiling using complex SQL queries on various sources systems including Oracle 10g/11g and SQL Server 2012.
  • Identified inconsistencies in data collected from different source.
  • Participated in requirement gathering and worked closely with the architect in designing and modeling.
  • Designed object model, data model, tables, constraints, necessary stored procedures, functions, triggers, and packages for Oracle Database.
  • Created Talend jobs to copy the files from one server to another and utilized Talend FTP component.
  • Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations.
  • Imported data from various sources into Spark RDD for processing.
  • Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Worked on AWS Elastic Beanstalk for fast deploying of various applications developed with Java, PHP, Node.js, Python on familiar servers such as Apache.
  • Developed server-side software modules and client-side user interface components and deployed entirely in Compute Cloud of Amazon Web Services (AWS).
  • Implemented Lambda to configure Dynamo DB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
  • Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort by 70%.
  • Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.
  • Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
  • Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
  • Used Jenkins pipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
  • Involved with development of Ansible playbooks with Python and SSH as wrapper for management of AWS node configurations and testing playbooks on AWS instances.
  • Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
  • Implemented CloudTrail in order to capture the events related to API calls made to AWS infrastructure.
  • Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into the internal cache system.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Worked on installing cluster, commissioning & decommissioning of Data node, Name node high availability, capacity planning, and slots configuration.
  • Developed Spark applications for the entire batch processing by using Scala.
  • Automatically scale-up the EMR instances based on the data.
  • Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
  • Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
  • Visualized the results using Tableau dashboards and the Python Seaborn libraries were used for Data interpretation in deployment.
  • Created PDF reports using Golang and XML documents to send it to all customers at the end of month.
  • Worked with business owners/stakeholders to assess Risk impact, provided solution to business owners.
  • Experienced in determine trends and significant data relationships Analyzing using advanced Statistical Methods.
  • Carrying out specified data processing and statistical techniques such as sampling techniques, estimation, hypothesis testing, time series, correlation and regression analysis Using R.
  • Install and configure Apache Airflow and created dags to run the Airflow.
  • Scheduled Glue jobs using AWS CLI and Zeke scheduler. Have exposure to building pipelines using Apache Airflow.
  • Applied various data mining techniques: Linear Regression & Logistic Regression, classification, clustering.
  • Took personal responsibility for meeting deadlines and delivering high quality work.
  • Strived to continually improve existing methodologies, processes, and deliverable templates.

ENVIRONMENT: R, SQL server, Oracle, HDFS, HBase, AWS, MapReduce, Hive, Impala, Pig, Sqoop, NoSQL, Tableau, RNN, LSTM, Unix/Linux, Core Java.

Confidential, Rensselaer, NY

Hadoop Developer

Responsibilities:

  • Data sourcing and visualization platform that is aimed at enabling analysts discover, visualize, model, and present Fitch data alongside and with world's data to facilitate better comparisons and analysis.
  • This is a pilot project at Fitch with milestone achievements like the establishment of a Data-Lake where in good quality influential data is stored and distributed for authorized users. This project also features modern technology for building the centralized data repository in AWS with all security built in using IAM.
  • Envisioned the possibility of AI-powered data discovery and employed technologies like Amazon S3 and Amazon Glue for the information of data lake. AWS Glue operates by automatically creating Amazon S3 data. Identifies data formats and suggests schemas for use with other AWS analytic services.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
  • Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.
  • Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
  • Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
  • One of the milestones is recent approval of the project following a POC with a sample set of publicly available Fitch data from a Rating, Navigator and Financial datasets that was used for the evaluation.
  • Responsible for analysis of requirements and designing generic and standard ETL process to load data from different source systems.
  • Successfully implemented POC in a Development Databases to validate the requirements and benchmarking the ETL loads. Understanding the existing business model and customer requirements.
  • The developed objects are tested for unit/component testing and prepared test cases document for mappings/sessions/workflows.
  • Handled Classification System part of the project which involved loading of the data based on some preconditions.
  • Involved in developing and documenting the ETL strategy to populate the Data Warehouse from various source systems.
  • Involved in Data Extraction, Staging, Targeting Transformation and Loading.
  • Involved in testing at the data base end and reviewing the Talend Mappings as per the business logic.
  • Listed out the issues that was not according to business requirement, developed some maps and changes for other maps.
  • Writing several test cases, identifying the issues that can occur, understanding the date merge, match process.

Confidential

Hadoop Developer

Responsibilities:

  • Our Mission is to identify and create effective company-wide transparency into out external data sources to provide business leaders with the information they need to make strategic decisions regarding the purchase and use of external data hub.
  • Deliver enterprise level data assets to standardize, centralize and seamlessly connect data across all-marketing-facing businesses through creation of an external data hub that support a comprehensive slate of digital and analytic capabilities utilizing quantum technologies.
  • Recognizes and understands use of design patterns for intermediate applications.
  • Develops or confirms detailed project or system change estimates or project plans; calibrates estimating factors for continuous improvements.
  • Develops code for intermediate modules, following documentation and development standards, creates enhanced technical documentation and implement changes.
  • Conducts timely structured code reviews to ensure standards and systems interoperability, reviews and critiques team members code creates accurate test plans, conditions, and data; participate in testing reviews.
  • Conducts basic levels of module and integration testing according to process standards; tracks and resolves moderate defects. Assists Quality Control personnel with functional tests.
  • Executes change management activities supporting production deployment to Developers, Quality Control Analysts, and Environment Management personnel.
  • POC and implementation done to interface with storage system called Scality S3 which is AWS S3 implementation.
  • The automation validates the Scality RING, which uses an object storage core for scalable data management deployed on hardware.
  • Secure Bigdata/Hadoop cluster with robust security architecture using Kerberos, Active Directory, Ranger, Knox, Centrify. Automate Hadoop cluster access controls through Ranger policies for the UpToDate security. Provide granular security up to column level. Data governance from the TMM data lineage and Ranger based access controls on HDFS and Hive Metadata. End user security through Kerberos CLI tools, beeline and Ambari views.
  • Enabled Kerberos for Hadoop cluster Authentication and integrate with Active Directory for managing users and application groups.
  • Worked on data processing for transformations and actions in spark computational engine by using PySpark. The Transform and analyze of the data using PySpark based on ETL(Talend) mappings.
  • Completes and delivers migration or change management form to above parties, creates and executes unit tests.
  • Implements technical process improvements. Manages technical process and application process flows.
  • Serve as a subject matter expert and an advocate of the data quality strategic vision and enterprise data strategy.
  • Design and drive the implementation of end-to-end data quality solutions for multiple data ecosystems that leverage non-relational data. Consistently contribute to process improvements and help lead the team to deliver working solutions and data quality standards.
  • Participate and review documentation for self-service data quality practices, the process framework, and internal guides.
  • Perform thorough data profiling with multiple usage patterns, root cause analysis and data cleansing and develop scorecards utilizing Tableau, Excel, and other data quality tools, Configure and manage connection process.

We'd love your feedback!